Teatime Numerical Analysis
Teatime Numerical Analysis
BRIN
C
Tea Time C
Numerical
Analysis −
B
Experiences in Mathematics, 2nd edition
Y
−
S
the first in a series of tea time textbooks
A
SOUTHERN
CONNECTICUT
STATE
UNIVERSITY
ii
The code printed within and accompanying Tea Time Numerical Analysis electronically is distributed under the
GNU Public License (GPL).
This code is free software: you can redistribute it and/or modify it under the terms of the GNU General Public
License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later
version.
The code is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the
implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General
Public License for more details. For a copy of the GNU General Public License, see GPL.
iii
To
Victorija, Cecelia, and Amy
iv
Contents
Preface ix
About Tea Time Numerical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
How to Get Octave . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
How to Get the Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
1 Preliminaries 1
1.1 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Measuring Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Sources of Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Key Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Octave . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2 Taylor Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Key Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Octave . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.3 Speed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Key Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Octave . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
1.4 Recursive Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
The Mathemagician . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Trominos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Octave . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2 Root Finding 37
2.1 Bisection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
The Bisection Method (pseudo-code) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Analysis of the bisection method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.2 Fixed Point Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Root Finding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
The Fixed Point Iteration Method (pseudo-code) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Key Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.3 Order of Convergence for Fixed Point Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Convergence Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Steffensen’s Method (pseudo-code) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Key Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Octave . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
v
vi CONTENTS
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
2.4 Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
A Geometric Derivation of Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Newton’s Method (pseudo-code) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Secant Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Secant Method (pseudo-code) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Seeded Secant Method (pseudo-code) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Key Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
2.5 More Convergence Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
2.6 Roots of Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
Synthetic division revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
Finding all the roots of polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Newton’s method and polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Müller’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Key Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
2.7 Bracketing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Bracketing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Inverse Quadratic Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
Stopping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Key Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
3 Interpolation 99
3.1 A root-finding challenge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
The function f and its antiderivative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
The derivative of f and more graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Octave . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
3.2 Lagrange Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
An application of interpolating polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
Neville’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Uniqueness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
Octave . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
Key Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
3.3 Newton Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
Sidi’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
Octave . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
More divided differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
Key Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
6.4 Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
A Note About Convention and Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
Higher Order Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
Key Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
6.5 Adaptive Runge-Kutta Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
Adaptive Runge-Kutta (pseudo-code) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
General Runge-Kutta Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
Key Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
Bibliography 349
Index 350
Preface
ix
x Preface
dynamical system should be eyebrow-raising and question-provoking even if only tangentially important. There are,
of course, other examples of somewhat less critical content, but each is there to enhance the reader’s understanding
or appreciation of the subject, even if the material is not strictly necessary for an introductory study of numerical
analysis.
Along the way, implementation of the numerical methods in the form of computer code will also be discussed.
While one could simply ignore the programming sections and exercises and still get something out of this text, it
is my firm belief that full appreciation for the content can not be achieved without getting ones hands “dirty” by
doing some programming. It would be nice if readers have had at least some minimal exposure to programming
whether it be Java, or C, web programming, or just about anything else. But I have made every effort to give
enough detail so that even those who have never written even a one-line program will be able to participate in this
part of the study.
In keeping with the desire to produce a completely free learning experience, GNU Octave was chosen as the
programming language for this book. GNU Octave (Octave for short) is offered freely to anyone and everyone! It
is free to download and use. Its source code is free to download and study. And anyone is welcome to modify or
add to the code if so inclined. As an added bonus, users of the much better-known MATLAB will not be burdened
by learning a new language. Octave is a MATLAB clone. By design, nearly any program written in MATLAB will
run in Octave without modification. So, if you have access to MATLAB and would prefer to use it, you may do so
without worry. I have made considerable effort to ensure that every line of Octave in this book will run verbatim
under MATLAB. Even with this earnest effort, though, it is possible that some of the code will not run under
MATLAB. It has only been tested in Octave! If you find any code that does not run in MATLAB, please let me
know.
I hope you enjoy your reading of Tea Time Numerical Analysis. It was my pleasure to write it. Feedback is
always welcome.
Leon Q. Brin
[email protected]
http://lqbrin.github.io/tea-time-numerical/more.html.
xi
http://lqbrin.github.io/tea-time-numerical/ancillaries.html.
The code printed within and accompanying Tea Time Numerical Analysis electronically is distributed under the
GNU Public License (GPL). Details are available at the website.
Acknowledgments
I gratefully acknowledge the generous support I received during the writing of this textbook, from the patience
my immediate family, Amy, Cecelia, and Victorija exercised while I was absorbed by my laptop’s screen, to the
willingness of my Spring 2013 Seminar class, Elizabeth Field, Rachael Ivison, Amanda Reyher, and Steven Warner
to read and criticize an early version of the first chapter. In between, the Woodbridge Public Library staff, especially
Pamela Wilonski, helped provide a peaceful and inspirational environment for writing the bulk of the text. Many
thanks to Dick Pelosi for his extensive review and many kind words and encouragements throughout the endeavor.
xii Preface
Chapter 1
Preliminaries
1.1 Accuracy
Measuring Error
Numerical methods are designed to approximate one thing or another. Sometimes roots, sometimes derivatives
or definite integrals, or curves, or solutions of differential equations. As numerical methods produce only approx-
imations to these things, it is important to have some idea how accurate they are. Sometimes accuracy comes
down to careful algebraic analysis—sometimes careful analysis of the calculus, and often careful analysis of Taylor
polynomials. But before we can tackle those details, we should discuss just how error and, therefore, accuracy are
measured.
There are two basic measurements of accuracy: absolute error and relative error. Suppose that p is the value
we are approximating, and p̃ is an approximation of p. Then p̃ misses the mark by exactly the quantity p̃ − p, the
so-called error. Of course, p̃ − p will be negative when p̃ misses low. That is, when the approximation p̃ is less
than the exact value p. On the other hand, p̃ − p will be positive when p̃ misses high. But generally, we are not
concerned with whether our approximation is too high or too low. We just want to know how far off it is. Thus,
we most often talk about the absolute error, |p̃ − p|. You might recognize the expression |p̃ − p| as the distance
between p̃ and p, and that’s not a bad way to think about absolute error.
The absolute error in approximating p = π by the rational number p̃ = 22 22
7 is | 7 − π| ≈ 0.00126. The absolute
5 16525 16525
error in approximating π by the rational number 54 is | 54 − π| ≈ 0.00116. The absolute errors in these
two approximations are nearly equal. To make the point more transparent, π ≈ 3.14159 and 22 7 ≈ 3.14285, while
π 5 ≈ 306.01968 and 16525 54 ≈ 306.01851. Each approximation begins to differ from its respective exact value in the
thousandths place. And each is off by only 1 in the thousandths place.
But there is something more going on here. π is near 3 while π 5 is near 300. To approximate π accurate to the
nearest one hundredth requires the approximation to agree with the exact value in only 3 place values—the ones,
tenths, and hundredths. To approximate π 5 accurate to the nearest one hundredth requires the approximation
to agree with the exact value in 5 place values—the hundreds, tens, ones, tenths, and hundredths. To use more
scientific language, we say that 22 7 approximates π accurate to 3 significant digits while
16525
54 approximates π 5
accurate to 5 significant digits. Therein lies the essence of relative errors—weighing the absolute error against the
magnitude of the number being approximated. This is done by computing the ratio of the error to the exact value.
| 22
7 − π|
Hence, the relative error in approximating π by 22 7 is ≈ 4.02(10)−4 while the relative error in approximating
|π|
| 16525
54 − π |
5
π 5 by 16525
54 is ≈ 3.81(10)−6 . The relative errors differ by a factor of about 100 (equivalent to about
|π 5 |
two significant digits of accuracy) even though the absolute errors are nearly equal. In general, the relative error in
|p̃ − p|
approximating p by p̃ is given by .
|p|
Sources of Error
There are two general categories of error. Algorithmic error and floating-point error. Algorithmic error is any error
due to the approximation method itself. That is, these errors are unavoidable even if we do exact calculations at
1
2 CHAPTER 1. PRELIMINARIES
every step. Floating-point error is error due to the fact that computers and calculators generally do not do exact
arithmetic, but rather do floating-point arithmetic.
Floating-point values are stored in binary. According to the IEEE Standard 754, which most computers use, the
mantissa (or significand) is stored using 52 bits, or binary places. Since the leading bit is always assumed to
be 1 (and, therefore, not actually stored), each floating point number is represented using 53 consecutive binary
place values. P
Now let’s consider how 1/7 is represented exactly. In binary, one seventh is equal to 0.001001001 . . .
because 17 = i=1 2−3i = 18 + 64 1 1
∞
+ 512 + · · · . To see that this is true, remember from calculus that
∞ ∞
X X i
2−3i = 2−3
i=1 i=1
2−3
=
1 − 2−3
1/8
=
7/8
1
= .
7
1
But in IEEE Standard 754, 7
is chopped to
1.0010010010010010010010010010010010010010010010010010 × (2)−3
P18 2573485501354569
or i=1
2−3i which is exactly 18014398509481984
. The floating point error in calculating 1/7 is, therefore,
2573485501354569 1 1
− = ≈ 7.93(10)−18 .
18014398509481984 7 126100789566373888
In floating-point arithmetic, a calculator or computer typically stores its values with about 16 significant digits.
For example, in a typical computer or calculator (using double precision arithmetic), the number 17 is stored as
about 0.1428571428571428, while the exact value is 0.1428571428571428 . . .. In the exact value, the pattern of
142857 repeats without cease, while in the floating point value, the repetition ceases after the third 8. The value
is chopped to 16 decimal places in the floating-point representation. So the floating point error in calculating 1/7
is around 5(10)−17 . I say “around” or “about” in this discussion because these claims are not precisely true, but
the point is made. There is a small error in representing 1/7 as a floating point real number. And the same is true
about all real numbers save a finite set.
Yes, there is some error in the floating-point representation of real numbers, but it is always small in comparison
to the size of the real number being represented. The relative error is around 10−17 , so it may seem that the
consideration of floating-point error is strictly an academic exercise. After all, what’s an error of 7.93(10)−18 among
friends? Is anyone going to be upset if they are sold a ring that is .14285714285714284921 inches wide when it
should be .14285714285714285714 inches wide? Clearly not. But it is not only the error in a single calculation (sum,
difference, product, or quotient) that you should be worried about. Numerical methods require dozens, thousands,
and even millions of computations. Small errors can be compounded. Try the following experiment.
Experiment 1
Use your calculator or computer to calculate the numbers p0 , p1 , p2 , . . . , p7 as prescribed here:
• p0 = π
• p1 = 10p0 − 31
• p2 = 100p1 − 41
1.1. ACCURACY 3
• p3 = 100p2 − 59
• p4 = 100p3 − 26
• p5 = 100p4 − 53
• p6 = 100p5 − 58
• p7 = 100p6 − 97
According to your calculator or computer, p7 is probably something like one of these:
0.93116 (Octave)
.9311599796346854 (Maxima)
1 (CASIO fx-115ES)
However, a little algebra will show that p7 = 10000000000000π − 31415926535897 exactly (which is approximately
0.932384). Even though p0 is a very accurate approximation of π, after just a few (carefully selected) computations,
round-off error has caused p7 to have only one or two significant digits of accuracy!
This experiment serves to highlight the most important cause of floating-point error: subtraction of nearly equal
numbers. We repeatedly subtract numbers whose tens and ones digits agree. Their two leading significant digits
match. For example, 10π −31 = 31.415926 . . .−31. 10π is held accurate to about 16 digits (31.41592653589793) but
10π − 31 is held accurate to only 14 significant digits (0.41592653589793). Each subsequent subtraction decreases
the accuracy by two more significant digits. Indeed, p7 is represented with only 2 significant digits. We have
repeatedly subtracted nearly equal numbers. Each time, some accuracy is lost. The error grows.
In computations that don’t involve the subtraction of nearly equal quantities, there is the concern of algorithmic
error. For example, let f (x) = sin x. Then one can prove from the definition of derivative that
sin(1 + h) − sin(1 − h)
f 0 (1) = lim .
h→0 2h
Experiment 2
Using a calculator or computer, compute p̃(h) for h = 10−2 , h = 10−3 , and so on through h = 10−7 . Your results
should be something like this:
h p̃∗ (h)
10−2 0.5402933008747335
10−3 0.5403022158176896
10−4 0.5403023049677103
10−5 0.5403023058569989
10−6 0.5403023058958567
10−7 0.5403023056738121
The second column is labeled p̃∗ (h) to indicate that the approximation p̃(h) is calculated using approximate
(floating-point) arithmetic, so it is technically an approximation of the approximation. Since f 0 (1) = cos(1) ≈
.5403023058681398, each approximation is indeed reasonably close to the exact value. Taking a closer look, though,
there is something more to be said. First, the algorithmic error of p̃(10−2 ) is
101 99
|p̃(10 ) − f (1)| = 50 sin
−2 0
− sin − cos(1)
100 100
≈ 9.00(10)−6
accurate to three significant digits. That is, if we compute p̃(10−2 ) using exact arithmetic, the value still misses
f 0 (1) by about 9(10)−6 . The floating-point error is only how far the computed value of p̃(10−2 ), what we have
labeled p̃∗ (10−2 ) in the table, deviates from the exact value of p̃(10−2 ). That is, the floating-point error is given by
|p̃∗ − p̃|:
0.5402933008747335 − 50 sin 101 − sin 99
≈ 1.58(10)−17 ,
100 100
4 CHAPTER 1. PRELIMINARIES
as small as one could expect. The absolute error |p̃∗ (10−2 ) − f 0 (1)| = |0.5402933008747335 − cos(1)| is essentially
all algorithmic. The round-off error is dwarfed by the algorithmic error. The fact that we have used floating-point
arithmetic is negligible.
On the other hand, the algorithmic error of p̃(10−7 ) is
10000001 9999999
|p̃(10 ) − f (1)| = 5000000 sin
−7 0
− sin − cos(1)
10000000 10000000
≈ 9.00(10)−16
accurate to three
significant digits. But we should be a little bit worried about the floating-point error since
10000001 9999999
sin ≈ 0.8414710388 and sin ≈ .8414709307 are nearly equal. We are subtracting numbers
10000000 10000000
whose five leading significant digits match! Indeed, the floating-point error is, again |p̃∗ − p̃|, or
Perhaps this error seems small, but it is very large compared to the algorithmic error of about 9(10)−16 . So, in
this case, the error is essentially all due to the fact that we are using floating-point arithmetic! This time, the
algorithmic error is dwarfed by the round-off error. Luckily, this will not often be the case, and we will be free to
focus on algorithmic error alone.
Crumpet 2: Chaos
Edward Lorenz, a meteorologist at the Massachusetts Institute of Technology, was among the first to recognize
and study the mathematical phenomenon now called chaos. In the early 1960’s he was busy trying to model
weather systems in an attempt to improve weather forecasting. As one version of the story goes, he wanted to
repeat a calculation he had just made. In an effort to save some time, he used the same initial conditions he
had the first time, only rounded off to three significant digits instead of six. Fully expecting the new calculation
to be similar to the old, he went out for a cup of coffee and came back to look. To his astonishment, he
noticed a completely different result! He repeated the procedure several times, each time finding that small
initial variations led to large long-term variations. Was this a simple case of floating-point error? No. Here’s a
rather simplified version of what happened. Let f (x) = 4x(1 − x) and set p0 = 1/7. Now compute p1 = f (p0 ),
p2 = f (p1 ), p3 = f (p2 ), and so on until you have p40 = f (p39 ). You should find that p40 ≈ 0.080685. Now set
p0 = 1/7 + 10−12 (so we can run the same computation only with an initial value that differs from the original
by the tiny amount, 10−12 ). Compute as before, p1 = f (p0 ), p2 = f (p1 ), p3 = f (p2 ), and so on until you have
p40 = f (p39 ). This time you should find that p40 ≈ 0.91909—a completely different result! If you go back and
run the two calculations using 100 significant digit arithmetic, you will find that beginning with p0 = 1/7 leads
to p40 ≈ .080736 while beginning with p0 = 1/7 + 10−12 leads to p40 ≈ 0.91912. In other words, it is not the
fact that we are using floating-point approximations that makes these two computations turn out drastically
different. Using 1000 significant digit arithmetic would not change the conclusion, nor would any more precise
calculation. This is a demonstration of what’s known as sensitivity to initial conditions, a feature of all chaotic
systems including the weather. Tiny variations at some point lead to vast variations later on. And the “errors”
are algorithmic. This is the basic principle that makes long-range weather forecasting impossible. In the words
of Edward Lorenz, “In view of the inevitable inaccuracy and incompleteness of weather observations, precise
very-long-range forecasting would seem non-existent.”
Experiment 3
Let a = 77617 and b = 33096, and compute
a
333.75b6 + a2 (11a2 b2 − b6 − 121b4 − 2) + 5.5b8 + .
2b
1.1. ACCURACY 5
You will probably get a number like −1.180591620717411(10)21 even though the exact value is
54767
− ≈ −.8273960599468214.
66192
That’s an incredible error! But it’s not because your calculator or computer has any problem calculating each term
to a reasonable degree of accuracy. Try it.
333.75b6 = 438605750846393161930703831040
a (11a b − b − 121b4 − 2)
2 2 2 6
= −7917111779274712207494296632228773890
5.5b8 = 7917111340668961361101134701524942848
a 77617
= ≈ 1.172603940053179
2b 66192
The reason the calculation is so poor is that nearly equal values are subtracted after each term is calculated.
a2 (11a2 b2 − b6 − 121b4 − 2) and 5.5b8 have opposite signs and match in their greatest 7 significant digits, so
calculating their sum decreases the accuracy by about 7 significant digits. To make matters worse, a2 (11a2 b2 − b6 −
121b4 − 2) + 5.5b8 = −438605750846393161930703831042, which has the opposite sign of 333.75b6 and matches it in
every place value except the ones. That’s 29 digits! So we lose another 29 significant digits of accuracy in adding
this sum to 333.75b6 . Doing the calculation exactly, the sum 333.75b6 + a2 (11a2 b2 − b6 − 121b4 − 2) + 5.5b8 is −2.
But the computation needs to be carried out to 37 significant digits to realize this. Calculation using only about
16 significant digits, as most calculators and computers do, results in 0 significant digits of accuracy since 36 digits
of accuracy are lost during the calculation. That’s why you can get a number like −1.180591620717411(10)21 for
your final answer instead of the exact answer 2b a
− 2 ≈ −.8273960599468214.
What may be even more surprising is that a simple rearrangement of the expression leads to a completely
different result. Try computing
a
(333.75 − a2 )b6 + a2 (11a2 b2 − 121b4 − 2) + 5.5b8 +
2b
instead. This time you will likely get a number like 1.172603940053179. Again the result is entirely inaccurate, and
the reason is the same. This time the individual terms are
Key Concepts
p The exact value being approximated.
Absolute error: |p̃ − p| is known as the absolute error in using p̃ to approximate the value p.
|p̃ − p|
Relative error: is known as the relative error in using p̃ to approximate the value p.
|p|
6 CHAPTER 1. PRELIMINARIES
Floating-point arithmetic: Arithmetic using numbers represented by a fixed number of significant digits.
Algorithmic error: Error caused solely by the algorithm or equation involved in the approximation, |p̃ − p| where
p̃ is an approximation of p and is computed using exact arithmetic.
Truncation error: Algorithmic error due to use of a partial sum in place of a series. In this type of error, the tail
of the series is truncated—thus the name.
Floating-point error: Error caused solely by the fact that a computation is done using floating-point arithmetic,
|p̃∗ − p̃| where p̃∗ is computed using floating-point arithmetic, p̃ is computed using exact arithmetic, and both
are computed according to the same formula or algorithm.
Round-off error: Another name for floating-point error.
Octave
The computations of this section can easily be done using Octave. All you need are arithmetic operations and a
few standard functions like the absolute value and sine and cosine. Luckily, none of these is very difficult using
Octave. The arithmetic operations are done much like they would be on a calculator. There is but one important
distinction. Most calculators will accept an expression like 3x and understand that you mean 3 × x, but Octave
will not. The expression 3x causes a syntax error in Octave. Octave needs you to specify the operation as in 3*x.
Standard functions like absolute value, sine, and cosine (and many others) have simple abbreviations in Octave.
They all take one argument, or input. Think function notation and it will become clear how to find the sine or
absolute value of a number. You need to type the name of the function, a left parenthesis, the argument, and a right
parenthesis, as in sin(7.2). Some common functions and their abbreviations are listed in Table 1.1. Functions and
arithmetic operations can be combined in the obvious way. A few examples from this section appear in Table 1.2.
There are two thing to observe. First, Octave notation is very much like calculator notation. Second, by default
Octave displays results using 5 significant digits. Don’t be fooled into thinking Octave has only computed those
five digits of the result, though. In fact, Octave has computed at least 15 digits correctly. And if you want to know
what they are, use the format(’long’) command. This command only needs to be used once per session. All
numbers printed after this command is run will be shown with 15 significant digits. For example, 1/7 will produce
0.142857142857143 instead of just 0.14286. If you would like to go back to the default format, use the format()
command with no arguments. We will discuss finer control over output later. For now, here are a few ways you
might do experiment 1 using Octave. The only differences are the amount of output and the format of the output.
The numbers are being calculated exactly the same way and with exactly the same precision.
1.1. ACCURACY 7
Experiment 3 in Octave
octave:1> a=77617;
octave:2> b=33096;
octave:3> t1=333.75*b^6;
octave:4> t2=a^2*(11*a^2*b^2-b^6-121*b^4-2);
octave:5> t3=5.5*b^8;
octave:6> t4=a/(2*b);
octave:7> t1+t2+t3+t4
ans = -1.18059162071741e+21
octave:8> t1=(333.75-a^2)*b^6;
octave:9> t2=a^2*(11*a^2*b^2-121*b^4-2);
octave:10> t1+t2+t3+t4
8 CHAPTER 1. PRELIMINARIES
ans = 1.17260394005318
In the end, the way you choose to complete an exercise in Octave will be a matter of preference, and will depend on
your goal. You should ask yourself questions like the following. How many significant digits do I need? How many
intermediate results do I need to see? Which ones? The answers to such questions should guide your solution.
When needed, Octave has abbreviations for most common constants. Table 1.3 shows the three most common.
3. Calculate the absolute error in approximating p by p̃. (d) f (x) = x − tan−1 (0.429)
[A]
1106 [S]
(e) f (x) = 10x /5!
(a) p = 123; p̃ = 9
(f) f (x) = 5!/x10
(b) p = 1e ; p̃ = .3666
(c) p = 210 ; p̃ = 1000 [S]
11. All of these equations are mathematically true.
(d) p = 24; p̃ = 48 Nonetheless, floating point error causes some of them
to be false according to Octave. Which ones? HINT:
[S] Use the boolean operator == to check. For example,
(e) p = π −7 ; p̃ = 10−4
to check if sin(0) = 0, type sin(0)==0 into Octave.
(f) p = (0.062847)(0.069234); p̃ = 0.0042 ans=1 means true (the two sides are equal according to
Octave—no round-off error) and ans=0 means false (the
4. Calculate the relative errors in the approximations of
two sides are not equal according to Octave—round-off
question 3. [S]
error).
5. How many significant digits of accuracy do the approx-
imations of question 3 have? [S] (a) (2)(12) = 92 − 4(9) − 21
6. Compute the absolute error and relative error in ap- (b) e3 ln(2) = 8
proximations of p by p̃. (c) ln(10) = ln(5) + ln(2)
√ √ √ √
(a) p = 2, p̃ = 1.414 (d) g( 1+2 5 ) = 1+ 5
2
where g(x) = 3
x2 + x
(b) p = 10π , p̃ = 1400 (e) b153465/3c = 153465/3
√
(c) p = 9!, p̃ = 18π(9/e)9 (f) 3π 3 + 7π 2 − 2π + 8 = ((3π + 7)π − 2)π + 8
√
1103 8 12. Find an approximation p̃ of p with absolute error .001.
7. Calculate using Octave.
9801
(a) p = π [S]
8. The number in question 7 is an approximation of √
(b) p = 5
1/π. Using Octave, find the absolute and relative errors
in the approximation. (c) p = ln(3) [S]
√ √23
9. Using Octave, calculate (d) p = 23
10 [S]
(e) p = ln(1.1)
(a) bln(234567)c
(f) p = tan(1.57079)
(b) edln(234567)e
13. Find an approximation p̃ of p with relative error .001
p
(c) 3
bsin(e5.2 )c
for each value of p in question 12. [S]
(d) −eiπ
14. p̃ approximates what value with absolute error .0005?
(e) 4 tan−1 (1)
[A]
p
bcos(3) − ln(3)c
5 (a) p̃ = .2348263818643
(f) (b) p̃ = 23.89627345677
darctan(3) − e3 e
1.1. ACCURACY 9
(c) p̃ = −8.76257664363 21. Find values for p and p̃ so that the relative and abso-
lute errors are equal. Make a general statement about
[A]
15. Repeat question 14 except with relative error .0005. conditions under which this will happen. [A]
1
16. p̃ approximates p with absolute error 100
and relative 22. Find values for p and p̃ so that the relative error is
3
error 100 . Find p and p̃. [A] greater than the absolute error. Make a general state-
17. p̃ approximates p with absolute error 3
and relative ment about conditions under which this will happen.
100
1
error 100 . Find p and p̃. 23. Find values for p and p̃ so that the relative error is
less than the absolute error. Make a general statement
18. Suppose p̃ must approximate p with relative error at
about conditions under which this will happen.
most 10−3 . Find the largest interval in which p̃ must
lie if p = 900. 24. Calculate (i) p̃∗ using a calculator or computer, (ii)
P∞ the absolute error, |p̃∗ − p|, and (iii) the relative error,
19. The number e can be defined by e = (1/n!). |p̃∗ −p|
n=0
Compute the absolute error and relative error in the |p|
. Then use the given value of p̃ to compute (iv)
following approximations of e: the algorithmic error, |p̃−p| and (v) the round-off error,
|p̃∗ − p̃|.
5
1
(a) Let f (x) = x4 +7x3 −63x2 −295x+350 and let p =
X
(a)
n! −7
)−f (−2−10−7 )
n=0 f 0 (−2). The value p̃ = f (−2+10 2(10)−7
10 is a good approximation of p. p̃ is exactly
X 1
(b) 8.99999999999999. [A]
n!
n=0
(b) Let f 0 (x) = ex sin(10x) and f (0) = 0
√ and let p = f (1). It can be shown that
1+ 5 1 10
20. The golden ratio, , is found in nature and in p = 101 e(sin 10 − 10 cos 10) + 101 . Eu-
2
mathematics in a variety of places. For example, if Fn ler’s method
P10 i/10 produces the approximation p̃ =
1
is the nth Fibonacci number, then 10 i=1
e sin i. Accurate to 28 significant dig-
√ its, p̃ is 0.2071647018159241499410798569.
Fn+1 1+ 5 √
lim = (c) Let a0 = 5+8 5 and an+1 = 4an (1 − an ), and con-
n→∞ Fn 2 sider p = a51 . It can be shown that p = a51 =
√
5− 5
Therefore, F11 /F10 may be used as an approximation 8
. The most direct algorithm for calculating
of the golden ratio. Find the relative error in this ap- a51 is to calculate a1 , a2 , a3 , . . . a51 in succession,
proximation. HINT: The Fibonacci sequence is defined according to the given recursion relation. Use this
by F0 = 1, F1 = 1, Fn = Fn−1 + Fn−2 for n ≥ 2. algorithm to compute p̃∗ and p̃.
10 CHAPTER 1. PRELIMINARIES
Proof. Let I be the open interval between x and x0 and I be the closure of I. Since I ⊂ I ⊂ (a, b) and f has n + 1
derivatives on (a, b), we have that f, f 0 , f 00 , . . . , f (n) are all continuous on I and that f (n+1) exists on I. We now
define
n
X f (j) (z)
F (z) = f (x) − f (z) − (x − z)j
j=1
j!
(x − x0 )n+1
F (x0 ) = −F 0 (ξ)
(n + 1)(x − ξ)n
f (n+1) (ξ)
= (x − x0 )n+1
n!(n + 1)
f (n+1) (ξ)
= (x − x0 )n+1 .
(n + 1)!
This completes the proof.
We will use the notation
n (j)
f (x0 )
X
Tn (x) = f (x0 ) + (x − x0 ) j
j=1
j!
and call this the nth Taylor polynomial of f expanded about x0 . We will also use the notation
f (n+1) (ξ)
Rn (x) = (x − x0 )n+1
(n + 1)!
and call this the remainder term for the nth Taylor polynomial of f expanded about x0 .
Crumpet 3: ξ
ξ is the (lower case) fourteenth letter of the Greek alphabet and is pronounced ksee. It is customary, but, of
course, not necessary to use this letter for the unknown quantity in Taylor’s theorem. The capital version of ξ is
Ξ, a symbol rarely seen in mathematics.
1.2. TAYLOR POLYNOMIALS 11
It will not be uncommon, for sake of brevity, to call Tn (x) the nth Taylor polynomial and Rn (x) the remainder
term when the function and center of expansion, x0 , are either unspecified or clear from context.
In calculus, you likely focused on the Taylor polynomial, or Taylor series, and did not pay much attention to the
remainder term. The situation is quite the reverse in numerical analysis. Algorithmic error can often be ascertained
by careful attention to the remainder term, making it more critical than the Taylor polynomial itself. The Taylor
polynomial will, however, be used to derive certain methods, so won’t be entirely neglected.
The most important thing to understand about the remainder term is that it tells us precisely how well Tn (x)
approximates f (x). From Taylor’s theorem, f (x) = Tn (x) + Rn (x), so the absolute error
(n+1) in using Tn (x) to approxi-
f (ξ)
mate f (x) is given by |Tn (x) − f (x)| = |Rn (x)|. But |Rn (x)| = (n+1)! (x − x0 ) for some ξ between x and x0 .
n+1
Therefore,
(n+1)
(ξ)
f
|Tn (x) − f (x)| = |Rn (x)| max (x − x0 )n+1
≤
ξ (n + 1)!
n+1
|x − x0 |
= max f (n+1) (ξ) .
(n + 1)! ξ
1. The remainder term is precisely the error in using Tn (x) to approximate f (x). Hence, it is sometimes referred
to as the error term.
2. The absolute error in using Tn (x) to approximate f (x) depends on three factors:
(a) |x − x0 |n+1
1
(b) (n+1)!
3. We can find an upper bound on |Tn (x) − f (x)| by finding an upper bound on f (n+1) (ξ).
Figure 1.2.1: For small n, Tn (x) is a good approximation only for small x.
3
T4
T10
2 cos(x)
-1
-2
-3
-10 -5 0 5 10
Because |Rn (x)| measures exactly the absolute error |Tn (x) − f (x)|, we will be interested in conditions that force
|Rn (x)| to be small. According to observation 2, there are three quantities to consider. First, |x−x0 |n+1 , or |x−x0 |,
1
the distance between x and x0 . The approximation Tn (x) will generally be better for x closer to x0 . Second, (n+1)! .
This suggests that the more terms we use in our Taylor polynomial (the greater n is), the better the approximation
will be. Finally, |f (n+1) (ξ)|, the magnitude of the (n + 1)st derivative of f . The tamer this derivative, the better
Tn (x) will approximate f (x). Be warned, however, these are just rules of thumb for making |Rn (x)| small. There
are exceptions to these rules.
12 CHAPTER 1. PRELIMINARIES
Figure 1.2.2: The actual error |Tn (x) − f (x)| is often much smaller than the theoretical bound.
4 T2(x)
T11(x)
ln(x)
3
(e2,2)
2
-1
0 2 4 6 8 10 12 14 16 18
To see these factors in action, consider f (x) = ln(x) expanded about x0 = e2 . According to Taylor’s theorem,
x − e2 (x − e2 )2 1
T2 (x) = 2 + 2
− and R2 (x) = (x − e2 )3 ;
e 2e4 3ξ 3
11
(−1)j−1 (x − e2 )j
−1
(x − e2 )12 .
X
T11 (x) = 2 + and R11 (x) =
j=1
je2j 12ξ 12
After you have convinced yourself these formulas are correct, suppose that we are interested in approximating ln(x)
with an absolute error of no more than 0.1. Since |ξ −3 | and |ξ −12 | are decreasing functions of ξ, they attain their
maximum values
on a closed
interval at the lower endpoint of that interval. Hence, for x ≥ e2 , we have |R2 (x)| ≤
maxξ∈[e2 ,x] 3ξ3 (x − e2 )3 = 3e16 (x − e2 )3 . But for 0 < x < e2 , we have |R2 (x)| ≤ maxξ∈[x,e2 ] 3ξ13 (x − e2 )3 =
1
1 2 3
3x3 (e −x) . To determine where these remainders are less than 0.1, we need to solve the equations 3e16 (x−e2 )3 = 0.1
q √
3 √3
and 3x1 3 (e2 − x)3 = 0.1. The values we seek are x = 1 + 3 10 3
e2 ≈ 12.33 and x = 8100+10 √ 90−30 e2 ≈ 4.427.
13 3 90
So
Taylor’s theorem guarantees that T2 (x) will approximate ln(x) to within 0.1 over the entire interval [4.427, 12.33].
Since e2 ≈ 7.389, T2 (x) approximates ln(x) to within 0.1 from about 3 below e2 to about 5 above e2 . In other
words, as long as x is close enough to x0 = e2 , the approximation is good. A similar calculation for R11 (x) reveals
that T11 (x) is guaranteed to approximate ln(x) to within 0.1 over the interval [3.667, 14.89]. In other words, for a
larger value of n, x doesn’t need to be as close to x0 to achieve the same accuracy.
But remember, these are only theoretical bounds on the errors. The actual errors are often much smaller than
1 2 3
the bounds. For example, our analysis gives the upper
bound |R2 (3)| ≤ 3·3 3 (e − 3) ≈ 1.05 where the actual
3−e 2 (3−e )
2 2
error, |T2 (3) − ln(3)| = 2 + e2 − 2e4 − ln(3) ≈ .131. The bound is about 8 times the actual error. If we
take this point a bit further, the graphs of T2 (x) and T11 (x) versus ln(x) (and a bit of calculation we will discuss
later) reveal that T2 (x) actually approximates ln(x) to within 0.1 over the interval [3.296, 13.13] and T11 (x) actually
approximates ln(x) to within 0.1 over the interval [0.9030, 15.33]. These intervals are a bit larger than the theoretical
guaranteed intervals. See Figure 1.2.2. This figure reveals something else too. T2 (18) does a much better job of
approximating ln(18) than does T11 (18). It’s not always the case that more terms means a better approximation.
We now turn our attention to perhaps the most often analyzed Taylor polynomials—those for the sine and cosine
functions. They provide examples with beautiful visualization and simple analysis. The nth Taylor polynomial for
f (x) = cos(x) expanded about 0 is
j
dxj (cos(x))
n d
X
Tn (x) = cos(0) + x=0
(x − 0)j
j=1
j!
Since the sine and cosine functions are bounded between −1 and 1 we know that
|x|n+1 |x|n+1
− ≤ Rn (x) ≤ .
(n + 1)! (n + 1)!
There are two ways this remainder term will be small. First, if x is close to 0, then |x| is small, making Rn (x)
1
small. Second, if n is large, then (n+1)! is small, making Rn (x) small. In other words, for small values of n, the
remainder term is small for small values of x. Tn (x) is a good approximation of cos(x) for such combinations of
x and n. On the other hand, for large values of n, the remainder term is small even for large values of x. For
62 √
example, |R61 (x)| ≤ |x|
62! , so |R61 (x)| will remain less than 1 for all x with magnitude less than
62
62! ≈ 23.933.
Figures 1.2.1 and 1.2.3 illustrate these points.
Figure 1.2.3: For large n, Tn (x) is a good approximation even for large x.
3
T60
cos(x)
2
-1
-2
-3
-30 -20 -10 0 10 20 30
Key Concepts
Rolle’s theorem: Suppose that f (x) is continuous on [a, b] and differentiable on (a, b). If f (a) = f (b), then there
exists ξ ∈ (a, b) such that f 0 (ξ) = 0.
Taylor’s theorem: Suppose that f (x) has n + 1 derivatives on (a, b), and x0 ∈ (a, b). Then for each x ∈ (a, b),
there exists ξ, depending on x, lying strictly between x and x0 such that
n (j)
f (x0 ) f (n+1) (ξ)
X
f (x) = f (x0 ) + (x − x0 ) j
+ (x − x0 )n+1 .
j=1
j! (n + 1)!
f (j) (x0 )
Pn
nth Taylor polynomial: Tn (x) = f (x0 ) + j=1 j! (x − x0 )j .
Maclaurin polynomial: A Taylor polynomial expanded about x0 = 0 is also called a Maclaurin polynomial.
f (n+1) (ξ)
Remainder term: Rn (x) = (n+1)! (x − x0 )n+1 is precisely − (Tn (x) − f (x)).
14 CHAPTER 1. PRELIMINARIES
The original theorem of Brook Taylor was published in his opus magnum Methodus Incrementorum Directa &
Inversa of 1715. In Methodus, it appears as the second corollary to Proposition VII Theorem III, bearing faint
resemblance to any modern statement of the theorem.
There is no mention of a remainder term. There is no use of the familiar f (x)-type function notation. It’s written
in Latin. And there is no laundry list of hypotheses.
Here is the original statement of Taylor’s theorem in English as translated by Ian Bruce. Proposition VII.
Theorem III: There are two variable quantities, z & x, of which z is regularly increased by the given increment
\ \ \\
z , and nz = v, v − z = v, v − z = v , and thus henceforth. Moreover, I say that in the time z increases to z + v, x
˙ ˙ ˙ ˙
\ \\\
increases likewise to become x + x 1z
v
+ x 1·2z
vv
2 + x 1·2·3z 3 + &c. Corollary II: If for the evanescent increments,
...
vv v
˙ ¨
˙ ˙ ˙
\\ \
the fluxions of the proportionals themselves are written, now with all the v , v, v, v , v , &c. equal to the time
/ //
v2 ... v3
z uniformly flows to become z + v, x becomes x + ẋ 1vż + ẍ 1·2 ż 2 + x 1·2·3ż 3 + &c . . .
1.2. TAYLOR POLYNOMIALS 15
Unfortunately, the English translation of Taylor’s theorem is only moderately helpful to anyone who is not well
acquainted with early 18th century mathematics. In 1715, function notation was still 20 years in the making.
Today, we would interpret the declaration of the two variables as declaring that x is a function of z. The claim
\ \\\
in Theorem III is that we can rewrite x(z + v) as x + x 1z
v
+ x 1·2z
vv
2 + x 1·2·3z 3 + &c.
...
vv v
Just as x should be
˙ ¨
˙ ˙ ˙
interpreted as a function of z so should x, x, and ...
x . More precisely, x means x(z + z ) − x(z), the amount x is
˙ ¨ ˙ ˙
incremented as z is incremented by z . Likewise, x is the amount x is incremented as z is incremented by z , so
h ˙ i h ¨ i ˙ ˙
x = x(z + z ) − x(z) = x(z + 2z ) − x(z + z ) − x(z + z ) − x(z) = x(z + 2z ) − 2x(z + z ) + x(z). Similarly, ...
x is
¨ ˙ ˙ ˙ ˙ ˙ ˙ ˙ ˙
the amount x is incremented as z is incremented by z . Now would be a good time to break from reading to verify
¨ ˙
that ...
x = x(z + 3z ) − 3x(z + 2z ) + 3x(z + z ) − x(z), that ....
x = x(z + 4z ) − 4x(z + 3z ) + 6x(z + 2z ) − 4x(z + z ) + x(z),
˙ ˙ ˙ ˙ ˙ ˙ ˙
\ \\
0 1 2
and so on. With this understanding and the conventions x for x, x for x, x for x, v for v, v for v, v for v , and
0 1 ˙ 2 ¨
so on, it is then an algebraic exercise to see that
n
n(n − 1) n(n − 1)(n − 2) n(n − 1) · · · 1
X n n
x(z + nz ) = x=x+x +x +x + ··· + x
˙ j j 0 1 1 2 1·2 3 1·2·3 n 1 · 2 · 3···n
j=0
v2 ... v3
have x + ẋ 1vż + ẍ 1·2 ż 2
+ x 1·2·3 ż 3
+ &c as claimed.
It is interesting that Theorem III is true for any function x defined on the interval [x, x + v]. No matter if
x is differentiable, or even continuous. It is a statement about finite differences. It is the corollary that requires
many more assumptions because that is where we pass to the limit.
Octave
Two things that will come in handy time and again when using Octave are inline functions and .m files. Creating
an inline function is a simple way to make a “custom” function in Octave. Creating a .m file is an organized way
to execute a number of commands and save your work for later.
In the last section we saw many built-in functions like sin(x), log(x), and abs(x). These have predefined
meaning in Octave. But what if you want to define f (x) = 3x2 ? There is no built-in “3 x squared” function. That’s
where an inline function is useful. The syntax for an inline function is
name = inline(’function definition’)
where name is the name of the function and function definition is its formula. In the case of f (x) = 3x2 ,
the Octave code looks like f=inline(’3*x^2’). Then you can use f the same way you would use sin or log or
abs. Write the name of the function, left parenthesis, argument, right parenthesis. So, after defining f with the
f=inline(’3*x^2’) statement, f(7) will result in 147:
16 CHAPTER 1. PRELIMINARIES
octave:1> f=inline(’3*x^2’);
octave:2> f(7)
ans = 147
Now we may complete Experiment 1 of section 1.1 a fourth way. Instead of doing the computations on the
command line, we can create a text file with the commands in it. Saved as a .m file, Octave will recognize it as a list
of instructions. If you are familiar with programming, this way of working with Octave will come very naturally.
Writing a .m file is the equivalent of writing a program. After it is written, it needs to be processed. On the Octave
command line, a .m file is run by typing the name of the file, without the .m. That’s it, so it isn’t exactly like
writing a program. There is no compiling. It’s a little bit more like scripting that way.
To begin, use any text editor you like to create the list of commands. Note well, Microsoft Word, LibreOffice,
and other word processors are not text editors. They are word processors. They have font formatting features,
page set up features, and so on. Now imagine your last report or letter to Mom and remove all the formatting,
save separation of paragraphs. That’s a text file. No bold, no centering, no images, no special fonts, no margins,
no pages. Just the typed words. There is no need for all the decorations a word processor allows. All Octave needs
is a list of commands. The only formatting you will need is the line feed (new line) and tabs. If you don’t already
have a favorite text editor (and maybe even if you do), you should use the one that comes with Octave. If you use
this program, you will have no problems. So, first create the text document experiment1.m exactly as shown here:
format(’long’)
p1 = 10*pi-31
p2 = 100*p1-41
p3 = 100*p2-59
p4 = 100*p3-26
p5 = 100*p4-53
p6 = 100*p5-58
p7 = 100*p6-97
Then, on the Octave command line, type experiment1 to get the results:
octave:1> experiment1
p1 = 0.415926535897931
p2 = 0.592653589793116
p3 = 0.265358979311600
p4 = 0.535897931159980
p5 = 0.589793115997963
p6 = 0.979311599796347
p7 = 0.931159979634685
This way of writing Octave commands has two distinct advantages. First, if you make errors, it’s a simple matter
to correct them. Just edit the text file and save the changes. Second, you have a record of your work. You can
share it, print it, or just save it for later. There is only one real disadvantage. It’s more involved than just executing
a few commands on the command line. So, for simple computations, it is more headache than necessary.
Note well that the .m file has to be saved in the same directory from which Octave was started. This type of
detail will be taken care of for you if you use an IDE, but if you are using a command line and text editor, you
need to be sure .m files are saved to the proper location.
(b) Use the Taylor polynomial to approximate f (4). (a) Find the second Taylor polynomial, P2 (x), about
(c) Find a bound on the absolute error of the approx- x0 = 0.
imation using the fact that (b) Find the remainder term, R2 (0.5), and the actual
(4)
error in using P2 (0.5) to approximate f (0.5).
−3 ≤ f (ξ) ≤ 5
(c) Repeat part (a) using x0 = 1.
for all ξ ∈ [2, 4]. (d) Repeat part (b) using the polynomial from part
5 (c).
5. Compute the 3 Taylor Polynomial for f (x) = x −
rd
2x4 + x3 − 9x2 + x − 1 expanded about x0 = 1. 14. Find the second Taylor polynomial, P2 (x), for f (x) =
6. Find the second Taylor Polynomial for f (x) = csc x ex- ex cos x about x0 = 0.
π
panded about x0 = . Here are some facts you may (a) Use P2 (0.5) to approximate f (0.5). Find an up-
4
find useful: per bound on the error |f (0.5) − P2 (0.5)| using
1 the remainder term and compare it to the actual
f 0 (x) = − csc(x) cot(x) csc(x) = error.
sin(x)
cos(x) (b) Find a bound on the error |f (x) − P2 (x)| good on
f 00 (x) = csc(x)(1 + 2 cot2 (x)) cot(x) = the interval [0, 1].
sin(x) ´1
(c) Approximate f (x) dx by calculating
´1 0
7. The hyperbolic sine, sinh(x), and hyperbolic cosine, P 2 (x) dx instead.
0
cosh(x), are derivatives of one another. That is,
(d) Find
´ 1 an upper bound for the error in (c) using
d 0
|R2 (x)| dx and compare the bound to the ac-
(sinh(x)) = cosh(x)
dx tual error.
and
15. Let f (x) = ex .
d
(cosh(x)) = sinh(x). (a) Find the nth Maclaurin polynomial Pn (x) for
dx
f (x).
Find the remainder term, R43 , associated with the 43rd
(b) Find a bound on the error in using P4 (2) to ap-
Maclaurin polynomial for f (x) = cosh(x).
proximate f (2).
8. Use an inline function to evaluate the Taylor poly- (c) How many terms of the Maclaurin polynomial
nomial T4 (x) = 1 − 21 x2 + 24
1 4
x at the given value of x. would you need to use in order to approximate
[S]
f (2) to within 10−10 ? In other words, for what n
does Pn (2) have an error bound less than or equal
(a) 0 to 10−10 ?
1
(b) 2
16. Find the fourth Taylor Polynomial for ln x expanded
(c) 1 about x0 = 1.
(d) π 17. What is the 50th term of T100 (ex ) expanded about
x0 = 6?
9. Use an inline function to evaluate the Taylor poly- 18. The Maclaurin series for the arctangent function con-
nomial T3 (x) = 1 + x + 21 x2 + 61 x3 at the given value of verges for −1 < x ≤ 1 and is given by
x.
∞
X x2i−1
(a) 0 arctan x = lim Pn (x) = lim (−1)i+1 .
n→∞ n→∞ 2i − 1
3
(b) 2
i=n+1
13. Let f (x) = x3 . Find the nth Maclaurin polynomial Pn (x) for f .
18 CHAPTER 1. PRELIMINARIES
21. How many terms of the Maclaurin Series of sin x are (g) Sketch graphs of f (x) and T2 (x) on the same set
needed to guarantee an approximation with error no of axes for x ∈ [1, 26].
more than 10−2 for any value of x between 0 and 2π?
28. Suppose f (x) is such that −3 ≤ f (10) (x) ≤ 7 for all
22. Suppose you are approximating f (x) = ex using the x ∈ [0, 10]. Find lower and upper bounds on the ab-
tenth Maclaurin polynomial. Find the largest interval solute error in using T9 (x) expanded about x0 = 3 to
over which the approximation is guaranteed to be ac- approximate
curate to within 10−3 .
23. Find a bound on the error in approximating e10 by (a) f (0).
using the twenty-fifth Taylor polynomial of g(x) = ex (b) f (10).
expanded about x0 = 0.
29. Suppose you wish to approximate the value of −e4 sin 4
24. Find a bound on the error of the approximation using separate Maclaurin polynomials (Taylor polyno-
1 2 1 3 1 1 mials expanded about x0 = 0) for the sine and exponen-
e2 ≈ 1 + 2 + (2) + (2) + (2)4 + (2)5 tial functions instead of a single Maclaurin polynomial
2 6 24 120
for the function f (x) = −ex sin x. How many terms of
according to Taylor’s Theorem. Compare this bound each would you need in order to get accuracy within
to the actual error. 10−20 ? Ignore round-off error.
25. Suppose f (8) (x) = ex cos x for some function f . Find 30. Find a theoretical upper bound, as a function of x, for
a bound on the error in approximating f (x) over the the absolute error in using T4 (x) to approximate f (x).
interval [0, π/2] using T7 (x) expanded about x0 = 0.
26. Let f (x) = 1
, and x0 = 5. [S] (a) ex sin x; x0 = 0.
x
2
[S]
(b) e−x ; x0 = 0.
(a) Find T2 (x).
10
(c) + sin(10x); x0 = π.
(b) Find R2 (x). x
(c) Use T2 (x) to approximate f (1) and f (9). 31. The Maclaurin Series for f (x) = e−x is
1.3 Speed
Besides accuracy, there is nothing more important about a numerical method than speed. There is almost always a
trade-off between one and the other, however. Fast computations are often not particularly accurate, and accurate
calculations are often not particularly fast. There are certain algorithms that produce accurate results quickly,
however. Deriving them, or identifying them once derived is what numerical analysis is all about.
The first type of numerical method we will encounter produces a sequence of approximations that, when ev-
erything is working, approach some desired value, say p. With these methods, we will get a sequence hpn i with
limn→∞ pn = p. You should be familiar with the concept of the limit of a sequence from Calculus, but the purpose
there was much different from ours here. Generally, you were concerned with whether a given sequence converged
at all. And when it did converge, and you were very lucky, you were able to determine the limit. In numerical
analysis, we know certain sequences converge, and are only interested in how quickly they do so.
Simple observation (and a little common sense) can tell you which cars on a highway are traveling faster than
which. Simple observation (and a little common sense) will also often tell you which sequences converge faster
than which. Consider the sequences in Table 1.4 which all converge to e ≈ 2.71828182845904. htn i is accurate
to 15 significant digits by the sixth term; hsn i is accurate to 15 significant digits by the eighth term; hrn i is still
not accurate to 15 significant digits by the eleventh term, but seems likely to gain 15 significant digits of accuracy
on the twelfth term; and hqn i is only accurate to 2 significant digits by the eleventh term, so seems likely to take
considerably more than twelve terms to gain 15 significant digits of accuracy. Since they all started at 3, it seems
reasonable to say that, ordered from fastest to slowest, they are htn i, hsn i, hrn i, hqn i. And that is correct as we will
see soon. But just like knowing which cars are faster than which is different from knowing how fast each is going,
knowing which sequences converge faster than which is different from knowing how quickly each one converges. To
measure the speed of a given car, you need access to its speedometer or a radar gun. To measure the order of
convergence (speed) of a sequence, you need a definition and a little algebra.
|pn+1 − p|
lim α =λ
n→∞ |pn − p|
|pn+1 − p| |pn+2 − p|
α ≈ α ≈ λ.
|pn − p| |pn+1 − p|
20 CHAPTER 1. PRELIMINARIES
−p
ln ppn+2
n+1 −p
In particular, we can solve for α to find α ≈ .
−p
ln pn+1
pn −p
|pn+1 −p|
There is no such thing as an order of convergence less than one because if limn→∞ |pn −p|α
= λ for some
0 < α < 1, then
|pn+1 − p| |pn+1 − p|
lim = lim · |pn − p|α−1 ,
n→∞ |pn − p| n→∞ |pn − p|α
|pn+1 −p|
a contradiction. On the one hand, the ratio test implies that limn→∞ |pn −p|
exists and is less than or equal to 1.
|p −p|
On the other hand, α < 1 =⇒ α − 1 < 0 so for |pn − p| small, |pn − p| is large. Hence, limn→∞ |pn+1
α−1
n −p|
α · |pn −
p| α−1
does not exist. To be rigorous, let M be any real number. Then there exists an N1 such that n > N1 implies
1
0.9λ 1−α
|pn+1 −p|
|pn −p|α
> 0.9λ. There also exists N 2 such that n > N 2 implies |p n − p| < , so |pn − p|α−1 > 0.9λM
.
M
|p −p|
Letting N = max{N1 , N2 } we have that n > N implies both |pn+1 n −p|
α > 0.9λ and |pn − p|
α−1 M
> 0.9λ . Hence, for
n > N , we have
|pn+1 − p| |pn+1 − p| M
= · |pn − p|α−1 > 0.9λ · = M.
|pn − p| |pn − p|α 0.9λ
|p −p|
Therefore, limn→∞ |pn+1n −p|
does not exist. When α = 1, it must be that λ ≤ 1 because otherwise the ratio test
implies that h|pn − p|i diverges, and, therefore, hpn i diverges.
ln 2.8985−e
2.7485−e
ln qq21 −e
−e
ln qq10 ln
−e 2.9436−e −e 2.7560−e
For example, ≈ ≈ 1 and 9 = ≈ 1. And if we try other sets of three
2.9436−e 2.7560−e
ln qq10 −e ln ln q9 −e
ln
−e 3−e q8 −e 2.7655−e
consecutive terms of hqn i, we get the same results. The order of convergence of hqn i is about 1. Of course, we would
need a formula for |qn − e| to determine whether the limit were truly 1, but we have some evidence. Repeating
the calculations for hrn i, hsn i, and htn i, we get approximate orders of convergence 1.322, 1.618, and 2, respectively.
Again we see that, ordered from fastest to slowest, they are htn i, hsn i, hrn i, hqn i.
If you attempted to calculate the orders of convergence yourself, you may have noticed that more information is
needed to use sn with n > 6 or tn with n > 4. All of these terms in the table are equal, so the formula for α fails to
produce a real number! A more useful table for calculating orders of convergence is one listing absolute errors: In
addition to making it easier to calculate α, this chart makes it painfully obvious that our common sense conclusion
1.3. SPEED 21
about which sequences converge faster than which was quite right. Just compare the accuracy (absolute errors) of
the eleventh terms.
So now we can calculate orders of convergence, but what does it all mean? What does the order of convergence
tell us about successive terms in the sequence? Solving the approximation |p|pn+1 −p|
n −p|
α ≈ λ gives us that |pn+1 − p| ≈
λ|pn − p|α . So, roughly speaking, convergence of order α means that, for large enough n, the approximation pn+1
is about λ|pn − p|α−1 times closer to the limit p than is pn . To rephrase in terms of significant digits of accuracy, a
little bit of algebra:
1. for linear convergence (α = 1), d(pn+1 ) ≈ d(pn ) − log λ, so each term has a fixed number more significant
digits of accuracy (approximately equal to − log λ) than the previous;
2. for quadratic convergence (α = 2), d(pn+1 ) ≈ 2d(pn ) − log (λ|p|), so each term has double the number of
significant digits of accuracy of the previous, give or take some;
3. for cubic convergence (α = 3), d(pn+1 ) ≈ 3d(pn )−log λ|p|2 , so each term has triple the number of significant
and so on. Summarizing, for large n, you can expect that each term will have − log λ|p|α−1 more than α times
as many significant digits of accuracy as the previous term. We can see this claim in action by calculating λ for
the sequences htn i, hsn i, hrn i, and hqn i. Using the fact that λ ≈ |p|pn+1 −p|
n −p|
α , we find that λ = 0.8 for each sequence.
Therefore, hqn i should show each term having − log 0.8 ≈ .1 more significant digits of accuracy than the previous.
More sensibly, this means the sequence will show about one more significant digit of accuracy every ten terms.
This is borne out by observing that q0 has error about 3(10)−1 while q10 has error about 3(10)−2 . For hrn i, we
should expect each term to have about − log(0.8 · e.322 ) ≈ −0.04 more than
1.322 times as many significant digits
of accuracy as the previous. For example, r3 has about log 2.145(10) e
≈ 2.1 significant digits of accuracy while
−2
r4 has about 1.322(2.1) − .04 ≈ 2.73 significant digits of accuracy, r5 has 1.322(2.73) − .04 ≈ 3.57 significant digits
of accuracy,
and so on until r8 has about 8.1 significant digits of accuracy. Again this is borne out by the table
as log r8 −e = log 2.113(10) −8 ≈ 8.1. Though we can do a similar calculation for htn i, it’s easier just to eyeball it
e e
since all we need to see is that the exponent in the scientific notation doubles, give or take a little, from one term
to the next. Indeed it does as it goes from 1 to 2 to 3 to 6 to 11, and so on.
Note that in all this analysis, we have ignored the requirement that n be “large”. That was acceptable in this
case since these sequences were contrived so that even n = 0 was large enough! In practical applications this will
not be the case.
To appreciate just how much faster one order of convergence is over another, consider the relation
again. Now suppose we know that d(pn0 ) = dn0 for some particular n0 large enough that the approximation is
reasonable. Then it can be shown that, for α > 1,
log λ|p|α−1
where C = .
α−1
22 CHAPTER 1. PRELIMINARIES
The relation d(pn+1 ) ≈ αd(pn ) − log λ|p|α−1 is an example of a recurrence relation. In particular, a first order
linear nonhomogeneous recurrence relation with constant coefficients since it has the form
an+1 = k1 an + k2
where k1 and k2 are constants. Linear nonhomogeneous recurrence relations can be solved by summing a homo-
geneous solution and a particular solution. For the particular solution, we seek a solution of the form an = A
(for all n) by substituting this assumed solution into the recurrence relation. Doing so gives A = k1 A + k2 , so
A = 1−kk2
1
is such a solution. For the homogeneous solution, we seek a sequence of the form an = rn that satisfies
an+1 = k1 an + 0. Substituting our assumed solution into the modified (homogeneous) recurrence relation gives
rn+1 = k1 rn . Rearranging, rn (r − k1 ) = 0 so r = 0 or r = k1 . Notice that Bk1n is also a solution for any constant
B. This includes the solution an = 0 which would arise from setting r = 0. Finally, putting the particular and
homogeneous solutions together, the solution of an+1 = k1 an + k2 is an = Bk1n + 1−k k2
1
for any constant B. In
log(λ|p|α−1 )
the case of d(pn+1 ) ≈ αd(pn ) − log λ|p|α−1 , k1 = α and k2 = − log λ|p|α−1 so d(pn ) = Bαn + .
α−1
The value of B is determined by substituting any known element of the sequence into this formula and solving
log(λ|p|α−1 ) log(λ|p|α−1 )
for B. Supposing d(pn0 ) = dn0 yields d(pn ) = dn0 − α−1
αn + α−1
.
The important thing to see here is that d(pn0 +k ) is an exponential function when α > 1. The number of significant
digits of accuracy grows exponentially with base α. As we saw before, for α = 1, the number of significant digits
grows linearly. In calculus you learned that any exponential function grows much faster than any polynomial
function, so it is reasonable and correct to conclude that sequences converging with orders greater than 1 are
markedly faster converging than are sequences converging with linear (α = 1) order.
But be careful. Based on this same memory of calculus, you would also conclude that the sequence h2−n i
converges to 0 much faster than does hn−2 i. By some measures, that’s true, but not by all measures. Consider the
orders of convergence of these two sequences. We seek values α1 and α2 such that
|2−(n+1) − 0| |(n + 1)−2 − 0|
lim = λ1 and lim = λ2
n→∞ |2−n − 0|α1 n→∞ |n−2 − 0|α2
for some real numbers λ1 and λ2 . A little bit of algebra will lead to solutions:
|2−(n+1) − 0| 2−n−1
= = 2(α1 −1)n−1
|2−n − 0|α1 2−α1 n
while
|(n + 1)−2 − 0| n2α2
= .
|n−2 − 0|α2 n2 + 2n + 1
2α2
The only way limn→∞ 2(α1 −1)n−1 will be a nonzero constant is if α1 = 1. The only way limn→∞ n2n+2n+1 will be a
nonzero constant is if the leading coefficients of the numerator and denominator are equal. That means α2 must be
1 as well. So h2−n i and hn−2 i both converge to zero with linear order. They are equally extremely slow to converge
by this measure! Still, something should not feel quite right about claiming that h2−n i and hn−2 i converge at the
same speed.
indicating that h2−n i approaches 0 much faster than does hn−2 i. You may also recall comparisons between power
functions:
n−p
lim −q = 0
n→∞ n
whenever p > q > 0; and between exponential functions:
a−n
lim =0
n→∞ b−n
Crumpet 8: Approximating π
The sequence
1103 · 23/2 1130173253125 1029347477390786609545
, , ,...
9801 313826716467 · 27/2 1116521080257783321 · 223/2
1
converges to π . Its terms are given by the formula
* √ n
+
8 X (4j)!(1103 + 26390j)
9801 (j!)4 · 3964j
j=0 n=0,1,2,3,...
of Srinivasa Ramanujan. For all practical purposes, it converges very quickly. The first term already has about
8 significant digits of accuracy:
1103 · 23/2
≈ 0.31830987844047012321768445317
9801
1
≈ 0.31830988618379067153776752674,
π
and the second has about 16:
1130173253125 1
− ≈ 6.48(10)−17 ,
313826716467 · 27/2 π
double the accuracy of the first term. The third term is already more than double-precision accurate.
It’s tempting to believe, or hope, the sequence is quadratically convergent, but it is not. The third term has
an accuracy of about 24 significant digits. Each term in the sequence is approximately 8 significant digits more
accurate than the previous—the hallmark of a linearly convergent sequence.
24 CHAPTER 1. PRELIMINARIES
Key Concepts
Order of convergence: The sequence hpn i converges to p with order of convergence α ≥ 1 if
|pn+1 − p|
lim α =λ
n→∞ |pn − p|
Absolute error: For a sequence hpn i that converges to p with order α, the absolute errors of consecutive terms
are related by the approximation
|pn+1 − p| ≈ λ|pn − p|α
for large enough n.
Significant digits of accuracy: For a sequence hpn i that converges to p with order α, the numbers of significant
digits of accuracy of consecutive terms are related by the approximation
log λ|p|α−1
where C = .
α−1
Rate of convergence: The sequence hpn i converges to p with rate of convergence O(bn ) if hbn i converges to 0 and
|pn − p| ≤ λ|bn |
Octave
An invaluable tool in any kind of programming is looping. When you need to perform some procedure multiple
times for varying input, a loop is probably the right solution. While there are several types of loops available in
Octave, we will discuss only for loops right now. The idea is to have a variable, sometimes called a counter, that
counts how many times the procedure has been performed. When the procedure has been performed the desired
number of times, the looping ends, and the program continues from there. You almost certainly encountered this
idea before you ever wrote a computer program. If you ever went to the fair and paid a dollar to toss a dozen rings
in hopes of landing one on the neck of a soda bottle, you have experienced looping. You may have even counted
the rings as you tossed them. You were the counter! You had to perform the procedure of throwing a ring into
the field of bottles 12 times. So, perhaps you threw one and counted to yourself “1”. Then you threw another and
counted “2”. And another and counted “3”. And so on through “12”. When the last ring was tossed, you continued
about your day at the fair.
The for loop is an abstract analogy of this situation. Suppose you want to calculate 1!, 2!, 3!, and so on through
12!. In Octave, you could create the following .m file and run it.
factorial(1)
factorial(2)
factorial(3)
factorial(4)
factorial(5)
factorial(6)
factorial(7)
factorial(8)
factorial(9)
factorial(10)
factorial(11)
factorial(12)
1.3. SPEED 25
But this can be tedious and not particularly reader-friendly, especially if we are interested in doing some computation
many more than 12 times. The purpose of the loop is to reduce the repetitiveness of this approach. We want to
perform the procedure of calculating the factorial of 12 different integers, so a loop is appropriate. The syntax for
the loop is to set up the counter, write the code to perform the procedure, and mark the end of the loop. It looks
something like this.
for j=first:last
do something.
end%for
This will cause Octave to perform the procedure once for each integer from first to last, including both first
and last. The value of the counter, j in this case, may be used in the procedure. So to calculate 1! through 12!,
we might write
for j=1:12
factorial(j)
end%for
This will produce exactly the same output as the program with one line for each factorial. And if later you want
to calculate 1! through 20! instead, all you have to do is change the 12 to a 20. The for loop is your friend!
Now suppose we want to calculate α for each set of three consecutive values of |sn − e| from Table 1.5. Since
there are 9 such sets, we need to create a loop that will run through 9 times. And inside the loop, we will need to
−e
ln ssn+2
n+1 −e
perform the calculation α = . But before we can start, we need to tell Octave about the 11 values from
−e
ln sn+1
sn −e
the table. The most convenient way to do so is in an array. An array is like a vector. It has components. In this
case, each component will hold one value from the table. And the syntax for creating the array is a lot like vector
notation. We will use square brackets to delimit the components of the array, and we will separate the components
by commas. So, the first line of our Octave code will look like this.
The ellipses (three consecutive dots) at the ends of the first two lines are needed to tell Octave that the command
continues onto the next line. Without them, separating a single command over multiple lines will cause a syntax
error. Starting a new line in Octave is the signal to start a new command as well.
Now Octave knows the values of |sn − e|. Using this vector is a lot like using subscripts. The first value,
2.817(10)−1 , is called errs(1). The second is called errs(2). The third is called errs(3), and so on. The length
of the array errs can be retrieved using the length() function of Octave. The command length(errs)-2 will be
used instead of hard-coding the 9. So we can finish the Octave code like so.
alpha = 1.6182
alpha = 1.6181
alpha = 1.6176
alpha = 1.6182
alpha = 1.6180
alpha = 1.6180
alpha = 1.6180
alpha = 1.6180
alpha = 1.6180
26 CHAPTER 1. PRELIMINARIES
Not bad, but we can do better. Let’s calculate α, λ, and d(sn ) by two different methods—directly and using the
formula d(pn+1 ) ≈ αd(pn ) − log λ|p|α−1 . Then let’s display the results in a nicely formatted table.
We will need the disp() command and a two-index array. The disp() command is used to display some text
or some quantity. When used for text, the text needs to be delimited by single quotation marks. When used for
quantities, not. So, we might have an Octave program output the word “hello” with the command disp(’hello’)
or have it output the value of ln(2) with the command disp(log(2)). The disp() command can also handle
variables, so if p1 and p2 have been assigned values, then we can display their difference using disp(p2-p1). A
two-index array can be thought of as a table, or a matrix. It holds values in what can be imagined as rows and
columns. So, instead of having errs(j) as we did before, we may have errs(j,k) where j indicates the row and
k indicates the column. The program
A(2,4) = 7;
disp(A);
produces
0 0 0 0
0 0 0 7
OK, back to the task at hand. We will combine everything we have learned about Octave into one program.
errs = [2.817*10^(-1), 1.03*10^(-1), 2.022*10^(-2), 1.451*10^(-3), ...
2.046*10^(-5), 2.07*10^(-8), 2.953*10^(-13), 4.263*10^(-21), ...
8.777*10^(-34), 2.608*10^(-54), 1.595*10^(-87)];
d = inline(’-log(x/exp(1))/log(10)’);
for j=1:9
% alpha:
T(j,1) = log(errs(j+2)/errs(j+1))/log(errs(j+1)/errs(j));
% lambda:
T(j,2) = errs(j+2)/errs(j+1)^T(j,1);
% d (explicit):
T(j,3) = d(errs(j+2));
end
alpha = 1.61804;
lambda = 0.8;
constant = log(lambda*exp(alpha-1))/log(10);
T(1,4) = T(1,3);
for j=2:9
% d (recursive)
T(j,4) = alpha * T(j-1,4) - constant;
end%for
It is worth taking some time to make sure you understand all the lines of this program. It uses assignment, built-in
functions, inline functions, simple output, arrays, and for loops. The % on a line tells Octave to ignore it and
everything on the line that follows. These tidbits are called comments. They are strictly for the human user to
document what the program does. Lengthy programs should always be documented so any user of the program will
be better able to understand what it does. Here the comments are simple, but they may be much more elaborate.
π
D E
Exercises (j) →0
en − π n
1. Some convergent sequences and their limits are given. 2
n [S]
Find the order of convergence for each. (k) →0
2n
n!
D E
(a) →0 7 + cos(5n)
nn (l) →0
D
1
E n3 + 1
[S]
(b) →0
3en 8n2
n
n (m) + →3
22 − 2 3n2 + 12 3n + 10
[S]
(c) →1
2 2n + 3
2n2 + 3n
[A]
(n) → −2
n2
[A] 1 − n2
(d) →1
1 + n2
3n5 − 5n
D
en E (o) → −3
(e) →0 1 − n5
een
D
n+1
E 7. Find the rates of convergence of the following sequences
2. Show that the sequence converges to 1 lin- as n → ∞.
n−1
early. 1
n (a) lim sin =0
3. Show that the sequence pn = 21−2 is quadratically n→∞ n
convergent. 1
(b) lim sin 2 = 0
4. Give an example of a sequence which converges to 0 n→∞ n
with order α = 10. 1 2
(c) lim sin =0
5. Approximate the order of convergence of the sequence n→∞ n
pn and explain your answer. (d) lim [ln(n + 1) − ln(n)] = 0
n→∞
|pn+1 −p| |pn+1 −p| |pn+1 −p|
n |pn −p|1.2 |pn −p|1.3 |pn −p|1.4 For questions on this page- on the next page, use the
25 9.07(10) −6
.0110 13.39 following definition for rate of convergence for a func-
26 1.88(10)−7 .00303 48.65 tion. For a function f (h), we say limh→a f (h) = L with
27 1.01(10)−9 .000530 277.8 rate of convergence g(h) if |f (h)−L| ≤ λ|g(h)| for some
28 λ > 0 and all sufficiently small |h − a|.
6. Some linearly convergent sequences and their limits are 8. Use a Taylor polynomial to find the rate of convergence
given. Find the (fastest) rate of convergence of the form of
O n1p or O a1n for each. If this is not possible, sug- lim (2 − eh ) = 1.
h→0
gest a reasonable rate of convergence.
9. Use a Taylor polynomial to find the rate of convergence
6 6 6 6 of
(a) 6, , , , ,... → 0
7 49 343 2401 sin(h) − eh + 1
D
11n − 2
E lim = 0.
(b) → 11 h→0 h
n+3
10. Find rates of convergence for the following functions as
sin n
(c) √ →0 [S] h → 0.
n
sin h
D
4
E
[S]
(a) lim =1
(d) →0 h→0h
10n + 35n + 9
1 − cos h
4 (b) lim =0
D E
[S]
(e) →0 h→0 h
10n − 35n − 9
sin h − h cos h
2n (c) lim =0
(f) √ → 2 [A] h→0 h
n2 + 3n
1 − eh
5 −2
D n E (d) lim = −1
(g) →1 h→0 h
5 +3
n
√ √ [A] 11. Find the rate of convergence of
(h) n + 47 − n → 0
n2 1 h2 + cos h − eh
(i) → lim = −1.
3n2 + 1 3 h→0 h
28 CHAPTER 1. PRELIMINARIES
Audience: Two!
Mathemagician: Very good. Allow me to fold it in half again. Now how many layers thick has it become?
Audience: Four!
Mathemagician: Excellent. Watch very closely as I fold it for a third time. Think hard and tell me how many
layers thick is the folded sheet now.
Mathemagician: That’s right. Eight. So much for the warm up. I shall now have my lovely assistant bring
out another perfectly ordinary bed sheet. This time already folded. Crystal! The bed sheet please ...
(Crystal brings out the bed sheet, already folded). Again, an ordinary bed sheet. This time folded.
I shall now fold it in half as I have done before and ask again, how many layers thick has the sheet
become?
Audience: (Laughing)
Mathemagician: ...but I can tell you it is twice as many layers thick as it was before!
Mathemagician: I know. I know. A cheap parlor trick. But wait! Watch as I slowly unfold the sheet, one fold at
a time. One! ... Two! (he peers toward the sky as if in thought) ... Three! ... (again seemingly deep in
thought) ... Four! ... Four times folded in half and now, as you can plainly see, the sheet is three layers
thick. The first fold was in thirds. (he peers off into space, waves his wand, stares deep into the eyes of
the audience) Forty-eight!!!
Mathemagician: The sheet started 3 layers thick, and was doubled in thickness four times ... 3 ... 6 ... 12 ... 24
... 48.
Though it was meant to seem like a wise crack, the observation that folding a sheet in half doubles the number of
layers was the key to counting the layers in the folded sheet. Recursive procedures are magical in the same way.
They seem to hold nothing of value when, in fact, they hold the key. They are based on the principle that no matter
what the current state of affairs (no matter how many layers thick the sheet is), following the procedure (folding it
in half) will produce a predictable result (double the thickness).
Perhaps the simplest numerical example of this idea comes from thinking of a bag of marbles—an opaque bag
with an unknown number of marbles inside. One marble is added, and you are asked how many are inside. Of
course the best you can say is something like “one more than there were before.” Even though you do not know
how many marbles are in the bag to begin with, when one is added to the bag, you know the new total is one more
than the previous total. This is recursive thinking.
30 CHAPTER 1. PRELIMINARIES
Trominos
Connect three squares edge-to-edge in the shape of an L, and you have a tromino. Trominos aren’t used in games
like dominoes are, but are often used in interesting mathematical questions involving tiling. Tiling with trominos
means covering without overlapping trominos and without having any parts of trominos lying outside the shape
being tiled. For example, a 2 × 3 grid can be tiled with trominos as can a 6 × 9 grid. See Figure 1.4.1. If n is a
positive integer, then a 2n × 2n grid can almost be tiled with trominos. All but one square can be covered. Try it,
first with a 2 × 2 grid. That one’s not too hard. Then try it with a 4 × 4 grid or an 8 × 8 grid.
How about a 1024 × 1024 grid? I can’t recommend that you actually get yourself a 1024 × 1024 grid of squares
and start filling in with trominos. It would take 349, 525 trominos. You may not finish in your lifetime! Instead,
it is time to start thinking recursively. Use the previous result in your answer. The same way you can just say
the marble bag “has one more than before”, we can phrase the solution to tiling the 1024 × 1024 grid in terms of
the tiling of the 512 × 512 grid. Here’s how it goes. Take a 1024 × 1024 grid and section it off into four 512 × 512
subgrids by dividing it down the middle both horizontally and vertically. In the upper left 512 × 512 grid, tile all
but the bottom right corner. In the lower left 512 × 512 grid, tile all but the upper right corner. In the lower right
grid, tile all but the upper left corner. Finally, in the upper right 512 × 512 grid, tile all but the upper right corner
(Figure 1.4.2). This leaves room for a single L-shaped tromino in the middle, and one square left over. That’s it!
It should feel a little bit like cheating since we didn’t specify how to deal with the 512 × 512 grid, but the same
argument applies to the 512 × 512 grid. You can section it off into four subgrids, tile those and be done.
The same tiling argument can be made for any 2n × 2n grid based on the 2n−1 × 2n−1 tiling, except when n = 1.
1.4. RECURSIVE PROCEDURES 31
You just have to tile the 2 × 2 grid yourself! But once that’s done, you have a complete solution for any 2n × 2n
grid. A similar exception applies to every recursive procedure. The recursion is only good most of the time. At
some point, you have to get your hands dirty and supply a solution or answer. Such an answer is often called an
initial condition.
Proof by induction also uses a sort of recursive thinking. In the method, one must prove that a claim is true for
some value of the variable. This part is analogous to having an initial condition. Then one must prove that the
truth of the claim for the value n implies the truth of the claim for n + 1. This is analogous to the recursive
relationship between states. In fact, the construction of a tiling for the 2n × 2n grid based on the 2n−1 × 2n−1
grid plus the tiling of the 2 × 2 grid just presented essentially form a proof by induction that the 2n × 2n grid,
save one corner, can be tiled by trominos for any n ≥ 1. In this way, all proofs by induction boil down to the
ability to see the recursive relationship between states.
In 1954, Solomon Golomb pubished a proof by induction that the 2n × 2n grid minus any single square (not
necessarily a corner), called a deficient square, can be tiled by trominos. Can you construct a (recursive) tiling
of a 2n × 2n deficient square? You may use the tiling of a 2k × 2k grid minus one corner in your construction.
Reference [12]
Octave
Custom functions
As any modern useful programming language does, Octave allows custom functions beyond those that can be written
as a single inline formula. Let’s say you are interested in the maximum value a function takes over an evenly
spaced set of values. That function has a very special purpose and is not commonly used. Consequently, it is not
built into any programming language, so if you really want a function that does that, it is your job to write it.
Similarly, if you want a function that calculates the symmedian point of a triangle, you need to write it. In fact,
most anything computational beyond evaluating basic functions will not be built into Octave.
Custom functions are written around three basic pieces of information: a name for the function, a list of inputs,
and a description of the output. These three things should be well defined before the work of writing the function
32 CHAPTER 1. PRELIMINARIES
begins. Actually writing the function involves simply telling Octave the desired name, inputs, and how to determine
the output. The basic format for a function is this:
The first line holds the name of the function and a list of inputs. The rest of the function is dedicated to computing
the output, ans.
The function that determines the maximum value of a function over an evenly spaced set of values might be
written following these steps. First, we decide to name it “maxOverMesh”. Notice there are no spaces and no special
characters in the name. There’s a very limited supply of non-alphabetic characters that can go into the name of a
function. It’s usually safe to assume an underscore and numbers are acceptable, but you can’t count on anything
else! It’s best to keep it at that. Second, we need to think about what inputs are necessary for this function. Of
course, the function to maximize is required, and somehow the mesh of points where it should be checked needs to
be specified. There are multiple ways to do this, but perhaps the one that is easiest for the user is to require the
lower end point, upper end point, and number of intervals in the mesh. Finally, we need to write some code that
will take those inputs and determine the maximum value of the function over the mesh. One way to do it is this:
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% maxOverMesh() written by Leon Q. Brin 21 January 2013 %
% INPUT: Interval [a,b]; function f; and number of %
% subintervals n. %
% OUTPUT: maximum value of the function over the end %
% points of the subintervals. %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function ans = maxOverMesh(f,a,b,n)
ans = f(a);
for i=1:n
x = (i*b + (n-i)*a)/n;
F = f(x);
if (F>ans) ans = F;
end%for
end%function
It is good practice to preface each function you write with a comment containing a three-point description of the
function—the name, inputs, and output. If you or anyone else looks at it later, you will have a quick summary of
how to use the function and for what.
Whatever the last value assigned to ans when the function is complete will be the output of the function. The
function starts by assigning the value of the function at the left end point to ans. Then it loops through the rest of
the subinterval end points, calculating the value of the function at each one. Each time it finds a value higher than
ans, it (re-)assigns ans to that value. At the end of the loop, the greatest value of the function has been assigned
to ans.
To use a custom function, save it in a .m file with the same name as that of the function. For example, the
maxOverMesh() function would be saved in a file named maxOverMesh.m. Then your custom function can be called
just as any built-in Octave function as long as the .m file is saved in the same directory in which the program using
it is saved. Or, if using it from the command line, the working directory of Octave (the one from which Octave was
started, unless explicitly changed during your session) must be the directory in which the .m file is saved:
Recursive functions
Thinking recursively, what would you say if I asked you what 10! was? Think about it for a moment before reading
on. That’s right! 10 factorial is just 10 times 9!:
10! = 10 · 9 · 8 · 7 · 6 · 5 · 4 · 3 · 2 · 1
= 10 · (9 · 8 · 7 · 6 · 5 · 4 · 3 · 2 · 1)
= 10 · (9!).
No need to come up with a number. Just a recursive idea, because of course the idea works just as well for 9!, and
so on . . . up to (or should I say down to?) a point. At what point is it no longer true that n! = n · (n − 1)!? When
n = 0. We need to specify that 0! = 1 and not rely on recursive thinking in this case. But only this case!
Let’s see how this recursive calculation works for 5!. According to the recursion, 5! = 5 · 4!. But 4! = 4 · 3!
so we have 5! = 5 · (4 · 3!). But 3! = 3 · 2! so we now have 5! = 5(4(3 · 2!)). Continuing, 2! = 2 · 1! = 2 · 1 · 0!
so we now have 5! = 5(4(3(2(1 · 0!)))). And now the recursion stops and we simply plug in 1 for 0! to find out
that 5! = 5(4(3(2(1(1))))). Maybe you were expecting 5 · 4 · 3 · 2 · 1 for a final result instead. Of course you get
120 either way, so from the standpoint of getting things right, either way is fine. Pragmatically, the point is moot.
Computing factorials recursively is dreadfully inefficient and impossible beyond the maximum depth of recursion
for the programming language in use, so should never be used in practice anyway. Its only value is as an exercise
in recursive thinking and programming.
Generally, a recursive function will look like this:
function ans = recFunction(input1, input2, ... )
if (recursion does not apply)
return appropriate ans
else
return recFunction(i1, i2, ... )
end%if
end%function
Determining whether the recursion applies is the first item of business. If not, an appropriate output must be
supplied. Otherwise, the recursive function simply calls itself with modified inputs. Since the recursive (wise-guy)
definition of n! is n · (n − 1)! and applies whenever n > 0, and 0! = 1, the recursive factorial function might look
like this:
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% recFactorial() written by Leon Q. Brin 21 January 2013 %
% is a recursively defined factorial function. %
% INPUT: nonnegative integer n. %
% OUTPUT: n! %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function ans = recFactorial(n)
if (n==0)
ans = 1;
else
ans = n*recFactorial(n-1);
end%if
end%function
Note the == when checking if n equals 1. This is not a typographical error. This is very important. All programming
languages must distinguish between assignments and conditions. On paper, it may seem natural to write x = 3
when you want to set x equal to 3. It may also seem natural to write “if x = 3, everything is good.” We use
the “equation” x = 3 exactly the same way on paper to mean two very different things. When we set x = 3 we
are making a statement, or assignment of the value 3 to the variable x. But when we write “if x = 3 . . .” we are
making a hypothetical statement, or a conditional statement. The value of x is unknown. In Octave the distinction
is made by using a single equals sign, =, to mean assignment and two equals signs, ==, to mean conditional equals.
recFactorial.m may be downloaded at the companion website.
34 CHAPTER 1. PRELIMINARIES
1. Write a .m file with a function that takes one input, (f) Which code is faster?
squares it, and returns the result. Your file should (g) Which code is more accurate?
(a) contain a comment block at the beginning con- NOTE: F30 = 1346269.
taining your name, the date, and an explanation 6. Let the sequence han i be defined by
of what the program does and how to use it.
1 2
(b) have a function of the form foo(x) in it that re- an+1 = 5an − 30an + 25 , n≥1
4
turns the square of its input (argument) x. √
17 + 2 7
a0 = .
Make sure to test your function from the Octave com- 5
mand prompt. (a) Calculate a1 , a2 and a3 exactly.
2. The Octave function foo(x) is shown below. (b) Find a20 and a51 exactly.
(c) Write a recursive function that calculates the nth
function res = foo(x)
term of the sequence. Your function should have
if (x<1)
one argument, n. Write a program that calls this
res = 0;
function to calculate a1 , a2 , a3 , a20 , and a51 .
else
half = x/2; (d) Write a function that uses a for loop to calcu-
floorhalf = floor(half); late the nth term of the sequence. Your function
if (half == floorhalf) should have one argument, n. Write a program
res = 0 + foo(floorhalf); that calls this function to calculate a1 , a2 , a3 , a20 ,
else and a51 .
res = 1 + foo(floorhalf); (e) Which code is simpler (recursive or nonrecursive)?
endif
(f) Which function is faster?
endif
endfunction (g) Which code is more accurate, and why?
(h) Which function is better, and why?
(a) Find foo(2).
(i) Do you trust either function to calculate a600 ac-
(b) Find foo(23). curately? If not, why not?
[A]
3. Write a recursive Octave function that will calculate 7. Trominos, part 1.
n
X 1 (a) Recursively speaking, how many trominos are
i needed to tile a 2n × 2n grid, save one corner?
i=1
(b) What is the greatest (integer) value of n for which
the recursive definition does not apply?
4. Write a recursive Octave function that calculates
an for any n ≥ 0 given (c) For the value of n of part 7b, how many trominos
are needed?
a0 = 100, 000
[S]
8. Trominos, part 2.
an = 1.05an−1 − 1200, n > 0.
(a) Write a recursive Octave function for calculating
5. The Fibonacci sequence, hFn i, is recursively defined by the number of trominos needed to tile a 2n × 2n
grid, save one corner.
Fn+1 = Fn + Fn−1 , n≥1
(b) Use your function to verify that 349, 525 tromi-
F0 = 1
nos are needed to tile a 1024 × 1024 grid, save one
F1 = 1 corner.
so the first few terms are 1, 1, 2, 3, 5, 8. 9. The Tower of Hanoi, part 1. The Tower of Hanoi is
a game played with a number of different sized disks
(a) Write a recursive function that calculates the n th
stacked on a pole in decreasing size, the largest on the
Fibonacci number. Your function should have one
bottom and the smallest on top. There are two other
argument, n.
poles, initially with no disks on them. The goal is to
(b) Write a function that uses a for loop to calculate move the entire stack of disks to one of the initially
the nth Fibonacci number. Your function should empty poles following two rules. You are allowed to
have one argument, n. move only one disk at a time from one pole to another.
(c) Write a program that calls the function from 5a You may never place a disk upon a smaller one. [S]
to calculate F30 . (a) Starting with a stack of three disks, what is the
(d) Write a program that calls the function from 5b minimum number of moves it takes to complete
to calculate F30 . the game? Answer this question with a number.
1.4. RECURSIVE PROCEDURES 35
(b) Starting with a stack of four disks, what is the (a) How many partitions of A into k nonempty sub-
minimum number of moves it takes to complete sets include the subset {n}? Give an answer in
the game? terms of Stirling numbers of the second kind.
i. Answer this question recursively. (b) How many partitions of A into k nonempty sub-
ii. Answer this question with a number based sets do not include the subset {n}? Give an
on your recursive answer. answer in terms of Stirling numbers of the sec-
[S]
ond kind. Hint, consider partitions of B =
10. The Tower of Hanoi, part 2. {1, 2, 3, . . . , n − 1} into k nonempty subsets.
(a) Starting with a “stack” of one disk, what is the 15. Stirling numbers of the second kind, part 4.
minimum number of moves it takes to complete
the game? (a) Use your answers to questions 13 and 14 to de-
rive a recursive formula with initial conditions for
(b) Use your answer to (a) plus a generalization the number of ways a set of n elements can be
of your answer to question 9(b)i to write a recur- partitioned into k subsets.
sive Octave function for calculating the minimum
number of moves it takes to complete the game (b) Write a recursive Octave function that calcu-
with a stack of n disks. lates Stirling numbers of the second kind.
(c) Use your Octave function to verify that it (c) Use your Octave function to verify that
takes a minimum of 1023 moves to complete the S(10, 4) = 34105.
game with a stack of 10 disks. 16. A set of blocks contains some that are 1 inch high and
11. The Tower of Hanoi, part 3. The Tower of Hanoi with some that are 2 inches high. How many ways are there
adjacency requirement. Suppose the rules of The Tower to make a stack of blocks 15 inches high? [S]
of Hanoi are modified so that each disk may only be 17. A male bee (drone) has only one parent since drones
moved to an adjacent pole, and the goal is to move the are the unfertilized offspring of a queen bee. A female
entire stack from the left-most pole to the right-most bee (queen) has two parents. Therefore, 0 generations
pole. back, a male bee has one ancestor (the bee himself). 1
generation back, the bee also has 1 ancestor (the bee’s
(a) What is the minimum number of moves it takes mother). 2 generations back, the bee has 2 ancestors
to complete the game with a “stack” of one disk? (the mother’s two parents). How many direct ancestors
(b) Find a recursive formula for the minimum num- does a male bee have n generations back?
ber of moves it takes to complete the game with 18. Argue that any polygon can be triangulated (covered
a stack of n disks, n > 1. with non-overlapping triangles). An example of a tri-
(c) Write a recursive Octave function for the mini- angulation of a dodecagon follows.
mum number of moves to complete the game with
a stack of n disks.
(d) Use your Octave function to compute the min-
imum number of moves it takes to complete the
game with a stack of 5 disks. 10 disks.
12. Stirling numbers of the second kind, part 1. Let S(n, k) 19. In questions 5 and 6, you should have noticed that
be the number of ways to partition a set of n elements the recursive functions were slower than their for loop
into k nonempty subsets. A partition of a set A is a counterparts. How many times slower? Why is the Fi-
collection of subsets of A such that each element of bonacci recursion so many more times slower than its
the set A must be an element of exactly one of the for loop counterpart?
subsets. The order of the subsets is irrelevant as the
20. Let the sequences hbn i and hcn i be defined as follows.
partition is a collection (a set of sets). For example, the
partition {{1}, {2, 3}, {4}} is a partition of {1, 2, 3, 4}. 1
b0 = ; bn+1 = 4bn − 1, n ≥ 0
{{4}, {1}, {2, 3}} is the same partition of {1, 2, 3, 4}. 3
1
(a) Find S(10, 1). [S] c0 = ; cn+1 = 4cn (1 − cn ), n ≥ 0
10
(b) Find S(3, 2). (a) Write a function that uses a for loop to calculate
(c) Find S(4, 3). the nth term of hbn i. Your function should have
(d) Find S(4, 2). [S] one argument, n.
(b) Write a function that uses a for loop to calculate
(e) Find S(8, 8).
the nth term of hcn i. Your function should have
[S]
13. Stirling numbers of the second kind, part 2. one argument, n.
(a) Find S(n, 1). (c) Write a program that calls these functions to cal-
culate b30 and c30 . How accurate are these calcu-
(b) Find S(n, n).
lations? HINT b30 = 31 and c30 = .32034 accurate
14. Stirling numbers of the second kind, part 3. Let to 5 decimal places.
A = {1, 2, 3, . . . , n}. [A] (d) Can you think of a way to make these calculations
more dependable (more accurate)?
36 CHAPTER 1. PRELIMINARIES
Chapter 2
Root Finding
2.1 Bisection
In Section 1.2 (page 12), we claimed that “T2 (x) actually approximates ln(x) to within 0.1 over the interval
[3.296, 13.13]”, with a promise that we would discuss the calculation later. It is now later. First, we rephrase
the claim as “the distance between T2 (x) and ln(x) is less than or equal to 0.1 for all x ∈ [3.296, 13.13].” In other
words,
1
|T2 (x) − ln(x)| < for all x ∈ [3.296, 13.13].
10
1
One way to begin solving this inequality is to consider the pair of equations T2 (x) − ln(x) = ± 10 . With a focus on
solving
1
T2 (x) − ln(x) = , (2.1.1)
10
x−e2 (x−e2 )2
recall that T2 (x) = 2 + e2 − 2e4 . We are thus looking to solve the equation
x − e2 (x − e2 )2 1
2+ − − ln(x) = .
e2 2e4 10
Finally, having written the equation in full detail, it should come as no surprise that we will not be solving for
x exactly. There is no analytic method for solving such an equation. Generally, equations with both polynomial
terms and transcendental terms will not be solvable. However, from the graph in Figure 1.2.2, we can get a first
approximation of the solution. We are looking for the place where T2 (x) exceeds ln(x) by 0.1. Since the two
graphs essentially overlap at x = 6, we might aver that T2 (6) exceeds ln(x) by less than 0.1 there. Since there is a
reasonably large gap between the graphs at x = 2, we might also aver that T2 (2) exceeds ln(x) by more than 0.1
1 1
there. In other words, T2 (2) − ln(2) > 10 while T2 (6) − ln(6) < 10 . Since T2 (x) − ln(x) is continuous on the interval
1
[2, 6], the Intermediate Value theorem guarantees there is a value c ∈ (2, 6) such that T2 (c) − ln(c) = 10 . It is this
value of c we are after. And we know it is between 2 and 6. It’s a start, but we can do better!
What about 4? Well, T2 (4) − ln(4) ≈ .04986 < 0.1, so now we know T2 (4) exceeds ln(4) by less than 0.1. Now
the Intermediate Value theorem tells us that c is between 2 and 4 (T2 (2) exceeds ln(x) by more than 0.1). Shall we
check on x = 3? Yes. T2 (3) − ln(3) ≈ .131 > 0.1, so now we know T2 (3) exceeds ln(3) by more than 0.1. Recapping,
T2 (4) − ln(4) < 0.1 while T2 (3) ln(3) > 0.1. By the Intermediate Value theorem again, we know c is between 3 and
4. And we may continue the process, limited only by our patience. This is the process we call the bisection method:
1. Identify an interval [a, b] such that either a or b overshoots the mark while the other undershoots it.
3. If a and m both overshoot or both undershoot the mark, the desired value lies in [m, b].
4. If b and m both overshoot or both undershoot the mark, the desired value lies in [a, m].
37
38 CHAPTER 2. ROOT FINDING
1 1
Figure 2.1.1: + indicates T2 (x) − ln(x) > 10 and − indicates T2 (x) − ln(x) < 10 .
|
| | | | | |
2 3 3.25 3.5 4 6
m2 m4 m3 m1
1
Using a + sign for values of x for which T2 (x) − ln(x) overshoots the desired value 10 and a − sign for values of x
1
for which T2 (x) − ln(x) undershoots the desired value 10 , we may diagram this procedure, including the next two
iterations, as in Figure 2.1.1. We might also reproduce the calculations in a table:
4, 3, 3.5, 3.25, . . .
Notice two things. The actual values of f (a), f (m), and f (b) are not needed. Only their sign is important because
all we need to do is maintain one endpoint where the function is greater than 0 (overshoots) and one where the
function is less than 0 (undershoots). Furthermore, the f (a) and f (b) columns are not strictly necessary either. If
the procedure is carried out faithfully, they will never change sign. In fact, that’s what it means to carry out the
procedure faithfully! In steps 3 and 4, you choose which subinterval to keep by maintaining opposite signs of the
function on opposite endpoints.
As the last line indicates, the desired value is approximately 3.296 as promised. The other value, 13.13, is
1
determined by finding a root of the function g(x) = T2 (x) − ln(x) + 10 . Give it a shot! Start with a = 10 and
b = 14, for example. Solution on page 45.
Though it works, the only real point of carrying out the procedure using a table is to make sure you understand
exactly how it works. If we were actually to use the method in practice, we would write a short computer program
2.1. BISECTION 39
instead. Computers are very good at repetitious calculations, something at which humans are not particularly
adept. In this procedure, we need to calculate a midpoint, decide whether this midpoint should then become the
left or right endpoint, make it so, and repeat.
That leaves only one question—how many repetitions, or iterations, should we compute? And that depends on
the user. Perhaps an answer to within 10−2 of the exact value will suffice, and maybe only 10−6 accuracy will do.
The program we write should be flexible enough to calculate the answer to whatever accuracy is desired, within
reason. With that in mind, here is some pseudo-code for the bisection method.
Assumptions: f is continuous on [a, b]. f (a) and f (b) have opposite signs.
Input: Interval [a, b]; function f ; desired accuracy tol; maximum number of iterations N .
Step 1: Set err = |b − a|; L = f (a);
Step 2: For j = 1 . . . N do Steps 3-5:
As noted earlier, this method should calculate a midpoint (Step 3), decide whether this midpoint should then
become the left or right endpoint (Step 5), make it so (Step 5), and repeat some number of times (Steps 1, 2, and 4).
Much of the code is dedicated to determining when to stop. This is typical of numerical methods. The calculations
are half the battle. Controlling the calculations is the other half. If we didn’t have to worry about stopping, the
pseudo-code might look something like this:
There would be no need for j, err, tol, or N , making the algorithm quite a bit simpler. Of course, programmed
this way, the program would never stop, so j, err, tol, and N , are indeed necessary. Nonetheless, this pseudo-code
without the ability to stop is important. It can be thought of as the guts of the program. This is the code that
executes the method. Sometimes it is easiest to start with the guts and then add the controls afterward.
As for determining whether the midpoint should become the left or right endpoint, Step 5 (Step 3 of the
guts) uses a somewhat slick method. By slick, I mean short, efficient, and not immediately obvious. The sign of
LM = f (a) · f (m) is checked. If it is negative (LM < 0) then m should become the right endpoint (should replace
b) because this means f (a) and f (m) have opposite signs. That’s the only way LM can be negative. On the other
hand, if LM > 0 then we know f (a) and f (m) have the same sign, so m should become the left endpoint (should
replace a). In Step 3 the midpoint is calculated without any fanfare.
The rest of the code is there to make sure the program doesn’t do more than necessary and doesn’t end up
spinning its wheels indefinitely. It is important to be able to separate, at least in your mind, the guts of the program
from the stopping logic. As for the stopping logic, in Step 4, we stop if err ≤ tol as we should. But we also check
the unlikely event that M = 0 in which case we happened to hit the root exactly so should quit. Though it could
be argued overkill to set a maximum number of iterations, N , in this program, it’s a good habit to get into. Some
numerical methods provide no guarantee the required tolerance will ever be reached. For these methods, a fallback
exit criterion is needed. Also, if tol were accidentally set to a negative value, it would certainly never be reached.
The algorithm would have no way to stop without N .
40 CHAPTER 2. ROOT FINDING
among those methods with linear order, it should be considered fast. The error decays exponentially—faster than
any polynomial decay.
Key Concepts
The Intermediate Value Theorem: Suppose f is a continuous function on [a, b] and y is between f (a) and f (b).
Then there is a number c between a and b such that f (c) = y.1
Iteration: (1) Repeating a computation or other process, using the output of one computation as the input of the
next.
Iteration: (2) Any of the intermediate results of an iteration. Also called an iterate.
The bisection method: Produces a sequence of approximations hmj i that converges to some root in [a, b].
Error bound for the bisection method: The error of approximation mj is no more than 2j .
b−a
That is, |mj −
2j for some root p of f (x).
p| ≤ b−a
Convergence for the bisection method: The bisection method converges with linear order and has rate of
convergence O 21n .
Octave
Roughly half the work in writing pseudo-code for the bisection method was dedicated to the logic of the method—
the determination of when to stop. In programming, this type of logic is handled by if then [else] statements,
and variations thereof. It is common practice in programming to use square brackets to denote something that is
optional. So the template if then [else] should be read to mean that logic is handled by if then statements or
if then else statements. The exact syntax looks like this:
if (condition)
execute code here
[else
execute code here]
end%if
if (n>10)
disp(’n is big’)
else
1 The word “between” in this theorem can be interpreted as inclusive or exclusive of the endpoint values as long as the same
disp(’n is small’)
end%if
In Octave, if then [else] statements are written almost exactly as they are in pseudo-code. In fact, much of
the pseudo-code in this text will translate nearly verbatim into Octave. One notable exception is the symbol used
in the condition. Octave requires a boolean operator in the condition. That is, an operator that will evaluate to
either true or false. The = operator assigns a value to a variable. It is not a boolean operator so should not be
used in an if condition. Instead, == (two equals signs) should be used. This table summarizes the six most common
boolean operators in Octave.
Comparison Operator
greater than, less than >, <
greater than or equal, less than or equal >=, <=
equal ==
not equal !=
If you needed to check if x ≥ 0, you would use if (x>=0) in Octave. If you needed to check if t equaled 1, you
would use if (t==1) in Octave. And so on. Logical operators are often needed as well.
For example, if you need to check whether x is between a and b, as in a ≤ x < b, a logical operator is needed. In
this case, we need logical “and” since a ≤ x < b means a ≤ x and x < b. The Octave code would be if (a<=x &&
x<b) or something logically equivalent.
Technically, an if then statement is concluded with an end statement. However, to emphasize the type of
statement being ended, we will make a habit of ending an if then statement with end%if and ending a for loop
with end%for. The %if and %for are just comments since they start with %. Consequently, they are not strictly
necessary, but they may aid in the readability of your code, especially when you have nested constructs. When you
have an if statement inside a for loop or vice versa, using end to end both of them is not as informative as using
end%if and end%for.
2 (x−e2 )2 1
An Octave program to find a root of f (x) = 2 + x−e e2 − 2e4 − ln(x) − 10 between 2 and 6 to within 10−4
using the bisection method with a maximum of 100 iterations might look like this.
f = inline(’2+(x-exp(2))/exp(2)-(x-exp(2))^2/(2*exp(4))-log(x)-1/10’);
a=2;
b=6;
err=b-a;
L=f(a);
for i=1:100
m=(a+b)/2;
M=f(m);
err=err/2;
if (M==0 || err<=10^-4)
disp(m);
return;
end%if
if (L*M<0)
b=m;
else
a=m;
L=M;
end%if
end%for
disp(’Method failed. Maximum iterations exceeded.’);
This code would produce the correct result, 3.2952. Compare this code to the pseudo-code. You will see the main
difference is syntax. However, there is one major disadvantage to writing the code this way. In order to change the
42 CHAPTER 2. ROOT FINDING
function, the endpoints, the tolerance, or the maximum number of iterations, the code needs to be modified in just
the right place. That is no real disadvantage if you never need to run the bisection method again. But, generally,
we should imagine that we will be running the methods we write many times over with different inputs. Or that we
will be handing our code over to someone else to run many times over with different inputs. Imagine me handing
you this code and asking you to find a root of f (x) = cos x − x between 0 and 3 to within 10−6 . It is not good
practice to hard code the inputs to a method. Instead, they should be given as inputs to a programmed function. In
Octave, this is done in a .m file. That doesn’t mean that we will simply take the code as written and save it in a .m
file. The .m file will assume that the inputs—interval [a, b]; function f ; desired accuracy tol; maximum number of
iterations N —will be supplied from another source—the user. The code inside the .m file should execute properly
regardless of the (yet unknown) inputs. The syntax for an Octave function is:
function result=name(input1,input2,...)
execute these lines
end%function
function is a keyword that tells Octave a function is to be defined. result is the name of the variable that holds
the answer, or result, of the function. name is the name of the function. It must also be the name of the .m file. A
completed bisection.m file might look like this:
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Bisection method written by Leon Q. Brin 09 July 2012 %
% Purpose: Implementation of the bisection method %
% INPUT: Interval [a,b]; function f; tolerance tol; and %
% maximum number of iterations maxits. %
% OUTPUT: root res to within tol of exact or message of %
% failure. %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function res=bisection(a,b,f,tol,maxits)
err=b-a;
L=f(a);
for i=1:maxits
m=(a+b)/2;
M=f(m);
err=err/2;
if (M==0 || err<=tol)
res=m;
return;
end%if
if (L*M<0)
b=m;
else
a=m;
L=M;
end%if
end%for
res=’Method failed. Maximum iterations exceeded.’;
end%function
Writing this way has not only the advantage of being easily reusable. It is also simpler! No need to worry about what
function the root of which is desired; or over what interval; and so on. And it more closely resembles the pseudo-code.
Once written and properly functioning, it can be saved as a .m file and never be looked at again (except for study).
It just works! If you hand it off to someone to use, they should be able to use it without modification. bisection.m
2 (x−e2 )2 1
may be downloaded at the companion website. Now finding a root of f (x) = 2 + x−e e2 − 2e4 − ln(x) − 10 between
2 and 6 to within 10 using the bisection method with a maximum of 100 iterations might look like this.
−4
octave:9>
f = inline(’2+(x-exp(2))/exp(2)-(x-exp(2))^2/(2*exp(4))-log(x)-1/10’);
octave:10> bisection(2,6,f,10^-4,100)
ans = 3.2952
2.1. BISECTION 43
After bisection.m is written, the bisection() function becomes part of the Octave language. It can be called just
like any built-in function. As a second example, we can find a root of f (x) = cos x − x between 0 and 3 like so:
octave:12> bisection(0,3,inline(’cos(x)-x’),10^-5,100)
ans = 0.73909
√
(e) f (t) = 4 + 5 sin t − 2.5; [−6, −5] 50
2 [S]
3t tan t
(f) g(t) = 1−t2
; [21.5, 22.5] 0
3t
(g) h(t) = ln(3 sin t) − 5
; [1, 2] -50
(j) h(r) = 2sin r − 3cos r ; [1, 3] 14. Suppose you are trying to find the root of f (x) =
x − e−x using the bisection method. Find an integer a
3. Create a table showing three iterations of the bisection
such that the interval [a, a + 2] is an appropriate one in
method with the function and starting interval indi-
which to start the search.
cated in question 2. [S]
15. Find a lower bound on the number of iterations it would
4. Use your bisection.m code to find a root of the func-
take to guarantee accuracy of 10−20 in question 6.
tion in the interval of question 2 to within 10−8 . [A]
16. How many steps (iterations) of the bisection method
5. Use the bisection method to find m3 for the given func- are necessary to guarantee a solution with 10−10 accu-
tion on the given interval. Do this without a computer racy if a root is known to be within [4.5, 5.3]? [A]
program. Just use a pencil, paper, and a calculator.
You may check your answers with a computer program 17. Suppose you are using the bisection method on an in-
if you wish. [A] terval of length 3. How many iterations are necessary
to guarantee accuracy of the approximation to within
√
(a) f (x) = x − cos x on [0, 1] 10−6 ?
(b) f (x) = 3(x + 1)(x − 12 )(x − 1) on [−1.25, 2.5] 18. Suppose a function g satisfies the assumptions of the
bisection method on the given interval. Starting with
6. Use the Bisection Method to find m4 for g(x) = that interval, how many iterations are needed to ap-
x sin x + 1 on [9, 10]. proximate the root to within the given tolerance?
7. Use the bisection method to find m3 for the equation
(a) [−7, 10]; 10−6
x cos x − ln x = 0 on the interval [7, 8].
(b) [5, 9]; 10−3
8. Use the bisection method to find a root of g(x) =
sin x − x2 between 0 and 1 with absolute error no more (c) [9, 15]; 10−10
than 1/4. (d) [−6, −1]; 10−105 (assume the computer calculates
9. Approximate the root of g(x) = 2 + x − ex between 1 with 300 significant digits so round-off error is not
and 2 to within 0.05 of the exact value using the bisec- a problem)
tion method.
19. 1 is a root of f (x) = ln(x4 −x3 −7x2 + 13x−5) that
10. There are 21 roots of the function f (x) = cos(x) on the can not be found by the bisection method.
interval [0, 65]. To which root will the bisection method
converge, starting with a = 0 and b = 65? [A] (a) Use a graph of the function near 1 to explain why.
You may use the Octave code below to produce
11. Find a bound on the number of iterations needed to
an appropriate graph.
achieve an approximation with accuracy 10−3 to the
solution of x3 + x − 4 = 0 on the interval [1, 4] using (b) Run the bisection method on f over the interval
the bisection method. Do not actually compute the [0.8, 1.2] anyway. What happens instead of find-
approximation. Just find the bound. [S] ing the root?
44 CHAPTER 2. ROOT FINDING
x=3.5:.05:4.5;
(a) [2, 3]
f=inline("abs(sin(pi*x))");
plot(x,f(x)) (b) [6, 8]
(c) [2, 6]
21. Let f (x) = sin(x2 ). f is continuous on [4, 5], but
f (4) < 0 and f (5) < 0, so the assumptions of the bi- (d) [5, 9]
section method are not met. Nonetheless, using the (e) [10, 12] Note: the assumptions of the bisection are
bisection method as described in the pseudo-code on f not met on this interval. Nonetheless, the method
over the interval [4, 5] does produce a root. Explain. [S] as outlined in the pseudo-code will converge to a
22. The functions in questions 2e, 2f, and 2g all fail to meet root!
the assumptions of the bisection method on the interval
27. Find an interval of length 1 over which the bisection
[−4, −0.5]. For each one, explain how so.
method may be applied in order to find a root of
23. Write an Octave function called collatz that takes f (x) = x4 − 7.6746x3 − 40.7477022x2 + 200.9894434x +
one integer input, n, and returns 3n + 1 if n is odd and 319.0914281.
n/2 if n is even. Save it as a collatz.m file. Use an if 28. The following algorithm is one possible incarnation of
then else statement in your function. HINT: Use the the bisection method.
Octave ceiling function. If ceil(n/2) equals n/2, then
n must be even (no remainder when divided by 2). Use Assumptions: f is continuous on [a, b]. f (a) and f (b)
your collatz function to calculate [A] have opposite signs.
Input: Interval [a, b]; function f
(a) collatz(17) Step 1: For j = 1 . . . 15 do Steps 2 and 3:
(b) collatz(10) Step 2: Set m = a+b ;
2
(c) collatz(109) Step 3: If f (a)f (m) < 0 then set b = m; else set
(d) collatz(344) a = m;
Step 4: Print m.
24. Write your own absolute value function called Output: Approximation m.
absval (abs is already defined by Octave, so it is best
to use a different name) that takes a real number input (a) Apply this algorithm to the function f (x) =
and returns the absolute value of the input. Use an (x)(x − 2)(x + 2) over the interval [−3, 3]. Which
if then else statement in your function. Save it as root will this algorithm approximate?
absval.m and test it on the following computations. (b) How accurate is the approximation guaranteed to
be according to the formula
(a) | − 3|
(b) |123.2| b−a
|pn − p| ≤ ?
22
2n
(c) π − 7
(c) How accurate is the approximation in reality?
(d) |10 − π 2 |
Compare this to the bound in (b).
25. f (x) = sin(x2 ) has five roots on the interval [7, 8]. (d) Modify the algorithm so it will approximate a dif-
f (7) < 0, f (8) > 0, and f is continuous on [7, 8], so ferent root using the same starting interval.
the assumptions of the bisection method are met. The
(e) Modify the algorithm so it does not use multipli-
method will converge to a root.
cation.
(a) Use your bisection.m file (Exercise 1) to deter-
29. Use the following pseudo-code to write a slightly differ-
mine which one. [A]
ent implementation of the bisection method. Refer to
(b) Find 4 different intervals for which the bisection Table 1.1 if you are unsure how to program the quan-
method will converge to the other four roots in tity d(ln(b − a) − ln(T OL))/ ln 2e. The while loop is
[7, 8]. discussed on page 61.
2.1. BISECTION 45
Input function f , endpoints a and b; tolerance T OL. 30. Use the code you wrote for question 29 to find solutions
Return approximate solution p and f (p) and the accurate to within 10−5 for the following problems.
number of iterations done N0 .
(a) x − 2x = 0 on [0, 1]
Step 1 Set i = 1; F A = f (a); N0 = d(ln |b − a| −
(b) ex − x2 + 3x − 2 = 0 on [0, 1]
ln(T OL))/ ln 2e;
Step 2 While i ≤ N0 do Steps 3-6. (c) 2x cos(2x)−(x+1)2 = 0 on [−3, −2] and on [−1, 0]
√
Step 3 Set p = (a + b)/2; F P = f (p); 31. Find an approximation of 3 correct to within 10−4
Step 4 If F P = 0 then using the bisection method. Write an essay on how
Return(p, f (p), N0 ); STOP. you solved this problem. Include your bisection code,
Step 5 Set i = i + 1; what function and what interval you used and why.
Step 6 If F A · F P > 0 then 32. A trough of length L has a cross section in the shape
Set a = p; F A = F P ; of a semicircle with radius r. When filled with water to
else within a distance h of the top, the volume V of water
Set b = p; is
Step 7 Return(p, f (p), N0 ); h
h
i
V = L 0.5πr2 − r2 arcsin
p
STOP. −h r 2 − h2
r
(a) Discuss the advantages/disadvantages of this al- Suppose L = 10 ft, r = 1 ft, and V = 12.4 ft3 . Find
gorithm compared to the one on page 42. the depth of the water in the trough to within 0.01 ft.
(b) Where does the calculation N0 = d(ln(b − a) − Note: In Octave, use asin(x) for arcsin(x) and pi for
ln(T OL))/ ln 2e come from? π.
Answers
What is the next value?: T2 (3.25) − ln(3.25) ≈ .10429, which overshoots the mark. So 3.25 becomes the new
left endpoint, and the next value is 3.25+3.5
2 = 3.375, the midpoint of 3.25 and 3.5.
The right endpoint is 13.13: Starting with a = 10 and b = 14, note that g(a) ≈ .088 > 0 and g(b) ≈ −.044 < 0,
so g of the left endpoint should always be positive and g of the right endpoint should always be negative:
a m b g(m)
10 12 14 .044 ⇒ m becomes left endpoint
12 13 14 .006 ⇒ m becomes left endpoint
13 13.5 14 −.017 ⇒ m becomes right endpoint
13 13.25 13.5 −.005 ⇒ m becomes right endpoint
13 13.125 13.25 .0004 ⇒ m becomes left endpoint
13.125 13.1875 13.25 −.002 ⇒ m becomes right endpoint
13.125 13.15625 13.1875 −.0009 ⇒ m becomes right endpoint
13.125 13.140625 13.15625 −.0002 ⇒ m becomes right endpoint
13.125 13.1328125 13.140625
46 CHAPTER 2. ROOT FINDING
(a) (b)
Figure 2.2.1(a) shows the graphs of y = cos(x) and y = x over the interval [0, 1]. We can see the intersection at
around (0.75, 0.75) so we should think that the fixed point is around 0.75 (which of course we know is true from our
calculator experiment). Figure 2.2.1(b) illustrates the exercise of computing cos(0), cos(1), cos(0.540302 . . .), . . ..
Following the vertical line segment from (0, 0) to (0, 1) represents calculating cos(0). Following the horizontal
continuation from (0, 1) to (1, 1) and subsequently the vertical line segment from (1, 1) to (1, 0.540302 . . .) rep-
resents calculating cos(1). Following the horizontal line from (1, 0.540302 . . .) to (0.540302 . . . , 0.540302 . . .) and
2.2. FIXED POINT ITERATION 47
subsequently the vertical line from (0.540302 . . . , 0.540302 . . .) to (0.540302 . . . , 0.857553 . . .) represents calculating
cos(0.540302 . . .), and so on. With each pair of line segments, one going horizontally from the graph of y = cos(x)
to the graph of y = x followed by one going vertically from the line y = x to the graph of y = cos(x), another
iteration is shown. Figure 2.2.1(b) is sometimes called a web diagram [2], and is commonly used to illustrate the
concept of iteration. That the path of the web diagram tends toward (0.739085 . . . , 0.739085 . . .) is an unavoidable
consequence of the geometry of the graph of cos(x).
What if we start with a number other than 0? Using figure 2.2.1, you should be able to convince yourself that
convergence to the point (0.7390851332 . . . , 0.7390851332 . . .) is assured for any initial value between 0 and 1. Try
it. Start anywhere on the line y = x. Proceed vertically to the graph of y = cos(x). Then horizontally to the line
y = x. And repeat. You should find that the path of the web diagram always tends toward the intersection of the
graphs. Now consider starting with any real number, r. The cosine of any real number is a number in the interval
[−1, 1] so cos(r) ∈ [−1, 1]. And the cosine of any number in the interval [−1, 1] is a number in the interval [0, 1] so
cos(cos(r)) ∈ [0, 1]. That is, the second iteration is in the interval from 0 to 1. So after only two iterations, any
initial value will become a value between 0 and 1. And our web diagram implies that further iteration will lead to
the fixed point. So, regardless of the initial value, iteration leads to the fixed point. And the preceding argument
forms the seed for a proof of this fact.
Not all functions are so well behaved, however. For example, 12 = 1. In other words, 1 is a fixed point of
the function y = x2 . However, iteration starting with any number other than 1 or −1 does not lead to this fixed
point. If we start with any number greater than 1 and square it, it becomes greater. And if we square the result, it
becomes greater still. And squaring again only increases the value, without bound. Hence, iteration starting with
any value greater than 1 (or less than −1) does not lead to convergence to the fixed point 1. Nor does iteration
starting with any number of magnitude less than 1. Figure 2.2.2 illustrates iteration of y = x2 with initial value 0.9.
Follow the web diagram from the point (0.9, 0.9) vertically to the graph of y = x2 and then horizontally back to
the line y = x, and so on, to check for yourself. This is a nice illustration of the fact that the square of any number
between 0 and 1, exclusive, is smaller than the number itself. With starting values between −1 and 1 exclusive
of ±1, iteration gives a sequence converging to 0, not 1. To summarize, excepting −1 and 1, no initial value will
produce a sequence converging to 1 under iteration of the function y = x2 .
There is a fundamental difference between the fixed point 0.7390851332 . . . of f (x) = cos(x) and the fixed point
1 of g(x) = x2 . Fixed point iteration converges to 0.7390851332 . . . under f (x) = cos(x) for any initial value. Fixed
point iteration fails to converge to 1 under g(x) = x2 for all initial values but ±1.2 Examining the graphs of f (x)
and g(x) each superimposed against the line y = x in the neighborhood of their respective fixed points can give a
clue [Figure 2.2.3] as to the difference. True, f (x) has a negative slope at its fixed point while g(x) has a positive
slope at its fixed point. You can see this from the graphs or you can “do the calculus”. The important difference,
though, is more subtle. It’s not the sign of the slope at the fixed point that matters. It’s the magnitude of the
slope at the fixed point that matters. For smooth functions, neighborhoods of points with slopes of magnitude
2 For a third type of behavior, fixed point iteration converges to 0 under g(x) for initial values near 0, but not for others!
48 CHAPTER 2. ROOT FINDING
greater than 1 tend to be expansive. That is, points move away from one another under application of the function.
However, neighborhoods of points with slopes of magnitude less than 1 tend to be contractive. That is, points move
toward one another under application of the function.
Proposition 2. If h(x) is differentiable on (a, b) with |h0 (x)| < 1 for all x ∈ (a, b), then whenever x1 , x2 ∈ (a, b),
|h(x2 ) − h(x1 )| < |x2 − x1 |.
Proof. Let x1 , x2 ∈ (a, b) and, without loss of generality, let x2 > x1 so that we may properly refer to the interval
from x1 to x2 . Since h(x) is continuous on [x1 , x2] and differentiable on (x1 , x2 ), the mean value
theorem gives us
h(x2 )−h(x1 ) h(x2 )−h(x1 )
c ∈ (x1 , x2 ) ⊆ (a, b) such that h (c) = x2 −x1 . But h (c) < 1 by assumption, so h (c) = x2 −x1 < 1, from
0 0 0
Moreover, a function whose derivative has magnitude less than 1 can only cross the line y = x one time. Once it
has crossed, it can never “catch up” because that would require a slope greater than 1, the slope of the line y = x.
Proposition 3. Suppose h(x) is continuous on [a, b], differentiable on (a, b) with |h0 (x)| < 1 for all x ∈ (a, b), and
h([a, b]) ⊆ [a, b]. Then h has a unique fixed point in [a, b].
Proof. If h(a) = a or h(b) = b, we have proved existence, so suppose h(a) 6= a and h(b) 6= b. Since h([a, b]) ⊆ [a, b] it
must be the case that h(a) > a and h(b) < b. It immediately follows that h(a) − a > 0 and h(b) − b < 0. Since the
auxiliary function f (x) = h(x) − x is continuous on [a, b], the Intermediate Value Theorem guarantees the existence
of c ∈ (a, b) such that f (c) = 0. By substitution, h(c) − c = 0, implying h(c) = c, so c is a fixed point of h. The
existence of a fixed point is established. Now suppose c1 ∈ [a, b] and c2 ∈ [a, b] are distinct fixed points of h. Then
h(c1 ) − h(c2 ) c1 − c2
= = 1.
c1 − c2 c1 − c2
By the mean value theorem, there exists c3 between c1 and c2 such that h0 (c3 ) = 1, contradicting the fact that
|h0 (x)| < 1 for all x ∈ (a, b). Hence, it is impossible that c1 and c2 are distinct.
Hence, we can reasonably expect that when the derivative at a fixed point has magnitude less than 1, iteration is
a viable method for approximating (finding) the fixed point, but when the derivative at a fixed point has magnitude
greater than 1, iteration is not a viable method of approximating the fixed point. We must be careful, though,
not to take this rule of thumb as absolute. It only applies to so-called well-behaved functions. In this case, that
the function has a continuous first derivative in the neighborhood of the fixed point is well-behaved enough. The
following theorem establishes that fixed point iteration will converge in a neighborhood of a fixed point if the
magnitude of the function’s derivative is less than 1 there.
Theorem 4. (Fixed Point Convergence Theorem) Given a function f (x) with continuous first derivative and fixed
point x̂, if |f 0 (x̂)| < 1 then there exists a neighborhood of x̂ in which fixed point iteration converges to the fixed point
for any initial value in the neighborhood.
2.2. FIXED POINT ITERATION 49
Proof. By continuity, there exists ε > 0 such that |f 0 (x)| < 1 for all x ∈ (x̂ − ε, x̂ + ε). Let 0 < δ < ε and set
M= max |f 0 (x)|. Now suppose x0 is a particular but arbitrary value in (x̂ − δ, x̂ + δ). As in proposition 2,
x∈[x̂−δ,x̂+δ]
the Mean Value Theorem is applied. This time, we are guaranteed c ∈ (x̂ − δ, x̂ + δ) such that f 0 (c) = f (x̂)−f (x0 )
x̂−x0 .
But |f 0 (c)| ≤ M so |f (x̂) − f (x0 )| ≤ M |x̂ − x0 |. Furthermore x̂ is a fixed point, so f (x̂) = x̂, from which it
follows that |x̂ − f (x0 )| ≤ M |x̂ − x0 |. Now we define xk = f (xk−1 ) for all k ≥ 1 and prove by induction that
|x̂ − xk | ≤ M k |x̂ − x0 | for all k ≥ 1. Since x1 = f (x0 ), we have already shown |x̂ − x1 | ≤ M |x̂ − x0 |, so the
claim is true when k = 1. Now suppose |x̂ − xk | ≤ M k |x̂ − x0 | for some particular but arbitrary value k ≥ 1.
Note that |x̂ − xk | ≤ M k |x̂ − x0 | implies xk ∈ (x̂ − δ, x̂ + δ) so we apply the Mean Value Theorem as before and
conclude that |x̂ − f (xk )| ≤ M |x̂ − xk |. Substituting xk+1 for f (xk ) and using the inductive hypothesis, we have
|x̂ − xk+1 | ≤ M · M k |x̂ − x0 | = M k+1 |x̂ − x0 |. Hence, we have 0 ≤ |x̂ − xk | ≤ M k |x̂ − x0 |. Of course lim 0 = 0 and
k→∞
lim M k |x̂ − x0 | = 0, so by the squeeze theorem, lim |x̂ − xk | = 0.
k→∞ k→∞
As suggested earlier, we should not expect fixed point iteration to converge when the derivative at a fixed
point has magnitude greater than one. In fact, more or less the opposite happens. There is a neighborhood of
the fixed point in which fixed point iteration is guaranteed to escape the neighborhood for any initial value in the
neighborhood not equal to the fixed point itself. Given that fact, it is tempting to think that perhaps the Fixed
Point Convergence Theorem could be strengthened to a bi-directional implication, an if-and-only-if claim. And it
“almost” can. What can be said here has direct parallels to the ratio test for series. Recall, for any sequence of real
∞
ak+1 X
numbers a0 , a1 , a2 , . . ., the limit L = lim helps determine the convergence of ak in the following way:
k→∞ ak
k=0
∞
X
• If L < 1, then ak converges (absolutely).
k=0
∞
X
• If L > 1, then ak diverges.
k=0
∞
X
• If L = 1, then ak may converge (absolutely or conditionally) or may diverge.
k=0
Analogously, for any function f (x) with continuous first derivative and fixed point x̂, the derivative f 0 (x̂) helps
determine the convergence of the fixed point iteration method in the following way:
• If |f 0 (x̂)| < 1, then fixed point iteration converges to x̂ for any initial value in some neighborhood of x̂.
• If |f 0 (x̂)| > 1, then fixed point iteration escapes some neighborhood of x̂ for any initial value in the neighbor-
hood other than x̂.
• If |f 0 (x̂)| = 1, then fixed point iteration may converge to x̂ for any initial value in some neighborhood of x̂;
or may escape some neighborhood for any initial value in the neighborhood other than x̂; or may have no
neighborhood in which all initial values lead to convergence and no neighborhood in which all values other
than x̂ escape.
The graphs in Figure 2.2.4 of functions with derivative equal to one at their fixed point help illustrate this last case.
For one of these functions, fixed point iteration converges for all values in a neighborhood of the fixed point. For
another of these functions, fixed point iteration escapes some neighborhood of the fixed point for all initial values in
the neighborhood except the fixed point itself. And for the third of these functions, fixed point iteration converges
to the fixed point for some initial values and escapes a neighborhood of the fixed point for others (and every
neighborhood of the fixed point will have both types of initial values). Can you tell which is which? Figure it out
by creating web diagrams for each. Answer on page 55.
The proof of the Fixed Point Convergence Theorem can easily be extended to include initial values in any
neighborhood of the fixed point in which the magnitude of the derivative remains less than 1. The size and
symmetry of the interval are not important. For example, f (x) = 18 x3 − x2 + 2x + 1 has a fixed point at x̂ = 2. The
proof of the Fixed Point Convergence Theorem establishes convergence to 2 in a symmetric interval about 2 such
50 CHAPTER 2. ROOT FINDING
Figure 2.2.4: Convergence behavior when the derivative at the fixed point is 1.
as [1.9, 2.1]. But this interval is far from the largest neighborhood of initial values for which fixed point iteration
converges to 2. We can find bounds on the largest such interval by solving the equation |f 0 (x)| = 1. To that end:
3 2
x − 2x + 2 = ±1
8
2
3x − 16x + 16 = ±8
2
3x − 16x + 24 = 0 or 3x2 − 16x + 8 = 0
√ √
8 ± i2 2 8 ± 2 10
x= or x=
√ 3 √ 3
8 − 2 10 8 + 2 10
≈ 0.558 and ≈ 4.775,
3 3
so we should expect fixed point iteration to converge to 2 on any closed interval contained in
√ √
8 − 2 10 8 + 2 10
, .
3 3
Now, if we have the computer execute fixed point iteration for a large number of evenly spaced initial values, say
100, on the interval [−2, 8] and record the results on a number line where we color an initial value black if it does
not converge to 2 and green if it does converge to 2 (we will call such diagram a convergence diagram), we get
by applying fixed point iteration to the corresponding function f (x) = x + g(x) = αx(1 − x) is a famous exercise
in dynamical systems which has a nasty habit of not working! Complete the following investigation to see what
happens.
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
2.4 2.6 2.8 3 3.2 3.4 3.6 3.8 4
4. Use the diagram to predict a value of α for which you would expect fixed point iteration to lead to x975
through x1000 cycling through 4 different values. Check your prediction.
52 CHAPTER 2. ROOT FINDING
Figure 2.2.6: Convergence diagrams for 6 functions with the same fixed points.
f1 :
f2 :
f3 :
f4 :
f5 :
f6 : √ √
black: does not converge; green: converges to 3; red: converges to 1 + 3; blue: converges to 1 − 3
Root Finding
When successful, fixed point iteration finds solutions of an equation of the form f (x) = x. A root finding problem
requires the solution of an equation of the form g(x) = 0. However, the equation f (x) = x has exactly the same
solutions as the equation f (x)−x = 0, so finding fixed points of f (x) is equivalent to finding roots of g(x) = f (x)−x.
Indeed, we can rephrase the example of finding fixed points of f (x) = 18 x3 − x2 + 2x + 1 as the problem of finding
roots of g(x) = f (x) − x = 81 x3 − x2 + x + 1. But it is the opposite problem that is much more common. We have
the question of finding the roots of a function and need to rephrase it in terms of a fixed point problem.
Suppose we want the roots of g(x) = −x3 + 5x2 − 4x − 6. We can rephrase the question of solving g(x) = 0 as
the problem of finding the fixed points of many different functions! But you will have to ignore some sage advice of
your algebra teacher to derive them! The key is to use algebra to rewrite the equation −x3 + 5x2 − 4x − 6 = 0 as
an equation of the form x = f (x). The simplest way is to add x to both sides of the equation. This manipulation
and several others are shown in the following list.
• −x3 + 5x2 − 4x − 6 = 0 ⇒ −x3 + 5x2 − 3x − 6 = x
−x3 +5x2 −6
• −x3 + 5x2 − 4x − 6 = 0 ⇒ −x3 + 5x2 − 6 = 4x ⇒ 4 =x
q
x3 +4x+6
• −x3 + 5x2 − 4x − 6 = 0 ⇒ −x3 − 4x − 6 = −5x2 ⇒ = x2 ⇒ ± x +4x+6 =x
3
5 5
√
• −x3 + 5x2 − 4x − 6 = 0 ⇒ 5x2 − 4x − 6 = x3 ⇒ 3
5x2 − 4x − 6 = x
Can you see what has been done for each one? Thus, q we have five candidates for fixed point iteration, f1 (x) =
3 2 −x3 +5x2 −6 x3 +4x+6
q
x3 +4x+6
√
−x + 5x − 3x − 6, f2 (x) = 4 , f3 (x) = 5 , f4 (x) = − 5 , and f5 (x) = 3 5x2 − 4x − 6,
all of which will potentially give roots of g(x). There is a sixth function we will discuss in much more detail later:
2x3 −5x2 −6 3
√ √
f6 (x) = 3x 2 −10x+4 . The roots of g(x) are 1 − 3 ≈ −0.73, 1 + 3 ≈ 2.73, and 3, so we will consider convergence
diagrams over the interval [−2, 5]. Fixed point iteration converges to different fixed points for the different functions
despite the fact that all 6 functions have exactly the same three fixed points. The convergence diagrams of Figure
2.2.6 are color-coded to reflect √this fact. Black
√ indicates lack of convergence just as before. Green, red, and blue
indicate convergence to 3, 1 + 3, and 1 − 3, respectively. Notice that only f6 provides convergence for, as far
as we can tell, every initial value in [−2, 5], and is also the only one for which fixed point iteration converges to
different fixed points for different initial values. See if you can understand why each function has the convergence √
behavior it does by looking at the graphs of f1 , f2 , . . . , f6 . Pay special attention to the graphs around 1 + 3 and
3 By
√ √
calculating f6 (1 − 3), f6 (1 + 3), and f6 (3), you can verify that f6 has these three values as fixed points as well.
2.2. FIXED POINT ITERATION 53
3. Looks can be deceiving in that area because the two fixed points are so close together. Also, see if you can find
two initial values in [−2, 5] for which fixed point iteration on f6 does not converge. What happens instead? For an
extra challenge, see if you can find a third point in [−2, 5] for which fixed point iteration on f6 does not converge.
Hint: you may need to use a computer algebra system to find such a point exactly or use fixed point iteration to
approximate it! Answers on page 55.
Assumptions: f is differentiable. f has a fixed point x̂. x0 is in a neighborhood (x̂ − δ, x̂ + δ) where the
magnitude of f 0 is less than one.
Input: Initial value x0 ; function f ; desired accuracy tol; maximum number of iterations N .
Step 1: For j = 1 . . . N do Steps 2-4:
Step 2: Set x = f (x0 );
Step 3: If |x − x0 | ≤ tol then return x;
Step 4: Set x0 = x;
Step 5: Print “Method failed. Maximum iterations exceeded.”
Output: Approximation x near exact fixed point, or message of failure.
Key Concepts
Fixed point: x0 is a fixed point of the function f (x) if f (x0 ) = x0 .
Fixed point iteration: Calculating the sequence x0 , x1 = f (x0 ), x2 = f (x1 ), x3 = f (x2 ), . . . given the function f
and initial value x0 .
Attractive fixed point: A fixed point is called attractive (or attracting) if there is a neighborhood of the fixed
point in which fixed point iteration converges for all initial values in the neighborhood.
Repulsive fixed point: A fixed point is called repulsive (or repelling) if fixed point iteration escapes some neigh-
borhood of the fixed point for any initial value in the neighborhood other than the fixed point itself.
Mean Value Theorem: If f is continuous on [a, b] and has a derivative on (a, b), then there exists c ∈ (a, b) such
that f 0 (c) = f (b)−f
b−a
(a)
.
Fixed Point Convergence Theorem: Given a function f (x) with continuous first derivative and fixed point x̂,
if |f 0 (x̂)| < 1 then there exists a neighborhood of x̂ in which fixed point iteration converges to the fixed point
for any initial value in the neighborhood.
Exercises Value Theorem are met for the function over the inter-
val. (ii) If the hypotheses are met, find a value c as
1. Write an Octave implementation of the fixed point it- guaranteed by the theorem.
eration method. Save it as a .m file for future use.
2. (i) Decide whether or not the hypotheses of the Mean (a) f (x) = 3 − x − sin x; [2, 3]
54 CHAPTER 2. ROOT FINDING
(b) g(x) = 3x4 − 2x3 − 3x + 2; [0, 1] 9. Use proposition 3 to show that g(x) = 2x(1 − x) has a
4 3
(c) g(x) = 3x − 2x − 3x + 2; [0, 0.9] [S] unique fixed point on [0.3, 0.7].
3x2 −1 [S]
(d) h(x) = 10 − cosh(x); [−3, −2] [A] 10. Let f (x) = 6x+4
.
√
(e) f (t) = 4 + 5 sin t − 2.5; [−6, −5] (a) Show that f has a unique fixed point on
3t2 tan t [S] [−4, −0.9].
(f) g(t) = 1−t2
; [20, 23]
3t [A]
(b) Use fixed point iteration to find an approximation
(g) h(t) = ln(3 sin t) − 5
; [2, 4] to the fixed point that is accurate to within 10−2 .
(h) f (r) = esin r − r; [−20, 20] [A]
11. Let g(x) = π + 0.5 sin(x/2).
(i) g(r) = sin(e ) + r; [−3, 3]
r
(a) Show that g has a unique fixed point on [0, 2π].
(j) h(r) = 2sin r − 3cos r ; [1, 3]
(b) Use fixed point iteration to find an approximation
3. Find the fixed point(s) of the function exactly. Use to the fixed point that is accurate to within 10−2 .
algebra.
√ 12. Show that√the fixed point iteration method applied to
(a) f (x) = 3 2x3 − x2 − x f (x) = 3 8 − 4x will converge to a root of g(x) =
ln(2 x3 + 4x − 8 for any initial value x0 ∈ [1.2, 1.5]. [S]
(b) f (x) = 2
13. Show that fixed point iteration is guaranteed to con-
(c) f (x) = log(x2 − 3x) − 1 + x [A]
verge to the fixed point of
(d) g(x) = 3x2 + 5x + 1 [A] √
5000
f (x) = ( 2)x
(e) g(x) = x + 1+2e−3t
− 2500 √
for any x0 ∈ [1, 3]. HINT: f 0 (x) = 21 ln(2) · ( 2)x .
(f) g(x) = eln(x+1)−3
√ 14. Let g(x) = x2 − 3x − 2.
(g) h(x) = 4x2 + 4x + 1
[S] (a) Find a function f on which fixed point iteration
(h) h(x) = x − 10 + 3x + 25 · 3−x
will converge to a root of g.
(i) h(x) = x + 6 − 3 log5 (2x)
(b) Use your function to find a root of g to within
4. Find at least two candidate functions, f1 (x) and f2 (x), 10−3 of the exact value.
for finding roots of g(x) via fixed point iteration. In (c) State the initial value you used and how many
other words, convert the problem of finding a root of g iterations it took to get the approximation.
into a problem of finding a fixed point of f1 or f2 .
15. Use fixed point iteration with p0 = −1 to approximate
(a) g(x) = 7x2 + 5x − 9 a root of g(x) = x3 − 3x + 3 accurate to the nearest
(b) g(x) = x + cos x 10−4 .
(c) g(x) = 6x5 + 12x2 − 8 [A] 16. Use a fixed√point iteration method to find an approx-
imation of 3 that is accurate to within 10−4 . What
(d) g(x) = x2 − e3x+4 [S]
function and initial value did you use?
(e) g(x) = 7x − 3 cos(πx − 2) + ln |2x2 + 4x − 8| 17. The function f (x) = x4 + 2x2 − x − 3 has two roots.
One of them is in [−1, 0] and the other is in [1, 2].
2 2
−5x−1 [A]
(f) g(x) = 3x −5x+1
− 2−x
5. Compute the first 5 iterations of the fixed point itera- (a) In preparation for finding a root of f (x) using
tion method using the given function and initial value. fixed point iteration, one way to manipulate the
Based on these 5 iterations, do you expect the method equation x4 + 2x2 − x − 3 = 0 is to add x to both
to converge? sides. This gives
√
24. Let g(x) = 12 + 1x
x
18. Fixed point iteration on f (x) = 3 2x3 − x2 − x will 5
− 10−5 .
not converge to a fixed point. However,
√ fixed point
iteration on the function g(x) = 3 x2 + x will con- (a) Show that if g(x) has a zero at p, then the func-
verge to approximately 1.618033988749895 for any x0 tion f (x) = x + cg(x) has a fixed point at p.
in [0.5, 3.5]. [A] (b) Find a value of c for which fixed point iteration
of f (x) will successfully converge for any start-
(a) How many iterations does it take to achieve 10 −4
ing value, p0 , in the interval [16, 17]. Sketch the
accuracy using g(x) with x0 = 2.5?
graphs that demonstrate that your choice of c is
(b) Explain why f (x) and g(x) have the same fixed appropriate.
points.
(c) Use the function from part 24b with the value of
19. Find a zero (any zero) of g(x) = x2 + 10 cos x accurate c you have determined to find a root of g(x) ac-
to within 10−4 using fixed point iteration. State curate to within 10−4 . State the value you used
for p0 . Show the last 3 iterations. How many
(a) the function f to which you fixed point iteration iterations did it take?
(b) the initial value, x0 , you used
25. Prove that for f (x) = cos x, fixed point iteration con-
(c) how many iterations it took
verges for any initial value.
20. Let c be a nonzero real number. Argue that any fixed 26. The Fixed Point Convergence Theorem can be
point of f (x) = xec·g(x) is a root of g. strengthened. The requirement that the first deriva-
√
21. Approximate 3 using the method suggested by ques- tive be continuous can be replaced. Modify the proof
tion 20. in the text to show the following claim.
22. Suppose g(x̂) = 0 and g has a continuous first deriva- Given a differentiable function f (x) with fixed point x̂,
tive. Argue that there exists a value c for which fixed if |f 0 (x)| ≤ M < 1 for all x in some neighborhood of
point iteration on f (x) = x + cg(x) will converge to x̂ x̂, then fixed point iteration converges to the fixed point
on some neighborhood of x̂. for any initial value in the neighborhood.
23. Find a value of c for which fixed point iteration is guar- 27. Create three graphs similar to those in Figure 2.2.4 to
anteed to converge for the function f (x) = x + c(x − analyze the situation when the derivative at the fixed
5 cos x) with any initial value x0 ∈ [0, π/2]. Explain. point equals −1. Does the situation differ from that
[A]
when the derivative at the fixed point equals 1?
Answers
Figure 2.2.4: From left to right: every neighborhood of the fixed point will have both types of initial values;
point iteration converges for all values in a neighborhood of the fixed point; fixed point iteration escapes some
neighborhood of the fixed point for all initial values in the neighborhood except the fixed point itself
Figure 2.2.6: When its denominator is zero, f6 (x) will be undefined (there is a vertical asymptote in the graph),
so we solve 3x2 −10x + 4 = 0 to find two initial values
√
for which fixed point iteration will fail (since the
first iteration will be undefined). They are x = 5±3 13 ≈ .4648 and 2.868. To find a third point for which
√
fixed point iteration will fail, we solve the equation f6 (x) = 5+5 13 (we could just as easily have solved
√ √
f6 (x) = 5−5 13 instead). Then the second iteration will be undefined since the first iteration will be 5+5 13 .
The
q only real solution is approximately 1.055909763230534, which can be found by fixed point iteration on
√ √ √
3 13x2 +10x2 − 10 13x
− 50x
3 + 3 + 3
4 13 38
2
3
. Prove it. Note, though, the claim that fixed point iteration will fail is
based on the assumption of exact arithmetic. The fact that any reasonable implementation of the fixed point
iteration method will involve floating point arithmetic might provide just enough error for the method to
converge even for these initial values.
56 CHAPTER 2. ROOT FINDING
f (xn ) − f (x̂)
|xn+1 − x̂|
1 =
|xn − x̂| xn − x̂
and
f (xn ) − f (x̂)
lim = |f 0 (x̂)|. (2.3.1)
n→∞ xn − x̂
Therefore, fixed point iteration is linearly convergent as long as f 0 (x̂) 6= 0. The following proposition could be
presented as a corollary to the Fixed Point Convergence Theorem since much of the argument simply repeats what
was noted there, but we choose to present it as a separate claim based on equation 2.3.1. To be more precise, we
have the following result.
Proposition 5. (Fixed Point Error Bound) Let f be a differentiable function with fixed point x̂ and let [a, b] be an
interval containing x̂. If |f 0 (x)| ≤ M < 1 for all x ∈ [a, b] and f ([a, b]) ⊆ [a, b], then for any initial value x0 ∈ [a, b],
fixed point iteration, with xk+1 = f (xk ) for all k ≥ 0, gives an approximation of x̂ with absolute error no more than
M k |x0 − x̂|.
Proof. An elementary induction proof (requested in the exercises) will establish that xk ∈ [a, b] for all k ≥ 0. We
proceed to prove the error bound. The absolute error in approximating x̂ by x0 is |x0 − x̂| = M 0 |x0 − x̂| so the
claim is true for k = 0. Now suppose the claim is true for some particular but arbitrary k ≥ 0. By the Mean Value
Theorem, there is a c in the interval from x̂ to xk such that f 0 (c) = f (xxkk)−f −x̂
(x̂)
. Since x̂ and xk are both in [a, b], so
is c. It follows that |f (c)| ≤ M , so |f (xk ) − f (x̂)| ≤ M |xk − x̂|. But x̂ is a fixed point of f , so f (x̂) = x̂, from which
0
it follows that |f (xk ) − x̂| ≤ M |xk − x̂|, and, therefore, that |xk+1 − x̂| ≤ M |xk − x̂|. By the inductive hypothesis,
|xk − x̂| ≤ M k |x0 − x̂|, so |xk+1 − x̂| ≤ M · M k |x0 − x̂| = M k+1 |x0 − x̂|.
When f 0 (x̂) = 0, equation 2.3.1 shows that fixed point iteration does not converge linearly. For any sequence
hpn i converging to p, if limn→∞ |p|pn+1 −p|
n −p|
= 0 we say the sequence is superlinearly convergent or that convergence is
faster than linear.
Consider the functions f (x) = 81 x3 −x2 +2x+1 and f1 (x) = −x3 +5x2 −3x−6 from section 2.2. Recall 2 is a fixed
point of f and 3 is a fixed point of f1 and observe that f 0 (2) = 38 ·22 −2·2+2 = − 12 and f10 (3) = −3·32 +10·3−3 = 0
Consequently, we should expect fixed point iteration of f1 to converge to 3 faster than that of f converges to 2. With
s0 , s1 , s2 , . . . = 1.75, f (1.75), f (f (1.75)), . . . and t0 , t1 , t2 , . . . = 2.75, f1 (2.75), f1 (f1 (2.75)), . . ., table 2.1 shows the
Table 2.1: Comparing order of convergence for fixed point iteration when the derivative at the fixed point is not
zero (sn ) to that when the derivative at the fixed point is zero (tn ).
n |2 − sn | |3 − tn |
0 2.5(10)−1 2.5(10)−1
1 1.074(10)−1 2.343(10)−1
2 5.644(10)−2 2.068(10)−1
3 2.740(10)−2 1.623(10)−1
4 1.388(10)−2 1.010(10)−1
5 6.894(10)−3 3.984(10)−2
6 3.459(10)−3 6.286(10)−3
7 1.726(10)−3 1.578(10)−4
8 8.640(10)−4 9.966(10)−8
9 4.318(10)−4 3.973(10)−14
10 2.159(10)−4 6.317(10)−27
relative speeds of convergence. hsn i is converging linearly as expected, and htn i seems to be converging quadratically.
The last four exponents in the |3 − tn | column are −4, −8, −14, −27, indicating that the number of significant digits
of accuracy is approximately doubling with each iteration. In other words, the error of one term is roughly the
square of the previous error (meaning α = 2 in the definition of order of convergence).
2.3. ORDER OF CONVERGENCE FOR FIXED POINT ITERATION 57
Taylor’s theorem will provide the proof we need that this convergence really is quadratic. Suppose f has a
third derivative in a neighborhood of x̂. Define en = x̂ − xn . Then according to Taylor’s theorem, x̂ = f (x̂) =
f (xn + en ) = f (xn ) + en f 0 (xn ) + 12 e2n f 00 (xn ) + O(e3n ). But f (xn ) = xn+1 so we get
1
x̂ − xn+1 = en+1 = en f 0 (xn ) + e2n f 00 (xn ) + O(e3n ). (2.3.2)
2
Also from Taylor’s theorem, f 0 (x̂) = f 0 (xn + en ) = f 0 (xn ) + en f 00 (xn ) + O(e2n ). But f 0 (x̂) = 0 so
Hence, x̂−xn+1
(x̂−xn )2 = en+1
en2 = − 12 f 00 (xn ) + O(en ) and
1 00
f (xn ) + O(en ) = 1 f 00 (x̂) ,
|x̂ − xn+1 |
lim = lim
n→∞ |x̂ − xn |2 n→∞ 2 2
showing that convergence is at least quadratic. If f 00 (x̂) happens to be 0, then the convergence is superquadratic.
To summarize, on the off-chance that, at a fixed point x̂, f 0 (x̂) = 0, fixed point iteration is successful and fast
for initial values near x̂. But when f 0 (x̂) 6= 0, fixed point iteration may fail to converge to x̂, and when it does
converge, the convergence is slow. There is a quick fix (quick to implement, not quick to explain) for some of this
deficiency when f 0 (x̂) 6= 0, however. We will first concentrate on the speed of convergence.
Let the sequence hcn i be defined by
c0 = 1
ck = cos(ck−1 ), k > 0.
You should be able to verify that the first few terms of this sequence are (approximately)
This is exactly the sequence you created in the calculator experiment on page 46 of section 2.2. Define a new
sequence han i by
(cn+1 − cn )2
an = cn − .
cn+2 − 2cn+1 + cn
Table 2.2 shows the first few terms of each sequence along with some error analysis. As promised, the sequence
han i is converging more quickly than hcn i, evidenced by the fact that acnn−c is tending to zero. The last column of
−c
the table indicates that the convergence of han i to c is not quadratic, however.
58 CHAPTER 2. ROOT FINDING
|p−pn+1 |
More generally, suppose hpn i is any sequence that converges linearly to p. Then we have lim = λ 6= 0,
n→∞ |p−pn |
so we should expect |p−pn+2 |
≈
|p−pn+1 | ≈ λ for large enough n, from which we get |(p−pn+2 )(p−pn )| ≈ |p−pn+1 |2 .
|p−pn+1 |
|p−pn |
Assuming p − pn+2 and p − pn have the same sign for large n4 , we can remove the absolute values to find
Therefore, we may take any three consecutive terms of hpn i and predict p from this formula. For large enough n,
this prediction will be a much better estimate of p than is pn . But just as we were able to claim |(p−pn+2 )(p−pn )| ≈
|p − pn+1 |2 , it must also be the case that pn+2 pn ≈ p2n+1 , so the numerator of our approximation is nearly zero. Of
course, that means the denominator must be nearly zero as well, since the quotient is p, a value that may not be
zero. To avoid some of the error inherent in this calculation, it is advisable to compute the algebraically equivalent
approximation
(pn+1 − pn )2
p ≈ pn − (2.3.4)
pn+2 − 2pn+1 + pn
instead. Let’s go back and revisit the sequence hsn i and apply this approximation.
(sn+1 −sn )2
Define an = sn − sn+2 −2sn+1 +sn and consider table 2.3 comparing the two sequences hsn i and han i. han i
Table 2.3: Comparing fixed point iteration when the derivative at the fixed point is not zero, sn , to the Aitken’s
delta-squared sequence, an .
n sn |2 − sn | an |2 − an |
0 1.75 2.5(10)−1 1.99506842493985 4.931(10)−3
1 2.107421875 1.074(10)−1 1.999022858310434 9.771(10)−4
2 1.943559146486223 5.644(10)−2 1.999737171760319 2.628(10)−4
3 2.027401559734717 2.740(10)−2 1.999937151202653 6.284(10)−5
4 1.986114080555812 1.388(10)−2 1.999983969455146 1.603(10)−5
5 2.006894420349172 6.894(10)−3
6 1.996540947531514 3.459(10)−3
converges significantly faster than the linearly convergent sequence from which is was derived, just as before! The
fact that |2 − an | ≈ |2 − sn+2 |2 is evidence of this claim, but the convergence of han i is still linear. Make sure you
can calculate the an in this table yourself before reading on.
On a practical note, there is no sense in calculating all the terms a0 , a1 , . . . , an−2 as done in the table. The
terms of han i are dependent only on those of hsn i so an−2 can be calculated just as well without having calculated
a0 , a1 , . . . , an−3 . The table shows all of them only for illustrative purposes and so you can get some practice with
formula 2.3.4. The important thing to notice is that an has approximately twice as many significant digits of
accuracy as does sn+2 . Consequently, a0 is a much better approximation than is s2 .
Crumpet 11: Aitken’s delta-squared method is designed for any linearly convergent sequence, not
just sequences derived from fixed point iteration.
The derivation of 2.3.4, referred to as Aitken’s delta-squared formula, makes no reference to fixed point iteration.
In fact it makes no assumptions about the origin of the sequence. It makes no difference. It may be a sequence of
partial sums, a sequence of partial products, a sequence derived from any recurrence relation, a sequence derived
from number theory, or anything else. The only important characteristics are that the sequence converges and it
does so linearly.
4 This will happen in the common events that the x̂ − x all have the same sign or the x̂ − x have alternating signs, so this is not
n n
an unrealistic assumption.
2.3. ORDER OF CONVERGENCE FOR FIXED POINT ITERATION 59
1
The sum 1
− 1 + 15 − 17 + 19 − · · · converges to π
linearly so Aitken’s delta-squared method should be helpful.
Pn3 (−1) k+1
4
13 76
If we let pn = k=1 2k−1
be the n th
partial sum, then p2 = 15
, p3 = 105
, p4 = 263315
, and p5 = 2578
3465
. Aitken’s
76 − 13 )2
( 105 ( 263 − 76 )2 |π −p 2
13 1321 76 989 4|
extrapolation gives a2 = 15
− 15
263 −2 76 + 13 = 1680
and a3 = 105
− 315 105
2578 −2 263 + 76 = 1260 . 4
|π −a2 |
≈ 2.6 and
315 105 15 3465 315 105 4
|π −p5 |2
4
|π −a3 |
≈ 3.5 so extrapolation gives an error less than the square of the error in the original sequence.
4
Perhaps this fact gives you an idea. Once s2 is calculated, we can use equation 2.3.4, also known as Aitken’s
delta-squared method, to calculate a better approximation than we already have. And once we have this good
approximation, it seems a bit silly to cast it aside and continue computing s3 = f (s2 ), s4 = f (s3 ), and so on. What
if we use a0 in place of s3 in our iteration? In other words, we would have s1 = f (s0 ), s2 = f (s1 ), s3 = a0 , s4 = f (s3 ),
and so on. That should improve s3 , s4 , and s5 . And once we have s5 we again have three consecutive fixed point
iterations, so we can apply Aitken’s delta squared method again. Instead of calculating s6 = f (s5 ), we can get what
should be a better approximation by using equation 2.3.4 on s3 , s4 , and s5 . In other words, s6 = a3 , s7 = f (s6 ),
s8 = f (s7 ). Again, we have three consecutive fixed point iterations, so s9 = a6 , and so on. This gives the sequence
which converges very rapidly! The construction of this subsequence as a sequence in and of itself is called Steffensen’s
method and the convergence is quadratic as long as hsn i is convergent. The following is a heuristic argument that
Steffensen’s method gives quadratic convergence. As seen, the error in s2 is not significantly different from the error
in s0 . But a0 has an error approximately equal to the square of the error in s2 , so the error in a0 is approximately
the square of the error in s0 . Similarly, the error in s5 is not significantly different from that in a0 = s3 . But the
error in a1 is approximately the square of the error in s5 , so the error in a1 is approximately the square of the error
in a0 . Similarly, the error in an+1 is approximately the square of the error in an .
Applying Steffensen’s method to the function f (x) = cos x with x0 = 1, we can accelerate the convergence of the
sequence hcn i dramatically. Table 2.4 shows the first few terms of han i with some error analysis. The last column
of the table indicates that
|an+1 − c|
lim ≈ .148
n→∞ |a − c|2
n
Figure 2.3.1: Convergence diagrams for 5 functions with the same fixed points—Steffensen’s method.
f1 :
f2 :
f3 :
f4 :
f5 : √ √
black: does not converge; green: converges to 3; red: converges to 1 + 3; blue: converges to 1 − 3
Convergence Diagrams
Speeding up fixed point iteration only takes care of one deficiency of the method. There is still the problem of diver-
gence from fixed points where the derivative of the function has magnitude equal to or greater than 1. Steffensen’s
method helps. Compare Figure 2.3.1 with Figure 2.2.6. The convergence diagrams for Steffensen’s method show
convergence over larger intervals of initial values. Moreover, where f1 and f2 are concerned, Steffensen’s method
finds all three fixed points, just as fixed point iteration on f6 did.
Assumptions: Fixed point iteration converges to a fixed point of f with initial value x0 .
Input: Initial value x0 ; function f ; desired accuracy tol; maximum number of iterations N .
Step 1: For j = 1 . . . N do Steps 2-6:
Step 2: Set x1 = f (x0 ); x2 = f (x1 )
Step 3: If |x2 − x1 | ≤ tol then return x2
Step 4: Set x = x0 − x(x 1 −x0 )
2
2 −2x1 +x0
Key Concepts
Aitken’s delta-squared method: If hpn i converges to p linearly, the sequence han i defined by an = pn −
(pn+1 −pn )2
pn+2 −2pn+1 +pn converges to p superlinearly.
Fixed Point Error Bound: Let f be a differentiable function with fixed point x̂ and let [a, b] be an interval
containing x̂. If |f 0 (x)| ≤ M < 1 for all x ∈ [a, b] and f ([a, b]) ⊆ [a, b], then for any initial value x0 ∈ [a, b],
fixed point iteration, with xk+1 = f (xk ) for all k ≥ 0, gives an approximation of x̂ with absolute error no
more than M k |x0 − x̂|.
2.3. ORDER OF CONVERGENCE FOR FIXED POINT ITERATION 61
Fixed Point Iteration Order of Convergence: Suppose f is a function with fixed point x̂ and f 0 (x̂) exists.
Let x0 , x1 , x2 , . . . be a sequence derived from fixed point iteration (xk+1 = f (xk ) for all k ≥ 1) such that
lim xk = x̂ and xk 6= x̂ for all k = 0, 1, 2, . . .. Then the sequence hxn i converges linearly to x̂ if f 0 (x̂) 6= 0 and
k→∞
at least quadratically if f 0 (x̂) = 0.
Steffensen’s method: A modification of fixed point iteration where every third term is calculated using Aitken’s
delta-squared method.
|pk+1 − p|
Superlinear convergence: If the sequence p0 , p1 , p2 , . . . converges to p and lim = 0, then the sequence
k→∞ |pk − p|
is said to converge superlinearly.
|pk+1 − p|
Superquadratic convergence: If the sequence p0 , p1 , p2 , . . . converges to p and lim = 0, then the
k→∞ |pk − p|2
sequence is said to converge superquadratically.
Octave
In section 1.3, we learned about for loops. With a for loop, you have to know how many times you want the loop
to run or at least you need a maximum. You can quit a for loop before it is done by exiting (returning) from
the function. There are times, however, when you don’t know how many times you need a loop to run and you
don’t even have a convenient maximum at hand. In this case, a while loop is more appropriate. A while loop will
continue to loop as long as a certain condition is met, and you set the condition. The syntax for a while loop is
while (condition)
do something.
end%while
but must be used with caution. for loops always have an end, but while loops do not if programmed carelessly. If
the condition of a while loop is never met, the loop runs indefinitely! Here is a simple example of a while loop
that never ends. Do not run it!
i=0;
while (i<12)
disp("Help! I’m stuck in a never-ending loop!!")
end%while
The problem is i is set less than 12 and never changes so always remains less than 12. Thus the condition of this
while loop is always met. This loop can easily be modified to terminate. If we increment i inside the loop, it will
end. This modification of the never-ending loop does end and displays a messge 12 times:
i=0;
while (i<12)
disp("That’s better. I can handle a dozen iterations.")
i=i+1;
end%while
Incidentally, any for loop can be replaced by a while loop like this one.
We are human. Inevitably, we will program a while loop that never ends. What to do once it starts running?
Of course, you can power down the machine, but that is a little like bringing your coffee mug to the kitchen using a
bull dozer. There is an easier way. You can simply stop the application in which you are running Octave. If you are
using a command line (terminal) window or the Octave GUI, you can simply close it. But, if you remember, you
can also press Ctrl-c. That is, tap the c key while holding down the Ctrl key. This will interrupt the never-ending
loop.
For a more practical example, the bisection method can easily be re-programmed using a while loop. First, the
pseudo-code:
Assumptions: f is continuous on [a, b]. f (a) and f (b) have opposite signs.
Input: Interval [a, b]; function f ; desired accuracy tol.
Step 1: Set m = 2 ;
a+b
err = |b − a|/2; L = f (a);
62 CHAPTER 2. ROOT FINDING
Now the Octave code. If you decide to use this code, it should be saved in a file named bisectionWhile.m.
function p = bisectionWhile(f,a,b,tol)
p = a + (b-a)/2;
err = abs(b-a);
FA = f(a);
while (err>tol)
p = a + (b-a)/2;
FP = f(p);
err=err/2;
if (FP == 0)
return
end%if
if (FA*FP > 0)
a = p;
FA = FP;
else
b = p;
end%if
end%while
end%function
Use this code with caution! It can run as a never-ending loop! If the function is called with a negative value for tol,
as in bisectionWhile(g,1,2,-10), it will run until forcibly stopped (using Ctrl-c or shutting down the Octave
app) as err will always be greater than −10.
Error checking
The most useful software includes error checking. In the case of the bisectionWhile function, we want to avoid
the endless loop in every instance we can imagine. Adding a couple lines at the beginning of the function provides
some security:
function p = bisectionWhile(f,a,b,tol)
if (tol<=0)
p = "ERROR:tol must be positive.";
return
end%if
p = a + (b-a)/2;
err = abs(b-a);
FA = f(a);
while (err>tol)
p = a + (b-a)/2;
FP = f(p);
err=err/2;
if (FP == 0)
return
end%if
if (FA*FP > 0)
a = p;
FA = FP;
2.3. ORDER OF CONVERGENCE FOR FIXED POINT ITERATION 63
else
b = p;
end%if
end%while
end%function
In general, having your program check for input errors like this is called error checking or validation . Most
of the time, we will write code assuming the input is valid and will not do any error checking. This makes the
programming simpler, but also allows for problems like never-ending loops! bisectionWhile.m may be downloaded
at the companion website.
6. Write an Octave program (.m file) that uses (a) Find a bound on the number of iterations required
a while loop, an array, and the disp() command to approximate the fixed point to with 10−11 ac-
2n curacy using fixed point iteration with any initial
to find the values of f (n) = √ for n =
n2 + 3n value in [−4, −0.9].
0, 2, 5, 10, 100, 1000, 20000.
(b) Use fixed point iteration with x0 = −4 to find an
7. The following Octave code is intended to calculate approximation to the fixed point that is accurate
the sum to within 10−11 . The fixed point is x = −1.
30
X 1 (c) Compare the bound to the actual number of iter-
k2 ations needed.
k=1
but it does not. Find as many mistakes in the code as 12. Let g(x) = π + 0.5 sin(x/2). In exercise 11 of section
you can. Classify each mistake as either a compilation 2.2, you were asked to show that g has a unique fixed
error (an error that will prevent the program from run- point on [0, 2π].
ning at all) or a bug (an error that will not prevent the
program from running, but will cause improper calcu- (a) Find a bound on the number of iterations required
lation of the sum). to achieve 10−2 accuracy using fixed point itera-
tion with any initial value in [0, 2π].
sum=1;
(b) Use fixed-point iteration with x0 = 0 to find an
k=1;
approximation to the fixed point that is accurate
while k<30
to within 10−2 . The fixed point is x =???.
sum=sum+1.0/k*k;
end (c) Compare the bound to the actual number of iter-
diss(sum) ations needed.
15. Compute a0 , a1 , and a2 of Aitken’s delta-squared hp1 , p2 , p3 , . . .i ≈ h.84147, .95885, .98158, . . .i converges
method for the sequence in problem 2 on page 27. to 1, albeit very slowly. Generate the first three terms
Since the sequence has an undefined term at n = 1, of the sequence han i using Aitken’s delta-squared cal-
start the sequence h n−1 n+1
i with n = 2. In other words, culation. Does it seem to be approaching 1 faster than
consider the sequence in problem 2 on page 27 to be does hpn i?
3, 2, 53 , 32 , 75 . . . so p0 = 3, p1 = 2, p2 = 35 , and so on. 20. Fixed point iteration applied to f (x) = sin(x) with
16. The following sequences are linearly convergent. Gen- x0 = 1 takes 29, 992 iterations to reach a number be-
erate the first five terms of the sequence han i using low 0.01 on its way to the fixed point 0. Incidentally,
Aitken’s delta-squared calculation. x29992 ≈ 0.099999. How many iterations does it take
Steffensen’s method with x0 = 1 to reach a number
(a) p0 = 0.5, pn = (2 − epn−1 + p2n−1 )/3 for n ≥ 1 [S]
below 0.01? Comment. [S]
p
(b) p0 = 0.75, pn = epn−1 /3 for n ≥ 1 21. Let f (x) = 1 + (sin x)2 and p0 = 1. Find a1 and a2 of
17. Use Aitken’s delta squared method to find p = lim pn Steffensen’s method with a calculator. [A]
n→∞
accurate to 3 decimal places. 22. Compute the first three √iterations of Steffensen’s
method applied to g(x) = ( 2)x using p0 = 3.
pn = {−2, −1.85271, −1.74274, −1.66045, 23. Steffensen’s method is applied to a function f (x) using
− 1.59884, −1.55266, −1.51804, p0 = 1. If f (f (p0 )) = 3 and a1 = 0.75, what is f (p0 )?
[A]
− 1.49208, −1.47261, . . .}
24. Find the fixed point of f (x) = x−0.002(ex cos(x)−100)
18. The sequence han i of question 15 converges faster than in [5, 6] using Steffensen’s method. [A]
does the sequence in problem 2 on page 27. If you
were to apply Aitken’s delta-squared method to the se- 25. In question 24 you found a fixed point x̂. For what
quence han i, would you expect the convergence to be function g(x) is x̂ a root?
even faster? Explain. [A]
26. Write a while loop that outputs the numbers
19. Recall from calculus that limn→∞ n sin n1 = 1.
1, .5, .25, .125, .0625, .03125, .015625, . . . until it reaches
Therefore, if we let pn = n sin n1 , then the sequence a number below 10−4 .
2.4. NEWTON’S METHOD 65
iteration,
x1 = f6 (x0 ) = 3.5
x2 = f6 (x1 ) ≈ 3.217391304347826
x3 = f6 (x2 ) ≈ 3.072749058541597
x4 = f6 (x3 ) ≈ 3.013730618589344
x5 = f6 (x4 ) ≈ 3.000683798275568
x6 = f6 (x5 ) ≈ 3.000001860777997
x7 = f6 (x6 ) ≈ 3.000000000013848.
You can see two things. The sequence x0 , x1 , x2 , . . .
1. is converging to (the fixed point) 3; and
2. it looks like the convergence is quadratic since, starting with x4 to x5 , the number of significant digits is
roughly doubling with each iteration.
In the analysis in section 2.3 on page 56, we found that fixed point iteration converges quadraticly (or better) only
when the derivative at the fixed point is zero. These observations should lead you to believe f60 (3) = 0. Let’s check.
First, the derivative f60 (x) = 6x −40x +74x2 −4x−60
4 3
(3x2 −10x+4)2 (you should verify this). Evaluating the numerator at the fixed
4 3 2
point, x = 3, we get 6(3) − 40(3) + 74(3) − 4(3) − 60 = 486 − 1080 + 666 − 12 − 60 = 0. So we have convergence
to a fixed point where the derivative of the function is zero, and we indeed √ have that convergence is quadratic.
Starting with x0 = 2, fixed√ point iteration on f6 converges to 1 + 3, and starting with x0 = −1, fixed point
iteration converges to 1 − 3. You should be able to verify this from the convergence diagram in Figure 2.2.6 or
from calculating the first several iterations for each yourself. What you do not get from the convergence diagram
is the speed of convergence. For that, you need to look at the iterates. You should do so. Does convergence look
quadratic in these cases too? Answer on page 72.
From the convergence diagram, we see that fixed point iteration will converge for virtually any initial value,
and all three fixed points can be estimated by fixed point iteration. Moreover, from our calculations, it looks like
convergence is quadratic for all three. It’s hard to ask for more from a function. Fast convergence to any fixed
point! So whence did f6 come?
Suppose g(x) is differentiable and g(x̂) = 0 so g has a root at x̂. Consider f (x) = x − gg(x) 0 (x) . x̂ is a fixed point
of f as long as g 0 (x̂) 6= 0:
g(x̂) 0
f (x̂) = x̂ − 0 = x̂ − 0 = x̂.
g (x̂) g (x̂)
Moreover, as long as g has a second derivative near x̂,
g 0 (x̂) · g 0 (x̂) − g(x̂)g 00 (x̂)
f 0 (x̂) = 1−
g 0 (x̂) · g 0 (x̂)
0 · g 00 (x̂)
= 1−1+ 0
g (x̂) · g 0 (x̂)
= 0.
From these calculations, we conclude if g(x) is twice differentiable, g(x̂) = 0 and g 0 (x̂) 6= 0, then fixed point iteration
of f (x) with initial value in a neighborhood of x̂ will converge quadratically to x̂. What a great way to turn a root
finding problem into a fixed point problem!
Now is a good time to recall that f6 was just one of 6 candidate functions designed to find the roots of
g(x) = −x3 + 5x2 − 4x − 6 by fixed point iteration. Indeed, g 0 (x) = −3x2 + 10x − 4 and
g(x) −x3 + 5x2 − 4x − 6
x− = x−
g 0 (x) −3x2 + 10x − 4
2x3 −5x2 −6
=
3x2 −10x + 4
= f6 (x).
66 CHAPTER 2. ROOT FINDING
g(x)
Using fixed point iteration on f6 (x) = x − g 0 (x) to find roots of g(x), as done here, is called Newton’s method.
To compute x1 , the tangent line to g at (x0 , g(x0 )) is drawn and its intersection with the x-axis is x1 . Similarly,
the tangent line to g at (x1 , g(x1 )) is drawn and its intersection with the x-axis is x2 . And so on. For example,
(x0 , g(x0 )) = (−2.5, 50.875) and g 0 (x0 ) = g 0 (−2.5) = −47.75. Hence, the “rise” (0−50.875) over the “run” (x1 +2.5)
between (−2.5, 50.875) and (x1 , 0) must equal −47.75. We thus have −50.875 x1 +2.5 = −47.75 so
−50.875
x1 = − 2.5 ≈ −1.43455497382199.
−47.75
In symbols, the “rise” (−g(x0 )) over the “run” (x1 − x0 ) must equal g 0 (x0 ). In other words,
−g(x0 )
= g 0 (x0 ) ⇒
x1 − x0
−g(x0 )
= x1 − x0 ⇒
g 0 (x0 )
g(x0 )
x1 = x0 − .
g 0 (x0 )
1) g(xn )
Similar calculation shows x2 = x1 − gg(x
0 (x ) , and more generally xn+1 = xn − g 0 (x ) . This recurrence relation describes
1 n
g(x)
Newton’s method—iterating the function f (x) = x − g 0 (x) .
Assumptions: g is twice differentiable. g has a root at x̂. x0 is in a neighborhood (x̂ − δ, x̂ + δ) where the
0 0 00
magnitude of f 0 (x) = 1 − g (x)·gg0(x)−g(x)g
(x)·g 0 (x)
(x)
is less than one.
Input: Initial value x0 ; function g and its derivative g 0 ; desired accuracy tol; maximum number of iterations
N.
Step 1: For j = 1 . . . N do Steps 2-4:
g(x0 )
Step 2: Set x = x0 − g 0 (x0 ) ;
Step 3: If |x − x0 | ≤ tol then return x;
Step 4: Set x0 = x;
Step 5: Print “Method failed. Maximum iterations exceeded.”
Output: Approximation x near exact fixed point, or message of failure.
2.4. NEWTON’S METHOD 67
Table 2.5: The secant method applied to g(x) = −x3 + 5x2 − 4x − 6 with x0 = 5 and x1 = x0 + g(x0 ) = −21.
n xn |3 − xn |
0 5 2(10)0
1 −21 2.4(10)1
2 4.9415730337078 1.941(10)0
3 4.8869924815972 1.886(10)0
4 4.0502898397912 1.050(10)0
5 3.7088949488497 7.088(10)−1
6 3.412824115541 4.128(10)−1
7 3.232292913133 2.322(10)−1
8 3.1141957095727 1.141(10)−1
9 3.0465011115969 4.650(10)−2
10 3.0132833760752 1.328(10)−2
11 3.0020189248976 2.018(10)−3
12 3.0001014520965 1.014(10)−4
13 3.0000008128334 8.128(10)−7
14 3.0000000003297 3.297(10)−10
Secant Method
The greatest weakness of Newton’s method is the requirement that g 0 be known and used in the calculation.
The derivative is not always accessible or manageable or even known, though. In such a case, it is better to use
Steffensen’s method or the secant method. The secant method is derived by replacing the g 0 of Newton’s method
with a difference quotient. In order for this to make any sense, though, we will need to restate Newton’s method in
g(xn )
terms of xn . In Newton’s method we are iterating f (x) = x − gg(x)
0 (x) so xn+1 = xn − g 0 (x ) .
n
Now suppose you have a function g and some iterate xn−1 . That is enough to locate one point on the graph
of g, namely (xn−1 , g(xn−1 )). But we need another point in order to form a difference quotient (the slope of
the line through two points). So suppose we have a second value, xn , near xn−1 . Then g(xxnn)−g(x n−1 )
−xn−1 ≈ g 0 (xn )
so we can substitute g(xxnn)−g(x
−xn−1
n−1 )
for g 0 (xn ) in Newton’s method. This yields the secant method, xn+1 = xn −
g(xn )/ g(xxnn)−g(x n−1 )
−xn−1 , which simplifies to
xn − xn−1
xn+1 = xn − g(xn ) . (2.4.1)
g(xn ) − g(xn−1 )
Notice this is not quite a fixed point iteration scheme. Each iteration depends on the previous two values, not one.
The analysis we’ve done so far does not apply, but there’s hope that convergence will be fast since this method is a
reasonable approximation of Newton’s method near a root, assuming g is differentiable near there. Table 2.5 provides
evidence that the secant method indeed converges quickly. In the particular case of g(x) = −x3 + 5x2 − 4x − 6 with
x0 = 5 and x1 = x0 + g(x0 ) = −21, it takes a while to settle in, but after the first 8 iterations or so, convergence is
very fast. Not quite quadratic, but superlinear for sure.
√
1+ 5
Crumpet 12: The secant method converges with order 2 .
Suppose g is a function with root x̂, g 0 (x̂) 6= 0, g 00 (x̂) 6= 0, and g 000 (x) exists in a neighborhood of x̂. Let
x −xn−1
x0 , x1 , x2 , . . . be a sequence derived from the secant method (xn+1 = xn − g(xn ) g(xnn)−g(x n−1 )
for all k ≥ 2) such
that lim xk = x̂. Define en = xn − x̂ so xn = x̂ + en . Making this substitution into 2.4.1 we have
k→∞
en − en−1
en+1 = en − g(x̂ + en ) . (2.4.2)
g(x̂ + en ) − g(x̂ + en−1 )
Taylor’s theorem allows g(x̂ + ek ) = g(x̂) + ek g 0 (x̂) + 21 e2k g 00 (x̂) + O(e3k ). Noting that g(x̂) = 0 and substituting
68 CHAPTER 2. ROOT FINDING
into 2.4.2,
|x̂−xn+1 |
Using equality 2.4.3 to find a value α for which limn→∞ |x̂−xn |α
= λ 6= 0, we have
= λ 6= 0.
But limn→∞ en = limn→∞ en−1 = 0. Hence, limn→∞ e1−α n en−1 must not be 0 or divergent, for if it were,
|x̂−x |
limn→∞ |x̂−xn+1
n|
α would be 0 or divergent, respectively. Consequently, there is a positive constant C such that
1/(1−α)
limn→∞ |e1−α
n en−1 | = limn→∞ |e1−α
n+1 en | = C ⇒ limn→∞ |en+1 en | = C 1/(1−α) . Now we have
|en+1 | |en+1 |
lim = λ 6= 0 and lim = C 1/(1−α) 6= 0.
n→∞ |en |α n→∞ |en |1/(α−1)
Since the order of convergence of a sequence is unique (Exercise 20 of section 1.3) it must be that α = 1/(α − 1)
or α2 − α − 1 = 0. The quadratic formula supplies the desired result.
So far we have only applied Newton’s method and the secant method to the cubic polynomial g(x) = −x3 +
2
5x − 4x − 6, a task not strictly necessary. The rational roots theorem, a basic tool from pre-calculus, would give
you the roots exactly. The method would have you check ±1, ±2, ±3, and ±6 as possible roots of g. Assuming you
did your checks by synthetic division, your work might look something like this:
3 −1 5 −4 −6
−3 6 6
−1 2 2 0
meaning g(x) = (x − 3)(−x2 + 2x +√ 2). The other two roots would then come from the quadratic formula applied
√
to −x2 + 2x + 2 and would be −2±−24+8 = 1 ± 3.
The solutions of the quadratic equation ax2 +bx+c = 0 are given by the well-known quadratic equation. Less well-
known, and significantly more involved, is any formula for the solutions of the cubic equation ax3 +bx2 +cx+d = 0.
One method of solution follows. First, we let
3ac − b2
p = and
3a2
2b3 − 9abc + 27a2 d
q = .
27a3
2.4. NEWTON’S METHOD 69
Then we set r
q
3 q2 p3
w =− − + .
2 4 27
Third, we set w1 , w2 , and w3 to the three possible (complex) values of w. Finally, the three solutions of ax3 +
bx2 + cx + d = 0 are
p b
xi = wi − − , i = 1, 2, 3.
3wi 3a
This is essentially the method of Cardano, published in the 16th century!
For example, to solve the equation −x3 + 5x2 − 4x − 6 = 0, we start with
3(−1)(−4) − 52 13
p = =− and
3(−1)2 3
2 · 53 − 9(−1)(5)(−4) + 27(−1)2 (−6) 92
q = = .
27(−1)3 27
Then
r
3 92 922 133
w = − − − 2
2 · 27 4 · 272 27
√
46 922 − 4 · 133
= − −
27 54
√
46 −324
= − −
27 54
46 i
= − − .
27 3
√ −1 √ −1
In polar form, w3 = 132713 ei(tan (9/46)−π) so we may set w1 = 313 ei(tan (9/46)−π)/3 , one of the cube roots of
w3 . Unfortunately, finding the angle (tan−1 (9/46) − π)/3 exactly amounts to solving a cubic equation! However,
with a calculator in hand, one can get the approximation
√
−0.982793723247329, which in the end will be good
enough. So, the real part of w1 is approximately 313 cos(−0.982793723247329) ≈ .6666666666666667 and the
√
imaginary part is approximately 313 sin(−0.982793723247329) ≈ −1. w1 is suspiciously close to 32 − i. And we
3 2 3 2 2
can check, 2
= +3 (−i) + 3 · 32 (−i)2 + (−i)3 = 27
8
− 12 i−2+i 46
= − 27 − 13 i. Therefore, w1 = 2
3
−i 3 3 9 3
−i
√ √ √ √ √ √
2 1 3 3 3−2
and we let w2 = 3
−i −2 + 2 i = 6
+ 6 i and w3 = 32 − i
3+2 3
− 12 + 2
3
i = −3
6
3−2
+ 3−2 3
6
i.
Finally,
13 5 13w1 5 5
x1 = w1 + + = w1 + + = w1 + w1 + =3
9w1 3 9|w1 |2 3 3
13 5 13w2 5 5 √
x2 = w2 + + = w2 + + = w2 + w2 + = 3+1
9w2 3 9|w2 |2 3 3
13 5 13w3 5 5 √
x3 = w3 + + = w3 + + = w3 + w3 + =− 3+1
9w3 3 9|w3 |2 3 3
For an equation you most likely did not see in pre-calculus, or calculus for that matter, consider
p
x − ex cos e2x − x2 = 0.
You might try to solve this equation exactly, with a pencil and paper, but you would soon run into a dead end. This
equation can not be solved explicitly. The best you can hope for is to approximate √ the solutions with a numerical
method. To get some idea what we are in for, look at the graph of x − ex cos e2x − x2 in Figure 2.4.1. The
function oscillates wildly, and only oscillates more wildly as x increases. The graph crosses the x-axis 29 times on
the interval from 0 to 4.5 so has 29 roots there! They are
.3181315052047641, 1.668024051576096, 2.062277729598284,
2.439940377216816, 2.653191974038697, . . .
and can be found by Newton’s method with initial values 0, 1.5, 2, 2.4, 2.6, . . .. Can you find the next root? Answer
on page 72.
70 CHAPTER 2. ROOT FINDING
√
Figure 2.4.1: The graph of x − ex cos e2x − x2 crosses the x-axis infinitely many times.
Assumptions: g has a root at x̂. g is differentiable in a neighborhood of x̂. x0 and x1 are sufficiently close
to x̂.
Input: Initial values x0 and x1 ; function g; desired accuracy tol; maximum number of iterations N .
Step 1: Set y0 = g(x0 ); y1 = g(x1 )
Step 2: For j = 1 . . . N do Steps 3-5:
−y0 ;
Step 3: Set x = x1 − y1 xy11 −x 0
Assumptions: g has a root at x̂. g is differentiable in a neighborhood of x̂. x0 is sufficiently close to x̂.
Input: Initial value x0 ; function g; desired accuracy tol; maximum number of iterations N .
Step 1: Set y0 = g(x0 ); x1 = x0 + y0 ; y1 = g(x1 )
Step 2: For j = 1 . . . N do Steps 3-5:
Step 3: Set x = x1 − y1 xy11 −x
−y0 ;
0
Key Concepts
Rational Roots Theorem: If the polynomial p(x) = a0 + a1 x + · · · + ak xk has rational coefficients, then any
rational roots of p are in the set d : n is a factor of a0 and d is a factor of ak .
n
Synthetic division: A method for calculating the quotient of a polynomial by a monomial. Example on page 68.
Newton’s method: A root finding method that generally converges to a root of g(x) quadratically, but requires
n)
the use of the derivative. In this method, x0 is chosen and xn+1 = xn − gg(x
0 (x ) is computed for each n > 0.
n
Secant method: A root finding method that generally converges to a root of g(x) with order approximately
1.618, but does not require the use of the derivative. In this method, x0 and x1 are chosen and xn+1 =
−xn−1
xn − g(xn ) g(xxnn)−g(x n−1 )
is computed for each n > 0.
Seeded secant method: A modification of the secant method where x0 is chosen and x1 = x0 + g(x0 ).
3. Write Octave code that implements the seeded se- 13. Find a value of x0 for which Newton’s method will fail
cant method as a function. to converge to a root of g(x) = 2 + x − ex .
14. Explain why Newton’s method fails to converge for the
4. Use your secant method function from question 2
the function g(x) = x2 + x + 1 with x0 = 1.
with a tolerance of 10−5 to find a solution of
2 ln(1 + x2 ) − x
15. Let g(x) = . Using Newton’s method
(a) e + 2
x −x
+ 2 cos x − 6 = 0 using 1 ≤ x0 ≤ 2. 1 + x2
to find a root of g(x) with x0 = 5 yields x14 =
(b) ln(x − 1) + cos(x − 1) = 0 using 1.3 ≤ x0 ≤ 2. 8.6624821192 and with x̃0 = 1.2 yields x̃14 = 0. Com-
(c) 2x cos x − (x − 2)2 = 0 using 2 ≤ x0 ≤ 3. [A] pare the values of x14 and x̃14 with the fourteenth iter-
ations from question 9 and explain any similarities or
(d) 2x cos x − (x − 2)2 = 0 using 3 ≤ x0 ≤ 4. [A]
differences. [A]
(e) (x − 2)2 − ln x = 0 using 1 ≤ x0 ≤ 2. 16. Let g(x) = e3x − 27x6 + 27x4 ex − 9x2 e2x and let
(f) (x − 2)2 − ln x = 0 using e ≤ x0 ≤ 4. p0 = 4. Find p10 using Newton’s method. HINT:
g 0 (x) = 3e3x − 18(x + x2 )e2x + 27(x4 + 4x3 )ex − 162x5 .
[A]
5. Repeat exercise 4 using your Newton’s method code
from question 1. [A] 17. Newton’s method does not introduce spurious solu-
tions. Suppose f (x) = x − gg(x) 0 (x) and g (x̂) 6= 0. Prove
0
6. Repeat exercise 4 using your seeded secant method that x̂ is a root of g if and only if x̂ is a fixed point of f .
code from question 3. [A] Hint: one direction is proven in the text of this section.
7. Repeat exercise 4 using a tolerance of 10−10 . Taking 18. The polynomial g(x) = x4 + 2x3 − x − 3 has a root
this new value as the exact value, did using a tolerance x̂ ≈ 1.097740792. Find the largest neighborhood (a, b)
of 10−5 give a result accurate to within 10−5 of the of x̂ such that Newton’s method converges to x̂ for any
exact value? [A] initial value x0 ∈ (a, b). [S]
the two points (x0 , f (x0 )) and (x1 , f (x1 )). Find the in- 25. Use your code from question 2 to find a root of
tersection of this line with the x-axis. The x-coordinate the function in the interval of question 2 on page 43
of the intersection is x2 . Find x3 by intersecting the to within 10−8 . Compare your answer to that from
line through (x1 , f (x1 )) and (x2 , f (x2 )) with the x- question 4 on page 43. [A]
axis. And so on. Graph the polynomial p(x) =
26. The sum of two numbers is 20. If each number is added
x3 − 3x + 3, and demonstrate the first iteration of the
to its square root, the product of the two sums is 172.2.
secant method graphically for x0 = −1 and x1 = −2.
[S] Determine the two numbers to within 10−4 of their ex-
act values. [S]
22. Suppose you are using the secant method with x0 = 1
and x1 = 1.1 to find a root of f (x). 27. Find an example of a situation in which Newton’s
method will fail on the second iteration (i.e., x1 may
(a) Find x2 given that f (1) = 0.3 and f (1.1) = 0.23. be calculated but x2 may not). [S]
(b) Create a sketch (graph) that illustrates the calcu- 28. Let h(x) = 2.2x3 − 6.6x2 + 4.4x and let g(x) = h◦3 (x).
lation. HINT: x2 will be located where the line That is, g(x) = h(h(h(x))). Approximate a root of
through (x0 , f (x0 )) and (x1 , f (x1 )) crosses the x- g 0 (x).
axis.
29. For what values of x0 , approximately, will Newton’s
23. Use the graph of g to answer the following questions. method converge to −2.5?
g has roots at −2π, −π, π, and 2π. [A]
(a) To which root will Newton’s method converge if 30. For the function shown in question 29, find x2 and x3
x0 = 2.5? for the secant method with x0 = −10 and x1 = 6.
(b) What will happen if x0 = 0?
31. Let ˆ x
(c) Find a positive integer value of x0 for which New- et
f (x) = 10 − dt.
ton’s method will converge to 2π. 0 1+t
(d) Find a negative value of x0 for which Newton’s Approximate the positive root of f . [A]
Answers
Quadratic convergence?
n xn xn
0 2 −1
1 2.5 −.7647058823529411
2 2.666666666666667 −.7326286052763475
3 2.722222222222227 −.7320509933083684
4 2.731741086881274 −.7320508075688965
5 2.732050478023325
6 2.732050807568503
.. .. ..
. . .
2.732050807568877 −.7320508075688772
The convergence looks quadratic since the number of significant digits of accuracy roughly doubles with the
last couple of iterations.
Next root? The next root is approximately 2.872257717171606. This can be found using Newton’s method with
x0 = 2.81, for example. Note this computation is very sensitive to initial conditions because there are so many
roots near one another. Starting with x0 = 2.8, for example, leads to the root at 9.662623060421268!
2.5. MORE CONVERGENCE DIAGRAMS 73
1 −1 0 0 1
−1 −1 −1
−1 −1 −1 0
This division shows that g(x) = (x − 1)(−x2 − x − 1), so the other two roots are √
the solutions√of the equation
1± 1−4
2
−x − x − 1 = 0, thus deflating the problem to a quadratic. The solutions are −2 = − 21 ± i 23 . By the way,
3
you may also recognize 1 − x as one of the special forms of polynomials, the difference of cubes.
Of course this is all fascinating, but what does this have to do with numerical analysis? What may surprise
you is that fixed point iteration (and, therefore, Newton’s method), the secant method, and Steffensen’s method
can all be used to find complex roots just as well as real ones! In fact, the algorithms need no modification! The
programming language used to implement the methods, of course, does need to be able to handle complex number
arithmetic. Octave does so without ado.
First, finding a root of g(x) = 1 − x3 and finding a fixed point of f (x) = 1/x2 are equivalent. Why? Answer
on page 80. Setting x0 = −1 + i and applying Newton’s method and the secant method to g(x) = 1 − x3 , and
Steffensen’s method to f (x) = 1/x2 we get the following:
xi
i Steffensen’s Secant Newton’s
0 −1 + i −1 + i −1 + i
1 −0.85 + 0.8i −0.66666666 + 0.83333333i −0.66666666 + 0.83333333i
2 −0.60313824 + 0.67770639i −0.55034016 + 0.82376444i −0.50869191 + 0.84109987i
3 −0.39846066 + 0.84671567i −0.49763752 + 0.85554014i −0.49932999 + 0.86626917i
4 −0.51660491 + 0.84998590i −0.49932718 + 0.86627140i −0.49999991 + 0.86602490i
5 −0.49910537 + 0.86543351i −0.50000774 + 0.86602504i −0.50000000 + 0.86602540i
6 −0.50000228 + 0.86602568i −0.49999999 + 0.86602540i
7 −0.50000000 + 0.86602540i −0.50000000 + 0.86602540i
.. .. ..
8 . . .
√
Each sequence quickly converges to the complex root − 12 + 23 i. And this is not a fluke or a contrived example.
Generally, these methods work just as well in the complex plane as they do on the real line. One can find real roots
starting with complex numbers too. If we change the initial value x0 to 1 + i, Newton’s method converges to 1, for
example.
Having expanded our view of the methods to include complex numbers, there is a new type of convergence
diagram to consider. We can now look at convergence patterns for the three methods over a host of initial values
in the complex plane, not just the real line. Figure 2.5.1 shows convergence diagrams for Newton’s method with
g(x) = 1 − x3 , the seeded secant method with g(x) = 1 − x3 , and Steffensen’s method with f (x) = 1/x2 . Each
diagram covers the part of the complex plane with real parts in [−5, 5] and imaginary parts in [−3.75, 3.75]. The top
left corner of each diagram represents initial value −5 + 3.75i and the bottom right corner represents initial value
5 − 3.75i. The center
√
of each diagram represents
√
the initial value 0. The colors correspond to the three roots, red to
1 3 1 3
1, green to − 2 + 2 i, and blue to − 2 − 2 i. Black corresponds to failure to converge. The different intensities of
red, green, and blue correspond to the number of iterations the method took to converge. The greater the intensity,
the fewer iterations.
√
We can see that for x0 = 5 − 3.75i, Newton’s method and the seeded secant method both
converge to − 12 + 23 i, because the upper right hand corner of each diagram is colored green. Steffensen’s method,
on the other hand, fails to converge to any root if begun with x0 = 5 − 3.75i, evidenced by the blackness in the
upper right hand corner of the convergence diagram.
The dwell represents the maximum number of iterations allowed, so actually the black dots represent initial
values for which convergence was not achieved within a number of iterations equal to or less than the dwell. That’s
different from claiming the method does not converge at all for these initial values. There’s a chance that some of
the blackened initial values would still lead to convergence if allowed more iterations.
74 CHAPTER 2. ROOT FINDING
Figure 2.5.2: A vertical line and its image under the exponential function.
Two things are very striking about these convergence diagrams. First, the seeded secant method and Newton’s
method converge for a much larger set of initial values than does Steffensen’s method. This is, at least in part,
due to the function chosen. For other functions, there may be a fixed point scheme for which Steffensen’s method
converges on large sets of initial values too. Second, the patterns of colors are extremely intricate, even fractal
in nature. Predicting to which root a method will converge for a given initial value, and indeed whether it will
converge at all, are very difficult questions! And this analysis is done on a rather benign (simple) function.
Consider now a much more complicated problem—finding the roots of g(z) = ez − z or, equivalently, finding the
fixed points of f (z) = ez . A graph of f (z) (over the real numbers) will quickly convince you that there are no real
number solutions. It will take some thought to determine the nature of any complex solutions.
To that end, fix a real number a0 and consider the vertical line in the complex plane, La0 = {a0 + ib : b ∈
R}. The image of La0 under the exponential function is a circle with radius ea0 centered at the origin. Indeed,
ea0 +ib = ea0 eib = ea0 (cos b + i sin b). Thus b parameterizes the circle about the origin with radius ea0 . Now,
suppose La0 contains a fixed point, ẑ = a0 + ib̂, of the exponential function, f (z) = ez . Then ẑ = f (ẑ), or
a0 + ib̂ = ea0 (cos b̂ + i sin b̂). We conclude that the line and the circle intersect at the fixed point. Every fixed point
of f is necessarily an intersection of the line La0 with the circle Ca0 for some a0 . Figure 2.5.2 shows a representative
example. In fact, the diagram shows an interesting case: x = a0 ≈ 2.439940377216816. The coordinates of the two
intersections are
(2.439940377216816, ±11.2098911414971).
The interesting thing is
e2.439940377216816+11.2098911414971i ≈ 2.439940377216816 − 11.2098911414971i
and
e2.439940377216816−11.2098911414971i ≈ 2.439940377216816 + 11.2098911414971i.
The two points are images of one another under the exponential function! What we have found here are called peri-
odic points. If we let z1 = 2.439940377216816−11.2098911414971i and z2 = 2.439940377216816+11.2098911414971i,
then ez1 = z2 and ez2 = z1. Hence, if we iterate z2 = f (z1 ), z3 = f (z2 ), z4 = f (z3 ), z5 = f (z4 ), and so on, the
sequence z1 , z2 , z3 , z4 , . . . actually looks like
z1 , z2 , z1 , z2 , z1 , z2 , . . . .
The sequence just flops back and forth between z1 and z2 in a periodic fashion. We call such values period 2 points.
They are not fixed points of f (z) but they are fixed points of f (f (z))!
From left to right: Newton’s method with g(z) = z − ez and dwell 20; secant method with g(z) = z − ez and
dwell 40; Steffensen’s method with f (z) = ez and dwell 40. Each diagram covers the part of the complex plane
with real parts in [−10, 30] and imaginary parts in [0, 73].
On the other hand, ẑ = 2.062277729598284 + 7.588631178472513i is (approximately) a fixed point of f (z) since
Moreover, the conjugate of ẑ, ẑ = 2.0622377729598284 − 7.588631178472513i is also a fixed point. Verify it with a
calculator or wih Octave!
Generally, if ẑ is a fixed point of ez then so is ẑ:
So if we find one fixed point, we actually have found two, the fixed point and its conjugate.
We’re ready to get back to considering intersections of La0 and Ca0 . Assume a0 + ib is a fixed point of ez . Then
a0 + ib = ea0 +ib = ea0 (cos b + i sin b), so
a0 = ea0 cos b
b = ea0 sin b (2.5.1)
2 2 2a0
Now, because ap0 + ib is a point of intersection, it is on Ca0 , so a0 + b = e ⇒ b = ± e2a0 − a20 . Finally,
p
substituting b = e2a0 − a20 into 2.5.1, we find an intersection point will be a fixed point if and only if
q
a0 = ea0 cos e2a0 − a20
q and q
e2a0 − a20 = ea0 sin e2a0 − a20 . (2.5.2)
You should pause long enough to consider why it is not necessary to substitute b = − e2a0 − a20 into 2.5.1. Hint:
p
make the substitution and simplify. You should find out that the two equations you get are equivalent to those in
2.5.1.
For example, 2.439940377216816 − 11.2098911414971i and 2.062277729598284 + 7.588631178472513i both sat-
isfy the first equation of 2.5.2, but 2.439940377216816 − 11.2098911414971i does not satisfy the second while
2.062277729598284 + 7.588631178472513i does. So, as observed earlier, 2.439940377216816 − 11.2098911414971i is
not a fixed point but 2.062277729598284 + 7.588631178472513i is.
2.5. MORE CONVERGENCE DIAGRAMS 77
Do you recognize the first equation of 2.5.2? We first saw it on page 69 in section 2.4. As noted there, the
smallest five solutions are
The values 2.062277729598284 and 2.439940377216816 provided the examples for this discussion. What about the
other three values in this list? Do they give fixed points of the exponential function? Period two points? Something
else? Take a moment to investigate. Answers are on page 80. Using Octave to investigate 2.062277729598284,
which we know is a fixed point:
octave:1> format(’long’)
octave:2> a0=2.062277729598284
a0 = 2.06227772959828
octave:3> b=sqrt(exp(2*a0)-a0^2)
b = 7.58863117847251
octave:4> exp(a0+I*b)
ans = 2.06227772959828 + 7.58863117847251i
verifies that ea0 +ib = a0 + ib for a0 = 2.062277729598284, at least to machine precision. The exact value of the
fixed point is not known, but that is the nature of numerical analysis.
Figure 2.5.3 shows convergence to 12 of the fixed points of ez , one for each of the 12 different colors. The
coordinates of each fixed point can be approximated by locating the spot of greatest intensity within each colored
band.
As was done in Figure 2.5.3, convergence diagrams for the secant method can be created by setting x1 = x0 + δ
for some small number δ. It does not matter whether δ is real or complex. Selecting x1 automatically this way
allows the diagram to show convergence or divergence based on x0 alone, just as is done for the other convergence
diagrams. You will notice that the convergence diagram for the secant method and the convergence diagram for
Newton’s method are quite similar. For sufficiently small δ, this will be the case in general. The secant method
convergence diagram and the Newton’s method convergence diagram for the same function over the same region will
look very much the same. The only significant difference will be the number of iterations needed for convergence.
The secant method will need more iterations to converge.
Exercises
1. Match the function with its Newton’s method convergence diagram. The real axis passes through the center of each
diagram, and the imaginary axis is represented, but is not necessarily centered. [S]
(a) (b)
(c) (d)
2. Match the function with its Newton’s method convergence diagram. The real axis passes through the center of each
diagram, and the imaginary axis is represented, but is not necessarily centered. [A]
f (x) = sin x
g(x) = sin x − e−x
h(x) = ex + 2−x + 2 cos x − 6
l(x) = x4 + 2x2 + 4
(a) (b)
(c) (d)
(a) −7, 2, 1 ± 5i
(b) −7, 2, 1 + 5i
2.5. MORE CONVERGENCE DIAGRAMS 79
[S]
(c) −4, −1, 2, ±2i
[S]
(d) −4, −1, 2, 2i
(e) 0, −1 ± i, 1 ± i
(f) −3 + i, −2 − i, −3i, 1 − 2i
4. Create Newton’s method convergence diagrams for the polynomials of question 3. Make sure you capture a region that
shows at least a small area converging to each root. Octave code may be downloaded at the companion website.
1
5. The functions f (x) = ex and g(x) = x2 +1
have no roots, real or complex. Find at least two others that also have no
roots.
x2 −7x+10
6. Let f (x) = 2
+ sin(3x).
(a) Find all the real roots of f . This is not a polynomial, so deflation will not work. Instead, graph the function and
use Newton’s method to find the real roots accurate to 10−8 . There are four of them.
(b) Create a Newton’s method convergence diagram for f to see if there are any complex roots. If so, use Newton’s
method to approximate them. Use the convergence diagram to help you choose initial values.
(c) Can you find all the roots of f ?
7. Match the function with its seeded secant method convergence diagram. The real axis passes through the center of
each diagram, and the imaginary axis is represented, but is not necessarily centered. [S]
f (x) = sin x
g(x) = sin x − e−x
h(x) = ex + 2−x + 2 cos x − 6
l(x) = 56 − 152x + 140x2 − 17x3 − 48x4 + 9x5
(a) (b)
(c) (d)
8. Match the function with its seeded secant method convergence diagram. The real axis passes through the center of
each diagram, and the imaginary axis is represented, but is not necessarily centered. [A]
f (x) = x4 + 2x2 + 4
g(x) = (x2 )(ln x) + (x − 3)ex
h(x) = 1 + 2x + 3x2 + 4x3 + 5x4 + 6x5
l(x) = (ln x)(x3 + 1)
80 CHAPTER 2. ROOT FINDING
(a) (b)
(c) (d)
9. Create seeded secant method convergence diagrams for the polynomials of question 3. Make sure you capture a region
that shows at least a small area converging to each root. Octave code may be downloaded at the companion website.
10. The Newton’s method convergence diagram for one polynomial is much like the Newton’s method convergence diagram
for another. Interesting changes in the Newton’s method convergence diagrams and seeded secant method convergence
diagrams can be achieved by multiplying a polynomial by a non-polynomial function with no roots. Create Newton’s
method and seeded secant method convergence diagrams for products of functions in question 3 with functions in
question 5.
11. Discuss the relative strengths and weaknesses of Newton’s method, the secant method, and the seeded secant method.
Answers
Why equivalent? The equations g(x) = 0 and f (x) = x have exactly the same solutions. g(x) = 0 ⇔ 1 − x3 =
0 ⇔ 1 = x3 ⇔ x12 = x ⇔ f (x) = x.
octave:1> format(’long’)
octave:2> a0=.3181315052047641;
octave:3> b=sqrt(exp(2*a0)-a0^2)
b = 1.33723570143069
octave:4> exp(a0+I*b)
ans = 0.318131505204764 + 1.337235701430689i
octave:1> format(’long’)
octave:2> a0=1.668024051576096;
octave:3> b=sqrt(exp(2*a0)-a0^2)
b = 5.03244706448616
octave:4> exp(a0+I*b)
ans = 1.66802405157609 - 5.03244706448616i
octave:5> a0=2.653191974038697;
octave:6> b=sqrt(exp(2*a0)-a0^2)
b = 13.9492083345332
octave:7> exp(a0+I*b)
ans = 2.65319197403878 + 13.94920833453319i
82 CHAPTER 2. ROOT FINDING
t p(x)
z}|{ z }| {
−3 −4 2 3 −6
12 −42 117
−4 14 −39 111
| {z } | {z }
q(x) p(t)
tells us that p(x) = −4x3 + 2x2 + 3x − 6 = (−4x2 + 14x − 39)(x + 3) + 111. While it is a small burden to evaluate
the expression −4x3 + 2x2 + 3x − 6 when x = −3, it is no burden at all to evaluate (−4x2 + 14x − 39)(x + 3) + 111
when x = −3. The (x + 3) factor is zero, so it doesn’t matter to what (−4x2 + 14x − 39) evaluates. The product is
zero and (−4x2 + 14x − 39)(x + 3) + 111 evaluates to 111. Therefore, p(−3) = 111. Synthetic division gives a quick
way to evaluate a polynomial. The number at the end of the division is the value of the polynomial at the value of
the divisor.
More generally, here is a dissection of the division of p(x) = a0 + a1 x + · · · + an xn by x − t using synthetic
division:
Beginning with t in the upper left corner, we end up with p(t) in the lower right corner. It is not only when the
number in the lower right corner is zero do we find something of interest. Every synthetic division gives something
of interest! The number in the bottom right corner is p(t) whether it turns out to be zero or not. And there is
more.
The numbers an , an t + an−1 , an (an t + an−1 ) + an−2 , and so on, appearing in the bottom row of the synthetic
division give the coefficients of the quotient, q(x). Every synthetic division gives a decomposition of the polynomial
into quotient and remainder. Thus, with every synthetic division, we get an equivalent expression of the form
q(x) · (x − t) + p(t). There is still more.
Differentiating the equation p(x) = q(x) · (x − t) + p(t) with respect to x gives
Hence, p0 (t) = q 0 (t) · (t − t) + q(t) = q(t). So, not only do the numbers in the bottom row give the coefficients of
the quotient, they double as coefficients appropriate for evaluating p0 (t). Returning to the previous example, if we
desire to calculate p0 (−3), we simply continue the synthetic division as in
−3 −4 2 3 −6
12 −42 117
−3 −4 14 −39 111
12 −78
−4 26 −117
and find out p0 (−3) = −117. The procedure of calculating p(t) and p0 (t) by simultaneous synthetic divisions
is known as Horner’s method and is especially convenient for use in Newton’s method. If we were trying to
find a root of p(x) = −4x3 + 2x2 + 3x − 6 with initial approximation x0 = −3 we would have, at this point,
0) 111
x1 = x0 − pp(x
0 (x ) = −3 − −117 ≈ −2.05128. Yet there is more.
0
2.6. ROOTS OF POLYNOMIALS 83
−1.18985 −4 2 3 −6
4.7594 −8.04267 6.00002
−4 6.7594 −5.04267 0.00002
The (near) zero in the box at the bottom-right indicates that −1.18985 is approximately a root. There is no appre-
ciable remainder upon division of −4x3 + 2x2 + 3x − 6 by x + 1.18985. Moreover, the numbers −4, 6.7594, −5.04267
in the bottom row give the coefficients of q(x). Thus, we find from this division that −4x3 + 2x2 + 3x − 6 =≈
(−4x2 + 6.7594x − 5.04267)(x + 118985). We can now find the other two roots by locating the roots of q(x) =
−4x2 + 6.7594x − 5.04267. Using the quadratic formula, they are
6.75942 − 4(−4)(−5.04267)
p
−6.7594 ±
≈ .84493 ± .73944i.
−8
Our process will lead us to finding n roots of any nth degree polynomial. It is important to note that some of
these roots may be complex and some of them may be repeated.
The process of finding one root of a given polynomial, deflating, and finding another mirrors quite closely the
mathematical theorems of algebra. The Fundamental Theorem of Algebra states that every polynomial with
complex coefficients and degree at least one has a complex root. Thus our search for a root is not in vain! We can
then write our polynomial in factored form and continue. The Fundamental Theorem says that there is again a
root of the deflated polynomial. And if we keep track of all the roots as we find them, we end up writing our
polynomial in the form
p(x) = a(x − r1 )e1 (x − r2 )e2 · · · (x − rk )ek , (2.6.1)
where a is a nonzero constant, r1 , r2 , . . . , rk are the k distinct complex roots, and e1 , e2 , . . . , ek are the so-called
(positive integer) multiplicities of the roots. From this form, we see that the degree of the polynomial equals the
sum of the multiplicities, e1 + e2 + · · · + ek . This is what we mean when we say the number of roots, counting
multiplicity, is equal to the degree of the polynomial. Thus when searching for the roots of a polynomial of degree
n, we know we are looking for n roots, but not necessarily n distinct roots. Some of them may be repeated and
the repetitions are accounted for in the multiplicities. To formalize the claim in equation 2.6.1, we have the
follwing theorem.
Proof. Suppose n = 1 so p(x) takes the form ax + b with a 6= 0. Then p(x) = a(x − (− ab ))1 and thus takes
the required form. Now suppose all polynomials of some degree n ≥ 1 take the required form and let p be a
polynomial of degree n + 1. By the Fundamental Theorem of Algebra, p has a root. Call it ρ. Then x − ρ is
a factor of p so p can be written as p(x) = (x − ρ) · q(x) for some polynomial q of degree n. By the inductive
hypothesis, we have that q takes the required form, so
In either case, p takes the required form and the proof is complete.
To refine the pseudo-pseudo-code into pseudo-code, we will use Newton’s method, assisted by Horner’s method,
in Step 2. The usual drawback of Newton’s method, the requirement that the derivative be known and calculated, is
but a small inconvenience when Horner’s method is employed. But how do we represent polynomials in a computer
program so that we can accomplish Steps 4 and 5? The same way we implement code to execute Horner’s method.
Pseudo-code for Horner’s method, with an array:
As in synthetic division, there is no need to retain the variable to various exponents. Only the coefficients are
needed to define a polynomial. So, in the program, a polynomial is represented by an array of numbers. Putting
together our pseudo-pseudo code, Newton’s method and Horner’s method into a single program, we have a method
for finding all the roots of a polynomial:
2c1
Step 15: If the real part of c2 is negative, then set rn−1 = s1
2c3 and rn = s1 ; else set rn−1 = s2
2c3 and
rn = 2c
s2 ;
1
Steps 4 through 12 implement Newton’s method to find a single root, using Horner’s method in Steps 7 through 10
to calculte the value of the polynomial and its derivative at x0 . Care is taken to calculate and store the coefficients
[d] of the quotient for easy referral in Step 13. It is assumed that the square root calculated in Step 14 is the
principle branch of the complex square root. Steps 14 and 15 utilize an alternate form of the quadratic formula
that avoids the subtraction of nearly equal quantities so much as possible.
√
−b± b2 −4ac
When the roots of p(x) = ax2 + bx + c are small, the numerator of the quadratic formula, x = 2a
, is
√
necessarily small. In this case, it is best to match the signs of −b and ± b2 − 4ac in order to avoid subtracting
quantities of nearly equal value. Choosing the sign of the square root term this way gives one of the roots as
accurately as possible, but leaves the other root undetermined. Multiplying both numerator and denominator
by the conjugate of the numerator gives an alternate expression of the quadratic formula:
√ √
−b ± b2 − 4ac −b ∓ b2 − 4ac b2 − (b2 − 4ac)
· √ = √
2a −b ∓ b2 − 4ac 2a(−b ∓ b2 − 4ac)
4ac
= √
2a(−b ∓ b2 − 4ac)
2c
= √ .
−b ∓ b2 − 4ac
Expanding, we have √
−b + b2 − 4ac 2c
= √
2a −b − b2 − 4ac
and √
−b − b2 − 4ac 2c
= √ .
2a −b + b2 − 4ac
However, there is little that can be done at this point if zero happens to be a double root. In this instance, both c1
and c2 will be zero or nearly zero, making both s1 and s2 very small. This is why the set of assumptions includes
the stipulation c1 6= 0. This ensures that zero is not a root of p.
86 CHAPTER 2. ROOT FINDING
Müller’s Method
Another very fast method for finding roots of equations is Müller’s method . In principle, it is very much like the
secant method. With the secant method, two initial approximations p0 and p1 are made. The secant line through
the points (p0 , f (p0 )) and (p1 , f (p1 )) is drawn and its intersection with the x-axis gives p2 . With Müller’s method,
three initial approximations p0 , p1 , and, p2 are needed. The parabola through the points (p0 , f (p0 )), (p1 , f (p1 )),
and (p2 , f (p2 )) is drawn and its intersection with the x-axis gives p3 . There are a couple of issues to deal with,
however. First, if the parabola so drawn crosses the x-axis at all, it crosses it twice. We need to choose one of the
zeros for p3 . Second, it is possible the parabola will not cross the x-axis at all.
Solving the problem of which root to choose is simple. We assume the approximation p2 is better than the
others, so we choose the root that is closest to p2 . Actually, that solves the second “problem” too. Even when the
parabola does not cross the x-axis, it has zeros. They are complex. And we do not worry about that. We simply
take the complex root that is closest to p2 . This has the nice advantage that even when the coefficients of p(x) are
all real and p0 , p1 , and, p2 are all real, and all the roots of p(x) are complex, it will find a complex root.
As to the business of finding the parabola passing through (p0 , f (p0 )), (p1 , f (p1 )), and (p2 , f (p2 )), we will seek
a parabola P (x) of the form
P (x) = a(x − p2 )2 + b(x − p2 ) + c.
Making the substitutions x = pi and P (x) = f (pi ) leads to the three equations
So we find out immediately that c = f (p2 ) and we must solve the simultaneous equations
Now plugging a, b, and c into the quadratic formula gives us roots x = p2 − b±√b22 −4ac . To choose the one closest to
√ √
p2 , we compare |b + b2 − 4ac| with |b − b2 − 4ac| and use the larger. This gives us the smallest value for |x − p2 |,
the distance of the root from p2 .
2.6. ROOTS OF POLYNOMIALS 87
For example, we will use Müller’s method with p0 = 1, p1 = 2, and p2 = 3 to find a root of f (x) = x3 + 1. We
calculate
h2 δ −h2 δ
so we get c = 28, b = 0h01h1 h12 0 = 4(−19)−1(−26)
−2 = 25, and a = h1hδ00h−h 0 δ1
1 h2
= −1(−26)−(−2)(−19)
−2 = 6. A close look at
2
the graphs of f (x) and P (x) = 6x + 25x + 28 shows that they do meet three times (at the required points), and
that P (x) does not have real roots:
45
40
35
30
25
20
15
10
0
0 0.5 1 1.5 2 2.5 3 3.5
.
√ √ √ √ √
b ± b2 − 4ac = 25 ± 625 − 672 = 25 ± i 47. Since |25 + i 47| = |25 − √
i 47|, it does not matter which root
we take. Selecting p3 = p2 − b−√b2c2 −4ac , we get p3 = 3 − 25−i
56√
47
= 11 47
12 − 12 i. Continuing this process gives the
√
1 3
iterates 0.75238 − 0.75810i, 0.57069 − 0.84288i, . . . , 0.50000 − 0.86603i, converging to 2 − 2 i.
The order of convergence of Müller’s method to a simple root (one that is not repeated) is
√ 13
11 19 4 1
√ + + 1 + 3 ≈ 1.839286755214161
3 3 27 √
11 19 3
9 3
√
3
+ 27
The following chart summarizes the relative strengths and weaknesses of Newton’s method, the secant method,
and Müller’s method.
88 CHAPTER 2. ROOT FINDING
Key Concepts
Synthetic division: A method for dividing a polynomial p(x) by a monomial (x − x0 ) using only addition, multi-
plication, and the coefficients of p. The process is identical to evaluating a polynomial by nesting. Synthetic
division simply provides an organizational tool so that nesting can be accomplished simply with pencil and
paper.
Horner’s method: A method where the value of a polynomial and its derivative at a single point are calculated
simultaneously via synthetic division.
Müller’s method: A root-finding method similar to the secant method where instead of using a secant line a
parabola is used.
Deflation: The method of replacing a polynomial p(x) by the product of a monomial (x − x0 ) and a polynomial
q(x) of degree one less than that of the original polynomial.
setting y to the value of the polynomial and yy to the (b) g(x) = −7 + 8x − 3x + 5x − 2x4
2 3 [A]
value of its derivative. Test your code well by compar-
ing outputs of your function to hand/calculator com- 7. Use Horner’s method to calculate g(−2) and g 0 (−2)
putations. where g(x) = 4x4 − 5x3 + 6x − 7. Do not use a com-
puter.
3. Write an Octave function that implements New-
ton’s method with Horner’s method. The first line of 8. Use your work from question 6 to help execute two it-
your function should be erations of Newton’s method using a pencil, paper, cal-
culator, and Horner’s method/synthetic division. Use
function x = newtonhorner(c,x0,tol,N) initial value x0 = 2. [S][A]
2.6. ROOTS OF POLYNOMIALS 89
9. Use your work from question 7 to help execute two it- (b) Find all the roots of f (x).
erations of Newton’s method using a pencil, paper, cal-
culator, and Horner’s method/synthetic division. Use 18. About 800 years ago John of Palermo challenged math-
initial value x0 = −2. ematicians to find a solution of the equation x3 + 2x2 +
10x = 20. In 1224, Fibonacci answered the call in
10. Compute x2 of Newton’s method by hand (using the presence of Emperor Frederick II. He approximated
Horner’s method/synthetic division) for f (x) = x3 + the only real root using a geometric technique of Omar
4x − 8 starting with x0 = 0. Khayyam (1048-1131), arriving at the estimate
11. Find x2 of Newton’s method by hand (using Horner’s 2 3
method/synthetic division) for f (x) = x4 − 2x3 − 4x2 + 1 1 1
1 + 22 +7 + 42 +
4x + 4 using x0 = 2. 60 60 60
4 5 6
1 1 1
12. Using Horner’s method as an aid, and not using your 33 +4 + 40 .
60 60 60
calculator, find the first iteration of Newton’s method
for the function f (x) = 2x3 − 10x + 1 using x0 = 2. How accurate was his approximation?
13. Demonstrate two iterations of Newton’s method (using Reference [5, pg. 96 ex. 10]
Horner’s method/synthetic division) applied to f (x) =
5x3 − 2x2 + 7x − 3 with p0 = 1 by hand.
19. Calculate the value of the polynomial at the given
14. Find all the roots of the polynomial as follows. Use value of x in two different ways. (i) Use your horner
Newton’s method with tolerance 10−5 to approxi- function from question 2; and (ii) use an inline() func-
mate a root of the polynomial. You may use your tion. Then (iii) compare the two results using Octave’s
newtonhorner function from question 3. Then use syn- == operator.
thetic division to deflate the polynomial one degree. Do √
not use a computer for deflation. Then use Newton’s (a) p(x) = x4 − 2x3 − 12x2 + 16x − 40 at x = 3 [S]
method with tolerance 10−5 to approximate a root of (b) q(x) = 56 − 152x + 140x2 − 17x3 − 48x4 + 9x5 at
the deflated polynomial. Then use synthetic division to x = π/2 [A]
deflate the deflated polynomial one degree. Repeat un-
(c) r(x) = x6 + 11x4 − 34x3 − 130x2 − 275x + 819 at
til the deflated polynomial is quadratic. Once this hap- √
1− 5 [A]
pens, use the quadratic formula (or alternate quadratic 2
formula) to find the last two roots. (d) s(x) = 5x10 + 3x8 − 46x6 − 102x4 + 365x2 + 1287
at 1e
(a) g(x) = x4 + 6x3 −59x2 + 144x−144 [S]
(b) g(x) = −280 + 909x − 154x2 − 178x3 + 54x4 + 9x5 20. Write an Octave function that uses your functions from
[A] questions 1, 3, and 4 to find all the roots of a polyno-
mial. Test your function well on polynomials of various
15. Find all the roots of the polynomial as follows. Use degrees for which you know the roots. You may base
Newton’s method with tolerance 10−5 to approxi- your function on the pseudo-code on page 84, but your
mate a root of the polynomial. You may use your code should be significantly simpler since you are call-
newtonhorner function from question 3. Then use syn- ing functions instead of writing their code. [A]
thetic division to deflate the polynomial one degree. 21. Use your code from question 20 to find all the solutions
You may use your deflate function from question 4 of the equation. [A]
for deflation. Then use Newton’s method with toler-
ance 10−5 to approximate a root of the deflated poly- (a) x5 + 11x4 − 34x3 − 130x2 − 275x + 819 = 0
nomial. Then use synthetic division to deflate the de- (b) 5x5 + 3x4 − 46x3 − 102x2 + 365x + 1287 = 0
flated polynomial one degree. Repeat until the deflated
polynomial is quadratic. Once this happens, use the 22. Find all the roots of g(x) = 25x3 − 105x2 + 148x − 174.
quadratic formula to find the last two roots. You may 23. Recall that there are some similarities between the se-
use your quadraticRoots function from question 1 for cant method and Müller’s method. They each require
solving the quadratic. multiple initial approximations. They each involve cal-
(a) g(x) = x4 − 2x3 − 12x2 + 16x − 40 [S] culating the zero of some function passing through
these initial points. They both give superlinear con-
(b) g(x) = 56 − 152x + 140x − 17x − 48x4 + 9x5
2 3 [A]
vergence to simple roots. And, of course, they are both
root finding methods. Let’s tweak the idea in the fol-
16. For each root you found in question 14 except the first
lowing way. To find roots of g, start as with the secant
one, use it as an initial approximation in Newton’s
method, using two approximations, x0 and x1 . Then,
method with tolerance 10−5 to see if you can refine
instead of using the zero of a line through (x0 , g(x0 ))
your roots. Do they change? [S][A]
and (x1 , g(x1 )), find the function of the form
17. f (x) = x3 − 1.255x2 − .9838x + 1.2712 has a root at
x = 1.12. h(x) = ax3 + b
(a) Use Newton’s method with an initial approxima- passing through (x0 , g(x0 )) and (x1 , g(x1 )). Let x2 be
tion x0 = 1.13 to attempt to find this root. Ex- the zero of h. Then repeat with x1 and x2 to get x3 ,
plain what happens. and so on.
90 CHAPTER 2. ROOT FINDING
(a) Let g(x) = 2 ln(1 + x2 ) − x, x0 = 5 and x1 = 6. 26. The graph of f (x) is shown. Find distinct sets of values
Find x2 using this method. p0 , p1 , and p2 for which Müller’s method
(b) Find a formula for x2 given any function g(x) and
any initial conditions x0 and x1 . Your formula (a) will lead to a complex value for p3 .
should be in terms of x0 , x1 , g(x0 ), and g(x1 ).
(c) Find a general formula for xn in terms of xn−2 , (b) will lead to the root at x ≈ 4.4.
xn−1 , g(xn−2 ), and g(xn−1 )).
(c) will lead to the root at x ≈ 2.8.
(d) Write an Octave function that implements this
method and prints out each iteration.
4
2.7 Bracketing
Bisection is called a bracketed root-finding method. A root is known to lie within a certain interval. Each iteration
reduces the size of the interval and maintains the guarantee the root is within. At each step of the algorithm, the
root is known to be between the latest estimate and one of the previous. These bounds form a bracket around the
root. As the algorithm proceeds, the bracket decreases in size until it is smaller than some tolerance, at which point
the root is known to be close and the algorithm stops.
The problem with bisection is its linear order of convergence. Compared to superlinear methods like the secant
method and Newton’s method, the bisection method just creeps along. But the bisection method has something
the secant method and Newton’s method do not—certainty of convergence. Yes, the secant method and Newton’s
method are fast when they converge, but there is no guarantee they will converge at all.
Methods combining the virtues of the bisection method (guaranteed convergence) and some higher order method
(speed) are called safeguarded methods. They are guaranteed to converge and can do so quickly when the root is
near. Any superlinear method may be bracketed, producing a safeguarded method.
Bracketing
Bracketing means maintaining an interval in which a root is known to lie. Bracketing is used in the bisection method.
With each iteration, the root is known to lie between the two latest approximations. Bracketing is not used in the
secant method nor Newton’s method. There is no guarantee a root remains near the latest approximations.
It is not difficult, however, to combine the bisection method with the secant method or Newton’s method, or
any other high order method for that matter, to form a hybrid method where the root remains bracketed and there
is a chance for fast convergence. In such a method, a candidate for the next iteration is computed according to the
high order method. If this candidate lies within the bracket, it becomes the next iteration. If the candidate lies
outside the bracket, the bisection method is used to compute the next iteration instead.
Bracketed secant method, better known as the method of false position or regula falsi, provides an elementary
example. In fact, the high order method (the secant method) always produces a value inside the bracket, so checking
that point is not necessary. Where false position and the secant method differ is choosing which of the previous
two iterations to keep. In the secant method, it is always the latest iteration which is kept for the next. In false
position, the latest iteration which maintains a bracket about the root is kept for the next whether that iteration
is the latest or not. Bracketed Newton’s method provides a slightly more advanced example because it is entirely
possible an iteration of Newton’s method will land outside the bracket.
Take the function g(x) = 3 − x − sin(x) over the interval [2, 3]. f is continuous on [2, 3], and g(2) ≈ 0.09 and
g(3) ≈ −0.14 have opposite signs. Thus [2, 3] brackets a root of g, so let x0 = 2 and x1 = 3. The table shows the
computation of the next iteration for bracketed secant method and bracketed Newton’s method.
x0 x1 candidate x2 x2
bracketed secant 2 3 x1 − g(x1 ) g(xx11)−g(x
−x0
0)
≈ 2.3912 2.3912
g(x1 )
bracketed Newton’s 2 3 x1 − g 0 (x1 ) ≈ −11.101 2.5
In bracketed secant, the candidate x2 is accepted, but in bracketed Newton’s method, the candidate x2 is outside
the bracket so it is discarded and x2 according to the bisection method (2.5) is taken instead.
To set up the next iteration, g(x2 ) is calculated. Since g(x2 ) is negative in both methods, the old x1 , which was
3, is discarded and x0 = 2 is “upgraded” to x1 in order to maintain the bracket. This way, g has opposite signs at
x1 and x2 . The following table demonstrates this decision process plus the computation of the next iteration.
g(x2 ) x1 x2 candidate x3 x3
bracketed secant −0.073141 2 2.3912 x2 − g(x2 ) g(xx22)−g(x
−x1
1)
≈ 2.2165 2.2165
g(x2 )
bracketed Newton’s −0.098472 2 2.5 x2 − g 0 (x2 ) ≈ 2.0048 2.0048
Can you fill in x4 based on the values in the following table? Notice the old x1 must be “upgraded” in bracketed
secant but not in bracketed Newton’s. Why? Answers on page 98.
g(x3 ) x2 x3 candidate x4 x4
bracketed secant −0.015215 2 2.2165 x3 − g(x3 ) g(xx33)−g(x
−x2
2)
≈ 2.1854 ?
g(x3 )
bracketed Newton’s 0.087906 2.5 2.0048 x3 − g 0 (x3 ) ≈ 2.1565 ?
92 CHAPTER 2. ROOT FINDING
The next 5 iterations of each method are given here in case you would like to try your hand at computing a few.
And now is a good time to do so. These values were computed using the subsequent Octave code.
bracketed
secant Newton’s
x5 2.18062942638407 2.17925592233708
x6 2.17988957044102 2.17975682599184
x7 2.17977718322867 2.17975706647997
x8 2.17976012038625 2.17975706648003
x9 2.17975753008587 2.17975706648003
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Written by Dr. Len Brin 20 May 2014 %
% Purpose: Implementation of the Method of %
% False Position. %
% INPUT: function g; initial values a and b; %
% tolerance TOL; maximum iterations N %
% OUTPUT: approximation x and number of %
% iterations i; or message of failure %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function [x,i] = falsePosition(g,a,b,TOL,N)
i=1;
A=g(a);
B=g(b);
while (i<N)
b
x=b-B*(b-a)/(B-A);
if (abs(x-b)<TOL)
return
end%if
X=g(x);
if ((B<0 && X>0) || (B>0 && X<0))
a=b; A=B;
end%if
b=x; B=X;
i=i+1;
end%while
x="Method failed---maximum number of iterations reached";
end%function
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Written by Dr. Len Brin 20 May 2014 %
% Purpose: Implementation of bracketed Newton’s %
% method. %
% INPUT: function g; its derivative gp; initial %
% values a and b; tolerance TOL; maximum %
% iterations N %
% OUTPUT: approximation x and number of %
% iterations i; or message of failure %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function [x,i] = bracketedNewton(g,gp,a,b,TOL,N)
i=1;
A=g(a);
2.7. BRACKETING 93
B=g(b);
while (i<N)
b
x=b-B/gp(b);
if (x<min([a,b]) || x>max([a,b]))
x=b+(a-b)/2;
end%if
if (abs(x-b)<TOL)
return
end%if
X=g(x);
if ((B<0 && X>0) || (B>0 && X<0))
a=b; A=B;
end%if
b=x; B=X;
i=i+1;
end%while
x="Method failed---maximum number of iterations reached";
end%function
if (x<min([a,b]) || x>max([a,b]))
x=b+(a-b)/2;
end%if
Actually, we could add these three lines to the bracketed secant method and it would run just the same. It is
impossible for the secant method to produce a value of x outside the bracket, so the bisection step would never be
executed. The only essential difference between the two functions is the execution of the high order method.
We can use this observation to create a sort of blueprint for bracketing any high order method. Steffensen’s,
Müller’s (as long as the approximation stays real), or Sidi’s (section 3.2), for example, can be bracketed this way.
The following pseudo-pseudo-code represents such a blueprint, giving guidance on how to safeguard a high order
method by combining it with bisection.
Assumptions: g is continuous on [a, b]. g(a) and g(b) have opposite signs.
Input: Interval [a, b]; function g; desired accuracy tol; maximum number of iterations N ; any other variables,
like g 0 in the case of Newton’s method, needed to iterate the superlinear method.
Step 1: Set A = g(a); B = g(b); i = 2;
Step 2: Initialize any other variables needed for superlinear();
Step 3: While i < N do Steps 4-10:
Step 4: Set x = superlinear(a, b, g, . . .);
Step 2 ;
5: If (x − a)(x − b) > 0 then x = b + a−b
Step 6: If |x − b| < tol then return x
Step 7: Set X = g(x);
Step 8: If BX < 0 then set a = b; A = B;
Step 9: Set b = x; B = X; i = i + 1;
Step 10: Update any other variables needed for superlinear();
Step 11: Print “Method failed. Maximum iterations exceeded.”
Output: Approximation m within tol of exact root, or message of failure.
94 CHAPTER 2. ROOT FINDING
As motivation for the need to develop bracketed versions of other high order methods, consider the particularly
1+10x 1
problematic function g(x) = 1−10x . It has a root at − 10 , but the bracketed secant method can be very slow to
converge to this root. Figure 2.7.1 illustrates this slow convergence beginning with the bracket [a, b] = [−4, .05].
With this unfortunate choice of bracket, the method takes 45 iterations to achieve 10−5 accuracy. A smarter
algorithm would not only check that each iterate lands within the brackets, but would also check to see that the
high order method is making quick progress toward the root. If it detected that convergence was slow, say slower
than bisection would be, it would take a bisection step instead. Note that bracketed Newton’s method does not
have a significant problem with this function. Given the same initial bracket, it converges to within 10−5 of the root
in only 10 iterations (the first 4 of which are bisection steps). Alas, Newton’s method requires use of the derivative.
A fast bracketed root-finding method that does not require knowledge of the derivative would be quite useful.
In the early 1970s, Richard Brent built upon the work of van Wijngaarden and Dekker to produce a bracketed
method that combines bisection, the secant method, and inverse quadratic interpolation, all the while checking
to make sure the high order method is making sufficiently quick progress toward a root. The result is what is
now known as Brent’s method [3]. It does not require knowledge of the derivative. It is fast. It is guaranteed to
converge. Consequently, it is a popular all-purpose method for finding a root within a bracket when the derivative
is not accessible. The full details of Brent’s method will not be presented here, but a significant step toward that
method will. The method presented here is similar to the MATLAB function fzero [22].
You may recall, in Müller’s method, three initial approximations, say a, b, and, c are needed. The parabola through
the points (a, g(a)), (b, g(b)), and (c, g(c)) is drawn and its intersection with the x-axis gives the next iteration. The
key elements of this method, the process of fitting a quadratic function to the three points, is called interpolation.
Thus Müller’s method could just as well be called the “quadratic interpolation method”.
As you may have guessed, the method of inverse quadratic interpolation is similar. Instead of fitting a quadratic
function to the points (a, g(a)), (b, g(b)), and (c, g(c)), the roles of x and y are reversed. A quadratic function is
fitted to the points (g(a), a), (g(b), b), and (g(c), c) instead. Since x is a function of y in this case, the quadratic
will cross the x-axis exactly once, when y = 0. Evaluating the quadratic at 0 gives the next iteration. Figure 2.7.2
shows quadratic interpolation and inverse quadratic interpolation on the same set of three points. In quadratic
interpolation, y is treated as a function of x. In inverse quadratic interpolation, x is treated as a function of
y. Inverse quadratic interpolation avoids the main complication of quadratic interpolation—calculating its x-axis
crossings. In quadratic interpolation, the quadratic may cross the x-axis twice or not at all! Either way, some choice
needs to be made at every step, and the roots of the quadratic involve the quadratic formula. In inverse quadratic
interpolation, the quadratic is guaranteed to cross the x-axis exactly once, and finding the crossing is just a matter
of evaluating the quadratic at 0. That is, y = 0. Remember, the quadratic gives x as a function of y.
Referring back to the derivation of Müller’s method on page 86, forcing the parabola to pass through the points
(a, A), (b, B), and (c, C), and swapping the roles of x and y, a formula for the inverse parabola, q, just falls out:
q(y) = q0 (y − B)2 + q1 (y − B) + q2
2.7. BRACKETING 95
where
q2 = b
(A − B)2 (c − b) − (C − B)2 (a − b)
q1 =
(A − B)(C − B)(A − C)
(C − B)(a − b) − (A − B)(c − b)
q0 = .
(A − B)(C − B)(A − C)
The method of inverse quadratic interpolation has order of convergence about 1.84 under reasonable assumptions.
If the function whose root is being determined has three continuous derivatives in a neighborhood of the root,
the latest three approximations are sufficiently close, and the root is simple, then the order of convergence is the
real solution of
α3 − α2 − α − 1 = 0.
We can use inverse quadratic interpolation to approximate it!
>> format(’long’)
>> f=inline(’x^3-x^2-x-1’)
f = f(x) = x^3-x^2-x-1
>> [res,i]=inverseQuadratic(f,1,2,10^-12,100)
res = 1.83928675521416
i = 8
You may recognize this as the order of convergence for Müller’s method. Indeed, any quadratic interpolation
method converges to a simple root with this order.
Reference [29]
x = q(0)
= B 2 q0 − Bq1 + q2
(C − B)(a − b) − (A − B)(c − b) (A − B)2 (c − b) − (C − B)2 (a − b)
= B2 −B +b
(A − B)(C − B)(A − C) (A − B)(C − B)(A − C)
2
B (C − B) + B(C − B)2 (a − b) − B 2 (A − B) + B(A − B)2 (c − b)
= +b
(A − B)(C − B)(A − C)
−B 2 C + BC 2 (a − b) − −B 2 A + BA2 (c − b)
= +b
(A − B)(C − B)(A − C)
BC(C − B)(a − b) − BA(A − B)(c − b)
= +b
(A − B)(C − B)(A − C)
A(B − 1)(a − b) − C (1 − B
A )(c − b)
B C A
= b+
(1 − A )( B − 1)( C − 1)
B C A
C (1 A )(c − b) − A ( B − 1)(a − b)
A
−B B C
= b+ .
( A − 1)( C − 1)( C
B B A
− 1)
96 CHAPTER 2. ROOT FINDING
To make the compuation of x a little more programmer-friendly, some new variables are introduced. Let
B C A
r= − 1, s= − 1, t= −1
A B C
so
r(t + 1)(c − b) + s(r + 1)(a − b)
x=b− . (2.7.1)
rst
Inverse quadratic interpolation can be bracketed just like any other high order method. But it does present an
interesting question that not all high order methods do. Three points are necessary for a quadratic interpolation,
so when they are used to produce the next iteration, a fourth point is generated. Of the four points, the computer
needs to decide which two will become the next bracket, and which point should be the third needed for the next
interpolation. But we are getting ahead of ourselves.
Each iteration begins with three points, (a, g(a)), (b, g(b)), and (c, g(c)) where a and b bracket a root and c is a
third point. For the first iteration, only the bracket is given. c is set equal to a. For every iteration, the signs of
g(a) and g(b) are checked to ensure that a and b bracket a root. If they are opposite, the method proceeds. If they
are the same, that means g(b) and g(c) must have opposite signs, so a is set equal to c. Next, the absolute values of
g(a) and g(b) are checked. If |g(a)| < |g(b)|, the labels of a and b are switched and c is set equal to the new value of
a. After these initial checks, the computation of the next iteration begins with assurance that a root lies between
a and b; b is likely the best estimate of the root to date; and c is likely the worst estimate of the root to date.
If c = a after the initial checks and possible relabeling, then quadratic interpolation is impossible. The next
iteration is generated by the secant method (linear interpolation) instead. If c 6= a after the initial checks and
possible relabeling, a candidate for the next iteration, x, is calculated according to inverse quadratic interpolation.
If the candidate lies within the bracket, it is accepted as the next iteration. If it lies outside the bracket, a step
of the bisection method is used instead. In either case, c is set equal to b and b is set equal to x. For bracketed
inverse quadratic interpolation, this completes one iteration. The method is then repeated until a sufficiently good
approximation is found.
In the best-case scenario, inverse quadratic interpolation is used at every step and convergence is superlinear
with order about 1.84. In the worst-case scenario, one of the high order methods is used at every step, but the
function is pathological and convergence is slow, possibly even slower than bisection. Slow convergence is rare,
though, and the actual order of convergence can not be pinned down in general. The method switches between
methods of different orders. The best we can say is it is usually fast.
A=g(a);
B=g(b);
c=a; C=A;
while (i<N0)
b
if (B*A>0)
a=c; A=C;
end%if
if (abs(A) < abs(B))
c=b; C=B;
b=a; B=A;
a=c; A=C;
end%if
if (a==c)
x=(b*A-a*B)/(A-B);
else
r=B/A-1; s=C/B-1; t=A/C-1;
p=(t+1)*r*(c-b)+(r+1)*s*(a-b);
q=t*s*r;
x=b-p/q;
end%if
if (x<min([a,b]) || x>max([a,b]))
x=b+(a-b)/2;
end%if
if (abs(x-b)<TOL)
disp(" ");
return
end%if
c=b; C=B;
b=x; B=g(b);
i=i+1;
end%while
x="Method failed---maximum number of iterations reached";
end%function
1+10x
Applying the bracketed inverse quadratic interpolation method to the problematic function g(x) = 1−10x over the
interval [−4, .05] yields the result within 10 accuracy in only 11 iterations. The method took only 1 iteration
−5
more than bracketed Newton’s without requiring knowledge of the derivative of g! bracketedInverseQuadratic.m
may be downloaded at the companion website.
Stopping
In all of our root-finding methods, the algorithm stops when the difference between consecutive iterations is less
than some tolerance. This criterion is based on the assumption that the error will be no more than this difference.
And that is a safe assumption for any method that is converging superlinearly when it quits. Indeed, it is even
safe for the linearly converging bisection method where the difference between consecutive iterations is exactly the
theoretical bound on the error.
The criterion is not safe when a superlinear method is used far enough from a root that superlinear convergence
is not observed. This is exactly what happens in figure on page 94. The difference between consecutive iterations
is actually larger than the absolute error at every step. This is an unusual situation, but it can happen.
The criterion is also not safe when a method is linearly convergent with a limiting convergence constant λ > 12 .
However, linearly convergent methods should never be used on their own as there is always a faster alternative.
There is one more important consideration regarding stopping. Stopping when the difference between consecutive
iterations is less than some tolerance is dependent on the absolute error. When roots could be very small or very
large, it is perhaps better to use a criterion based on relative error. Instead of stopping when |xn+1 − xn | < tol, for
example, we would instead stop when |xn+1 − xn | < tol · |xn+1 |.
98 CHAPTER 2. ROOT FINDING
Key Concepts
Bracketing: Iteratively refining an interval, also known as the bracket, in which a root is known to lie until it is
small beyond some tolerance.
Inverse quadratic interpolation: A quadratic in y is fit to three consecutive approximations of a root. The
intersection of the quadratic with the x-axis becomes the next iteration.
Bracketed secant method: A combination of the secant method and bisection method employing bracketing. At
each iteration, if the secant method produces a value inside the current bracket, it becomes the next iteration.
Otherwise bisection is used to produce the next iteration.
False position: Another name for the bracketed secant method.
Regula falsi: Another name for the bracketed secant method.
Bracketed Newton’s method: A combination of Newton’s method and the bisection method employing brack-
eting. At each iteration, if Newton’s method produces a value inside the current bracket, it becomes the next
iteration. Otherwise bisection is used to produce the next iteration.
Bracketed inverse quadratic interpolation: A combination of inverse quadratic interpolation, the secant method,
and bisection employing bracketing. At each iteration, if inverse quadratic interpolation produces a value in-
side the current bracket, it becomes the next iteration. Otherwise either the secant method or bisection is
used to produce the next iteration.
[A]
(a) f (x) = 3 − x − sin x; [2, 3]
4 3 9. Write a bracketed Steffensen’s method Octave func-
(b) g(x) = 3x − 2x − 3x + 2; [0, 1]
tion. REMARK: Steffensen’s method is a fixed point
(c) g(x) = 3x4 − 2x3 − 3x + 2; [0, 0.9] [S]
finding method. It solves the equation f (x) = x, not
(d) h(x) = 10 − cosh(x); [−3, −2] f (x) = 0. So a proper bracket [a, b] is one for which
√ [A] (f (a) > a and f (b) < b) or (f (a) < a and f (b) >
(e) f (t) = 4 + 5 sin t − 2.5; [−600, −500]
2
b). Geometrically, this means the points (a, f (a)) and
3t tan t
(f) g(t) = 1−t2
; [3490, 3491] (b, f (b)) are on opposite sides of the line f (x) = x, anal-
(g) h(t) = ln(3 sin t) − 3t
; [1, 2] ogous to a root-finding bracket where the two points are
5
sin r [S]
on opposite sides of the line f (x) = 0.
(h) f (r) = e − r; [−20, 20]
(i) g(r) = sin(e ) + r; [−3, 3]
r
10. Use your code from question 9 to repeat question
(j) h(r) = 2 sin r
−3 cos r
; [1, 3] [A] 1 using Octave, bracketed Steffensen’s method, and a
tolerance of 10−6 . Given that you are looking for a root
2. Repeat question 1 using bracketed Newton’s method. of g(x), use f (x) = g(x) + x in your call to Steffensen’s
[S][A]
method. [S][A]
3. Repeat question 1 using the secant method. Compare [A]
your answer with that of false position. [S][A] 11. Compare the results of questions 7 and 10.
4. Repeat question 1 using Newton’s method. Compare 12. Rewrite the inverseQuadraticInterpolation Oc-
your answer with that of bracketed Newton’s method. tave function so that it stops when the (approximated)
[S][A]
relative error is less than the tolerance.
5. Repeat question 1 using Octave and a tolerance of
10−6 . [S][A] 13. Use your code from question 12 to repeat question
1 with a tolerance of 10−6 . [S][A]
6. Repeat question 2 using Octave and a tolerance of
10−6 . [S][A] 14. Compare the results of questions 7 and 13. [A]
Answers
x4 : In both methods, the candidate x4 is accepted since in each case, x4 is within the bracket formed by x2 and
x3 . So, for bracketed secant, x4 = 2.1854, and for bracketed Newton’s, x4 = 2.1565. x1 is upgraded to x2 in
bracketed secant because g(x3 ) is negative. g(x2 ) and g(x3 ) must have opposite signs in order to maintain
the bracket. x1 is not upgraded in bracketed Newton’s because g(x3 ) is positive.
Chapter 3
Interpolation
0.08
0.06
0.04
0.02
-0.02
which we will call F , could easily be mistaken for a cubic or higher degree polynomial, but it is far from so nice.
First, its domain is the interval [0, 1], so the graph shown is the entire graph. Second, it has but two derivatives.
Third, its definition is a touch unusual. More on that soon.
What we have here is the antiderivative of a fractal interpolating function. An interpolating function is a function
that contains a set of prescribed points. This one happens to be fractal in nature, thus a fractal interpolating
function. The fractal interpolating function, f , passes through
99
100 CHAPTER 3. INTERPOLATION
in such a way that the graph shown is that of its antiderivative. The unusual nature of the definition of F is derived
from the unusual nature of the definition of f :
f1 + c1 αx + d1 f 0≤x≤α
( x
α ,
f (x) =
f2 + c2 x−α
1−α + d2 f x−α
1−α , α≤x≤1
where
8979 34779 27
f1 = , c1 = − , d1 =
100000 100000 100
75891 317391 67
f2 = − , c2 = , d2 =
550000 550000 550
33
α= .
100
Fractal interpolating functions are not restricted to passing through three points. Actually, three is the minimum.
In general, for n ≥ 3, suppose x1 < x2 < · · · < xn . The linear fractal interpolating function (there are other
types of fractal interpolating functions) passing through each of the points
The ai , ci , ei , and fi are calculated based on the requirement that the function interpolate the given points. In
particular, we require
x1 xi xn xi+1
Li = and Li = .
y1 yi yn yi+1
The di are free parameters with the restriction |di | < 1. It is a straightforward algebraic exercise to show
xi+1 − xi
ai =
xn − x1
yi+1 − yi − di (yn − y1 )
ci =
xn − x1
ei = xi − ai x1
fi = yi − ci x1 − di y1 .
In concert,
the Li define the
function f , each Li responsible for the subset [xi , xi+1 ] of the domain.
x ai x + ei
Li = , so as Li takes x to ai x + ei , it simultaneously takes y to ci x + di y + fi .
y ci x + di y + fi
Noting that Li takes this action on the function f , we must have that f (ai x + ei ) = ci x + di f (x) + fi on [x1 , xn ],
or equivalently,
x − ei x − ei
f (x) = fi + ci + di f on [xi , xi+1 ].
ai ai
Putting all the pieces together, f is defined by
h1 (x),
x1 ≤ x ≤ x2
h2 (x),
x2 ≤ x ≤ x3
f (x) = ..
.
hn−1 (x),
xn−1 ≤ x ≤ xn
where
x − ei x − ei
hi (x) = fi + ci + di f .
ai ai
3.1. A ROOT-FINDING CHALLENGE 101
´x
Consequently, F (x) = x1
f (t) dt is defined by
´ x
x1 h1 (t)dt,
´
x 1 ≤ x ≤ x2
F (x2 ) + xx h2 (t)dt,
2
x 2 ≤ x ≤ x3
F (x) = ..
.
F (xn−1 ) + ´ x
hn−1 (t)dt,
xn−1
xn−1 ≤ x ≤ xn
h1 (x),
0
x1 ≤ x ≤ x2
h02 (x),
x2 < x ≤ x3
f 0 (x) = ..
.
h0n−1 (x),
xn−1 < x ≤ xn
as long as f 0 exists! If adii < 1 for all i, then the derivative will exist almost everywhere, but will generally be
discontinuous. If we also have h0i (xi+1 ) = h0i+1 (xi+1 ) for all i = 1, 2, . . . , n − 2, then the derivative will exist and
will be continuous.
The definition of f is self-referential. Its values are defined by, among other terms, values of itself! This makes
evaluating the function a bit different from evaluating a typical function. For example, by virtue of the fact that f
passes through the points 3.1.1, we must have f (0) = .123, f (.33) = −.123, and f (1) = .5, facts we can check easily
enough. According to the definition,
f (0) = f1 + d1 f (0) = .08979 + .27f (0)
so f (0) is defined in part by itself. We need to solve the equation f (0) = .08979 + .27f (0) to find f (0). Thus we
have f (0) = .08979
.73 = .123, as promised. Again according to the definition,
75891 317391 67
f (1) = f2 + c2 + d2 f (1) = − + + f (1).
550000 550000 550
− 75891
+ 317391
Solving for f (1), we have f (1) = 550000 550000
1− 550
67 = 12 , as promised. Since α = .33, the definition actually gives two
ways to calculate f (.33). According to the first part of f ,
f (.33) = f (α) = f1 + c1 + d1 f (1)
8979 34779 27 1
= − + ·
100000 100000 100 2
= −.123.
Now is a good time to verify that f (α) = −.123 according to the second part of f as well. Try it! Calculating other
values of f can be a bit more challenging, but there are still a few that are not so bad. α2 < α and α + (1 − α)α > α,
so
f (α2 ) = f1 + c1 α + d1 f (α)
8979 34779 33 27 123
= − · + · −
100000 100000 100 100 1000
= −.0581907
f (α + (1 − α)α) = f2 + c2 α + d2 f (α)
75891 317391 33 67 123
= − + · + · −
550000 550000 100 550 1000
2060703
=
55000000
= .037467327
102 CHAPTER 3. INTERPOLATION
Answers on page 105. More generally, once you have calculated f (x) for some value x, you can then calculate f (αx)
and f (α + (1 − α)x) from it. ´x
Now that we have a handle on f , we define F by F (x) = 0 f (t) dt for all x ∈ [0, 1]. Integrating f (x) we have
c 1 x2
f1 x + + αd1 F 0≤x≤α
( x
2α α ,
F (x) = (x−α)2
F (α) + f2 (x − α) + c22(1−α) + (1 − α)d2 F x−α
1−α , α≤x≤1
where again both formulas are applicable when x = α. Just like f , F is self-referential. We must go through the same
process in finding values of F as we did finding values of f . To get started, F (0) = αd1 F (0) ⇒ (1 − αd1 ) · F (0) = 0,
but α and d1 are both less than 1, so 1 − αd1 6= 0. Therefore,
0
F (0) = = 0.
1 − αd1
´0
We could have computed this value by integration just as well: F (0) = 0 f (t) dt = 0. Now, according to the
formula, c2
F (1) = F (α) + (1 − α) f2 + + d2 F (1)
2
and
c1
F (α) = α f1 + + d1 F (1) ,
2
a system of two equations in the two unknowns, F (α) and F (1). Its solution is
121012947
F (α) = − ≈ −.01989886325517151
6081400000
5361861
F (1) = ≈ 0.0881682014009932.
60814000
Now that we have the few values, F (0), F (α), and F (1), we can calculate others as before. The values F (αx) and
F (α + (1 − α)x) will both depend on the value of F (x). So we can compute F (α2 ) and F (α + (1 − α)α):
c1 α3
F (α2 ) = f1 α2 + + αd1 F (α)
2
10678194456039
=
6081400000000000
≈ .001755877668964219
c2 (1 − α) α2
F (α + (1 − α)α) = F (α) + f2 (1 − α) α + + (1 − α)d2 F (α)
2
94196657189979
= −
3040700000000000
≈ −.03097860926430723.
Now you can calculate F (α3 ), F (α(α + (1 − α)α)), F (α + (1 − α)α2 ), and F (α + (1 − α)(α + (1 − α)α)) yourself.
Answers on page 105. You shouldn’t worry about calculating these values exactly. That would require a computer
algebra system with arbitrary precision and is not really the point. The point is to make sure you understand how
to do the calculations. Use a calculator or Octave and the approximate values already calculated.
α + αf 0≤x≤α
(c d1 0 x
α ,
1
f (x) =
0
0.4
0.3
0.2
0.1
-0.1
-1
-2
-3
-4
-5
-6
0 0.2 0.4 0.6 0.8 1
Octave
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Written by Dr. Len Brin 19 February 2014 %
% Purpose: Calculate values of the fractal interpolating %
% function, f, passing through %
% (0,f_0), (alpha,f_alpha), and (1,f_1), %
% its derivative and its integral. %
% INPUT: value at which to evaluate, x; array of values, %
104 CHAPTER 3. INTERPOLATION
Answers
Evaluating f : The following are a few values of f :
f (α3 ) ≈ .03620418000000000
f (α(α + (1 − α)α)) ≈ −.09176089063636364
f (α + (1 − α)α2 ) ≈ −.08222890363636364
f (α + (1 − α)(α + (1 − α)α)) ≈ .1846063473223140.
F (α3 ) ≈ .002702687013731212
F (α(α + (1 − α)α)) ≈ −.003859289400223274
2
F (α + (1 − α)α ) ≈ −.02753062961856850
F (α + (1 − α)(α + (1 − α)α)) ≈ −.01466250212441314.
106 CHAPTER 3. INTERPOLATION
As written, Ln is called the Lagrange form of Pn . For sake of brevity, it is often called the Lagrange interpolating
polynomial, or even Lagrange polynomial. However, the interpolating polynomial of least degree by any other name
would be but Pn . We will adhere to the practice of calling it the interpolating polynomial of least degree, or use
the notation Pn , when the form is unimportant and will add the phrase Lagrange form, or use the notation Ln ,
when it is.
The main use for interpolating polynomials in numerical analysis is to approximate non-polynomial functions in
the following way. Suppose we know the value of f at a selection of points. That is, we know f (x0 ) = y0 , f (x1 ) =
x1 , . . . , f (xn ) = yn and perhaps not much more. The interpolating polynomial of least degree passing through the
n + 1 points
(x0 , y0 ), (x1 , y1 ), . . . , (xn , yn )
will, by construction, agree with f at x0 , x1 , . . . , xn and we can say with some precision how closely this interpolating
polynomial agrees with f at other points as well. The values of the interpolating polynomial at these “other points”
are what we refer to as approximations of the non-polynomial function.
Setting a = min(x0 , . . . , xn , x) and b = max(x0 , . . . , xn , x), we have the following result. If f has n+1 derivatives
on (a, b) and f, f 0 , f 00 , . . . , f (n) are all continuous on [a, b], then there is a value ξx ∈ (a, b) such that
f (n+1) (ξx )
f (x) − Pn (x) = (x − x0 )(x − x1 ) · · · (x − xn ). (3.2.3)
(n + 1)!
Ironically, this result is proven by considering the Lagrange form of an interpolating polynomial in t that is equal
to the error at x and equal to zero at each xi . That polynomial is
(t − x0 )(t − x1 ) · · · (t − xn )
Λ(t) = [Pn (x) − f (x)] .
(x − x0 )(x − x1 ) · · · (x − xn )
Crumpet 20: Λ
Λ is the (capital) eleventh letter of the Greek alphabet and is pronounced lam-duh . The lower case version, λ,
appears much more commonly in mathematics and often represents an eigenvalue.
Subtracting this polynomial from the error, e(t) = Pn (t) − f (t), we have a function,
that is zero for all t = x0 , x1 , . . . , xn , x. Since g, g 0 , . . . , g (n) are all continuous on [a, b] and g (n+1) exists on (a, b),
by Generalized Rolle’s Theorem, there is a value ξx ∈ (a, b) such that g (n+1) (ξx ) = 0. On the other hand,
But, Λ is a polynomial of degree n + 1 in t, so its (n + 1)st derivative with respect to t is constant with respect to
t. We write Λ as
Pn (x) − f (x)
Λ(t) = + bn tn + · · · + b0 t0
n+1
t
(x − x0 )(x − x1 ) · · · (x − xn )
for some constants bn , bn−1 , . . . , b0 , and consequently,
Pn (x) − f (x)
Λ(n+1) (t) = · (n + 1)!,
(x − x0 )(x − x1 ) · · · (x − xn )
f (x) − Pn (x)
f (n+1) (ξx ) = · (n + 1)!
(x − x0 )(x − x1 ) · · · (x − xn )
or, equivalently,
f (n+1) (ξ)
(x − x0 )(x − x1 ) · · · (x − xn ) = f (x) − Pn (x)
(n + 1)!
as desired.
Figure 3.2.1 shows interpolating polynomials for three different functions. The x-coordinates of the prescribed
points are the same for each interpolating polynomial. The x-coordinates are
The four numbers between 0 and 1 were selected by a random number generator. The interpolating polynomial
closely resembles the function only in the first case. The sixth derivative of f helps explain why.
Our error term,
f (6) (ξ)
(x − x0 )(x − x1 ) · · · (x − x5 )
6!
implies that the sixth derivative of f and the polynomial h(x) = (x−x0 )(x−x 6!
1 )···(x−x5 )
determine how much f and
L6 will differ. By bounding both f (6) and |h| over the interval [0, 1], we can get a bound on the difference between
f and L6 . The graphs of f (6) are shown in Figure 3.2.1. The graph of h is
2.5e-06
2e-06
1.5e-06
1e-06
5e-07
-5e-07
-1e-06
so maxx∈[0,1] |h(x)| occurs around 0.75. We can use a root-finding method applied to h0 to find that the maximum
of |h| is approximately h(.7409254943919) ≈ 2.506891519629(10)−6 , a relatively small number. On the other hand,
for f (x) = esin((x+1) ) , we find maxx∈[0,1] f (6) (x) ≈ f (6) (.6777170541644) ≈ 44013.74605321, a relatively large
2
gives a bound on the error. The absolute furthest L6 can be from f over the interval [0, 1] is 0.11, a relatively small
number. The actual error is considerably smaller, so can barely be noticed in the top left graph of Figure 3.2.1.
3.2. LAGRANGE POLYNOMIALS 109
Figure 3.2.1: Three interpolating functions. From top to bottom, esin((x+1) ) , sin e(x+1) , and a fractal function
2
2
40000
2.5
30000
20000
2
10000
0
1.5
-10000
-20000
1
-30000
-40000
0.5
8e+13
1 6e+13
4e+13
0.5
2e+13
0
0
-2e+13
-0.5 -4e+13
-6e+13
-1 -8e+13
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
0 0.2 0.4 0.6 0.8 1
110 CHAPTER 3. INTERPOLATION
For f (x) = sin e(x+1) , we find maxx∈[0,1] f (6) (x) ≈ f (6) (1) ≈ 8.552147927657737(10)13 , a relatively large
2
is a huge number relative to the values of f . So the theoretical error bound does not predict good results for
this interpolation. In fact, it suggests that the interpolation could have been much, much worse! L6 might have
differed from f by over 2 million, a fact that should be worrisome considering f takes values between −1 and 1. An
approximation that is off by even 1 is completely useless for this particular f . As it is, we should not be surprised
that L6 is not a good approximation of f since the error term can be quite large. Nonetheless, the method is sound.
Failure to approximate f well should not be seen as a flaw in the method, but rather a flaw in its application. If
we really wanted to approximate f well, we would need to find a different set of points over which to interpolate.
For the fractal function in the bottom left of Figure 3.2.1, our error estimate is entirely irrelevant. The sixth
derivative of f does not exist. In fact, even the first derivative of f does not exist. We have no way to estimate
the error except to look at the graphs. And as we see, L6 again does a very poor job of approximating f . Failure,
again, should not be seen as a flaw in the method, but rather in its application. Approximating a function with an
interpolating polynomial presumes that the function has sufficient derivatives.
Suppose f is a continuous function on the interval [0, 1], and define the polynomial
n
X n ν
Bn (x) = f xν (1 − x)n−ν , n = 1, 2, 3, . . .
ν n
ν=0
Then
lim Bn (x) = f (x)
n→∞
uniformly. That is, limn→∞ max{|Bn (x) − f (x)| : x ∈ [0, 1]} = 0. The Bn are Bernstein polynomials. Shown
below are B4 , B20 , B100 , and B500 for the fractal function in figure 3.2.1.
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
calculate any given iteration can be viewed as an interpolating line! It passes through two points lying on g. Hence,
it is an approximation of g.
Having taken this point of view, we can now imagine generalizing the method by using the derivative of a higher
degree interpolating polynomial to approximate g 0 at each step. Such a generalized method, which we will call
Sidi’s k th degree method [30], is summarized by the formula
g(xn )
xn+1 = xn −
p0n,k (xn )
where pn,k is the interpolating polynomial passing through the points
(xn , g(xn )), (xn−1 , g(xn−1 )), . . . , (xn−k , g(xn−k )).
When k = 1, this is exactly the secant method. When k = 2, this method uses the same parabola as does Müller’s
method, but in a different way. In Müller’s method, the next iteration is found by locating a root of the interpolating
polynomial. In this method, the next iteration is found by locating a root of a tangent line to the interpolating
polynomial.
As k increases, more initial values are needed, but the order of convergence
√
increases as a benefit. Letting αk
be the order of convergence of Sidi’s k th degree method, we have α1 = 1+2 5 ≈ 1.618, the order of convergence of
the secant method, and
α2 ≈ 1.839, α3 ≈ 1.928, α4 ≈ 1.966.
For any k, Sidi’s method has an order of convergence less than 2 (the order of convergence of Newton’s method)
but it approaches 2 as k increases.
At this point, you might wonder just how practical such a method might be. After all, calculating a new
Lagrange interpolating polynomial and evaluating its derivative at each step can be a cumbersome process. We will
take up this issue in the next section.
Neville’s Method
The Lagrange form of an interpolating polynomial is as convenient as it gets for a human. With a little care and
patience, it is possible to write down such a polynomial without even the aid of a calculator. However, adding
points to the interpolation and evaluating the polynomial for non-interpolated points can be cumbersome tasks.
Consider a simple example: the polynomial interpolating f (x) = ex at x = 0, 1, 2:
(x − 1)(x − 2) 0 (x − 0)(x − 2) 1 (x − 0)(x − 1) 2
L2 (x) = e + e + e
(0 − 1)(0 − 2) (1 − 0)(1 − 2) (2 − 0)(2 − 1)
(x − 1)(x − 2) x(x − 2) x(x − 1) 2
= + e+ e .
2 −1 2
Evaluating L2 (1.5), for example, requires either
1. computing the values of the three separate terms, each a quadratic polynomial, and adding:
(1.5 − 1)(1.5 − 2) 1.5(1.5 − 2) 1.5(1.5 − 1) 2
L2 (1.5) = + e+ e
2 −1 2
= −.125 + .75e + .375e2
≈ 4.684607408443278
or
2. the unpleasant business of simplifying L2 into a simpler form and then evaluating:
(x − 1)(x − 2) x(x − 2) x(x − 1) 2
L2 (x) = + e+ e
2 −1 2
1 2 e2
= (x − 3x + 2) − e(x2 − 2x) + (x2 − x)
2 2
1 e2 3 e2
= −e+ x2 + − + 2e − x+1
2 2 2 2
≈ 1.47624622100628x2 + 0.242035607452765x + 1
so L2 (1.5) ≈ 1.47624622100628(1.5)2 + 0.242035607452765(1.5) + 1 = 4.684607408443277.
112 CHAPTER 3. INTERPOLATION
Method 2 is better if you have more points at which to evaluate, and method 1 is better if you plan to add points
of interpolation. However, neither method is particularly convenient. Even less convenient than evaluating the
polynomial is the task of requiring another point of interpolation. Previous work is of limited use. And we haven’t
even begun to discuss the trouble of writing a computer program to automate the calculations. Neville’s method
can be used to overcome these limitations when the value of the polynomial at a specific point is required.
Neville’s method is based on the observation that interpolating polynomials can be constructed recursively.
Suppose Pk,l is the polynomial of degree at most l interpolating the data
Then, by definition, P0,n is the polynomial of degree at most n interpolating the data
1. Pi,0 is the degree 0 polynomial interpolating the one datum (xi , f (xi )).
2. Pi,m and Pi+1,m are polynomials of degree at most m, so Pi,m+1 is a polynomial of degree at most m + 1.
4. For any j = i + 1, . . . , i + m,
A rigorous proof by induction on m, requested in the exercises, should follow closely these notes. Points 1
and 2 establish that Pk,l has degree at most l. Points 3 through 5 establish that Pk,l interpolates the points
(xk , f (xk )), (xk+1 , f (xk+1 )), . . . , (xk+l , f (xk+l )). Formula 3.2.4 succinctly summarizes Neville’s method.
While Neville’s method (formula 3.2.4) can be used to find formulas for interpolating polynomials as in
it is normally used to find the value of an interpolating polynomial at a specific point. We earlier determined that
L2 (1.5) = 4.684607408443277 for the polynomial, L2 (x), interpolating f (x) = ex at x = 0, 1, 2. We now find this
value using Neville’s method. P0,0 (1.5) = f (0) = 1, P1,0 (1.5) = f (1) ≈ 2.718281828459045, and P2,0 (1.5) = f (2) ≈
3.2. LAGRANGE POLYNOMIALS 113
7.38905609893065. So
(1.5 − x1 )P0,0 (1.5) − (1.5 − x0 )P1,0 (1.5)
P0,1 (1.5) =
x0 − x1
(1.5 − 1)(1) − (1.5 − 0)(2.718281828459045)
=
0−1
≈ 3.577422742688568
(1.5 − x2 )P1,0 (1.5) − (1.5 − x1 )P2,0 (1.5)
P1,1 (1.5) =
x1 − x2
(1.5 − 2)(2.718281828459045) − (1.5 − 1)(7.38905609893065)
=
1−2
≈ 5.053668963694848
(1.5 − x2 )P0,1 (1.5) − (1.5 − x0 )P1,1 (1.5)
P0,2 (1.5) =
x0 − x2
(1.5 − 2)(3.577422742688568) − (1.5 − 0)(5.053668963694848)
=
0−2
≈ 4.684607408443278.
A tabulation of the computation may make it easier to internalize the recursion and imagine how this process might
be automated. Table 3.2 shows such a tabulation. The use of this recursive formula may be more difficult than
direct computation for a human being, but for a computer, using the recursion is much quicker and simpler as
evidenced by a look at the pseudo-code.
Uniqueness
There are some subtleties we have thus far glossed over. When we introduced the Lagrange form, we casually stated
“Ln is called the Lagrange form of Pn ”, implying that the Lagrange form gives the interpolating polynomial of least
degree (since Pn is defined as such)! This fact is far from obvious. Nonetheless, we went on as if it were obvious that
Ln and Pn were one and the same polynomial. Worse yet, when we came around to discussing Neville’s method,
we calculated P0,2 (1.5) and compared it to L2 (1.5) from earlier with the implication that they should be the same,
again as if it were simply given that P0,2 and L2 should be the same polynomial. The following result shows that
our blind faith that Pn , Ln , and P0,n amount to different names for the same object was not misplaced (by virtue
of the fact that they all interpolate the same data and have degree at most n).
114 CHAPTER 3. INTERPOLATION
Theorem 7. The polynomial, Pn , of least degree interpolating the data (x0 , y0 ), (x1 , y1 ), . . . , (xn , yn ) exists and is
unique. Moreover, any interpolating polynomial of degree at most n is equal to Pn .
Proof. By construction, Ln interpolates the data. Moreover, the degree of Ln is at most n since it is the sum of
polynomials pi each with degree exactly n. Thus Pn exists and has degree at most n [at this point, we must
admit that the degree of Pn may be less than that of Ln ]. Now suppose q is any polynomial interpolating
(x0 , y0 ), (x1 , y1 ), . . . , (xn , yn ) with degree n or less. Then the polynomial f = Pn − q also has degree n or less.
Moreover, f (xi ) = Pn (xi ) − q(xi ) = yi − yi = 0 for all i − 0, . . . , n. Thus f has n + 1 roots. Alas, the only way f
can have n + 1 roots and have degree n or less is if f is identically 0. Hence, f (x) = Pn (x) − q(x) = 0, implying
Pn (x) = q(x) for all x.
Octave
The indices presented in the pseudo-code are predicated on indexing starting with 0, as in the mathematical
description. In Octave, however, indices can not be 0. They are always positive integers. A slight modification of
the indices is required to accommodate this discrepancy.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Written by Leon Brin 22 March 2014 %
% Purpose: This function implements Neville’s method for %
% computing the value P(xhat) of the interpolating %
% polynomial P passing through the data (x(1),y(1)), %
% (x(2),y(2)),...,(x(n),y(n)). %
% INPUT: value xhat; array x of abscissas; array y of %
% ordinates. %
% OUTPUT: table of values Q; Q(1,n)=P(xhat). %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function Q = nevilles(xhat,x,y)
n=length(x);
for i=1:n
Q(i,1)=y(i);
end%for
for j=2:n
for i=1:n+1-j
Q(i,j)=((xhat-x(i+j-1))*Q(i,j-1)-(xhat-x(i))*Q(i+1,j-1))/(x(i)-x(i+j-1));
end%for
end%for
end%function
nevilles.m may be downloaded at the companion website.
Key Concepts
Interpolating function: A function whose graph is required to pass through a set of prescribed points.
Interpolating polynomial: A polynomial whose graph is required to pass through a set of prescribed points.
Interpolating polynomial of least degree: The polynomial of least degree interpolating a given set of n + 1
data points is unique. We denote this polynomial by Pn .
Interpolating polynomial of degree at most n: The polynomial interpolating n + 1 distinct points has degree
at most n and is equal to the polynomial of least degree interpolating the points.
Generalized Rolle’s theorem: Suppose that f has n derivatives on (a, b) and f, f 0 , f 00 , . . . , f (n−1) are all contin-
uous on [x0 , xn ]. If f (x0 ) = f (x1 ) = · · · = f (xn ) for some x0 < x1 < · · · < xn , then there exists ξ ∈ (a, b)
such that f (n) (ξ) = 0.
Lagrange form of an interpolating polynomial: The Lagrange form, Ln , of the polynomial of degree at most
n interpolating the points (x0 , y0 ), (x1 , y1 ), . . . , (xn , yn ) is given by the formula
n
X pi (x)
Ln (x) = yi ,
p (x )
i=0 i i
3.2. LAGRANGE POLYNOMIALS 115
n
Y
where pi (x) = (x − xj ) = (x − x0 ) · · · (x − xi−1 )(x − xi+1 ) · · · (x − xn ).
j=0
j6=i
Interpolation error: For Pn , the interpolating polynomial of least degree passing through the n + 1 points
(x0 , y0 ), (x1 , y1 ), . . . , (xn , yn ), there is a value ξx ∈ (a, b) such that
f (n+1) (ξx )
f (x) − Pn (x) = (x − x0 )(x − x1 ) · · · (x − xn ),
(n + 1)!
assuming f has n + 1 derivatives on (a, b) and f, f 0 , f 00 , . . . , f (n) are all continuous on [a, b], and where
a = min(x0 , . . . , xn , x) and b = max(x0 , . . . , xn , x).
Sidi’s method: A root-finding method summarized by the formula
f (xn )
xn+1 = xn −
p0n,k (xn )
Neville’s method: A method for computing the interpolating polynomial of least degree or values of it based on
the recursive relation
(x − xi+m+1 )Pi,m (x) − (x − xi )Pi+1,m (x)
Pi,m+1 (x) =
xi − xi+m+1
Pi,0 (x) = f (xi )
(d) f (x) = ln(10x) 19. The interpolating polynomial on n + 1 points does not
always have degree n. It has degree at most n. Plot the
10. Use formula 3.2.3 to find theoretical error bounds for data (1, 1), (2, 3), (3, 5), and (4, 7), and make a conjec-
the approximations in question 9. Compare the bound ture as to the degree of the polynomial interpolating
to the actual error. [S] these four points. What led you to your conjecture?
11. A Lagrange interpolating 20. Use Neville’s method to find the polynomial described
√ polynomial is constructed for
the function f (x) = ( 2)x using x0 = 0, x1 = 1, in question 19. Does it have the degree you expected?
x2 = 2, x3 = 3. It is used to approximate f (1.5). 21. Let
Find a bound on the error in this approximation.
1
12. Find the polynomial referred to in question 11. Then xj = 1− for j = 0, 1, 2, . . .
j+1
(a) use the polynomial to approximate f (1.5); and f (x) = 5 + 3x2018
(b) calculate the actual error of this approximation, Pn (x) = the interpolating polynomial
and compare it to the bound you calculated in passing through
question 11. (x0 , f (x0 )), . . . , (xn , f (xn )).
We now wish to compute the polynomial Nn+1 (x) interpolating the data
and we would like to recycle the work we have already done (much the same way we could add a point of interpolation
in Neville’s method and reuse all previous work)! One way to attack the problem is to find a polynomial q(x) such
that
Nn+1 (x) = Nn (x) + q(x).
If the attack is to be successful, we must have q(x) = Nn+1 (x) − Nn (x) for all x, and, in particular, q(xj ) =
Nn+1 (xj ) − Nn (xj ) for j = 0, 1, . . . , n + 1. But Nn+1 (xj ) − Nn (xj ) = f (xj ) − f (xj ) = 0 for j = 0, 1, . . . , n, and
Nn+1 (xn+1 ) − Nn (xn+1 ) = f (xn+1 ) − Nn (xn+1 ). In other words, we seek the polynomial q interpolating the points
(x − x0 ) · · · (x − xn )
q(x) = (f − Nn )(xn+1 )
(xn+1 − x0 ) · · · (xn+1 − xn )
(f − Nn )(xn+1 )
= (x − x0 ) · · · (x − xn ). (3.3.1)
(xn+1 − x0 ) · · · (xn+1 − xn )
(f −Nn )(xn+1 )
But (xn+1 −x0 )···(xn+1 −xn ) is just a constant, so we replace it by an+1 so that we have q(x) = an+1 (x−x0 ) · · · (x−xn ).
Of course we can calculate an+1 using the formula (xn+1(f−x −Nn )(xn+1 )
0 )···(xn+1 −xn )
, but there is a better way, which we will
see shortly. We can also learn from the upcoming computation the most convenient form for Nn .
When n = 0, q has the form a1 (x − x0 ); when n = 1, q has the form a2 (x − x0 )(x − x1 ); when n = 2, q has
the form a3 (x − x0 )(x − x1 )(x − x2 ); and so on. Of course N0 (x) = a0 is constant since it is the interpolating
polynomial of least degree passing through a single point. So N1 (x) = N0 (x) + a1 (x − x0 ) immediately takes the
form a0 + a1 (x − x0 ); N2 (x) immediately takes the form a0 + a1 (x − x0 ) + a2 (x − x0 )(x − x1 ); N3 (x) immediately
takes the form a0 + a1 (x − x0 ) + a2 (x − x0 )(x − x1 ) + a3 (x − x0 )(x − x1 )(x − x2 ); and so on. This would suggest
that the most convenient form for Nn+1 , the one that requires no simplification, is
Given in this form, the unknown quantity, an+1 , appears as the coefficient of the xn+1 term. Consequently, an+1 is
potentially the leading coefficient of Nn+1 . If an+1 were zero, then we would not call it the leading coefficient. We
will facilitate the rest of this discussion by introducing the following term. For an interpolating polynomial on k + 1
points, the coefficient of its xk term is called its potential leading coefficient (even if it happens to be zero).
Since this potential leading coefficient is the crux of our problem, we focus attention on determining the potential
leading coefficient of any interpolating polynomial.
Here is where the recursive formula
(x − xi+m+1 )Pi,m (x) − (x − xi )Pi+1,m (x)
Pi,m+1 (x) =
xi − xi+m+1
Pi,0 (x) = f (xi )
118 CHAPTER 3. INTERPOLATION
used in devising Neville’s method comes in handy. In as much as Pi,m and Pi+1,m both have degree at most m,
their potential leading coefficients are the coefficients of their xm terms. It follows that the coefficient of the xm+1
term of (x − xi+m+1 )Pi,m (x) equals the potential leading coefficient of Pi,m (x), and, similarly, the coefficient of
the xm+1 term of (x − xi )Pi+1,m equals the potential leading coefficient of Pi+1,m . Therefore, the coefficient of the
xm+1 term of (x − xi+m+1 )Pi,m (x) − (x − xi )Pi+1,m (x) is the difference of the potential leading coefficients of Pi,m
and Pi+1,m . To simplify the discussion, we use the notation fi,j for the potential leading coefficient of Pi,j . Now the
coefficient of the xm+1 term of (x − xi+m+1 )Pi,m (x) − (x − xi )Pi+1,m (x) is just fi,m − fi+1,m . Hence, the potential
leading coefficient fi,m+1 of Pi,m+1 (the coefficient of the xm+1 term of Pi,m+1 ) is given by
fi,m − fi+1,m
fi,m+1 = (3.3.3)
xi − xi+m+1
fi,0 = f (xi ).
While we choose to use the notation fi,j for the potential leading coefficient of Pi,j , it is much more customary
to use the expanded notation f [xi , xi+1 , . . . , xi+j ] for this quantity, and to call it a j th divided difference.
Finally, we have a formula for the potential leading coefficient that recycles previous calculations. Since Nn+1
and P0,n+1 interpolate the same set of points and both have degree at most n + 1, they are equal by theorem
7. Therefore, their potential leading coefficients, an+1 and f0,n+1 are equal. By recursion 3.3.3, we then have
f −f1,n
an+1 = f0,n+1 = x0,n0 −xn+1
.
It can not be stressed enough that we have not discovered a new polynomial. We have only discovered a new
way to calculate the same old interpolating polynomials. Nn , Ln , and P0,n all interpolate the same data and all
have degree at most n. They are, therefore, equal by theorem 7. Just the forms in which they are written possibly
differ. The polynomial form in equation 3.3.2 is called the Newton form.
Typically, the Newton form and divided differences are presented completely independent of Neville’s recursive
formula, an approach that takes considerably more work to develop. There are reasons to do so, however. Refrain-
ing from the use of Neville’s formula follows more closely the historical development of the subject since Newton
(1643–1727) preceded Neville (1889-1961) by over 200 years! Moreover, following the historical development more
naturally leads to further study of divided differences.
As an example, take the polynomial interpolating f (x) = ex at x = 0, 1, 2, as we did in the discussion of Neville’s
method on page 111. f0,0 = f (0) = 1, f1,0 = f (1) ≈ 2.718281828459045, and f2,0 = f (2) ≈ 7.38905609893065. So
Therefore, N2 (x) = 1 + 1.718281828459045(x) + 1.47624622100628(x)(x − 1). f0,i are the coefficients of Nn . Though
this computation is manageable without a table, it is most convenient to tabulate the values of fi,j as they are
computed (just as is the case for Neville’s method). This is true for both humans and computers! A tabulation
of the computation makes it easier to internalize the recursion and imagine how this process might be automated.
Table 3.3, which is called a table of divided differences (or divided difference table), shows such a tabulation. Adding
a data point to the interpolation is as easy as computing another diagonal of coefficients (just like Neville’s method).
Sidi’s Method
We now return attention to Sidi’s k th degree root-finding method,
g(xn )
xn+1 = xn − ,
p0n,k (xn )
so
p0n,k (xn ) = gn−1,1 + gn−2,2 (xn − xn−1 ) + · · · + gn−k,k (xn − xn−1 ) · · · (xn − xn−k ). (3.3.4)
In particular,
p0n,2 (xn ) = gn−1,1 + (xn − xn−1 )gn−2,2
and
p0n,3 (xn ) = gn−1,1 + (xn − xn−1 )gn−2,2 + (xn − xn−1 )(xn − xn−2 )gn−3,3
and so on. As a nested product,
p0n,k (xn ) = gn−1,1 + (xn − xn−1 ) [gn−2,2 + (xn − xn−2 ) [· · · + (xn − xn−k ) [gn−k,k ] · · · ]] .
This new pseudo-code, which utilizes the previous pseudo-code in its first step is an improvement. Now the input
and output match in type and quantity, meaning the output of this routine may be used as input for the next
iteration. However, this routine still only calculates one step of Sidi’s method. Moreover, we have been ignoring
another issue. Each of the routines spelled out in pseudo-code so far assume we have the diagonal entries of the
corresponding divided difference table. It is not good practice to make the user of the code worry about this detail.
The routine we write should supply these values. After all, the end-user, the person trying to find a root of a
function, will only have immediate access to the function and some number of initial values. The routine must
supply the rest. Finally, we present pseudo-code in the spirit of other root-finding methods.
Assumptions: g has a root at x̂; g is k times differentiable; x0 , x1 , . . . , xk are sufficiently close to x̂.
Input: Initial values x0 , x1 , . . . , xk ; function g; desired accuracy tol; maximum number of iterations N .
Step 1: For i = 0, 1, . . . , k do Step 2:
Step 2: Set gi,0 = g(xi );
Step 3: For j = 1, 2, . . . , k do Steps 4-5:
Step 4: For i = 0, 1, . . . , k − j do Step 5:
gi+1,j−1 − gi,j−1
Step 5: Set gi,j =
xi+j − xi
Step 6: For i = 1 . . . N do Steps 7-11:
Step 7: Compute x = xk+1 according to Sidi’s method applied to
x0 , x1 , . . . , xk and gk,0 , gk−1,1 , . . . , g0,k ;
Step 8: If |x − xk | ≤ tol then return x;
Step 9: Compute gk+1,0 , gk,1 , . . . , g1,k ;
Step 10: Set x0 = x1 ; x1 = x2 ; · · · xk−1 = xk ; xk = x;
Step 11: Set gk,0 = gk+1,0 ; gk−1,1 = gk,1 ; · · · g0,k = g1,k ;
Step 12: Print “Method failed. Maximum iterations exceeded.”
Output: Approximation x near exact fixed point, or message of failure.
As complete as this latest pseudo-code is, it leaves one item unaddressed. It requires k initial values to run Sidi’s k th
degree method. When we encountered the secant method, we noted that needing two initial values as opposed to
one was a disadvantage. The disadvantage is only magnified in Sidi’s method where k + 1 initial values are required.
However, just as with the secant method, we can automatically generate initial values if needed. If Sidi’s method is
given one initial value, x0 , and we are trying to find a root of the function g, then we can set x1 = x0 + g(x0 ) just
as we did for the secant method. You may recall, this was not particularly successful, however. The secant method
often failed to converge with this selection of initial condition.
Much less is known about Sidi’s method and how the selection of intial values affects convergence. It might
make an interesting project to analyze good and bad practices for selecting initial values. In any case, if you have
initial values x0 , x1 , . . . , xj with 1 < j < k, the remaining k + 1 − j intial values can be found using Sidi’s method
of degree j (on x0 , x1 , . . . , xj ) to get xj+1 followed by using Sidi’s method of degree j + 1 (on x0 , x1 , . . . , xj+1 ) to
get xj+2 followed by using Sidi’s method of degree j + 2 (on x0 , x1 , . . . , xj+2 ) to get xj+3 , and so on until xk is
computed.
3.3. NEWTON POLYNOMIALS 121
Octave
As is the case with Neville’s method, the Octave code follows identically its corresponding pseudo-code except that
indices have been modified to accommodate indexing beginning with 1, not 0.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Written by Dr. Len Brin 1 April 2014 %
% Purpose: Implementation of Sidi’s Method %
% INPUT: function g; initial values x0,x1,...,xk; %
% tolerance TOL; maximum number of %
% iterations N %
% OUTPUT: approximation X and number of iterations %
% i; or message of failure %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function [X,j] = sidi(x, TOL, N, g)
n=length(x);
for i=1:n
G(i,1)=g(x(i));
end%for
for j=2:n
for i=1:n+1-j
G(i,j)=(G(i+1,j-1)-G(i,j-1))/(x(i+j-1)-x(i));
end%for
end%for
for i=1:N
s=G(1,n);
for j=2:n-1
s=(x(n)-x(j))*s+G(j,n+1-j);
end%for
X=x(n)-G(n,1)/s;
if (abs(X-x(n))<TOL)
return
end%if
G(n+1,1)=g(X);
for j=n:-1:2
G(j,n+2-j)=(G(j+1,n+1-j)-G(j,n+1-j))/(X-x(j));
end%for
for j=1:n-1
x(j)=x(j+1);
end%for
x(n)=X;
for j=1:n
G(n+1-j,j)=G(n+2-j,j);
end%for
end%for
X = "Method failed. Maximum iterations exceeded.";
end%function
actually gives us two different at-most-quadratic interpolating polynomials with four representations for each! First,
the table was devised to compute the interpolating polynomial
Notice that if we simply truncate the f0,3 (x − x0 )(x − x1 )(x − x2 ) term, we still have an interpolating polynomial
with nodes x0 , x1 , x2 . We can support this claim in at least two ways. First, the term f0,3 (x − x0 )(x − x1 )(x − x2 )
is 0 at x0 , x1 , x2 so it does not contribute to the interpolation at the nodes x0 , x1 , x2 . Second, we can “reverse
engineer” the table, simply erasing the bottom-most diagonal. The remaining table is still a legitimate divided
difference table since none of the remaining entries depends on any of the erased entries:
So
P2 (x) = f0,0 + f0,1 (x − x0 ) + f0,2 (x − x0 )(x − x1 )
is one of the degree at most 2 interpolating polynomials. Erasing the top row of the table also leaves a legitimate
divided difference table:
x1 f1,0 f1,1 f1,2
x2 f2,0 f2,1
x3 f3,0
so
Q2 (x) = f1,0 + f1,1 (x − x1 ) + f1,2 (x − x1 )(x − x2 )
is another degree at most 2 interpolating polynomial. Notice that P2 and Q2 are not just different representations
of the same polynomial. They are two different polynomials! P2 interpolates over the nodes x0 , x1 , x2 while Q2
interpolates over the nodes x1 , x2 , x3 .
The bottom diagonals of each truncated table give degree at most 2 interpolating polynomials as well. Remember,
fi,j represents the potential leading coefficient of the interpolating polynomial over the nodes xi , xi+1 , . . . , xi+j .
Hence,
Q̃2 (x) = f3,0 + f2,1 (x − x3 ) + f1,2 (x − x3 )(x − x2 )
interpolates over the nodes x3 , x2 , x1 and
interpolates over the nodes x2 , x1 , x0 . These are not new polynomials. These are new representations for P2 and
Q2 . Actually, P̃2 = P2 and Q̃2 = Q2 .
The critical feature of each of these interpolating polynomial representations is that each successive coefficient
depends on all the same nodes as its predecessor, plus one new one. For example, f2,0 depends on x2 , f1,1 depends
on x2 and x1 , and f0,2 depends on x2 , x1 , and x0 . Hence, these three coefficients can be used to produce the
interpolating polynomial over the nodes x0 , x1 , x2 in the form of polynomial P̃2 (which, as we have already noted,
equals P2 ). Another representation for the same polynomial can be written by utilizing f1,0 (which depends on x1 ),
f0,1 (which depends on x1 and x0 ), and f0,2 (which depends on x1 , x0 , x2 ):
to give a representation of the polynomial interpolating over x0 , x1 , x2 (which, therefore, must equal P2 ). There is
one more representation of P2 that can be extracted from the original divided difference table. It comes from the
coefficients f1,0 , f1,1 , f0,2 . Can you write it down? Answer on page 126. There are two more representations of Q2
that can be extracted from the original divided difference table. Can you write them down? Answers on page 126.
3.3. NEWTON POLYNOMIALS 123
Key Concepts
Newton form of an interpolating polynomial: The Newton form, Nn , of the polynomial of degree at most n interpo-
lating the points (x0 , y0 ), (x1 , y1 ), . . . , (xn , yn ) is
for n distinct indices i0 , i1 , . . . , in−1 from the set {0, 1, 2, . . . , n}. The Newton form for a particular set of data is not
unique.
Potential leading coefficient: For an interpolating polynomial on k + 1 points, the coefficient of its xk term is called its
potential leading coefficient.
Divided differences: The coefficients of the Newton form of an interpolating polynomial are called divided differences.
Exercises
1. Modify the Neville’s method pseudo-code on page 113 to produce pseudo-code for computing the coefficients of Nn .
2. Modify the Neville’s method Octave code on page 114 to produce octave code for computing the coefficients of Nn .
Test it by computing N2 interpolating f (x) = ex at x = 0, 1, 2 and comparing your result to that on page 118.
3. Let f (0.1) = 0.12, f (0.2) = 0.14, f (0.3) = 0.13, and f (0.4) = 0.15.
(a) Find the leading coefficient of the polynomial of least degree interpolating these data.
(b) Suppose, additionally, that f (0.5) = 0.11. Use your previous work to find the leading coefficient of the polynomial
of least degree interpolating all of the data.
[S]
4. Find a Newton form of the polynomial of degree at most 3 interpolating the points (1, 2), (2, 2), (3, 0) and (4, 0).
5. Use the method of divided differences to find the at-most-second-degree polynomial interpolating the points (0, 10),
(30, 58), (1029, −32). [A]
6. Use divided differences to find an interpolating polynomial for the data f (1) = 0.987, f (2.2) = −0.123, and f (3) =
0.432. [S]
7. Create a divided differences table for the following data using only pencil and paper.
(a) What is the interpolating polynomial of degree at most 2? Does it actually have degree 2?
(b) Write down two distinct linear interpolating polynomials for this data based on your table.
8. Use divided differences to find the at-most-cubic polynomial of exercise 19 of section 3.2. Does it have the expected
degree? [A]
9. Find the degree at most two interpolating polynomial of the form
10. Use the Octave code from question 2 to compute the interpolating polynomial of at most degree four for the data:
x f (x)
0.0 −6.00000
0.1 −5.89483
0.3 −5.65014
0.6 −5.17788
1.0 −4.28172
Then add f (1.1) = −3.9958 to the table, and compute the interpolating polynomial of degree at most 5 using a
calculator. You may use the Octave code to check your work. [S]
11. Use the Octave code from question 2 to find interpolating polynomials of degrees (at most) one, two, and three for
the following data. Approximate f (8.4) using each polynomial.
12. Find a bound on the error in using the interpolating polynomial of question 6 to approximate f (2) assuming that all
derivatives of f are bounded between −2 and 1 over the interval [1, 3]. [S]
(b) assuming f ∈ C 3 , find a theoretical bound on the error of approximating f (x) on the interval [2, 4].
[A]
14.
(a) Find an error bound, in terms of f (4) (ξ8.4 ), for the approximation P3 (8.4) in question 11.
(b) Find an error bound, in terms of f (4) (x), for the approximation P3 (x) in question 11 good for any x ∈ [8.1, 8.7].
(c) Suppose f (4) (x) = x cos x − ex for the function f (x) of question 11. Use this information to find an error bound
for the approximation P3 (x) good for any x ∈ [8.1, 8.7].
15. Buck spilled coffee on his divided differences table, obscuring several numbers. Nevertheless, there is enough legible
information to find the at-most-degree-3 polynomial interpolating the data. Find it. [A]
16. Show that the polynomial interpolating the following data has degree 3.
x −2 −1 0 1 2 3
f (x) 1 4 11 16 13 −4
17. For a function f , Newton’s divided difference formula gives the interpolating polynomial
16
N3 (x) = 1 + 4x + 4x(x − 0.25) + x(x − 0.25)(x − 0.5)
3
[S]
on the nodes x0 = 0, x1 = 0.25, x2 = 0.5, x3 = 0.75. Find f (0.75).
18. Match the function with its Seeded Sidi method convergence diagram. In each case, Sidi’s 6th degree method was used.
The real axis passes through the center of each diagram, and the imaginary axis is represented, but is not necessarily
centered. [S]
f (x) = sin x
g(x) = sin x − e−x
h(x) = ex + 2−x + 2 cos x − 6
l(x) = 56 − 152x + 140x2 − 17x3 − 48x4 + 9x5
3.3. NEWTON POLYNOMIALS 125
(a) (b)
(c) (d)
19. Match the function with its Seeded Sidi method convergence diagram. The real axis passes through the center of each
diagram, and the imaginary axis is represented, but is not necessarily centered. [A]
f (x) = x4 + 2x2 + 4
g(x) = (x2 )(ln x) + (x − 3)ex
h(x) = 1 + 2x + 3x2 + 4x3 + 5x4 + 6x5
l(x) = (ln x)(x3 + 1)
(a) (b)
(c) (d)
20. You have found the following Octave function with no comments (boo to the author of the function!).
ans = 0;
for i=1:n
a=1;
for j=1:n
if (j==i)
a=a*y(i);
else
a=a*(x0-x(j))/(x(i)-x(j));
endif
endfor
ans=ans+a;
endfor
endfunction
foo([1.1,1.2,1.3,1.4],[.78,.81,.79,.75],1.2)
and why?
Answers
P2 from f1,0 , f1,1 , f0,2 : P 2 (x) = f1,0 + f1,1 (x − x1 ) + f0,2 (x − x1 )(x − x2 )
Q2 two new ways: Q̂2 (x) = f2,0 + f1,1 (x − x2 ) + f1,2 (x − x2 )(x − x1 ) and Q2 (x) = f2,0 + f2,1 (x − x2 ) + f1,2 (x −
x2 )(x − x3 )
Chapter 4
Numerical Calculus
x0 x2
93/70 2.084603181618954
95/70 2.055494116570853
97/70 2.030278824314539
99/70 2.009751835391139
101/70 1.993574976724822
103/70 1.981091507449763
105/70 1.971614474758557
In the Newton’s method experiment, x2 is a function of x0 , and in the fixed point iteration experiment, x10 is
a function of x0 . So you start to think of them completely independently from the original root-finding question.
As they sit in their tabular form, they are just two functions for which you know a handful of values and not much
more. What do these functions look like? Do we have enough information to perhaps find their derivatives, and,
hence, local extrema? Can we find their antiderivatives? This is the stuff of numerical calculus. We can certainly
approximate these things.
In chapter 3 we learned how to approximate functions by interpolation, so we know we can use the tabular data
to approximate the functions themselves. But what about their derivatives and integrals? Well, polynomials are
easy to differentiate and integrate. Perhaps we can use the derivatives and integrals of interpolating polynomials
to approximate the derivatives and integrals of x2 (x0 ) and x10 (x0 ). Indeed we can!
In order to avoid the confusion of using x0 for multiple purposes, we will rename our functions ν(x) for x2 (x0 )
and ϕ(x) for x10 (x0 ). Hence, we have ν(93/70) = 2.0846 . . ., ν(95/70) = 2.0554 . . ., and so on. Similarly, we
127
128 CHAPTER 4. NUMERICAL CALCULUS
have now ϕ(1/7) = 1.9498 . . ., ϕ(2/7) = 1.9510 . . ., and so on. We will also take up the practice of calling the
x-coordinates of the prescribed interpolation points nodes. Hence, the nodes we have for ν are 93/70, 95/70, and
so on. The nodes we have for ϕ are 1/7, 2/7, and so on.
ν is the (lower case) thirteenth letter of the Greek alphabet and is pronounced noo. ϕ is the (lower case) twenty-
first letter of the Greek alphabet and is pronounced fee. The letter fee is also written φ, but in mathematics it
is much more common to see the variant ϕ, perhaps to avoid confusion between fee and the empty set, ∅. The
capital versions of ν and ϕ are N and Φ, respectively.
We begin by considering interpolating polynomials on three nodes. For ν, we use the nodes 93/70, 99/70, and
1.5, and get
P2,ν (x) = .07673215587088045x2 −.07445530457646088x + 1.95895140161684.
For ϕ, we use the nodes 1/7, 4/7, and 1, and get
We have added a second subscript to P2 in order to distinguish the interpolating polynomial for ν from that for ϕ.
Now we can approximate derivatives and integrals for both ν and ϕ using P2,ν and P2,ϕ , respectively:
ν 0 (x) ≈ P2,ν
0
(x) = 4.997181372684508x − 7.726543017101505
ϕ0 (x) ≈ P2,ϕ
0
(x) = .1534643117417609x − .07445530457646088
ˆ ˆ
ν dx ≈ P2,ν dx = .8328635621140847x3 − 3.863271508550753x2 + 7.939599956140455x + C
ˆ ˆ
ϕ dx ≈ P2,ϕ dx = .02557738529029348x3 − .03722765228823044x2 + 1.95895140161684x + D.
ν 0 (1.4) ≈ 0
P2,ν (1.4)
= 4.997181372684508(1.4) − 7.726543017101505
= −.7304890953431942
ϕ (0.5) ≈
0 0
P2,ϕ (0.5)
= .1534643117417609(0.5) − .07445530457646088
= .002276851294419568
and
ˆ 1.5 ˆ 1.5
ν(x)dx ≈ P2,ν (x)dx
1.4 1.4
1.5
= .8328635621140847x3 − 3.863271508550753x2 + 7.939599956140455x 1.4
= .1991481658283149
ˆ 1 ˆ 1
ϕ(x)dx ≈ P2,ϕ (x)dx
0 0
1
= .02557738529029348x3 − .03722765228823044x2 + 1.95895140161684x
0
= 1.947301134618903.
That’s it! This exercise encapsulates the entire strategy. Given some values of an otherwise unknown function, we
will approximate the unknown function with a polynomial. We will then approximate derivatives and integrals of
4.1. RUDIMENTS OF NUMERICAL CALCULUS 129
the unknown function by differentiating and integrating the polynomial. There is very little more to be said about
the idea. There is, however, a lot more to be said about automation, accuracy, and efficiency, the focus of the rest
of the chapter. But before we tackle those issues, we will have another look and ν and ϕ.
Using all the nodes of ν, and the help of a computer algebra system, we compute the sixth degree interpolating
polynomial
Using all the nodes of ϕ (and a computer algebra system) we compute the sixth degree interpolating polynomial
ν 0 (1.4) 0
≈ P6,ν (1.4) ≈ −.7178145479410887
ϕ (0.5)
0 0
≈ P6,ϕ (0.5) ≈ .1729311759579151
ˆ 1.5 ˆ 1.5
ν(x)dx ≈ P6,ν (x)dx ≈ .1991932206801721
1.4 1.4
ˆ 1 ˆ 1
ϕ(x)dx ≈ P6,ϕ (x)dx ≈ 1.925578216262883.
0 0
´ 1.5
Table 4.1 summarizes the eight estimates we have made so far. The first four digits of the estimates of 1.4 ν(x)dx
´1
agree, and the first two of 0 ϕ(x)dx agree. So there is some agreement for the estimates of the integrals. The
estimates for the derivatives don’t agree quite as well, however. The estimates for ν 0 (1.4) only agree in their first
significant digit. They both suggest ν 0 (1.4) ≈ −.7. But there is essentially no agreement between the estimates of
ϕ0 (0.5). One approximation is more than 60 times the other! Based on this simple analysis, we should have a hard
time believing either estimate of ϕ0 (0.5). And we should only trust the first few digits of the others. We will see
later that we can use this type of comparison to have the computer decide whether an approximation is good or
not.
Issues
There are three issues with the method of estimating derivatives and integrals just outlined.
1. Efficiency. For illustrative purposes and understanding the basic concept of numerical calculus, it is a good
idea to calculate some interpolating polynomials as done in the previous subsection. However, it is cumbersome
and time-consuming to do so. We will dedicate significant energy into finding shortcuts to this direct method,
thus making it more efficient and practical.
2. Automation. Numerical methods are meant to be run by a computer, not a human with a calculator. We
need to find ways that a computer can handle interpolating polynomials. This issue has intimate ties with
efficiency. After all, what will make an algorithm efficient is if it can be executed quickly by a computer!
130 CHAPTER 4. NUMERICAL CALCULUS
3. Accuracy. So far we have done very little to determine how accurate our approximations are. We need to
get a better handle on the error terms in order to understand how to use the method accurately.
Presently, we make strides toward addressing all three of these issues, but we leave the bulk of it for the upcoming
sections.
In chapter 3, we labeled the nodes of an interpolating function x0 , x1 , . . . , xn . It will be beneficial to begin calling
them x0 + θ0 h, x0 + θ1 h, . . . , x0 + θn h instead. And for most of our analysis, we will use x0 + θh instead of x for
the point at which we desire an estimate. One might call this substitution a change of variables or a recalibration
of the x-axis.
To see how this helps with the analysis, consider the degree at most 2 interpolating polynomial of f with nodes
x0 + θ0 h, x0 + θ1 h, and x0 + θn h.
(θ − θ1 )(θ − θ2 )
P2 (x0 + θh) = f (x0 + θ0 h)
(θ0 − θ1 )(θ0 − θ2 )
(θ − θ0 )(θ − θ2 )
+ f (x0 + θ1 h)
(θ1 − θ0 )(θ1 − θ2 )
(θ − θ0 )(θ − θ1 )
+ f (x0 + θ2 h). (4.1.1)
(θ2 − θ0 )(θ2 − θ1 )
For the most part, we have just swapped x for θ and xi for θi . This benign-looking change is actually a huge step
forward! This formula makes it apparent that the actual values of the xi are not important. It is only their location
relative to some base point, x0 , measured by some characteristic length, h, that matters. θ and the θi are those
measures. Essentially this makes x0 the origin and h the unit of measure on the x-axis. We measure all values by
how many lengths of h they are from x0 .
To illustrate the benefit, let us assume that we have three nodes, equally spaced, so the least and greatest
nodes are the same distance from the third, middle node. Setting the central node as the base point, x0 , and the
characteristic length, h, to the distance from this central node to the others, we can then label them
x0 − h, x0 , and x0 + h.
And we have already arrived at the essential point. It doesn’t matter if the set of nodes is {1, 2, 3} or {80, 90, 100}
or {−4.3, −4.2, −4.1}. In each of these sets, we have three nodes, one of which is the midpoint of the other two.
Each set of nodes is equal to the set {x0 − h, x0 , x0 + h} for some values of x0 and h. Hence, if we can do any
analysis with the set {x0 − h, x0 , x0 + h}, then we get information about working with any of the sets of nodes
{1, 2, 3} or {80, 90, 100} or {−4.3, −4.2, −4.1} and so on.
Back to the set of nodes {x0 − h, x0 , x0 + h}. For this set of nodes, we have θ0 = −1, θ1 = 0, and θ2 = 1.
Substituting into 4.1.1,
Now this formula can be used to get the interpolating parabola over any set of three equally spaced nodes.
99
In an attempt to apply this formula to ν, consider the nodes 93/70, 99/70, and 105/70. Since 70 − 93 105 99
70 = 70 − 70 ,
99 6 3
we have a set of nodes of the form {x0 − h, x0 , x0 + h} with x0 = 70 and h = 70 = 35 . It just so happens that
4.1. RUDIMENTS OF NUMERICAL CALCULUS 131
99 1 3
1.4 = 70 − 6 · so we use θ = − 16 to calculate P2,ν (1.4):
35 ,
1
P2,ν (1.4) = P2,ν x0 − h
6
1 2 1 2 ! 2
−6 + 6 93 1 99 − 16 − 16 105
= ν + 1− − ν + ν
2 70 6 70 2 70
93 99 105
7ν 70 + 70ν 70 − 5ν 70
=
72
7(2.084603181618954) + 70(2.009751835391139) − 5(1.971614474758557)
=
72
= 2.019677477429439.
This seems a pretty good estimate since it is between ν(93/70) ≈ 2.085 and ν(99/70) ≈ 2.009 but significantly closer
to 2.009. After all, 1.4 is between 93/70 ≈ 1.328 and 99/70 ≈ 1.414 but significantly closer to 1.414. Equation 3.2.3
gives us some idea how good we might expect this estimate to be.
7ν ( 70
93
)+70ν ( 99
70 )−5ν ( 70 )
105
But let’s back this calculation up just a couple steps. The constants of the 72 step were
determined purely from the values of θ and the θi . And the 93 ,
70 70
99
, and 105
70 are just the three nodes, x 0 −h, x0 , x0 +h,
so what we really have here is a prescription, or formula, for the value P2 (x0 − 16 h) for any degree at most 2
interpolating polynomial over the nodes x0 − h, x0 , and x0 + h:
1 1 7ν(x0 − h) + 70ν(x0 ) − 5ν(x0 + h)
ν x0 − h ≈ P2,ν x0 − h = .
6 6 72
And there is nothing special about the particular ν in this formula either. None of the constants − 61 , 7, 70, −5,
nor 72 is dependent on ν, but rather only dependent on the spacing of the nodes. Therefore, given any function f ,
we can extract from this calculation the succinct approximation formula
1 7f (x0 − h) + 70f (x0 ) − 5f (x0 + h)
f x0 − h ≈ . (4.1.2)
6 72
This formula illustrates the real purpose in reframing the values of the xi in terms of x0 , h, and the θi . This way,
we get formulas applicable to a whole class of nodes, not just one particular set of nodes.
As for ϕ, the nodes 71 , 74 , and 1 are equally spaced, so the set { 71 , 74 , 1} has the form {x0 − h, x0 , x0 + h} where
x0 = 74 and h = 37 . Not by accident, it happens that 47 − 16 · 73 = 0.5, so ϕ(0.5) = ϕ(x0 − 61 h) where x0 = 47 and
h = 37 . Now we can use formula 4.1.2 to approximate ϕ(0.5)!
7ϕ(x0 − h) + 70ϕ(x0 ) − 5ϕ(x0 + h)
ϕ(0.5) ≈ P2,ϕ (0.5) =
72
7(1.9498808918992) + 70(1.941460911122824) − 5(1.96122825291126)
=
72
= 1.94090678829633.
This time, we have completely circumvented any direct calculation and evaluation of P2,ϕ . Formula 4.1.2 allows us
to calculate P2,ϕ (0.5) directly from the values of ϕ at the three nodes. No need to calculate, refer back to, evaluate,
or simplify P2,ϕ ! All of that has been done in deriving the formula. Very quick. Very efficient.
Stencils
A formula such as 4.1.2 is only applicable to a set of nodes and point of evaluation with the same geometry (relative
positioning) as those used to derive the formula. Therefore, it will be important to keep track of the geometry used
to derive such formulas. To that end, we often refer to a particular set of nodes with its corresponding point of
evaluation as a stencil. For example, the nodes x0 − h, x0 , x0 + h with point of evaluation x0 − 16 h form a stencil—a
relative positioning of points that can be scaled (by changing the value of h) and translated (by changing the value
of x0 ). On a number line, this particular stencil looks like
.
132 CHAPTER 4. NUMERICAL CALCULUS
x0 can be located anywhere and h can be any size, even negative. It is this flexibility that makes formulas like 4.1.2
useful.
Now let’s suppose we do not have evenly spaced data, but we are interested in a point midway between two
others. An appropriate three-point stencil would use the nodes x0 − h, the leftmost node, x0 + h, the rightmost
node, x0 + θ1 h for some θ1 between −1 and 1, the middle node, and point of evaluation x0 , the point midway
between the leftmost and rightmost nodes. For θ1 = 13 , this stencil looks like
.
And we can derive a formula for P2 (x0 ) based on the values of f at the three nodes. Plugging θ = 0, θ0 = −1,
θ1 = 31 , and θ2 = 1 into equation 4.1.1, we get
(− 31 )(−1) (1)(−1) 1 (1)(− 13 )
P2 (x0 ) = 4 f (x0 − h) + 4 2 f (x0 + 3 h) + f (x0 + h)
(− 3 )(−2) ( 3 )(− 3 ) (2)( 23 )
f (x0 − h) + 9f (x0 + 13 h) − 2f (x0 + h)
= ,
8
again a succinct formula applicable to any function f . No need to calculate the interpolating polynomial or evaluate
it directly for any data that fit this stencil. That part has already been done and simplified.
Derivatives
Derivative formulas can be derived likewise. Once derived for a given stencil, they can be used very easily and
efficiently for other data fitting the same stencil. We now find the formula for the first derivative, P20 (x0 − 16 h), over
the stencil
used earlier. We begin by recognizing that in 4.1.1 x is a function of θ. In particular, x(θ) = x0 + hθ, so d
dθ x(θ) = h.
By the chain rule, dθ
d
P2 (θ) = dx
d
P2 (x) · dθ
d
x(θ) = h dx
d
P2 (x). From equation 4.1.1, we then have
dθ P2 (θ)
d
d
P2 (x) =
dx h
(θ − θ1 ) + (θ − θ2 )
= f (x0 + θ0 h)
h(θ0 − θ1 )(θ0 − θ2 )
(θ − θ0 ) + (θ − θ2 )
+ f (x0 + θ1 h)
h(θ1 − θ0 )(θ1 − θ2 )
(θ − θ0 ) + (θ − θ1 )
+ f (x0 + θ2 h). (4.1.3)
h(θ2 − θ0 )(θ2 − θ1 )
In particular, when θ0 = −1, θ1 = 0, θ2 = 1, and θ = − 61 , we have
1 − 16 − 76 5
−7 5
−1
P20 x0 − h = f (x0 − h) + 6 6 f (x0 ) + 6 6 f (x0 + h) (4.1.4)
6 h(−1)(−2) h(1)(−1) h(2)(1)
−2f (x0 − h) + f (x0 ) + f (x0 + h)
= .
3h
We now have a formula for P20 (x0 − 16 h) ≈ f 0 (x0 − 16 h) for the stencil with nodes x0 − h, x0 , x0 + h and x = x0 − 61 h.
We can now apply this formula to approximate ν 0 (1.4) and ϕ0 (0.5).
93 99
−2ν( 70 ) + ν( 70 ) + ν( 105
70 )
ν 0 (1.4) ≈ 3
3( 35 )
−2(2.084603181618954) + 2.009751835391139 + 1.971614474758557)
=
9/35
= −.7304890953430477.
4.1. RUDIMENTS OF NUMERICAL CALCULUS 133
Notice this is not exactly what we got in table 4.1 for ν 0 (1.4) using P2 . The two estimates differ in the last few
digits. This is due to floating-point error affecting the calculations in different ways. Generally there is more error
in calculating directly from the interpolating polynomial because the data are processed much more heavily. Best
not to trust the last several digits in either calculation, however. Now
−2ϕ( 71 ) + ϕ( 47 ) + ϕ(1)
ϕ0 (0.5) ≈
3( 37 )
−2(1.9498808918992) + 1.941460911122824 + 1.96122825291126)
=
9/7
= .002276851294420679.
Again, this is close to the approximation in table 4.1, but not exactly the same due to different floating-point errors
for the two calculations. But the point is made. Using a formula based on a stencil is preferable to working directly
from the interpolating polynomial. It is easier, more efficient, and can be automated.
Before moving on to integration, we make one more observation. When trying to approximate f using an
interpolating polynomial, it does not make much sense to consider a stencil like
where the point of evaluation is one of the nodes. We know, by definition of Pn , that Pn (xi ) = f (xi ) for each
node xi . Hence, the “formula” would be f (xi ) = P2 (xi ), and it would be exact, not an approximation. And not
particularly informative since this is one of the facts from which we calculated P2 ! On the other hand, it does make
sense to consider such a stencil when trying to approximate derivatives of f . There is no guarantee the derivative
of Pn will agree with the derivative of f anywhere, even at the nodes. Substituting θ0 = −1, θ1 = 0, θ2 = 1, and
θ = 0 into 4.1.3, we find
1 1 + (−1) 1
P20 (x0 ) = f (x0 − h) + f (x0 ) + f (x0 + h)
h(−1)(−2) h(1)(−1) h(2)(1)
f (x0 + h) − f (x0 − h)
= , (4.1.5)
2h
for example.
Integrals
For integration formulas, we use a modified stencil. We need the nodes plus the endpoints of integration, which will
be identified by square brackets, [ for the left endpoint and ] for the right endpoint. But the process is analogous.
We find a formula for the interpolating polynomial and, in place of integrating the unknown function, we integrate
the interpolating polynomial.
Following this procedure, we can derive a formula for the integral of f over the stencil
for example. The algebra is straightforward but tedious, so we do not show it here. It is best to use a computer
algebra system to derive such a formula. The result, an approximation of the integral over [x0 + 2.5h, x0 + 6h] using
nodes x0 , x0 + h, x0 + 2h, x0 + 3h, x0 + 4h, x0 + 5h, and x0 + 6h, is
ˆ x0 +6h
h
f (x)dx ≈ [42056f (x0 + 6h) + 201831f (x0 + 5h) + 63357f (x0 + 4h)
x0 +2.5h 138240
+195902f (x0 + 3h) − 28518f (x0 + 2h) + 10731f (x0 + h) − 1519f (x0 )] .
´ 1.5
This formula can now be used to approximate 1.4 ν(x)dx instead of integrating the interpolating polynomial
directly as done on page 129. You are invited to plug in the appropriate values of ν and compare your answer to
the one in table on page 129. Answer on´page 136.
1
The stencil for the approximation of 0 ϕ(x)dx using P6,ϕ looks like
134 CHAPTER 4. NUMERICAL CALCULUS
,
´ 1.5
different from the one we used to approximate 1.4 ν(x)dx. Consequently, the approximation formula is different
too. We need a formula for the integral over [x0 − h, x0 + 6h] with nodes x0 , x0 + h, x0 + 2h, x0 + 3h, x0 + 4h,
x0 + 5h, and x0 + 6h. The nodes are the same as before, but the interval of integration is different. The result is
ˆ x0 +6h
h
f (x)dx ≈ [5257f (x0 + 6h) − 5880f (x0 + 5h) + 59829f (x0 + 4h)
x0 −h 8640
−81536f (x0 + 3h) + 102459f (x0 + 2h) − 50568f (x0 + h) + 30919f (x0 )] . (4.1.6)
Key Concepts
node: the abscissa (first coordinate) of a data point used in interpolation.
polynomial approximation: approximating the value of a function, its derivative or integral based on the cor-
responding value of an interpolating polynomial.
stencil: relative positioning of the abscissas used in a polynomial approximation.
1
Exercises (c) Substitute θ = 2
into your formula from (b) and
simplify. [A]
1. Derive an approximation formula for the first derivative
over the stencil 3. Derive an approximation formula for the first derivative
over the stencil
[S]
following these steps.
following these steps.
(a) Write down L1 (x), the Lagrange form of the inter-
(a) Calculate N2 (x), the Newton form of the interpo-
polating polynomial passing through the points
lating polynomial passing through the points
(x0 , f (x0 )) and (x1 , f (x1 )). (x0 , f (x0 )), (x1 , f (x1 )), and (x2 , f (x2 )).
(b) Calculate the derivative L01 (x). (b) Calculate the derivative N20 (x).
(c) Substitute x0 + 21 h for x and x0 + h for x1 in your (c) Substitute x0 + 21 h for x, x0 +h for x1 , and x0 +2h
formula from (b) and simplify. for x2 in your formula from (b) and simplify. [A]
2. Derive an approximation formula for the first derivative 4. Derive an approximation formula for the second deriva-
over the stencil tive over the stencil
[S]
following these steps.
following these steps.
(a) Calculate N2 (x(θ)) = N2 (x0 + θh), the New-
(a) Write down L1 (x(θ)) = L1 (x0 + θh), the La- ton form of the interpolating polynomial passing
grange form of the interpolating polynomial pass- through the points
ing through the points
(x0 , f (x0 )), (x0 + h, f (x0 + h)),
(x0 , f (x0 )) and (x0 + h, f (x0 + h)) and (x0 + 2h, f (x0 + 2h))
in terms of θ, h, and x0 . in terms of θ, h, and x0 .
2
(b) Calculate the derivative dx
d
L1 (x(θ)). Remember, (b) Calculate the derivative dx
d
2 N2 (x(θ)). Remem-
x(θ) = x0 + θh, and use the chain rule. ber, x(θ) = x0 + θh, and use the chain rule.
4.1. RUDIMENTS OF NUMERICAL CALCULUS 135
1
(c) Substitute θ = 2
into your formula from (b) and (b) for the first derivative.
simplify. (c) for the second derivative.
5. Formula 4.1.5 and the formula you got from question (d) for the third derivative. What can you say about
1 should be different. However, they were derived over this formula?
essentially the same stencil—two nodes with the point
of evaluation centered between them. Only the labels 11. The polynomial p(x) = 3x4 − 2x2 + x − 7 is an interpo-
on the stencils were different. In other words, they lating polynomial for f . Use p to approximate
were derived from the same geometry, so, in some sense,
(a) f (1)
must be the same. In question 1, x0 plays the same role
[A]
as x0 − h does in 4.1.5. Moreover, in question 1, the (b) f (2)
distance from the point of evaluation to either node is (c) f 0 (1)
h
while in 4.1.5, that distance is h. Make the substitu-
2 (d) f 0 (2) [S]
tion x0 for x0 − h in 4.1.5. Then make the substitution ˆ 1
h
2
for the h in the denominator of 4.1.5. With these (e) f (x)dx
substitutions, formula 4.1.5 should match exactly the 0
formula you got in question 1. In other words, different ˆ 2
[A]
labelings in a stencil produce different labelings in the (f) f (x)dx
0
associated formula. Nothing more.
6. Use formula 4.1.6 to approximate the integral. 12. The polynomial q(x) = −7x4 + 3x2 − x + 4 is an inter-
polating polynomial for g. Use q to approximate
ˆ 3
(a) ex dx [A] (a) g(1) [A]
−4
ˆ 6 (b) g(2)
(b) sin x dx (c) g 0 (1) [A]
−1
ˆ 17 (d) g (2)
0
1
(c) dx [S] ˆ 1
x−5 [S]
10
ˆ (e) g(x)dx
4 0
(d) x5 − 4 dx ˆ
2
−3
ˆ (f) g(x)dx
1 0
−x [A]
(e) e dx
0 13. Use 4.1.3 to find the formula for the first derivative over
ˆ π/2 the stencil
(f) cos x dx
−π/2 (a)
ˆ 2
1 [A]
(g) dx
1 x
ˆ 6.1
[A]
(h) 9 − x4 dx (b)
4
(b) Integrate the polynomial over the interval [x0 + (e) [A]
θ0 h, x0 + θ1 h].
[A]
(c) Simplify.
15. Use the general approximation formula you derived in 16. A general three point formula for the first derivative
question 14 to find an approximation formula over the using f (x0 ), f (x0 + αh), and f (x0 + 2h), α 6= 0 and
stencil. α 6= 2, is given by
[A]
(a)
1 2+α
h
f 0 (x0 ) = − f (x0 )
2h α
(b) 4
+ f (x0 + αh)
α(2 − α)
α
i
(c) [S] − f (x0 + 2h) + O(h2 )
2−α
Answers
´ x0 +6h
x0 +2.5h
f (x)dx:
1/35
[42056(1.971614474758557) + 201831(1.981091507449763)
138240
+63357(1.993574976724822) + 195902(2.009751835391139)
−28518(2.030278824314539) + 10731(2.055494116570853)
−1519(2.084603181618954)]
´ x0 +6h
x0 −h
f (x)dx:
1/7
[5257(1.96122825291126) − 5880(1.965674866641883)
8640
+59829(1.960870620285721) − 81536(1.941460911122824)
+102459(1.923339403354019) − 50568(1.951091775564697)
+30919(1.9498808918992)]
4.2. UNDETERMINED COEFFICIENTS 137
where x0 , x1 , . . . , xn are the nodes of the interpolating polynomial, places where the value of f is known, and the
ai are constants resulting from the derivation. The Method of Undetermined Coefficients takes a direct approach
to calculating the constants ai . Knowing that the “approximation” formula must be exact for all polynomials of
degree 0, 1, . . . , n, we can create n + 1 equations in the n + 1 unknowns, a0 , a1 , . . . , an . The solution of the resulting
system of equations gives the values of the coefficients.
Derivatives
We seek an approximation of the k th derivative of f based on knowledge of the values f (x0 + θ0 h), f (x0 +
θ1 h), . . . , f (x0 + θn h). To be precise, we desire an approximation of the form
n
f (k) (x0 + θh) ≈
X
ai f (x0 + θi h). (4.2.1)
i=0
Due to equation 3.2.3, the approximation must be exact for all polynomials of degree n or less. In particular, it
must be exact for the polynomials pj (x) = (x − x0 )j , j = 0, 1, . . . , n. Symbolically, it must be that
n
(k)
X
pj (x0 + θh) = ai pj (x0 + θi h)
i=0
for j = 0, 1, . . . , n. Notice the approximation has become an (exact) equality. Noting that pj (x0 + θi h) = ((x0 +
θi h) − x0 )j = (θi h)j , the system of equations becomes
n
(k)
X
pj (x0 + θh) = a0 + (θi h)j ai (4.2.2)
i=1
In general, a system of linear equations may have zero, one, or many solutions. However, system 4.2.2 has a
special form. In each equation, the constants (θi h)j form a geometric progression. Such a matrix of coefficients
is called a Vandermonde matrix, and it is known that as long as the θi are distinct, this system will have one
solution.
and are interested in formulas for both the first and second derivatives of f (at x0 ). For this stencil, θ = 0, θ0 = −1,
θ1 = 0, and θ2 = 1, so we are looking for formulas of the forms
Each of these formulas must be exact when f = p0 , when f = p1 , and when f = p2 . These three requirements give
three equations in the three unknowns.
Beginning with the first derivative formula, we detail system 4.2.2 with k = 1 and n = 2:
0 = a0 + a1 + a2
1 = −ha0 + ha2
0 = h2 a0 + h2 a2 .
The system can be solved by substitution, elimination, or computer algebra system. The solution is a0 = 2h ,
−1
1
a1 = 0, and a2 = 2h , giving the approximation formula
f (x0 + h) − f (x0 − h)
f 0 (x0 ) ≈
2h
system 4.2.2 with k = 2 and n = 2. Notice the right-hand sides are exactly the same as they are for the first
derivative formula, save the name change from ai to bi . Only the left-hand side changes substantively. p000 (x) = 0 so
p000 (x0 ) = 0; p001 (x) = 0 so p1 (x0 ) = 0; and p002 (x) = 2 so p002 (x0 ) = 2. Making these substitutions into the equations
above,
0 = b0 + b1 + b2
0 = −hb0 + hb2
2 = h2 b0 + h2 b2 .
Again, the system can be solved by substitution, elimination, or computer algebra system. The solution is b0 =
b2 = h12 and b1 = h22 , giving the approximation formula
Integrals
The idea for estimating integrals is identical to that of estimating derivatives. The mechanics only change nominally.
´b
Where there were derivatives before, we will have integrals now. We seek an approximation of a f (x)dx based on
knowledge of the values f (x0 + θ0 h), f (x0 + θ1 h), . . . , f (x0 + θn h):
ˆ b n
X
f (x)dx ≈ ai f (x0 + θi h). (4.2.3)
a i=0
The approximation will be exact for all polynomials of degree n or less. In particular, it will be exact for pj (x) =
(x − x0 )j , j = 0, 1, . . . , n. Therefore, the system of equations
ˆ b n
X
pj (x)dx = a0 + (θi h)j ai j = 0, 1, . . . , n (4.2.4)
a i=1
For this stencil, a = x0 − h, b = x0 + 6h, and θi = ih, i = 0, 1, . . . , 6. Therefore, we will have a system of seven
equations in the seven unknowns. First, the left-hand sides:
ˆ b ˆ x0 +6h ˆ x0 +6h
x +6h
p0 (x)dx = p0 (x)dx = 1dx = (x − x0 )|x00 −h = 7h
a x0 −h x0 −h
ˆ ˆ x0 +6h ˆ x0 +6h x0 +6h
b
1 2 35 2
p1 (x)dx = p1 (x)dx = (x − x0 )dx = (x − x0 ) =
h
a x0 −h x0 −h 2 x0 −h 2
ˆ ˆ x0 +6h ˆ x0 +6h x0 +6h
b
2 1 3 217 3
p2 (x)dx = p2 (x)dx = (x − x0 ) dx = (x − x0 ) =
h
a x0 −h x0 −h 3 x0 −h 3
ˆ ˆ x0 +6h ˆ x0 +6h x0 +6h
b
3 1 4 1295 4
p3 (x)dx = p3 (x)dx = (x − x0 ) dx = (x − x0 ) =
h
a x0 −h x0 −h 4 x0 −h 4
ˆ ˆ x0 +6h ˆ x0 +6h x0 +6h
b
4 1 5 7777 5
p4 (x)dx = p4 (x)dx = (x − x0 ) dx = (x − x0 ) =
h
a x0 −h x0 −h 5 x0 −h 5
ˆ ˆ x0 +6h ˆ x0 +6h x0 +6h
b
5 1 6 46655 6
p5 (x)dx = p5 (x)dx = (x − x0 ) dx = (x − x0 ) =
h
a x0 −h x0 −h 6 x0 −h 6
ˆ ˆ x0 +6h ˆ x0 +6h x0 +6h
b
6 1 7
p6 (x)dx = p6 (x)dx = (x − x0 ) dx = (x − x0 ) = 39991h7 .
a x0 −h x0 −h 7 x0 −h
140 CHAPTER 4. NUMERICAL CALCULUS
Now putting them together with the right-hand sides (and swapping sides):
6
(θi h)0 ai
X
= a0 + a1 + a2 + a3 + a4 + a5 + a6 = 7h
i=0
6
35 2
(θi h)1 ai
X
= ha1 + 2ha2 + 3ha3 + 4ha4 + 5ha5 + 6ha6 = h
i=0
2
6
217 3
(θi h)2 ai h2 a1 + 4h2 a2 + 9h2 a3 + 16h2 a4 + 25h2 a5 + 36h2 a6 =
X
= h
i=0
3
6
1295 4
(θi h)3 ai h3 a1 + 8h3 a2 + 27h3 a3 + 64h3 a4 + 125h3 a5 + 216h3 a6 =
X
= h
i=0
4
6
7777 5
(θi h)4 ai h4 a1 + 16h4 a2 + 81h4 a3 + 256h4 a4 + 625h4 a5 + 1296h4 a6 =
X
= h
i=0
5
6
46655 6
(θi h)5 ai h5 a1 + 32h5 a2 + 243h5 a3 + 1024h5 a4 + 3125h5 a5 + 7776h5 a6 =
X
= h
i=0
6
6
(θi h)6 ai h6 a1 + 64h6 a2 + 729h6 a3 + 4096h6 a4 + 15625h6 a5 + 46656h6 a6 = 39991h7
X
=
i=0
The system again may be solved by substitution, elimination, or computer algebra, at least in principle. Not many
humans have sufficient patience and precision to solve such a system with paper and pencil, though. Trusting a
computer algebra system, the solution is a0 = 30919 2107 34153 1274 19943
8640 h, a1 = − 360 h, a2 = 2880 h, a3 = − 135 h, a4 = 2880 h,
49 5257
a5 = − 72 h, and a6 = 8640 h giving the approximation formula
ˆ x0 +6h
h
f (x)dx ≈ [5257f (x0 + 6h) − 5880f (x0 + 5h) + 59829f (x0 + 4h) − 81536f (x0 + 3h)
x0 −h 8640
+102459f (x0 + 2h) − 50568f (x0 + h) + 30919f (x0 )] (4.2.5)
Practical considerations
We have used stencils like
and
not because the results are particularly helpful, but rather to (a) illustrate the methods and (b) emphasize that these
methods work in general for any stencil you may dream up. Most of the differentiation and integration formulas
presented in numerical analysis sources stick to a small host of regularly spaced stencils where, for derivatives the
point of evaluation is a node, and for integrals, all the nodes lie between the endpoints or there are nodes at both
endpoints. It is possible the regularly-spaced stencils are all you will ever need, but it is good to know that you can
derive appropriate formulas for more unusual stencils should the need arise.
As for their derivation, the main advantage of the method of undetermined coefficients over working directly
with interpolating polynomials is the ease of automation and lessening of the necessary and often laborious algebra
needed. In the method of undetermined coefficients, the only polynomials that need to be differentiated or integrated
4.2. UNDETERMINED COEFFICIENTS 141
are the polynomials pj = (x−x0 )j , a much simpler task than integrating or differentiating interpolating polynomials.
Formulas with up to three or four nodes can be handled this way with pencil and paper. The trade-off is the necessity
of solving a system of equations, again a simpler task than differentiating and simplifying interpolating polynomials
of degree 3 or 4. As a final benefit to the method of undetermined coefficients, it is a general solution technique
used not only in numerical analysis for deriving calculus approximations, but in other studies as well, particularly
differential equations. The method is applicable whenever the form of a solution or formula is known, but the
constants (coefficients) remain a mystery.
y − 2y 0 + 3y 00 = 5 sin x (4.2.6)
has the form y = A sin x+B cos x, but we do not immediately know the values of A and B. They are undetermined
coefficients (at this point). They are determined by substituting the known form into the equation being solved.
y0 = A cos x − B sin x
y 00 = −A sin x − B cos x
We now match coefficients on left and right sides to get the system of equations
−2A + 2B = 5
−2A − 2B = 0
When we get involved with stencils with more than 3 or 4 nodes, solving the resulting (relatively large) system of
linear equations by hand is not a task to which most of us would look forward. However, it is a standard calculation
any computer algebra system can do easily and efficiently. Yes, it is advisable to use a computer algebra system to
derive formulas as complicated as 4.1.6. We have used Maxima1 to handle or double check a number of the more
tedious calculations presented in this text.
The best way to solve a large system of linear equations is with the aid of a computer algebra system. Figure
4.2.1 shows how wxMaxima may be used to derive formula 4.2.5.
Notice the similarities between Maxima code and Octave code. Maxima allows for statements, print state-
ments, variable assignments, arrays, and suppression of output. The syntax for these things is not the same, but
1 See http://maxima.sourceforge.net/
142 CHAPTER 4. NUMERICAL CALCULUS
the principles behind them are. Once you have learned how to do these things in one language, learning how to
do them in another is usually straightforward.
Also notice the main difference between Maxima and Octave. Maxima was designed for symbolic manipulation
while Octave was designed for numerical computation. Octave can be made to do symbolic calculation and
Maxima can be made to do numerical computation, but the old carpenter’s adage “use the right tool for the
job” is worth consideration. Maxima is much more adept at symbolic manipulation than is Octave, and Octave
is much more adept at number crunching than is Maxima.
Reference
http://andrejv.github.io/wxmaxima/
It is unusual to use stencils with more than five nodes anyway. It is not because the formulas for more nodes
are significantly more complicated or difficult to use, however. As evidenced by formula 3.2.3, the error term for
an interpolating polynomial involves higher and higher derivatives of f as more nodes are added. This is generally
fine as long as f has sufficiently many derivatives and the values of the high derivatives are not prohibitively
large. However, numerical methods are often employed when the smoothness of f is known to be limited, the high
derivatives are known to be large, or the properties of its derivatives are unknown completely. For these functions,
stencils with fewer nodes, which give rise to formulas with lower order error terms, are often more accurate, not
less. And in the case of unknown smoothness, the lower order methods have a better chance of being accurate.
As a final note, some care must be taken not to ask too much of a derivative formula. With n+1 nodes, the error
term for the interpolating polynomial involves f (n+1) , so there is no hope of using these nodes to estimate f (n+1)
or any higher derivatives at any point. If you, however, forget this fact, it shows up in a direct way in the method
4.2. UNDETERMINED COEFFICIENTS 143
of undetermined coefficients. If k > n, then the system of equations with undetermined coefficients becomes
n
X
(θi h)j ai = 0, j = 0, 1, . . . , n
i=0
because the k th derivative of pj is identically 0 for all j ≤ n < k. The only solution to this system is a0 = a1 =
· · · = an = 0 giving the “approximation” formula
f (k) (x0 + θh) = 0.
Indeed, this is exact for all polynomials of degree n or less. However, the error in using this formula is exactly
f (k) (x0 + θh), a relative error of exactly 1, making it completely useless.
Stability
In Experiment 2 on page 3, section 1.1, we took a brief look at approximating the first derivative of f (x) = sin x
using the fact that
sin(1 + h) − sin(1 − h)
f 0 (1) = lim .
h→0 2h
The conclusion we drew was that this computation was highly susceptible to floating-point error. If calculations
are done exactly, then we expect sin(1+h)−sin(1−h)
2h to approximate f 0 (1) better and better as h becomes smaller and
smaller. Not so for floating-point calculations, as the experiment revealed. There was a point at which making
h smaller made the approximation worse! And this example is not unique. This problem always arises when
approximating f 0 using the centered difference formula
f (x + h) − f (x − h)
f 0 (x) ≈ . (4.2.7)
2h
But how can we predict at what value of h that might happen without comparing our results to the exact value of
the derivative? After all, numerical differentiation is employed most often when the exact formula for the derivative
is unknown or prohibitively difficult to compute.
Suppose f can be computed to near machine precision. In typical floating point calculations, including Octave,
that means a relative floating-point error of approximately 10−15 or absolute floating-point error εf ≈ 10−15 |f (x)|.
Since we assume h is small, we can approximate both |f˜(x + h) − f (x + h)| and |f˜(x − h) − f (x − h)| by εf giving
an absolute error of approximately 2εf in calculating the numerator f (x + h) − f (x − h). Assuming h is calculated
exactly, we have the absolute error
2εf εf |f (x)| 1
εr = |f˜0 (x) − f 0 (x)| ≈ = = · . (4.2.8)
2h h 1015 h
000
As we will see shortly, the algorithmic error, εa , is caused by truncation and equals f 6(ξ) h2 for some value of ξ
near x. Since ξ is near x, we approximate f 000 (ξ) by f 000 (x) and conclude that
|f 000 (x)| 2
εa ≈ h . (4.2.9)
6
We now minimize the value of εr + εa by setting its derivative (with respect to h) equal to zero and solving the
resulting equation:
d |f (x)| 1 |f 000 (x)| 2
d
0= (εr + εa ) ≈ · + ·h
dh dh 1015 h 6
|f (x)| 1 |f 000 (x)|
= − 15 · 2 + ·h
10 h 3
⇒
|f 000 (x)| |f (x)| 1
·h ≈ ·
3 1015 h2
|f (x)| 3
h3 ≈ ·
|f 000 (x)| 1015
s
3 3|f (x)|
h ≈ · 10−5 .
|f 000 (x)|
144 CHAPTER 4. NUMERICAL CALCULUS
q
For Experiment 2 on page 3, this means we should expect the optimal value of h to be around 3 3sin(1)
sin(1)
· 10−5 ≈
1.44(10)−5 . We reproduce the table from Experiment 2 here with the addition of a third column, the actual absolute
error:
h p̃∗ (h) |p̃∗ (h) − f 0 (1)|
10−2 0.5402933008747335 9.00(10)−6
10−3 0.5403022158176896 9.00(10)−8
10−4 0.5403023049677103 9.00(10)−10
10−5 0.5403023058569989 1.11(10)−11
10−6 0.5403023058958567 2.77(10)−11
10−7 0.5403023056738121 1.94(10)−10
Indeed, when h = 10−5 , we get our best results! However, the prediction of the optimal value of h was based on
knowledge of f 000 , something we generally will not be able to do. Unless we happen to know that |f|f000(x)|
(x)| is far from
1, we assume it is reasonably close to 1, in which case the optimal value of h is around 10−5 . Similar estimates can
be made for other derivative formulas.
Because numerical differentiation is so sensitive to floating-point error, we say that it is unstable. The root
finding methods and numerical integration we have discussed are all stable methods. Their sensitivity to floating-
point error is commensurate with that of calculating f .
Key Concepts
undetermined coefficients: A method for solving problems in which the solution is known save for a set of
(undetermined) coefficients.
Exercises [S]
(j)
1. Using the method of undetermined coefficients, derive
an approximation formula for the first derivative over
the stencil. (k)
(a)
[A]
(l)
[A]
(b)
2. Using the method of undetermined coefficients, derive
an approximation formula for the second derivative
over the stencil.
(c)
(a)
[S]
(d) [A]
(b)
(c)
(e)
[A]
(d)
[A]
(f)
(g) (e)
[S]
(f)
[A]
(h)
(g)
(i)
4.2. UNDETERMINED COEFFICIENTS 145
[A]
(h) (e)
(g)
f (n+1) (ξx )
f (x) − Pn (x) = (x − x0 )(x − x1 ) · · · (x − xn ).
(n + 1)!
We can use this formula to derive a concise formula for the error in approximating f 0 (x) by Pn0 (x).
As done in section 3.2, suppose n ≥ 1 and x0 , x1 , . . . , xn are n distinct real numbers. Set w(x) = (x − x0 )(x −
x1 ) · · · (x − xn ), a = min(x0 , . . . , xn , x), and b = max(x0 , . . . , xn , x). We know from equation 3.2.3 that, assuming
f has n + 1 derivatives on (a, b) and f 0 , f 00 , . . . , f (n) are all continuous on [a, b], for each x ∈ [a, b],
f (n+1) (ξx )
f (x) − Pn (x) = w(x)
(n + 1)!
for some ξx ∈ (a, b). Hence,
Since w vanishes at each node, this formula simplifies nicely when x is a node. Without loss of generality, we
evaluate for x = x0 and get
f (n+1) (ξx0 ) 0
f 0 (x0 ) − Pn0 (x0 ) = w (x0 ).
(n + 1)!
From here on, the error formula is only valid at a node! This last expression can be simplified further by noting
that
Xn Y n n
X
w0 (x) = (x − xj ) = pi (x),
i=0 j=0 i=0
i6=j
where pi is as defined for equation 3.2.2 on page 106. But pi (x0 ) = 0 for all i except i = 0, so
Substituting this expression for w0 , we have the first derivative error formula
f (n+1) (ξx0 )
f 0 (x0 ) − Pn0 (x0 ) = (x0 − x1 )(x0 − x2 ) · · · (x0 − xn ).
(n + 1)!
f (n+1) (ξx0 )
f 0 (x0 ) − Pn0 (x0 ) = (−θ1 h)(−θ2 h) · · · (−θn h).
(n + 1)!
f (n+1) (ξ)
f 0 (x0 ) − Pn0 (x0 ) = θ1 θ2 · · · θn (−h)n . (4.3.1)
(n + 1)!
Error terms for the first derivative over other stencils are computed similarly as long as the derivative is evaluated
at a node. Table 4.2 summarizes some common first derivative formulas, including error terms.
Notice that the error term contains (x0 − x1 )(x0 − x2 ) · · · (x0 − xn ), the product of the differences between the
point of evaluation and all other nodes, as a factor. When the differences between the point of evaluation and
the other nodes is small, the product is small. Consequently, first derivative approximation formulas are generally
more accurate when the point of evaluation is centrally located among the nodes. Hence, we might expect a first
derivative formula involving nodes x0 < x1 < x2 to be more accurate when the point of evaluation is x1 rather
than when the point of evaluation is x0 or x2 . The same can be said about higher derivative formulas. The more
centrally located the point of evaluation, the more accurate the approximation.
Again, we choose this stencil not because the stencil is generally useful, but rather to emphasize that the method is
generally useful.
In subsection 4.1 on page 132, we derived the approximation
The left hand side, the quantity being approximated, as a Taylor series looks like
1 1 1 1 3 (4)
f x0 − h = f 0 (x0 ) − hf 00 (x0 ) + h2 f 000 (x0 ) −
0
h f (x0 ) + · · · .
6 6 72 1296
The terms of the right hand side, the approximation, as Taylor series look like
1 1 3 000 1
f (x0 − h) = f (x0 ) − hf 0 (x0 ) + h2 f 00 (x0 ) − h f (x0 ) + h4 f (4) (x0 ) − · · ·
2 6 24
f (x0 ) = f (x0 )
1 1 3 000 1
f (x0 + h) = f (x0 ) + hf 0 (x0 ) + h2 f 00 (x0 ) + h f (x0 ) + h4 f (4) (x0 ) + · · · .
2 6 24
We now substitute these Taylor series into the right hand side of 4.3.2 and simplify. To facilitate the algebra, we
begin by summing −2f (x0 − h) + f (x0 ) + f (x0 + h):
−2f (x0 − h) = −2f (x0 ) + 2hf 0 (x0 ) − h2 f 00 (x0 ) + 31 h3 f 000 (x0 ) − 12 1 4 (4)
h f (x0 ) − · · ·
f (x0 ) = f (x0 )
f (x0 + h) = f (x0 ) + hf 0 (x0 ) + 21 h2 f 00 (x0 ) + 16 h3 f 000 (x0 ) + 24
1 4 (4)
h f (x0 ) + · · ·
1 2 00 1 3 000 1 4 (4)
−2f (x0 − h) + f (x0 ) + f (x0 + h) = 3hf (x0 ) − 2 h f (x0 ) + 2 h f (x0 ) − 24 h f (x0 ) + · · · .
0
148 CHAPTER 4. NUMERICAL CALCULUS
Hence, we have
1 1 1 3 (4)
f 0 (x0 ) − hf 00 (x0 ) + h2 f 000 (x0 ) − h f (x0 ) + · · ·
6 72 1296
1 1 1
− f 0 (x0 ) − hf 00 (x0 ) + h2 f 000 (x0 ) − h3 f (4) (x0 ) + · · ·
6 6 72
11 2 000 17 3 (4)
= − h f (x0 ) + h f (x0 ) + · · · .
72 1296
We now know that we have an error of the form O(h2 f 000 (ξh )), the form of the remaining term with least degree,
but we do not have rigorous proof of that fact. Think of what has been done so far as discovery. Now that we know
the f 000 terms do not cancel, we go back and truncate all the Taylor series after the f 00 terms, replacing higher order
derivatives with an error term, and “redo” the algebra. We thus have
1 1 1
0
f x0 − h = f 0 (x0 ) − hf 00 (x0 ) + h2 f 000 (ξ1 )
6 6 72
1 1
f (x0 − h) = f (x0 ) − hf 0 (x0 ) + h2 f 00 (x0 ) − h3 f 000 (ξ2 )
2 6
f (x0 ) = f (x0 )
1 1
f (x0 + h) = f (x0 ) + hf 0 (x0 ) + h2 f 00 (x0 ) + h3 f 000 (ξ3 )
2 6
where ξ1 ∈ (x0 − 16 h, x0 ), ξ2 ∈ (x0 − h, x0 ), and ξ3 ∈ (x0 , x0 + h). And now when we compute e(h) = f 0 x0 − 61 h −
−2f (x0 −h)+f (x0 )+f (x0 +h)
3h , we know all the terms involving f , f 0 , and f 00 vanish. The only terms left are those
involving f :000
The final formality is that of converting this expression into big-oh notation:
2
h 1 000 1
|e(h)| = f (ξ1 ) − f 000 (ξ2 ) − f 000 (ξ3 )
9 8 2
2
h 1 000 1 000
(ξ ) + 000
(ξ )| + (ξ )
≤ f 1
|f 2
f 3
9 8 2
h2 13
≤ · max {|f 000 (ξ1 )| , |f 000 (ξ2 )| , |f 000 (ξ3 )|}
9 8
= h2 · M |f 000 (ξh )|
13
for some ξh ∈ (x0 − h, x0 + h) and M = 72 (the value of ξh is ξ1 , ξ2 , or ξ3 ). We conclude
In general, ξh is guaranteed to be between the least node and the greatest node. In the case of an integral
approximation, the endpoints of integration are treated as nodes for the purpose of locating ξh .
4.3. ERROR ANALYSIS 149
Gaussian quadrature
Ultimately, the accuracy of a numerical calculus formula is measured by its error term, a quantity having the form
O(hn f (k) (ξh )). If we are interested in the rate of convergence, we consider n, the power of h appearing in the error
term. The greater the power, the speedier the convergence. However, if we are interested in the largest class of
polynomials for which the formula is exact, we need to consider the value k, the order of the derivative appearing
in the error term. The greater k is, the larger the class of polynomials for which the formula is exact. In fact, if the
error term contains a factor of f (k) (ξh ), then the formula is exact for all polynomials up to (and including) degree
k − 1. The further implication is that there are degree k polynomials for which the formula is not exact, for if this
were not the case, then the error term would involve a higher derivative. We call the value k − 1 the degree of
precision. Formally, the degree of precision of a numerical calculus formula is the integer m such that the formula
is exact for all polynomials of degree up to and including m but is not exact for all polynomials of degree m + 1.
Gaussian quadrature formulas aim to maximize the degree of precision for integral formulas.
The numerical derivatives and integrals over a stencil with n + 1 points that we have derived so far are exact
for all polynomials up to degree n as they must be. They have degree of precision at least n. As it turns out, a
select few have degree of precision greater than n. Consider the second derivative approximation over the stencil
The stencil has three points, so we expect it to be exact for all polynomials up to degree 2 (and it is). However, its
error term is O(h2 f (4) (ξh )), indicating that the formula is exact for all polynomials up to degree 3. The degree of
precision is actually 3, not 2. The first derivative formula over the same stencil is similar. Though it has an error
2
term of h6 f 000 (ξh ), indicating that the formula has degree of precision 2 as expected, the formula itself only involves
two of the three points available! The coefficient of f (x0 ) turns out to be zero. It follows that we can derive the
same formula using the stencil
having only two points yet having degree of precision 2. Several other centered differences have this attribute. The
Newton-Cotes formulas with an odd number of nodes also have this property. Their error terms exceed degree of
precision expectations by one degree. We noted earlier that a centrally located point of evaluation tends to increase
accuracy, and now we see that the increase can be dramatic.
What we might gather from these observations is that it is not only the number of nodes that determines the
error term of a numerical calculus formula. The location of the nodes is also important. Up to now, we have only
seen how node location affects derivative approximation. We know that centrally locating the point of evaluation
generally increases accuracy. We now take up the question of how to locate nodes in order to increase the accuracy
of integral formulas. The idea of a centralized point of evaluation has no meaning in this context, however. Integrals
do not have a single point of evaluation. They are taken over an interval. It is the locations of the nodes relative
to the endpoints of evaluation that are important. We now find out where to put the nodes to attain the greatest
degree of precision for any given number of nodes.
Let Gn be the nth Legendre polynomial, defined recursively by
We set the θi equal to the roots of Gn to derive the n-point quadrature formula over the interval [x0 − h, x0 + h]
with greatest degree of precision possible. With placement of the nodes chosen, we force the formula to be exact
for polynomials up to degree n − 1 as we did earlier. The difference this time is, due to the particular values of θi ,
the resulting formula will be exact for all polynomials up to degree 2n − 1. When the nodes are placed at the roots
´ x +h
of the nth Legendre polynomial, we get a quadrature formula for x00−h f (x)dx that exceeds the expected degree of
precision by n, the number of nodes!
We demonstrate for n = 1 and n = 3.
G1 (x) = x
150 CHAPTER 4. NUMERICAL CALCULUS
has for its only root, 0. Hence, we seek a formula of the form
ˆ x0 +h
f (x)dx ≈ a0 f (x0 )
x0 −h
which is exact for polynomials up to degree 0. The one equation for the one unknown, a0 , is
ˆ x0 +h
(1)dx = a0 (1)
x0 −h
or 2h = a0 . Hence, we have
ˆ x0 +h
f (x)dx ≈ 2hf (x0 ),
x0 −h
and
2hf (x0 ) = 2h(x0 − x0 ) = 0,
and
2hf (x0 ) = 2h(x0 − x0 )2 = 0,
so it is not exact for all degree two polynomials. Therefore, its degree of precision is 1. Note the formula
ˆ x0 +h
f (x)dx ≈ 2hf (x0 ) is equivalent to the Midpoint Rule as found in Table 4.5.
x0 −h
Now
3xG1 (x) − G0 (x)
G2 (x) =
2
1 2
= (3x − 1)
2
so
5xG2 (x) − 2G1 (x)
G3 (x) =
3
5 3
2 (3x − x) − 2x
=
3
5(3x3 − x) − 4x
=
6
15x3 − 9x
=
6
1
= (5x3 − 3x),
2
q q
3 3
which has roots − 5 , 0, 5. Hence, we seek a formula of the form
ˆ x0 +h
! !
3 3
r r
f (x)dx ≈ a0 f x0 − h + a1 f (x0 ) + a2 f x0 + h
x0 −h 5 5
4.3. ERROR ANALYSIS 151
which is exact for polynomials up to degree 2. The three equations for the three unknowns are
ˆ x0 +h
(1)dx = 2h = a0 + a1 + a2
x0 −h
ˆ x0 +h
3 3
r r
(x − x0 )dx = 0 = − ha0 + ha2
x0 −h 5 5
ˆ x0 +h
2 3 3 2 3
(x − x0 )2 dx = h = h a0 + h2 a2 .
x0 −h 3 5 5
The solution is
5 8
a0 = a2 = h and a1 = h,
9 9
so the quadrature formula is
ˆ x0 +h
" ! !#
3 3
r r
h
f (x)dx ≈ 5f x0 − h + 8f (x0 ) + 5f x0 + h .
x0 −h 9 5 5
The formula was derived to be exact for polynomials up to degree 2, so its degree of precision is at least 2. We
claim the degree of precision is actually 5. For f (x) = (x − x0 )3 ,
ˆ x0 +h x0 +h
1
f (x)dx = (x − x0 )4 =0
x0 −h 4 x0 −h
and
" ! !# !3 !3
3 3 3 3
r r r r
h h
5f x0 − h + 8f (x0 ) + 5f x0 + h = 5 − h +0+5 h = 0,
9 5 5 9 5 5
and
" ! !# !4 !4
3 3 3 3
r r r r
h h
5f x0 − h + 8f (x0 ) + 5f x0 + h = 5 − h +0+5 h
9 5 5 9 5 5
5 9 4 9 4
= h h + h
9 25 25
2 5
= h ,
5
so it is exact for degree four polynomials. For f (x) = (x − x0 )5 ,
ˆ x0 +h x0 +h
1 6
f (x)dx = (x − x0 ) =0
x0 −h 6 x0 −h
and
" ! !# r !5 r !5
3 3 3 3
r r
h h
5f x0 − h + 8f (x0 ) + 5f x0 + h = 5 − h +0+5 h = 0,
9 5 5 9 5 5
and
" ! !# !6 !6
3 3 3 3
r r r r
h h
5f x0 − h + 8f (x0 ) + 5f x0 + h = 5 − h +0+5 h
9 5 5 9 5 5
5 27 6 27 6
= h h + h
9 125 125
3 7
= h ,
25
so it is not exact for all degree six polynomials. Its degree of precision is 5. The formula is listed as the second
Gaussian quadrature formula in table 4.5.
We can also find the degree of precision of any numerical calculus formula by observing the form of its error
term. If the error term has the form O(hn f (k) (ξh )), then its degree of precision is k − 1.
Key Concepts
Degree of precision: The integer m such that a numerical calculus formula is exact for all polynomials of degree
up to and including m but is not exact for all polynomials of degree m + 1.
Error terms: Error terms for numerical calculus approximations can be found by replacing all occurrences of f
in an approximation formula by Taylor series expansions about x0 and reducing.
Gaussian quadrature: A quadrature method which maximizes the degree of precision relative to the number of
nodes used.
2-point formulas
4.3. ERROR ANALYSIS
−f (x0 ) + f (x0 + h) h 00
f 0 (x0 ) = − f (ξh ) Forward Difference
h 2
−f (x0 − h) + f (x0 ) h 00
f 0 (x0 ) = + f (ξh ) Backward Difference
h 2
3-point formulas
5-point formulas
−25f (x0 ) + 48f (x0 + h) − 36f (x0 + 2h) + 16f (x0 + 3h) − 3f (x0 + 4h) h4 (5)
f 0 (x0 ) = + f (ξh ) Forward Difference
12h 5
−3f (x0 − h) − 10f (x0 ) + 18f (x0 + h) − 6f (x0 + 2h) + f (x0 + 3h) h4 (5)
Table 4.2: Some standard first derivative formulas.
f 0 (x0 ) = + f (ξh )
12h 20
f (x0 − 2h) − 8f (x0 − h) + 8f (x0 + h) − f (x0 + 2h) h4 (5)
f 0 (x0 ) = + f (ξh ) Centered Difference
12h 30
−f (x0 − 3h) + 6f (x0 − 2h) − 18f (x0 − h) + 10f (x0 ) + 3f (x0 + h) h4 (5)
f 0 (x0 ) = + f (ξh )
12h 20
3f (x0 − 4h) − 16f (x0 − 3h) + 36f (x0 − 2h) − 48f (x0 − h) + 25f (x0 ) h4 (5)
f 0 (x0 ) = + f (ξh ) Backward Difference
12h 5
153
154
3-point formulas
4-point formulas
5-point formulas
35f (x0 ) − 104f (x0 + h) + 114f (x0 + 2h) − 56f (x0 + 3h) + 11f (x0 + 4h)
Table 4.3: Some second derivative formulas.
4-point formulas
5-point formulas
−5f (x0 ) + 18f (x0 + h) − 24f (x0 + 2h) + 14f (x0 + 3h) − 3f (x0 + 4h)
f 000 (x0 ) = + O(h2 f (5) (ξh )) Forward Difference
2h3
−3f (x0 − h) + 10f (x0 ) − 12f (x0 + h) + 6f (x0 + 2h) − f (x0 + 3h)
f 000 (x0 ) = + O(h2 f (5) (ξh ))
2h3
Table 4.4: Some third derivative formulas.
f (x)dx = [f (x0 ) + 3f (x0 + h) + 3f (x0 + 2h) + f (x0 + 3h)] + O(h5 f (4) (ξh )) Simpson’s 8 Rule
x0 8
ˆ x0 +4h
2h
f (x)dx = [7f (x0 ) + 32f (x0 + h) + 12f (x0 + 2h) + 32f (x0 + 3h) + 7f (x0 + 4h)] + O(h7 f (6) (ξh )) Bode’s Rule
x0 45
Exercises
1. Let f (x) = ex − sin x. Complete the following table using the approximation formula
x f (x) f 0 (x)
−2.7 0.054797
−2.5 0.11342
−2.3 0.65536
−2.1 0.98472
−0.25
´3
(d) 1
esin x dx
´2
(e) 1
x4 dx [A]
[S][A]
4. Do question 3 using the Trapezoidal rule.
[S][A]
5. Do question 3 using the Midpoint rule.
[S][A]
6. Find the error of the approximation in question 3.
[S][A]
7. Find the error of the approximation in question 4.
8. Find the error of the approximation in question 5. [S][A]
´ 11 √
9. Find the error in approximating −7 (32x2 + 7x − 2)dx using Simpson’s 38 Rule.
´ 36
10. Find the error in approximating −17 (32x5 + 7x3 − 2)dx using Bode’s Rule. [A]
11. For the following values of f , x0 , and h, use the formula
12. Compute both a lower bound and an upper bound on the error for the approximation in question 11. Verify that the
actual error is between these bounds. [S][A]
[S][A]
13. For each part of question 11, find the value of ξ guaranteed by the formula.
14. State the degree of precision of the closed Newton-Cotes formula on 5 nodes, Bode’s Rule.
[S]
15. State the degree of precision of the five point formula.
1
f 0 (x0 ) = [−25f (x0 ) + 48f (x0 + h) − 36f (x0 + 2h)
12h
h4 (5)
+16f (x0 + 3h) − 3f (x0 + 4h)] + f (ξ)
5
158 CHAPTER 4. NUMERICAL CALCULUS
17. Find the error term for the quadrature method, and state its degree of precision.
ˆ x0 +h
(a) f (x) dx ≈ hf (x0 ) [A]
x0
ˆ x0 +h
h
(b) f (x) dx ≈ hf x0 +
x0 4
ˆ x0 +h
h 2
h i
[S]
(c) f (x)dx ≈ 3f x0 + h + f (x0 )
x0 4 3
ˆ x0 +2h
h 4
h i
(d) f (x)dx ≈ 3f x0 + h + f (x0 )
x0 2 3
ˆ x0 +3h
3h [A]
(e) f (x)dx ≈ [f (x0 ) + 3f (x0 + 2h)]
x0 4
ˆ x0 +2h
h h 3
h i
(f) f (x)dx ≈ f x0 − + 3f x0 + h
x0 2 2 2
ˆ x0 +2h
h [A]
(g) f (x)dx ≈ [f (x0 − h) − 2f (x0 ) + 7f (x0 + h)]
x0 3
ˆ x0 +3h
3 3
h i
(h) f (x)dx ≈ 3h 3f x0 + h − 6f (x0 + h) + 4f x0 + h
x0 2 4
ˆ x0 +3h
h 3 3 3
h i
[A]
(i) f (x)dx ≈ − 208f x0 + h − 891f (x0 + h) + 1344f x0 + h − 625f x0 + h
x0 12 2 4 5
21. What can you say about the error in approximating the first derivative of
(a) Compute the error (not a bound on the error) in estimating f 0 (2) using the forward difference
f (x0 + h) − f (x0 )
h
with h = 0.1.
(b) Find ξ0.1 as guaranteed by the error term.
23. Let f (x) = sin x. Find a bound on the error of the approximation.
[A]
(a) f 00 (3.0) ≈ 25[sin(2.8) − 2 sin(3.0) + sin(3.2)]
(b) f 00 (3.0) ≈ 1600 [2 sin(3.0) − 5 sin(3.025) + 4 sin(3.05) − sin(3.075)]
[S]
(c) f 000 (3.0) ≈ 500000 [−5 sin(3.0) + 18 sin(3.01) − 24 sin(3.02) + 14 sin(3.03) − 3 sin(3.04)]
(d) f 000 (3.0) ≈ 1000 [− sin(2.8) + 3 sin(2.9) − 3 sin(3.0) + sin(3.1)]
ˆ 4
1
(e) f (x)dx ≈ [sin(3) + 4 sin(3.5) + sin(4)]
3 6
ˆ 4
1 7 1 7 1
[S]
(f) f (x)dx ≈ sin − √ + sin + √
3 2 2 2 3 2 2 3
[S]
24. Suppose you have the following data on a function f .
x 0 1 2 3 4
f (x) −0.2381 −0.3125 −0.4545 −0.8333 −5
[A]
31. Repeat 30 supposing Chuck was using the Trapezoidal Rule.
32. Sketch the graph of a function f (x), and indicate on it values for x0 and h so that the backward difference f (x0 )−fh(x0 −h)
gives a better approximation of f 0 (x0 ) than does the central difference f (x0 +h)−f
2h
(x0 −h)
.
´1
33. Sketch the graph of a function f (x) for which the Trapezoidal Rule gives a better approximation of 0 f (x) dx than
does Simpson’s Rule, and explain how you know. [S]
34. Suppose a 5 point formula is used to approximate f 00 (x0 ) for stepsizes h = 0.1 and h = 0.02. If E0.1 represents the
error in the approximation for h = 0.1 and E0.02 represents the error in the approximation for h = 0.02, what would
you expect EE0.02
0.1
to be, approximately? [S]
35. A general three point formula using nodes x0 , x0 + αh, and x0 + 2h, (α 6= 0, 2) is given by
1 2+α 4
α
f 0 (x0 ) ≈ − f (x0 ) + f (x0 + αh) − f (x0 + 2h) .
2h α α(2 − α) 2−α
(a) Show that this formula reduces to one of the standard formulas when α = 1.
(b) Find the error term for this formula.
[A]
36. Find three different approximations for f 0 (0.2) using three-point formulas.
x f (x)
0 1
0.1 1.10517
0.2 1.22140
0.3 1.34986
0.4 1.49182
The graph of f 000 (x) is shown below. Use it to rank your three approximations in order from least expected error to
greatest expected error, and explain why you ranked them the way you did.
1.4
1.3
1.2
1.1
1
0 0.1 0.2 0.3 0.4
−2f (x0 −h)−3f (x0 )+6f (x0 +h)−f (x0 +2h)
37. Verify numerically that the error in using the formula f (x0 ) = 0
6h
to approximate
f 0 (3) using the function f (x) = (cos 3x)2 + ln x is really O(h3 ).
38. Numerically approximate the best estimate that can be obtained from the formula
−2f (3 − h) − 3f (3) + 6f (3 + h) − f (3 + 2h)
f 0 (3) =
6h
with double precision (standard Octave) computation and f (x) = (cos 3x)2 + ln x. What value of h gives this optimal
approximation? [A]
39. Find the degree of precision of the quadrature formula
ˆ 1 √ √
3 3
f (x)dx = f − +f .
−1 3 3
ˆ 2
40. The quadrature formula f (x)dx = c0 f (0) + c1 f (1) + c2 f (2) is exact for all polynomials of degree less than or equal
0
to 2. Determine c0 , c1 , and c2 .
4.4. COMPOSITE INTEGRATION 161
approximations are not as straightforward because, in each case, the quantity being approximated depends on h.
Changing h in the integration formula also changes the quantity being approximated. This is true of each formula in
table 4.5. The trapezoidal rule is as good an example as any. The left hand side, the quantity being approximated,
´ x +h
is x00 f (x)dx, so smaller h means approximating the integral over a smaller interval. So how does having a
smaller error in approximating a different number tell us anything about the potential benefit of computing with
smaller values of h? Careful study of the trapezoidal rule will reveal the answer.
According to the trapezoidal rule, h2 [f (x0 ) + f (x0 + h)] approximates the integral of f over the interval [x0 , x0 +
h]. If h is replaced by h/2, the resulting approximation, h4 f (x0 ) + f (x0 + h2 ) , is an approximation of the integral
of f over the interval [x0 , x0 + h2 ]. It is no longer an approximation of the integral over [x0 , x0 + h]! To use
the trapezoidal rule to approximate the original quantity, the integral of f over [x0 , x0 + h], using h/2 instead of
h requires two applications of the trapezoidal rule—one over the interval [x0 , x0 + h2 ] and one over the interval
[x0 + h2 , x0 + h]. The sum of these two approximations is an approximation for the integral of f over [x0 , x0 + h].
Reducing h further requires more applications of the trapezoidal rule over more intervals. In general, reducing h to
n for any whole number n requires n applications of the trapezoidal rule:
h
ˆ x0 +h ˆ x0 + n
h ˆ x0 +2 n
h ˆ x0 +h
f (x)dx = f (x)dx + f (x)dx + · · · + f (x)dx
x0 x0 x0 + n
h
x0 +(n−1) n
h
h h h h h
≈ f (x0 ) + f x0 + + f x0 + + f x0 + 2 +
2n n 2n n n
h h
··· + f x0 + (n − 1) + f (x0 + h) . (4.4.1)
2n n
´ x +h ´x ´x ´ xn
Decomposing x00 f (x)dx into the sum x01 f (x)dx+ x12 f (x)dx+· · ·+ xn−1 f (x)dx and summing approximations
of these integrals is called composite integration.
As for using the trapezoidal rule to do the approximating, the error in a single application of the trapezoidal rule is
3 2 Pn
O(h3 f 00 (ξh )). The error in the above sum is, therefore, bounded by i=1 M nh f 00 (µi ) = M h nh · n1 i=1 f 00 (µi )
Pn
for some µi with x0 + (i − 1) nh < Pµ i < x0 + i n . Assuming f is continuous on [x0 , x0 + h], the P
h 00
intermediate value
theorem allows us to replace n i=1 f (µi ) with f (ξn ) for some ξn ∈ (x0 , x0 + h) because n1 i=1 f 00 (µi ) is the
1 n 00 00 n
average of the f (µi ), which is no more than the maximum of the f (µi ) and no less than the minimum of the f 00 (µi ).
00 00
2
Making this replacement gives us the error bound M h nh f 00 (ξn ). In conclusion, the trapezoidal rule used multiple
´ x0 +h 1 2 00
times when necessary to approximate x0 f (x)dx actually has error O n f (ξn ) , where n is the number of
subintervals used in the calculation and ξn depends on n. Now the nature of the error is clearer. It is measured
by how many subintervals are used in the calculation. More subintervals (greater n) means less error (assuming
the benefit of more subintervals is not counteracted by the f 00 factor). Other composite integration formulas are
similar. If asingle-interval quadrature
formula has error O(hk f (l) (ξh )), then the corresponding composite version
k−1
has error O n1 f (l) (ξn ) . More intervals generally means smaller error.
This leads to the following pseudo-code where we make the substitutions a = x0 and b = x0 + h.
Other composite integration formulas should be simplified likewise to minimize the number of times f is evaluated.
Adaptive quadrature
ˆ 3
2
e−x dx ≈ 4.57837939409486
0
and it is simple enough to approximate this value with the composite trapezoidal rule. Table 4.6 shows the
minimum number of subintervals needed to achieve various accuracies, assuming the calculations are done with
enough significant digits that floating point error does not overwhelm the calculation. It should be apparent that
achieving high accuracy results using the
trapezoidal rule is not practical. It requires too many computations. We will take
up this deficiency in the next
1 2 00
section. For now, let’s analyze the usefulness of the error bound O n f (ξn ) . Assuming f 00 (ξn ) is roughly
4.4. COMPOSITE INTEGRATION 163
constant, we should expect to improve our estimate from an accuracy of 2.2(10)−2 to an accuracy of 5(10)−5 ,
−2
an increase in accuracy of 2.2(10) 5(10)−5 ≈ 440 times, by increasing the number of subintervals by a factor of about
√
440 ≈ 21. In other words, we should expect it to take approximately 42 subintervals to achieve 5(10)−5 accuracy
based on accuracy of 2.2(10)−2 with 2 intervals. Since it only takes 3, we conclude that the assumption that
f 00 (ξ2 ) ≈ f 00 (ξ3 ) is bad! Luckily, the badness of this assumption actually works in our favor. It takes less, not more,
than the expected number of intervals to achieve 5(10)−5 accuracy. On the other √ hand, increasing the accuracy from
5(10)−5 to 10−5 , an increase by a factor of 5, we should expect to need about 5 ≈ 2.2 times as many subintervals.
3 × 2.2 = 6.6, so the 8 needed is just about what we would expect. Similarly, to increase the accuracy from 10−5
to 10−7 , an increase in accuracy by a factor of 100, we should expect to need about 10 times as many subintervals.
Indeed, 75 is about 10 times as many as 8. Likewise, to increase accuracy by a factor of 10, 000 (as in going from
10−7 to 10−11 or from 10−11 to 10−15 ), we should expect to need to increase the number of subintervals by a factor
of 100. Indeed, the table bears this estimate out as well.
Just remember, if f 00 does not exist or is wildly discontinuous, or just wildly varying, the assumption that
f (ξn ) is constant could be a bad one, no matter how many subintervals are used. The more common case is when
00
f 00 is continuous and reasonably tame, though. Even in this case, when the number of subintervals is small, the
assumption is often not a good one, but when the number of subintervals is large, it is a pretty reliable assumption.
The exact number of subintervals needed before this assumption is reasonable changes from one function to another,
however.
Taking this lesson to heart, we approximate ˆ 3 p
x − ex cos e2x − x2 dx
0
using the trapezoidal rule with 50 subintervals and find that it is accurate to within about 10−1 of the exact value.
How many subintervals should we expect to need to achieve 10−3 accuracy? About 10 times as many, or about
500. With 500 subintervals, we actually attain accuracy of about .997(10)−3 , spot on! The assumption that f 00 (ξn )
is constant seems to be valid for this integral with n ≥ 50 (and maybe for some n < 50 too). Alas, this is the type
of analysis that can not be done in practice. In practice, we calculate integrals numerically because we don’t know
how to compute their values exactly! In “real life” situations, we have no way of knowing how accurate an integral
estimate is with 3 or 50 or 500 or 3000 subintervals. We need the computer to estimate errors as it calculates, just
as we had it do for root-finding algorithms.
Even though we know the assumption is not perfect, especially for small n, we assume f 00 (ξn ) is constant, so the
2
error of the trapezoidal rule becomes O n1 . The f 00 factor is subsumed by the implied constant of the big-oh
notation. Accordingly, halving the number of intervals can be expected to increase the error by a factor of about 4.
´b
Introducing the notation Tk (a, b) for the composite trapezoidal rule approximation of a f (x)dx with k subintervals
´b
and ek = a f (x)dx − Tk (a, b) for its error,
2 2
1 1
en ≈ M and e2n ≈ M
n 2n
so
1 2
en M
≈ n
= 4, which implies en ≈ 4e2n .
e2n 1 2
M 2n
´b
Because a
f (x)dx = T2 (a, b) + e2 = T1 (a, b) + e1 ,
T2 (a, b) − T1 (a, b) = e1 − e2
≈ 4e2 − e2
= 3e2
To harness this knowledge, we need to incorporate this estimate into our calculation. Suppose we wish to estimate
´b
a
f (x)dx to within an accuracy of tol. We begin by calculating T2 (a, b) and T1 (a, b). If 31 |T2 (a, b) − T1 (a, b)| < tol,
we are done. T2 (a, b) is our approximation. In the more likely case that 31 |T2 (a, b) − T1 (a, b)| ≥ tol, we divide
the interval [a, b] into two subintervals, [a, a+b
2 ] and [ 2 , b] and compare our error estimates on these subintervals
a+b
1
to tol
2 . If 3 T2 (a, 2 ) − T1 (a, 2 ) < 2 , we are done with the subinterval [a, 2 ]. T2 (a, 2 ) is a satisfactory
a+b a+b
tol a+b a+b
´ a+b
approximation of a 2 f (x)dx. If not, we bisect the interval again and compare error estimates to tol 4 . On the other
1
half of [a, b], if 3 T2 ( 2 , b) − T1 ( 2 , b) < 2 , we are done with the subinterval [ a+b
a+b a+b tol
2 , b]. T2 ( a+b
2 , b) is a satisfactory
´b
approximation of a+b f (x)dx. If not, we bisect the interval again and compare error estimates to tol 4 . Each time
2
a subinterval fails to meet the error tolerance, we divide it in half and try again. The process will normally end
successfully because, with each subinterval division, we will generally have the error decreasing by a factor of 4
while the error requirement is decreasing by a factor of only 2. In the end, the sum of the T2 estimates where the
´b
error tolerance is met will be our approximation for a f (x)dx.
The simplest way to code this algorithm is to use a recursive function. It is possible to do without, but the record
keeping is burdensome. Depending on the programming language you are using, the trade-off may be simplicity for
speed. Some languages do not handle recursive functions quickly.
Assumptions: f has a continuous second derivative on [a, b].
Input: Function f ; interval over which to integrate [a, b]; tolerance tol.
Step 1: Set m = 2 ;
b+a
I1 = T1 (a, b); I2 = T2 (a, b);
Step 2: If |I2 − I1 | < 3tol then return I2 ;
2 ]; and
Step 3: Do Steps 1-5 with inputs f ; [a, a+b 2 ;
tol
and set A equal to the result;
Step 4: Do Steps 1-5 with inputs f ; [ a+b
2 , b]; and 2 ;
tol
and set B equal to the result;
Step 5: Return A + B;
´b
Output: Approximate value of a
f (x)dx.
A tabulated example of such a computation´ 3 might help clarify any confusion over how this algorithm works. The
following table approximates the integral 0 ln(3 + x)dx with a tolerance of .006.
1
a b T1 (a, b) T2 (a, b) 3 |T2 (a, b) − T1 (a, b)| tol
0 3 4.33555 4.42389 .02944 .00600
0 1.5 1.95201 1.96732 .00510 .00300
0 0.75 0.90763 0.90997 .00077 .00150
0.75 1.5 1.05968 1.06124 .00051 .00150
1.5 3 2.47187 2.47961 .00257 .00300
´3
0
ln(3 + x)dx ≈ 0.90997 + 1.06124 + 2.47961 = 4.45082
The calculation in the table requires 7 evaluations of f and underestimates the integral by about .00390. In order
of occurrence, the evaluations happen at x = 0, 3, 1.5, .75, .375, 1.125, 2.25. The composite trapezoidal rule with
7 evaluations (6 subintervals each of length .5) underestimates the integral by about .00346. The non-adaptive
composite trapezoidal rule gives a slightly better estimate with essentially the same amount of computation. But
remember, it is not necessarily efficiency we are after. It is automatic error estimates. The adaptive trapezoidal
rule does something the conventional composite trapezoidal rule does not. It monitors itself for accuracy, so when
the routine completes, you not only get an estimate, but you can have some confidence in its accuracy even when
you have no way to calculate the integral exactly for comparison.
Key Concepts
Composite numerical integration: Dividing the interval of integration into a number of subintervals, applying
a simple quadrature formula to each subinterval and summing the results.
Adaptive numerical integration: Leveraging the error term of a simple quadrature formula in order to obtain
automatic calculation of the number and nature of subintervals needed to obtain a definite integral with some
prescribed accuracy.
4.4. COMPOSITE INTEGRATION 165
Exercises
ˆ x0 +h
1. Use the composite midpoint rule with 3 subintervals to h h 2h
h i
f (x) dx ≈ f x0 + + f x0 + .
approximate x0 2 3 3
ˆ 3
12. Based on our discussion of composite integration, the
(a) ln(sin(x))dx [S]
1 error term for composite Simpson’s rule applied to
ˆ ´b 1 4 (4)
7 f (x) dx with n subintervals is O f (ξn ) .
√
(b) x cos x dx a n
23. ˆ Use the composite trapezoidal rule to estimate 30. (i) Use your code from question 27 to approximate
1
the integral using tol = 10−5 . (ii) Calculate the actual
ln(x + 1)dx accurate to within 10 −6
. How many
0 error of the approximation. (iii) Is the approximation
[S] accurate to within 10−5 as requested?
subintervals are needed?
24. Repeat question 23 using the composite midpoint ˆ 2π
rule. (a) x sin(x2 )dx [A]
27. Write an Octave function that implements adap- ´2 24 ln(5)−6 tan−1 (2)−4
tive Simpson’s rule as a recursive function. Some notes NOTE: 0
x2 ln(x2 + 1) dx = 9
.
about the structure: [A]
31. Write an Octave function that implements the gen-
(a) The inputs to the function should be f (x), a, b,
eral trapezoidal rule of question 1 in such a way that
and a maximum overall error, tol.
x0 and x1 are chosen at random.
(b) The output of the function should be the esti-
mate and, if you are feeling particularly stirred, 32. Write an Octave function that implements a com-
the number of function evaluations. posite version of the quadrature method in question
28. ´ Use your code from question 27 to approximate 31.
3
ln(sin(x))dx with tolerance 0.002. [A]
1 33. Do some numerical experiments to compare the
29. ˆ Use your code from question 27 to approximate (standard) composite trapezoidal rule to the (random)
1 composite trapezoidal rule of question 32. What do
ln(x + 1)dx accurate to within 10−4 . you find?
0
4.5. EXTRAPOLATION 167
4.5 Extrapolation
In calculus, you undoubtedly encountered Euler’s constant, e, which you were probably told is approximately 2.718,
or maybe just 2.7. And unless you were involved in a digits-of-e memorization contest, you probably never saw
more digits of e than your calculator could show. We’re about to change that. The first 50 digits of e are
2.7182818284590452353602874713526624977572470936999.
How many of them do you remember? Not to worry if it is not very many. No quiz on the digits of e is imminent.
2.7182818284590452353602874713526624977572470936999
59574966967627724076630353547594571382178525166427
42746639193200305992181741359662904357290033429526
05956307381323286279434907632338298807531952510190
11573834187930702154089149934884167509244761460668
08226480016847741185374234544243710753907774499206
95517027618386062613313845830007520449338265602976
06737113200709328709127443747047230696977209310141
69283681902551510865746377211125238978442505695369
67707854499699679468644549059879316368892300987931
27736178215424999229576351482208269895193668033182
52886939849646510582093923982948879332036250944311
73012381970684161403970198376793206832823764648042
95311802328782509819455815301756717361332069811250
99618188159304169035159888851934580727386673858942
28792284998920868058257492796104841984443634632449
68487560233624827041978623209002160990235304369941
84914631409343173814364054625315209618369088870701
67683964243781405927145635490613031072085103837505
10115747704171898610687396965521267154688957035035
Can you prove it? Proof on page 174. Based on this fact, we might use
ẽ(h) = (1 + h)1/h
ẽ(0.01) ≈ 2.704813829421529
ẽ(0.005) ≈ 2.711517122929293
ẽ(0.0025) ≈ 2.714891744381238
ẽ(0.00125) ≈ 2.716584846682473
ẽ(0.000625) ≈ 2.717432851769196.
168 CHAPTER 4. NUMERICAL CALCULUS
Sadly, this sequence of approximations is not converging very quickly. We have two digits of accuracy in the first
approximation and still only three digits of accuracy in the fifth. We could, of course, continue to make h smaller to
get more accurate approximations, but based on the slow improvement observed so far, this does not seem like a very
promising route. Instead, we can combine the estimates we already have to get an improved approximation. This
idea should remind you, at least on the surface, of Aitken’s delta-squared method. In that method, we combined
three consecutive approximations to form another that was generally a better approximation than any of the original
three. We will do something similar here, combining inadequate approximations to find better ones. We will name
the various new approximations for continued reuse.
Each of these new approximations is accurate to 5 or 6 significant digits! Already a significant improvement. We
can combine them further to find yet better approximations:
Then
m1
p̃ (αh) = p + c1 · (αh) + c2 · (αh)m2 + c3 · (αh)m3 + · · · .
Now, if we multiply the second equation by α−m1 and subtract the first from it, the hm1 terms vanish, and we get
an approximation with error term beginning with c2 · hm2 :
α−m1 p̃ (αh) = α−m1 p + c1 · hm1 + c2 αm2 −m1 · hm2 + c3 αm3 −m1 · hm3 + · · ·
− [p̃(h) = p + c1 · hm1 + c2 · hm2 + c3 · hm3 + · · · ]
α−m1 p̃ (αh) − p̃(h) = (α−m − 1)p + c2 (αm2 −m1 − 1) · hm2 + c3 (αm3 −m1 − 1) · hm3 + · · ·
for some constants c1 , c2 , c3 , c4 . The actual values of the constants are not relevant for this computation. To
understand the computation of ẽ1 , we use equation 4.5.4 with α = 12 and m1 = 1 to get
2ẽ h2 − ẽ(h)
ẽ1 (h) =
2−1
1 1 1
= 2e + c1 h + c2 h2 + c3 h3 + c4 h4 + O(h5 )
2 4 8
− e + c1 h + c2 h2 + c3 h3 + c4 h4 + O(h5 )
= e + d2 h2 + d3 h3 + d4 h4 + O(h5 )
for some constants d2 , d3 , d4 . ẽ1 (h) is the formula that gave us the round of approximations accurate to 5 or 6
significant digits. It is not hard to find the constants di in terms of the constants ci , but, again, the values of the
constants are immaterial and can only serve to complicate further refinements. What is important is the form of
the error. Now that we know ẽ1 (h) = e + d2 h2 + d3 h3 + d4 h4 + O(h5 ), we find ẽ2 (h) using formula 4.5.4 with α = 21
and m1 = 2:
4ẽ1 h2 − ẽ1 (h)
ẽ2 (h) =
3
= e + k3 h3 + k4 h4 + O(h5 )
for some constants k3 and k4 . ẽ2 (h) is the formula that gave us the round of approximations accurate to 7 to 9
significant digits. We can again use formula 4.5.4, this time with α = 21 and m1 = 3:
for some constant l4 . ẽ3 (h) is the formula that gave us the approximations accurate to 10 and 11 significant digits.
Now is a good time to see if you can use the expression for ẽ3 (h) and formula 4.5.4 to derive an O(h5 ) formula for
ẽ4 (h). Then use your formula to compute ẽ4 (0.01) using the previously given values of ẽ3 (0.01) and ẽ3 (0.005). How
accurate is ẽ4 (0.01)? Answers on page 174.
As a special case, Richardson’s extrapolation with α = 12 applied to any approximation of the form
p̃0 (h) = p + c1 h + c2 h2 + c3 h3 + · · ·
gives the recursively defined refinements
2k p̃k−1 h
− p̃k−1 (h)
2
p̃k (h) = , k = 1, 2, 3, . . .
2k −1
which are expected to increase in accuracy as k increases. For other α or other forms of error, the formula for p̃k (h)
changes according to 4.5.4.
(1 + h)1/h if h 6= 0
ẽ(h) = ,
e if h = 0
thus defining ẽ at 0, then ẽ(h) becomes infinitely differentiable at 0, and its fifth Taylor polynomial, for example,
is:
e 11e 2 7e 3 2447e 4 f (5) (ξ) 5
ẽ(h) = e − · h + ·h − ·h + ·h + h
2 24 16 5760 120
for some ξ ∈ (0, h).
170 CHAPTER 4. NUMERICAL CALCULUS
Differentiation
Using extrapolation, high order differentiation approximation formulas can be derived from low order formulas.
−f (x0 ) + f (x0 + h) h 00
We begin with the lowest order approximation, f 0 (x0 ) = − f (ξh ). The standard error term,
h 2
− 2 f (ξh ) does not give the error in the form c · h + O(h ) as required by Richardson’s extrapolation, so we
h 00 m1 m2
1 1
f (x0 + h) = f (x0 ) + hf 0 (x0 ) + h2 f 00 (x0 ) + h3 f 000 (x0 ) + · · ·
2 6
so
−f (x0 ) + f (x0 + h) 1 1
= f 0 (x0 ) + hf 00 (x0 ) + h2 f 000 (x0 ) + · · · .
h 2 6
Hence,
−f (x0 ) + f (x0 + h) 1 1
f 0 (x0 ) − = − hf 00 (x0 ) − h2 f 000 (x0 ) − · · ·
h 2 6
= c1 h + O(h2 )
Integration
Applying extrapolation to definite integrals is more rewarding. We begin with any composite integration formula
and apply Richardson’s extrapolation. We now consider the composite trapezoidal rule and use the notation Tk (a, b)
´b
to represent the approximation of a f (x)dx using the trapezoidal rule with k subintervals.
Before
continuing we need to have a good idea what it means for the composite trapezoidal rule to have error
1 2
term O n . In essence, it means we should expect the error to decrease by a factor of about 4 when the number
of intervals is doubled. We should expect the error to decrease by a factor of about 9 when the number of intervals is
tripled. And generally we should expect the error to decrease by a factor of about β 2 when the number of intervals
is multiplied by β. To see this effect in action, consider the definite integral
ˆ 1
sin x dx
0
4.5. EXTRAPOLATION 171
whose exact value is 1 − cos(1) ≈ .4596976941318602. The absolute errors of T5 (0, 1), T10 (0, 1), and T15 (0, 1) are
ˆ 1
sin x dx − T5 (0, 1) ≈ 1.533(10)−3
0
ˆ 1
sin x dx − T10 (0, 1) ≈ 3.831(10)−4
0
ˆ 1
sin x dx − T15 (0, 1) ≈ 1.702(10)−4
0
´ ´
1 1
We should expect the error 0 sin x dx − T5 (0, 1) to be about four times the error 0 sin x dx − T10 (0, 1) and nine
´
1
times the error 0 sin x dx − T15 (0, 1). To check, we compute the ratios:
´
1
0 sin x dx − T5 (0, 1)
1.533(10)−3
´ = ≈ 4.001
1 3.831(10)−4
0 sin x dx − T10 (0, 1)
´
1
0 sin x dx − T5 (0, 1)
1.533(10)−3
´ = ≈ 9.007.
1 1.702(10)−4
0 sin x dx − T15 (0, 1)
´
|´01 sin x dx−T10 (0,1)|
What should you expect the ratio to be about? Answer on page 174.
| 01 sin x dx−T15 (0,1)|
Finally, we apply Richardson’s extrapolation with α = 21 and m1 = 2 to produce the higher order estimate,
T5 (0, 1) ≈ .4581643459604436
T10 (0, 1) ≈ .4593145488579763
T20 (0, 1) ≈ .4596019197882473
T40 (0, 1) ≈ .4596737512942187.
Hence,
4T10 (0, 1) − T5 (0, 1)
T5,1 (0, 1) = ≈ .4596979498238206
3
4T20 (0, 1) − T10 (0, 1)
T10,1 (0, 1) = ≈ .4596977100983375
3
4T40 (0, 1) − T20 (0, 1)
T20,1 (0, 1) = ≈ .4596976951295424
3
and
´
1
0 sin x dx − T5,1 (0, 1)
´
1
≈ 16.01
0 sin x dx − T10,1 (0, 1)
´
1
0 sin x dx − T10,1 (0, 1)
´
1
≈ 16.00.
0 sin x dx − T20,1 (0, 1)
4 3
When we double the number of subintervals, the error is decreased
by a factor of 16. That’s
2 , not 2 as we might
2 4
have expected! The first refinement takes us from a O n1 approximation to a O n1 approximation. In
4
other words, the error of Tn,1 is O n1 .
172 CHAPTER 4. NUMERICAL CALCULUS
1 4
Now that we know the error of Tn,1 is O we can extrapolate again. Applying Richardson’s extrapolation
n
1
with α = 2 and m1 = 4, we have
so each refinement increases the least degree in the error term by 2, not 1. Skipping the odd degrees makes this
particular choice very efficient. But this method comes with a price. Hidden within c2 is the assumption that f has
a continuous second derivative. Hidden within c4 is the assumption that f has a continuous fourth derivative. And
so on. The accuracy of each refinement depends on f having two more continuous derivatives. The more refinements
we do, the smoother f must be for this method to work. For this reason, it is advisable to use Romberg’s method
only when the integrand is known to have sufficient derivatives.
Key Concepts
Richardson’s extrapolation: If approximation p̃ is know to have the form
neer Richardson refinements in order to approximate (a) Use Richardson’s extrapolation to derive an
the ci of equation 4.5.5 on page 168. For example, O(h2 ) formula for f 0 (x0 ).
ẽ(h) = e + c1 h + O(h2 ), and we assume the O(h2 ) term (b) The formula you derived should look familiar.
is relatively small, so we can rearrange this equation to What formula does it look like? Is it exactly the
find same? Why or why not?
ẽ(h) − e
≈ c1 .
h 7. Derive an O(h3 ) formula for approximating M that
To take a specific example, ẽ(.005)−e
= uses N (h), N ( h2 ), and N ( h3 ), and is based on the as-
.005
2.711517122929293−e
.005
≈ −1.35 so c 1 ≈ −1.35. If we sumption that [S]
pay careful attention to how the constants are affected
M = N (h) + K1 h + K2 h2 + K3 h3 + · · · .
as we refine our initial approximations, we can find c2 ,
c3 , and c4 as well. 8. The following data give estimates of the integral M =
´ 3π/2
h
0
cos x dx.
ẽ1 (h) = 2ẽ − ẽ(h)
2
c2 2 c3 3 c4 4 N (h) = 2.356194 N (h/2) = −0.4879837
= 2e + c1 h + h + h + h + O(h5 )
2 4 8 N (h/4) = −0.8815732 N (h/8) = −0.9709157
−(e + c1 h + c2 h2 + c3 h3 + c4 h4 + O(h5 )) Assuming M − N (h) = K1 h + K2 h4 + K3 h6 + · · · , find
2
c2 3c3 3 7c4 4 a third Richardson’s extrapolation for M . [S]
= e − h2 − h − h + O(h5 ).
2 4 8
9. Suppose that N (h) is an approximation of M for every
Therefore, ẽ1 (h) − e ≈ − c22 h2 , from which we conclude h > 0 and that
−2(ẽ1 (h) − e) M − N (h) = K1 h + K2 h2 + K3 h3 + · · ·
≈ c2 .
h2
for some constants K1 , K2 , K3 , . . .. Use the values
(a) Use this formula and the values in 4.5.1 to verify N (h), N (h/3), and N (h/9) to produce an O(h3 ) ap-
that c2 ≈ 1.24. proximation of M . [A]
(b) Approximate c3 using values in 4.5.2. 10. Use Romberg integration to compute the integral with
(c) Approximate c4 using values in 4.5.3. tolerance 10−4 .
(d) Compare these approximations of c1 , c2 , c3 , c4 to ˆ 3
the exact values in crumpet 30. (a) ln(sin(x))dx [S]
1
ˆ 7
3. Suppose N approximates M according to N (h) = √
(b) x cos x dx
M +K1 h3 +K2 h5 +K3 h7 +· · · . Of what order will N3 (h) 5
(the third generation Richardson’s extrapolation) be? ˆ 4 x
e ln(x)
[A] (c) dx [A]
1 x
4. Suppose N approximates M according to N (h) = ˆ 13 p
M + K1 h2 + K2 h4 + K3 h6 + · · · . What would you (d) 1 + cos2 x dx
10
expect the value of ˆ ln 7
ex [A]
|M − N (h/3)| (e) dx
ln 3 1+x
|M − N (h/4)| ˆ 1 2
x −1
[A] (f) dx
to be for small h, approximately? 0 x2 + 1
1−cos h [A] ˆ 2
5. N (h) = can be used to approximate
h2
(g) x2 ln(x2 + 1)dx [A]
1 − cos h 0
lim
h→0 h2 [A]
11. Write a Romberg integration Octave function.
(a) Compute N (1.0) and N (0.5).
12. (i) Use your code from question 11 to approximate
(b) Compute N1 (1.0), the first Richardson’s extrapo-
the integral using tol = 10−5 . (ii) Calculate the actual
lation, assuming
error of the approximation. (iii) Is the approximation
i. N (h) has an error of the form K1 h + K2 h2 + accurate to within 10−5 as requested?
K3 h3 + · · · ˆ 2π
ii. N (h) has an error of the form K2 h2 +K4 h4 + (a) x sin(x2 )dx [A]
K6 h6 + · · · 0
ˆ 2
(c) Which of the assumptions in part 5b do you think 1
(b) dx
gives the correct error and why? 0.1 x
ˆ 2
6. The backward difference formula can be expressed as (c) x2 ln(x2 + 1) dx
0
1 ´2
f 0 (x0 ) = [f (x0 ) − f (x0 − h)] 24 ln(5)−6 tan−1 (2)−4
h NOTE: 0
x2 ln(x2 + 1) dx = 9
.
h h2 000 13. Compare the results of question 12 with those of ques-
+ f 00 (x0 ) − f (x0 ) + O(h3 )
2 6 tion 30 on page 166.
174 CHAPTER 4. NUMERICAL CALCULUS
Answers
ln(1+h)
lim (1 + h)1/h = e: Begin by noting ln (1 + h)1/h = . Set
h→0 h
ln(1 + h)
L = lim
h→0 h
d
(ln(1 + h))
= lim dh
dh (h)
h→0 d
1
= lim
h→0 1 + h
= 1.
lim eln[(1+h) ]
ln(1+h) ln(1+h) 1/h
e = eL = elimh→0 h = lim e h =
h→0 h→0
= lim (1 + h)1/h .
h→0
16(2.718281828448482) − 2.718281828281785
ẽ4 (0.01) =
15
= 2.718281828459595,
64T10,2 − T5,2
T5,3 = ≈ .4596976941318606
63
ˆ 1
sin x dx − T5,3 4(10)−16
≈
0
Chapter 5
More Interpolation
with the first mi derivatives specified at (ti , yi ), i = 0, 1, . . . , n. As before, the t0 , t1 , . . . , tn are called nodes.
One useful type of osculating polynomial is the Hermite polynomial in which the value of the polynomial and
its first derivative are both given at each node. Even more specifically, third degree, or cubic, Hermite polynomials
play an important role in approximation theory. Since a third degree polynomial has four parameters, data—the
ordinate and first derivative—at two nodes is sufficient to specify such a polynomial. So suppose we wish to find a
polynomial p of degree at most three that passes through (t0 , y0 ) and (t1 , y1 ) with derivative ẏ0 at t0 and ẏ1 at t1 .
Remembering the lessons of our study of interpolating polynomials, we might begin with the Lagrange form of
the interpolating polynomial passing through (t0 , y0 ) and (t1 , y1 ) and worry about the derivatives later. That gives
us f (t) = tt−t 1
0 −t1
y0 + tt−t0
1 −t0
y1 to begin. Of course f passes through the required points, but it is not even potentially
cubic, and its derivative is f 0 (t) = t0y−t0
1
+ t1y−t
1
0
, a constant. It would be nice if we could add to it, a third degree
polynomial that has zeroes at t0 and t1 and whose derivatives we can control. Well, g(t) = (t − t0 )(t − t1 )2 , for
example, is cubic, has zeroes at t0 and t1 , and has derivative (t − t1 )2 + 2(t − t0 )(t − t1 ), so we have at least some
control over its derivative. Great, now let us look at it a little more closely:
So g 0 (t1 ) = 0 and g 0 (t0 ) = (t0 − t1 )2 is nonzero. That should remind you of how we developed the Lagrange
interpolating polynomial. Only, there, the value of the polynomial was either 0 or 1 at each node before we added
an unknown coefficient. Of course, ĝ(t) = (t0g(t) −t1 )2 has derivative 1 at t1 and 0 at t0 . Putting it all together,
ĝa (t) = a (t−t 0 )(t−t1 )
has everything we need to control the derivative at t0 . Similarly, ĥb (t) = b (t−t 0 ) (t−t1 )
2 2
175
176 CHAPTER 5. MORE INTERPOLATION
zeroes at t0 and t1 and easily specified derivatives at t0 and t1 . Finally, a polynomial p of the form
t − t1 t − t0
p(t) = y0 + y1 + ga (t) + hb (t)
t0 − t1 t1 − t0
t − t1 t − t0 (t − t0 )(t − t1 )2 (t − t0 )2 (t − t1 )
y0 + y1 + a + b
t0 − t1 t1 − t0 (t0 − t1 )2 (t1 − t0 )2
would be the Hermite polynomial we are after. The first two terms form the interpolating polynomial passing
through the required points. The last two terms are zero at t0 and t1 so do not affect this interpolation. Moreover,
the last two terms are chosen so that their derivatives are convenient at t0 and t1 . The derivative of (t−t 1 ) (t−t0 )
2
(t0 −t1 )2
is 1 at t0 and 0 at t1 . The derivative of (t−t 0 ) (t−t1 )
2
polynomial is
t − t1 t − t0 (t − t1 )2 (t − t0 ) (t − t0 )2 (t − t1 )
p(t) = y0 + y1 + (ẏ 0 − m) + (ẏ1 − m) (5.1.1)
t0 − t1 t1 − t0 (t0 − t1 )2 (t1 − t0 )2
−y0
where m = yt11 −t 0
.
This form of the Hermite cubic polynomial is convenient for humans. It is formulaic and requires very little
computation to write down. We will call it the Human form of the Hermite cubic polynomial. A more computer-
friendly form, which we will refer to as the Computer form of the Hermite cubic is obtained via divided differences.
In general, for an osculating polynomial where the first k derivatives are specified at ti , ti and yi must be repeated
k +1 times in the divided differences table. Quotients that would otherwise be undefined as a result of the repetition
are replaced by the specified derivatives, first derivatives for first divided differences, second derivatives for second
divided differences, and so on.
For the cubic Hermite polynomial p passing through (t0 , y0 ) and (t1 , y1 ) with derivative ẏ0 at t0 and ẏ1 at t1 ,
the table looks like so:
t0 y0 y00
t 0 y0
t1 y1 y10
t1 y1
The four remaining entries are to be filled in by the usual divided difference method. Can you compute them in
general (in terms of t0 , t1 , y0 , y1 , ẏ0 , ẏ1 )? Answers on page 183. Using the results, we write down the interpolating
polynomial in two ways:
y1 − y0 ẏ0
p(t) = y0 + [ẏ0 ] (t − t0 ) + − (t − t0 )2
(t1 − t0 )2 t1 − t0
ẏ1 + ẏ0
y1 − y0
+ − 2 (t − t0 )2 (t − t1 )
(t1 − t0 )2 (t1 − t0 )3
and
ẏ1 y1 − y0
p(t) = y1 + [ẏ1 ] (t − t1 ) + − (t − t1 )2
t1 − t0 (t1 − t0 )2
ẏ1 + ẏ0
y1 − y0
+ − 2 (t − t1 )2 (t − t0 ).
(t1 − t0 )2 (t1 − t0 )3
Just as we had for interpolating polynomials, we have two ways to find cubic Hermite osculating polynomials. One
way is convenient for humans and the other for computers.
5.1. OSCULATING POLYNOMIALS 177
Bèzier Curves
Forcing (x(0), y(0)) = (−1, 2), we need
x(0) = ax = −1
y(0) = ay = 2.
x(1) = ax + bx + cx = −1 + bx + cx = 5
y(1) = ay + by + cy = 2 + by + cy = −2
or
bx + cx = 6
by + cy = −4.
Bèzier curves are parametric curves with parameter t ∈ [0, 1] connecting two points. The simplest Bèzier curve
is a straight line passing through the two points. For example, the simplest Bèzier curve from (−1, 2) to (5, −2) is
given by the parametric linear functions
which we choose to write down in Lagrange form. You can check that x(0) = −1, x(1) = 5, y(0) = 2, and y(1) = −2.
In other words, x passes through (0, −1) and (1, 5) while y passes through (0, 2) and (1, −2). This parametrization
is unique because x and y are interpolating polynomials.
One the other hand, if we allow x and y to be quadratic, there are infinitely many (parametric) pairs of functions
connecting (−1, 2) to (5, −2) even if we require x and y to be interpolating polynomials and restrict the parameter
t to the interval [0, 1]. That is not to say we do not have quadratic Bèzier curves, but rather that we need to specify
more than just the two points to be connected. Allowing the parameter function to be quadratic, we have say
x(t) = ax + bx t + cx t2
y(t) = ay + by t + cy t2 ,
giving six unknowns or undetermined coefficients, if you will. That leaves two conditions that may yet be imposed
on the parameter functions.
Any particular quadratic Bèzier curve is prescribed by specifying a control point distinct from the two endpoints.
The two linear Bèzier curves, one connecting (−1, 2) to the control point and the other connecting the control point
to (5, −2), then determine the quadratic Bèzier curve. Suppose B ~ 1,0 (t) is the linear Bèzier curve from (−1, 2) to
the control point and B1,1 (t) is the linear Bèzier curve from the control point to (5, −2). These two curves define
~
a family of linear Bèzier curves, namely the set of linear Bèzier curves from B ~ 1,0 (t0 ) to B
~ 1,1 (t0 ), where t0 ∈ [0, 1].
Letting B2,0,t0 (t) be the linear Bèzier curve from B1,0 (t0 ) to B1,1 (t0 ), the point B2,0,t0 (t0 ) is on the quadratic Bèzier
~ ~ ~ ~
curve from (−1, 2) to (5, −2) via the given control point. The collection of all such points as t0 varies from 0 to 1
is the quadratic Bèzier curve we are after. Different control points determine different quadratics. For example, if
we have (0, 4) as our control point, B ~ 1,0 is the linear Bèzier curve connecting (−1, 2) to (0, 4) and B ~ 1,1 is the linear
Bèzier curve from (0, 4) to (5, −2):
(1 − t)(−1)
~ 1,0 (t) =
B
(1 − t)(2) + t(4)
and
~ 1,1 (t) = t(5)
B .
(1 − t)(4) + t(−2)
~ 2,0,t is the linear Bèzier curve connecting B
B ~ 1,0 (t0 ) to B
~ 1,1 (t0 ). Therefore, B~ 2,0,t (t) = (1 − t)B
~ 1,0 (t0 ) + tB
~ 1,1 (t0 )
0 0
or
(1 − t0 )(−1) t0 (5)
B2,0,t0 (t) = (1 − t)
~ +t .
(1 − t0 )(2) + t0 (4) (1 − t0 )(4) + t0 (−2)
178 CHAPTER 5. MORE INTERPOLATION
Then
(1 − t0 )(−1) t0 (5)
~ 2,0,t (t0 ) = (1 − t0 )
B + t0 .
0
(1 − t0 )(2) + t0 (4) (1 − t0 )(4) + t0 (−2)
~ 2,0,1 (1) = 5 .
Observe that B~ 2,0,t is quadratic as a function of t0 and that B ~ 2,0,0 (0) = −1 and B
0
2 −2
But the notation B ~ 2,0,t (t0 ) is cumbersome and we are really interested in a parametrization of the quadratic
0
anyway. Letting B ~ 2,0 (t) = B ~ 2,0,t (t), we get the quadratic Bèzier curve from (−1, 2) to (5, −2) via control point
(0, 4):
(1 − t)(−1)
t(5)
B2,0 (t) = (1 − t)
~ +t
(1 − t)(2) + t(4) (1 − t)(4) + t(−2)
and we have cleaner notation.
With some algebra, the expression for B ~ 2,0 can be simplified, but leaving it unsimplified emphasizes whence it
came. It is the result of nested linear interpolations. Higher order Bèzier curves are constructed by continued nesting.
We now use this idea to define the Bèzier curve from P~0 to P~n via control points P~1 , P~2 , . . . , P~n−1 . Commonly, P~0
and P~n are also considered control points and so this Bèzier curve is also referred to as the Bèzier curve with control
points P~0 , P~1 , . . . , P~n . Such a Bèzier curve will have degree at most n.
We begin by defining the linear Bèzier curves
~ 1,i (t) = (1 − t)P~i + (t)P~i+1 ,
B i = 0, 1, . . . , n − 1. (5.1.2)
Note that B
~ 1,i is the linear Bèzier curve from P~i to P~i+1 . Then
~ j,i (t) = (1 − t) · B
B ~ j−1,i (t) + (t) · B
~ j−1,i+1 (t), j = 2, 3, . . . , n; i = 0, 1, . . . , n − j. (5.1.3)
Note that B ~ 2,i (t) is the quadratic Bèzier curve connecting P~i to P~i+2 via control point P~i+1 . With a little algebra,
you can confirm that B ~ 3,i (t) is at-most-cubic and connects P~i to P~i+3 . An inductive proof will show that B
~ j,i (t) is
an at-most-degree-j polynomial parametrization connecting Pi to Pi+j . Can you provide it? Answer on page 5.1.
~ ~
It follows that B ~ n,0 (t) is the degree at most n Bèzier curve connecting P~0 to P~n .
Returning to our previous example, we add the control point (5, 1) so we have now four control points:
0 5 5
−1
P~0 = , P~1 = , P~2 = , P~3 = .
2 4 1 −2
By equation 5.1.2,
0 −1 + t
~ 1,0 (t) −1
B = (1 − t)P0 + (t)P1 = (1 − t)
~ ~ +t =
2 4 2 + 2t
0 5 5t
B1,1 (t) = (1 − t)P1 + (t)P2 = (1 − t)
~ ~ ~ +t =
4 1 4 − 3t
5 5 5
B1,2 (t) = (1 − t)P2 + (t)P3 = (1 − t)
~ ~ ~ +t = .
1 −2 1 − 3t
−1 + t 5t −1 + 2t + 4t2
~ 2,0 (t)
B = (1 − t)B
~ 1,0 (t) + (t)B
~ 1,1 (t) = (1 − t) +t =
2 + 2t 4 − 3t 2 + 4t − 5t2
5t 5 10t − 5t2
~ 2,1 (t)
B = (1 − t)B
~ 1,1 (t) + (t)B
~ 1,2 (t) = (1 − t) +t = ,
4 − 3t 1 − 3t 4 − 6t
and
~ 3,0 (t)
B = (1 − t)B~ 2,0 (t) + (t)B
~ 2,1 (t)
−1 + 2t + 4t2 10t − 5t2
= (1 − t) +t
2 + 4t − 5t2 4 − 6t
2 3
−1 + 3t + 12t − 9t
= . (5.1.4)
2 + 6t − 15t2 + 5t3
5.1. OSCULATING POLYNOMIALS 179
Figure 5.1.1: Three points on a cubic Bèzier curve constructed by recursive linear interpolation.
5 0 5
~ 3,0 (t) is the cubic Bèzier curve from −1
B to via control points and . Figure 5.1.1 shows this
2 −2 4 1
Bèzier curve and the construction of three of its points via recursive linear interpolation. The blue points lie along
the linear Bèzier curves B ~ 1,0 , B ~ 1,2 . The orange points lie along the quadratic Bèzier curves B
~ 1,1 , B ~ 2,0 and B ~ 2,1 .
The black points lie along the cubic Bèzier curve. The graphs of the quadratics have been suppressed to avoid
overcomplicating the figure.
Figure 5.1.1 may help you grasp the recursion, but maybe more importantly, may help you understand the
relationship between the control points and the Bèzier curve. For example, upon close examination, you may be
led to believe the line segments B ~ 1,0 and B ~ 1,2 are tangent to the cubic Bèzier curve B
~ 3,0 at P~0 and P~3 , respectively.
Close examination of the formulas will confirm it.
According to formulas 5.1.2 and 5.1.3, the (at most) cubic Bèzier curve with control points P~0 , P~1 , P~2 , P~3 is
computed thus:
~ 1,0 (t)
B = (1 − t)P~0 + (t)P~1
~ 1,1 (t)
B = (1 − t)P~1 + (t)P~2
~ 1,2 (t)
B = (1 − t)P~2 + (t)P~3
so
h i h i
~ 2,0 (t)
B = (1 − t)B
~ 1,0 (t) + (t)B
~ 1,1 (t) = (1 − t) (1 − t)P~0 + (t)P~1 + t (1 − t)P~1 + (t)P~2
Hence, dt
d ~
B3,0 (t) = −3(1 − t)2 P~0 + 3 (1 − t)2 − 2t(1 − t) P~1 + 3 2t(1 − t) − t2 P~2 + 3t2 P~3 , from which it follows
d ~
B3,0 (t) = −3P~0 + 3P~1 = 3(P~1 − P~0 )
dt t=0
d ~
B3,0 (t) = −3P~2 + 3P~3 = 3(P~3 − P~2 ).
dt t=1
Indeed, the derivative of B ~ 3,0 at 0 is in the direction of the line segment from P~1 to P~2 , and the derivative of B
~ 3,0
at 1 is in the direction of the line segment from P2 to P3 . Moreover, these derivatives have magnitude exactly three
~ ~
times the magnitudes of the line segments.
Though we took a somewhat circuitous route, we now see another way to compute cubic Bèzier curves besides
using recursion 5.1.2/5.1.3 or formula 5.1.5. Control points P~0 and P~3 give us two points x and y must pass through.
Control points P~1 and P~2 give us ẋ and ẏ at those two points. Thus specified, x and y are cubic Hermite polynomials!
180 CHAPTER 5. MORE INTERPOLATION
To be precise, let P~i = (xi , yi ) for i = 0, 1, 2, 3. Then x(t) is the cubic Hermite polynomial with x(0) = x0 ,
ẋ(0) = 3(x1 − x0 ), x(1) = x3 , and ẋ(1) = 3(x3 − x2 ); and y(t) is the cubic Hermite polynomial with y(0) = y0 ,
ẏ(0) = 3(y1 − y0 ), y(1) = y3 , and ẏ(1) = 3(y3 − y2 ).
5 0 5
−1
We close this section by computing the Bèzier curve from to via control points and
2 −2 4 1
using equation 5.1.1 and comparing our results to 5.1.4. With x(0) = −1, ẋ(0) = 3, x(1) = 5, and ẋ(1) = 0 (and
5+1
the understood substitution of x for y), equation 5.1.1 gives m = 1−0 = 6 and
t−1 t (t − 1)2 t t2 (t − 1)
x(t) = (−1) + (5) + (3 − 6) + (−6) .
−1 1 1 1
Using equation 5.1.1 with y(0) = 2, ẏ(0) = 6, y(1) = −2, and ẏ(1) = −9 gives m = −2−2
1−0 = −4 and
t−1 t (t − 1)2 t t2 (t − 1)
y(t) = (2) + (−2) + (6 + 4) + (−9 + 4) .
−1 1 1 1
While these equations are complete and correct, it is difficult to compare them to 5.1.4 without some simplification.
Can you show
x(t) = −1 + 3t + 12t2 − 9t3
y(t) = 2 + 6t − 15t2 + 5t3
as required? Answer on page 183.
Bézier curves were originally developed around 1960 by employees at french automobile manufacturing companies.
Paul de Casteljau of Citroën was first, but Pierre Bèzier of Renault popularized the method so has his name
associated with the polynomials.
Nowadays, almost all computer aided graphic design, or CAGD, software uses Bèzier curves, particularly
cubic, for drawing smooth objects. CAGD software with cubic Bèzier tools will display the four control points
and allow the user to move them about. In fact, the software will draw the two linear Bézier curves at the
endpoints as well. This gives the user “handles” to manipulate the curve. Some software will include the third
linear Bèzier curve as well. The three linear Bèzier curves together form the so-called control polygon. Since the
relationship between the control points and the curve is intuitive, manipulation of the control points, whether it
be by handles or control polygons, provides a means for swift modeling of smooth shapes.
Some shapes are too intricate to model with a single cubic Bèzier curve, however. To handle such shapes,
CAGD software allows a user to string cubic Bèzier curves together end to end, forming a composite, or piecewise,
Bèzier curve, such as that shown here.
This particular curve is made of two cubic Bèzier curves, one with control points P ~0 , P
~1 , P ~3 and the other with
~2 , P
control points P
~3 , P
~4 , P ~6 . Since Bèzier curves are intended to model smooth objects, software will provide the
~5 , P
option of forcing derivative matching at a common point such as P ~3 . This is done by making sure the common
point is on the line segment between its two adjacent control points (P ~2 and P~4 in this diagram). You may view
an interactive version of this diagram at the companion website.
Free open source software such as Inkscape, LibreOffice Drawing, and Dia provide Bezier curve drawing
tools, but not all of them use the technical term. Inkscape has a Bezier curve tool by that name, but LibreOffice
Drawing’s Bezier curve tool is simply called “curve”, and Dia’s tool for single Bezier curves only, not composite,
goes by the name of “Bezierline”.
5.1. OSCULATING POLYNOMIALS 181
Key Concepts
osculating polynomial: A polynomial whose graph is required to pass through a set of prescribed points
Bèzier curve: A curve connecting two points via parametric osculating polynomials.
Exercises
1. Find the cubic Hermite polynomial interpolating the
data.
x f (x) f 0 (x)
1 2 1
5 3 −1
x f (x) f 0 (x)
0 2 1
0.5 2 0
1 2 1
√
3. Let g(x) = ( 2)x .
7. Write down the parametric equations of the Bèzier
(a) Using x0 = 1 and x1 = 2, find a Hermite interpo- curve with control points (−1, 2), (−3, 2), (3, 1), and
lating polynomial for g. (3, 0). It is not necessary to simplify your answer.
(b) Use the Hermite polynomial to approximate 8. Construct the parametric equations for the Bèzier curve
g(1.5). with control points (1, 1), (2, 1.5), (7, 1.5), (6, 2).
(c) Calculate the actual error of this approximation, 9. Find equations for the cubic polynomials that make up
and compare it to the error you got in question the composite Bézier curve.
15 of section 3.2 on page 116.
(d) Which polynomial approximated g(1.5) with
smaller absolute error, the Hermite or the La-
grange interpolating polynomial?
x f (x) f 0 (x)
0.1 −0.29004996 −2.8019975
0.2 −0.56079734 −2.6159201
0.3 −0.81401972 −2.4533949
6. Find parametric equations for the cubic Bèzier curve. 10. The data in question 5 were generated using f (x) =
The ends of the “handles” are the four control points. x2 cos(x) − 3x.
182 CHAPTER 5. MORE INTERPOLATION
x f (x) f 0 (x)
0 2 −6
1 −4
2 −10 2
14. Construct the divided differences table that led to the (a) A standard cubic Bèzier curve is given by the con-
Hermite polynomial trol points (0, 0), (2, 0), (0, 1), and (0, 3) (in that
order). Convert this data into polar coordinate
1 1 data. Recall that the conversion from Cartesian
p(x) = 2 − (x − 1) + (x − 1)2 + (x − 1)2 (x − 3).
4 4 coordinates to polar coordinates involves the for-
mulas
15. The Bèzier Curve
p y
r= x2 + y 2 and tan θ = .
x(t) = 11t3 − 18t2 + 3t + 5 x
y(t) = t3 + 1
(b) Find the cubic polar Bèzier curve based on your
has control points (5, 1), (6, 1), and (1, 2). Find the results from (a).
fourth control point.
21. Write an Octave function to compute Hermite poly-
16. What is the minimum number of cubic Bèzier curves
nomials.
in the diagram, and why?
22. A car traveling along a straight road is clocked at
a number of points. The data from the observations
are given in the following table, where the time is in
seconds, the distance is in feet, and the speed is in feet
per second.
Time 0 3 5 8 13
Distance 0 225 383 623 993
Speed 75 77 80 74 72
(b) Use your polynomial from part (a) to predict the # Written by Dr. Len Brin #
position (distance) of the car and its speed when # 13 March 2012 #
t = 10 seconds. # Purpose: Evaluate an interpolating #
(c) Determine whether the car ever exceeds the 55 # polynomial at the value z. #
mph speed limit on the road. If so, what is the # INPUT: number z #
first time the car exceeds this speed? # Data x0,x1,...,xn used to #
# calculate the polynomial: x #
(d) What is the predicted maximum speed for the
# Entries a0;0, a1;0,1, ... #
car?
# an;0,1,...,n as an array: c #
NOTES: Speed is the derivative of distance. # OUTPUT: P(z), the value of the #
# interpolating polynomial at z. #
miles miles 5280 feet 1 hour
55 = 55 × × #######################################
hour hour mile 3600 seconds function ans = divDiffEval(z,x,c)
feet
≈ 80.67 n = length(x);
second ans = c(n);
for i=1:n-1
23. Complete the following code.
ans=(z-x(???))*ans+c(???);
end#for
####################################### end#function
Answers
Hermite polynomial computer form: The four remaining entries are
y1 − y0
f1,1 =
t1 − t0
f1,1 − ẏ0 y1 − y0 ẏ0
f0,2 = = −
t1 − t0 (t1 − t0 )2 t1 − t0
ẏ1 − f1,1 ẏ1 y1 − y0
f1,2 = = −
t1 − t0 t1 − t0 (t1 − t0 )2
f1,2 − f0,2 ẏ1 + ẏ0 y1 − y0
f0,3 = = −2
t1 − t0 (t1 − t0 )2 (t1 − t0 )3
B~ 1,i (0) = P~i and B1,i (1) = P~i+1 so B ~ 1,i connects P~i to P~i+1 . Furthermore, B ~ 1,i (t) = P~i + t(P~i+1 − P~i ), so B
~ 1,i
is an at-most-degree-1 polynomial. Now assume B ~ j,i (t) is an at-most-degree-j polynomial connecting P~i to
Pi+j for some j ≥ 1 (and all applicable i). By definition, B
~ ~ j+1,i (0) = B~ j,i (0) and B ~ j+1,i (1) = B~ j,i+1 (1). By
the inductive hypothesis, B ~ j,i (0) = P~i and B ~ j,i+1 (1) = P~i+j+1 , so B ~ j+1,i connects P~i to P~i+j+1 . Furthermore,
~ j+1,i (t) = (1 − t) · B
B ~ j,i (t) + (t) · B
~ j,i+1 (t)
Bézier curve via Hermite cubics: The simplification may be done as follows.
t−1 t (t − 1)2 t t2 (t − 1)
x(t) = (−1) + (5) + (3 − 6) + (−6)
−1 1 1 1
= (t − 1) + 5t − 3t(t − 1)2 − 6t2 (t − 1)
= 6t − 1 − 3t(t2 − 2t + 1) − 6t3 + 6t2
= 6t − 1 − 3t3 + 6t2 − 3t − 6t3 + 6t2
= −9t3 + 12t2 + 3t − 1
184 CHAPTER 5. MORE INTERPOLATION
and
t−1 t (t − 1)2 t t2 (t − 1)
y(t) = (2) + (−2) + (6 + 4) + (−9 + 4)
−1 1 1 1
= −2(t − 1) − 2t + 10t(t − 1)2 − 5t2 (t − 1)
= −2t + 2 − 2t + 10t(t2 − 2t + 1) − 5t3 + 5t2
= 2 − 4t + 10t3 − 20t2 + 10t − 5t3 + 5t2
= 5t3 − 15t2 + 6t + 2.
5.2. SPLINES 185
5.2 Splines
Osculating polynomials have limited use in applications where a curve is required to pass through a large number
of points. And large may mean only a half dozen or so. Take the following innocuous-looking set of points.
1
0.5
0
-0.5
-1
0 1 2 3 4 5 6 7
It is easy to imagine an equally innocuous function passing through these eight points, but actually finding such a
function poses a slight challenge. The interpolating polynomial of least degree oscillates too widely.
-1
-2
0 1 2 3 4 5 6 7
This is a common problem with high-degree interpolating polynomials. There is no control over their oscillations,
and the oscillations are most often undesirable. The oscillations can be tamed to some degree by finding the
osculating polynomial through these points with, say, a first derivative of 0 at 0 and of − 12 at the seventh point
from the left (the one whose x-coordinate is between 5 and 6).
1
0.5
0
-0.5
0 1 2 3 4 5 6 7
That’s better, but still leaves something to be desired. And the business of setting the first derivatives at two of the
points strictly for the purpose of reducing the oscillations is a bit arbitrary—better to let the nature of the problem
dictate. The oscillations of the previous attempts make them far too distinctive and interesting for the vapid set of
points with which we began. A rightfully trite way to interpolate the data is by connecting consecutive points by
line segments.
1
0.5
0
-0.5
-1
0 1 2 3 4 5 6 7
186 CHAPTER 5. MORE INTERPOLATION
This forms what is known as the piecewise linear interpolation of the data set. This type of graph is often seen in
public media. Many applications, especially those from engineering, require some smoothness, however. Connecting
sets of three consecutive points by quadratic functions helps.
1
0.5
0
-0.5
-1
0 1 2 3 4 5 6 7
That takes care of smoothness at three of the points, but still lacks differentiability at the points common to
consecutive quadratics. Moreover, using the first three points for the first quadratic (which looks linear to the
naked eye), the third through fifth points for the second quadratic, and the fifth through seventh points for the
third quadratic (which also looks linear to the naked eye) leaves only the seventh and eighth points for what would
presumably be a fourth quadratic. With only two points, however, a line segment is used instead. A smoother
solution to the problem is to make sure the first derivatives of consecutive quadratics match at their common point.
With that in mind, it makes sense to fit only two points per parabola, leaving one coefficient (of the three in any
quadratic) for matching the derivative of the neighboring quadratic.
1
0.5
0
-0.5
-1
0 1 2 3 4 5 6 7
That’s better! This piecewise parabolic function has continuous first derivative, but there is still something arbitrary
about it. The seven parabolas have, all together, 21 coefficients. Making each parabola pass through two points
gives 14 conditions on those coefficients. Having adjacent parabolas match first derivatives at their common points
gives 6 more conditions, one at each of the 6 interior points. That leaves one “free” coefficient. Specifying one last
condition seems a bit arbitrary, and is. The graph shows the result when the derivative at 0 is set to 1. Notice
there is no control over the derivative at the right end. Besides the arbitrariness, this asymmetry is bothersome. If
only we had one more degree of freedom...
Piecewise polynomials
A piecewise-defined function whose pieces are all polynomials is called a piecewise polynomial. It takes the form
p1 (x), x ∈ [x0 , x1 ]
p2 (x), x ∈ (x1 , x2 ]
p(x) = ..
.
pn (x) x ∈ (xn−1 , xn ]
where pi (x) is a polynomial for each i = 1, 2, . . . , n and x0 < x1 < · · · < xn ; or some variant where p(xj ) is defined
by exactly one of the pi . If each pi is a linear function, p is called piecewise linear. If each pi is a quadratic function,
p is called piecewise quadratic. If each pi is a cubic function, p is called piecewise cubic. And so on. Examples of
piecewise linear and piecewise quadratic functions appear in the introduction to this section.
Splines
Nothing about the definition of piecewise polynomials requires one to be differentiable or even continuous. The
following function is a piecewise polynomial.
0.4
0.2
0
-0.2
-0.4
0 0.5 1 1.5 2 2.5 3
5.2. SPLINES 187
Most applications of piecewise polynomials require continuity or differentiability, however. Any piecewise polynomial
with at least one continuous derivative is called a spline. The points separating adjacent pieces, the xj , j =
1, 2, . . . , n − 1, are called knots or joints.
The last graph in the introduction to this section shows a quadratic spline. Each piece of the piecewise function
is a quadratic, and the quadratics are chosen so that their derivatives match at the joints. As pointed out there,
though, we needed to supply one unnatural condition—the derivative at the left endpoint. It could have been the
derivative at any of the points, or even the second derivative at one of the points. In a very real sense, the choice
was arbitrary. It was not governed naturally by the question at hand. Consequently, there is a family of solutions
to the problem of connecting those eight points with a continuously differentiable piecewise quadratic.
Cubic splines
The most common spline in use is the cubic spline. As with the quadratic spline, a cubic spline is computed by
matching derivatives at the joints. In fact, there are enough coefficients in the set of cubics that both first and
second derivatives are matched. Note that, according to our definition of spline, matching both first and second
derivatives at the joints is not strictly necessary, however. Other sources will give a more restrictive definition of
spline where matching both derivatives is required. As a matter of convention, we focus on such splines.
A cubic spline required to interpolate n + 1 points has n − 1 joints and n pieces. It follows that the set of cubics
has 4n coefficients. Requiring each cubic to pass through 2 points gives 2n conditions on the coefficients. Requiring
first derivative matching at the joints gives n − 1 more conditions. Requiring second derivative matching at the
joints gives an additional n − 1 conditions for a grand total of 4n − 2 conditions. That leaves 2 “free” coefficients.
Mathematically speaking, we have a family of splines with two degrees of freedom. To find any specific spline, we
need to enforce two more conditions on the coefficients. These conditions may include the first, second, or third
derivative at two of the nodes, both the first and second derivative at a single node, or some other combination of
two derivative requirements.
Guided perhaps by knowledge of draftsman’s splines, convention leads us to supply endpoint conditions. That
is, we require something of some derivative at x0 and at xn . Supplying the first derivative is akin to pointing
the draftsmen’s spline in a particular direction at its ends. Setting the second derivative equal to 0 is akin to
allowing the ends of a draftsman’s spline to freely point in whatever direction physics takes them. These models of
draftsman’s splines are not particularly accurate, but they are motivational.
A cubic spline with its first derivative specified at both endpoints is called a clamped spline. A cubic spline with
its second derivative set equal to zero at both endpoints is called a natural or free spline. A hybrid where the first
derivative is specified at one end and the second derivative is set to zero at the other has no special name. To be
precise, we have the following definitions.
Let (x0 , y0 ), (x1 , y1 ), . . . , (xn , yn ) be n + 1 points where x0 < x1 < · · · < xn and let Si (x) = ai + bi (x − xi ) +
ci (x − xi )2 + di (x − xi )3 for i = 1, 2, . . . , n. Then S, defined by
S1 (x), x ∈ [x0 , x1 ]
S2 (x), x ∈ [x1 , x2 ]
S(x) = .. ,
.
Sn (x), x ∈ [xn−1 , xn ]
1
0.5
0
-0.5
-1
0 1 2 3 4 5 6 7
Finally, a function that is as unspectacular as the data set itself! How was it calculated, you ask? The short answer
is, the 28 simultaneous equations resulting from the definition of natural cubic spline were solved. The solution
provided the coefficients ai , bi , ci , di , i = 1, 2, . . . , 7.
Si (xi ) = ai = yi .
Without much ado, we have the values of the ai and of cn . The remaining 3n − 1 coefficients are found by solving
the remaining 3n − 1 simultaneous equations. Though a computer can certainly handle the solution from here,
finding a bit of the general solution by hand gives a much more efficient algorithm.
y0 − y1 2
b1 = − c1 h1 .
h1 3
certain graph.
(b) 9b
5. Write down a system of equations that could be solved (c) 9c [A]
in order to find the free cubic spline through the fol-
lowing data points. Do not solve the system. 11. Modify the Octave code presented in this section
x f (x) so that it computes the coefficients for a clamped cubic
0.1 −0.62 spline. [S]
0.2 −0.28 12. Use your code from question 11 to check your an-
0.3 0.0066 swer to question
0.4 0.24
(a) 9d
6. Write down the system of equations that would need [S]
to be solved in order to find the cubic spline through (b) 9e
[A]
(0, −9), (1, −13), and (2, −29) with clamped boundary (c) 9f
conditions S 0 (0) = 1 and S 0 (2) = −1. Do not attempt
to solve the system. 13. Modify the Octave code presented in this section so
that it computes the coefficients for a cubic spline with
7. Set up but do not solve the equations which could be
mixed endpoint conditions 3c (page 187).
solved to find the clamped cubic spline through the
points (1, 1), (2, 3), and (4, 2) with S 0 (1) = S 0 (4) = 0. 14. Use your code from question 13 to find the cu-
[S]
bic spline through (0, −9), (1, −13), and (2, −29) with
8. Write down a system of equations that could be solved mixed boundary conditions S 0 (0) = 1 and S 00 (2) = 0.
in order to find the clamped cubic spline through the
following data points with S 0 (0.1) = 0.5 and S 0 (0.4) = 15. Use your code from question 13 to find the cubic
0.1. Do not solve the system. spline through the points (1, 1), (2, 3), and (4, 2) with
S 0 (1) = S 00 (4) = 0.
x f (x) 16. Suppose n + 1 points are given (n > 1). How many
0.1 −0.62 endpoint conditions are needed to fit the points with a
0.2 −0.28
0.3 0.0066 (a) quadratic spline with first derivative matching at
0.4 0.24 each joint?
1 Ahlberg and Nilson, The Theory of Splines and their Applications, Elsevier, 1967.
2 SIAM News, volume 38, number 4, May 2005.
192 CHAPTER 5. MORE INTERPOLATION
(b) cubic spline with first and second derivative any such endpoint conditions must be specified at x3
matching at each joint? and not x0 .
(c) quartic spline with first, second, and third deriva- 18. Let f (x) = sin x and x0 = 0, x1 = π/4, x2 = π/2,
tive matching at each joint? x3 = 3π/4, and x4 = π.
(d) a degree k spline (k > 1) with derivative matching (a) Find the cubic (clamped) spline through
up to degree k − 1 at each joint? (x0 , f (x0 )), (x1 , f (x1 )), . . . , (x4 , f (x4 )) with
S 0 (0) = f 0 (0) and S 0 (π) = f 0 (π).
17. Suppose a spline S is to be fit to the four points (xi , yi ),
(b) Approximate f (π/3) by computing S(π/3).
i = 0, 1, 2, 3 where x0 < x1 < x2 < x3 . Further sup-
pose S is to be linear on [x0 , x1 ], quadratic on [x1 , x2 ], (c) Approximate f (7π/8) by computing S(7π/8).
and cubic on [x2 , x3 ]. Finally suppose S is to have one (d) Calculate the absolute errors in the approxima-
continuous derivative. How many endpoint conditions tions.
are needed to specify the spline uniquely? Argue that
Chapter 6
Ordinary Differential Equations
193
194 CHAPTER 6. ORDINARY DIFFERENTIAL EQUATIONS
Galileo never implemented the pendulum as a timekeeping mechanism. It was around 15 years after Galileo’s
death that the pendulum clock became a reality. Even though his first pendulum clock (1656) was more accurate
than any other clock at the time, Huygens strived to improve upon its design. During his quest, he built a clock
with a modified pendulum and published the classic work, Horologium Oscillatorium, where mathematical details
of the isochronism of the cycloid were laid out for the first time, in 1673.[33, 21]
Today, we take for granted that the cycloid is the path a falling object must follow in order for its travel to a
given point to happen in the same time regardless of its starting position. And we also take for granted that the
period of a simple pendulum varies with its amplitude. We have over 400 years of physical and mathematical
hindsight that tell us so!
The damping or drag force (air resistance) is taken as a force proportional to the speed of the bob, `θ̇, so has
magnitude c`θ̇. Damping forces are always taken to directly oppose the motion, so the magnitude of damping in
the direction of T~ , is its entirety. It only remains to choose the right sign. Since θ̇ indicates the direction of motion,
the damping force must have the opposite sign. The damping constant c is taken to be positive, and of course ` is
positive, so the damping force must be −c`θ̇.
The tension acting on the bob is irrelevant because it is always perpendicular to the motion. The component of
tension in the tangential direction is always zero.
Substituting the sum of these tangential forces for F , Newton’s second law applied to the pendulum becomes
−mg sin θ − c`θ̇ − 0 = m`θ̈ or
c g
θ̈ + θ̇ + sin θ = 0. (6.1.1)
m `
Equation 6.1.1 is known as a differential equation because it is an equation that involves derivatives (or differentials).
To be more precise, it is a second degree ordinary differential equation (o.d.e.). Second degree because the highest
degree derivative is the second and ordinary because it involves only one independent variable (time t).
The simplest differential equations are considered in calculus, though the term “differential equation” is rarely
used. When first discussing the idea of antidifferentiation, the question of “What function has a derivative equal to
... ?” inevitably comes up. For example, one might be faced with the question of what function’s derivative equals
x? This question can also be asked, what function y satisfies the (differential) equation y 0 = x? The answer can be
arrived at by integrating the equation:
ˆ ˆ
y dx =
0
x dx
1 2
y = x +C
2
(don’t forget the constant of integration!).
Friction: when a body lies in contact with a surface, friction opposes motion with a magnitude proportional to the
normal force. The constant of proportionality is called the coefficient of friction and is denoted by µ. For any
body/surface combination, there are two types of friction to consider—static friction and kinetic friction. A
body at rest on a surface is capable of resisting a greater force than is the same body sliding across the same
surface (with the same normal force). You may be familiar with this phenomenon if you’ve ever tried to slide
an oven into or out of its usual position in a kitchen. It’s much harder to get it started moving than it is to
keep it moving. Whether the friction is static or kinetic, it always resists motion tangential to the surface.
Applied: a force that is applied to a body by another body, such as a person pushing a sofa or an engine accelerating
a vehicle.
The anti-lock braking system (ABS) of an automobile is designed to take advantage of the fact that the static
friction between a tire and the road can stop a car more quickly than the kinetic friction between the same tire
and the same road. A tire that is not skidding is capable of applying a greater braking (frictional) force than the
same tire skidding. When the ABS senses that a wheel has locked (ceased rotation) while the car is still moving,
it forces the driver to let up on the brake enough so the wheel will start spinning again, though very briefly. If
the driver continues to hold down the brake hard enough to skid, the ABS will force the driver to let up again.
The ABS rapidly alternates between forcing the driver to let up and allowing the driver to do as (s)he will. The
quick alternation between making the driver let up and allowing the driver to brake hard is what causes the
vibration or pulsing you feel when the ABS kicks in. If the ABS is working properly, a vehicle will come to a
halt more quickly than it would have if it were allowed to skid to a stop. Also, it’s much easier to steer a car
when it is not skidding than when it is skidding!
3(2)2 − 8(2) + 4 = 0,
a true statement. Analogously, we would say that θ = e2t is a solution of the differential equation 3θ̈ − 8θ̇ + 4θ = 0
since, substituting e2t for θ gives
3(4e2t ) − 8(2e2t ) + 4(e2t ) = 0,
again a true statement. Notice that the derivatives θ̇ and θ̈ need to be calculated in order to complete the substi-
tution.
Approximate solutions of differential equations, then, must be approximations of functions. In fact, for any
given ode, we settle for the crudest approximation, a set of points that, if our approximation is good, lie near the
graph of an exact solution. Hence the set {(0, 1), (.25, 1.5), (.5, 2.25), (.75, 3.375), (1, 5.0625)} might qualify as an
approximate solution of the equation 3θ̈ − 8θ̇ + 4θ = 0 for t ∈ [0, 1]. See figure 6.1.2. The approximation is good
for values of t near zero but not as good for values of t near 1.
any constant c. The ode 3θ̈ − 8θ̇ + 4θ = 0 has infinitely many solutions! It is a straightforward exercise to check.
For θ = ce2t , θ̇ = 2ce2t and θ̈ = 4ce2t , so
3θ̈ − 8θ̇ + 4θ = 3(4ce2t ) − 8(2ce2t ) + 4(ce2t )
= 12c(e2t ) − 16c(e2t ) + 4c(e2t )
= (12c − 16c + 4c)e2t
= 0.
2t/3
Even more, θ = ae is a solution for any constant a. This solution can be verified just as the solution θ = ce2t
was verified. Can you do it? Answer on page 199. Finally, θ = ce2t + ae2t/3 is also a solution for any pair of
constants c and a! Can you show it? Answer on page 200. It is not uncommon for a differential equation to have
infinitely many solutions.
Another differential equation with infinitely many solutions is
t
ẏ = .
y
√ √
The solutions are y = t2 + c and y = − t2 + a, valid for any constants c and a as long√as y 6= 0. Complex
solutions are valid! However, if we also require y(0) = 1, there is only one√solution! y = − t2 + c is no longer a
solution because it gives negative
√ values of y for all values of t. And y = t2 + c is only a solution if c = 1. The
one and only solution is y = t2 + 1.
The requirement y(0) = 1 is called an initial value, or initial condition, and the pair of equations
t
ẏ =
y
y(0) = 1
is called an initial value problem. More generally, the pair of equations
ẏ = f (y, t)
y(t0 ) = y0
forms what is knows as a first order initial value problem.
√ 1
Setting y = t2 + 1, ẏ = 2
√1 (2t) = √ t
. Hence the equation ẏ = t
y
becomes
t2 +1 t2 +1
t t
√ = √ ,
t2 + 1 t2 + 1
198 CHAPTER 6. ORDINARY DIFFERENTIAL EQUATIONS
√ √
an undeniably true statement. Hence y = t2 + 1 is a solution of ẏ = yt . Moreover y(0) = 02 + 1 = 1, so
√ √
the particular solution y = t2 + 1 satisfies the √ requirement that √y(0) = 1 also. Hence y = t2 + 1 is one
solution—and the only solution of the form y = t2 + c or y = − t2 + a. But is it the only solution of any
form? Perhaps there are other functions that satisfy the differential
√ equation. A little
√ bit of calculus should help
settle the issue. The demonstration hinges on showing that y = t2 + c and y = − t2 + a are the only solutions
of ẏ = yt . The following sequence of equations show it. Each line implies the next.
dy t
= , y 6= 0
dt y
y dy = t dt, y 6= 0
ˆ ˆ
y dy = C+ t dt, y 6= 0
1 2 1 2
y = C+ t , y 6= 0
2 2
y2 = 2C + t2 , y 6= 0
p
y = ± t2 + 2C, y 6= 0.
Replacing
√ the constant√2C with c or a does not change the fact that the term is an arbitrary constant, so
y = t2 + c and y = − t2 + a are the only solutions of ẏ = yt . This method of solving the differential equation
is called separation of variables.
Key Concepts
Approximate solution of a differential equation: a set of points that, ideally, lie near the graph of an exact
solution.
Degree of a differential equation: equal to the highest order derivative appearing in the equation.
Differential equation: an equation with derivatives (or differentials) in it.
Free body diagram: An engineering diagram consisting of only a body and the forces acting on it.
Initial value problem: a differential equation coupled with a required value of the solution.
Newton’s second law of motion: the acceleration of an object is directly proportional to the magnitude of the
net force applied to the object, in the same direction as the net force, and inversely proportional to the mass
of the object—often summarized by the equation F = ma. This equation assumes the mass of the object is
constant.
Ordinary differential equation (o.d.e.): a differential equation with only one independent variable.
Solution of a differential equation: a function that, when substituted for the dependent variable, makes the
equation a true statement.
[A]
Exercises (a) y(t) = et ; ẏ = y
1. State the degree of the differential equation. (b) y(x) = x3 − 26.83x − sin x; y 00 = 6x + sin x
√
3 [A]
(a) ẏ = y [A] (c) s(t) = e−t/2 sin 2
t ; s̈ + ṡ + s = 0
(b) y 00 = 6x + sin x x3
(d) f (x) = 4
+ x4 , x > 0; f 0 + f
x
= x2 [S]
[A]
(c) s̈ + ṡ + s = 0 (e) h(x) = −2x; (2h + x)h + h = 4x 0
√
(d) f +
0 f
x
=x 2 [S]
(f) r(t) = t, t > 0; r̈ṙt2 = − 18 [A]
(e) (2h + x)h0 + h = 4x 3. Verify that the function is a solution of the initial value
(f) r̈ṙt2 = − 18 [A] problem.
[A]
2. Verify that the function is a solution of the differential (a) y(t) = 4et ; ẏ = y, y(0) = 4
equation. (b) y(x) = x − sin x − π ; y = 3x2 − cos x, y(π) = 0
3 3 0
6.1. THE MOTION OF A PENDULUM 199
[A]
(b) A block sliding down an inclined plane.
2
1 [A]
(c) s(t) = 2
1 + e−t ; ṡ = (1 − 2s)t, s(0) = 1
x3 (c) A block sitting on an inclined plane (not moving).
(d) f (x) = 4
+ 16
x
, x > 0; f 0 = − fx + x2 , f (4) = 20 [S]
[S]
4. Solve the differential equation. (g) A sofa being pushed up an old, slanted hardwood
floor. The applied force may or may not be par-
(a) y 0 = 5x4 [A]
allel to the floor. [A]
2
(b) y 0 = 3xex (h) A sledder has reached the bottom of a hill (and is
(c) ẏ = t − sin t [S] now traveling on level snow) and is coasting to a
[A] stop. [A]
(d) ẏ = 1t , t < 0
[A]
(e) s0 = 1 − ln x (i) A sledder sledding down a hill.
[A] [A]
(f) ṡ = 3tet (j) A hockey puck sliding across an ice rink.
5. Given are an initial value problem, its exact solution, (k) A hockey puck sliding across ice at constant speed
and an approximate solution. Comment on how well (ignoring friction).
the approximate solution approximates the exact solu- (l) A sky diver falling. [A]
tion.
[S]
(m) A sky diver whose parachute just opened.
(a) ẏ = y, y(0) = 4; y(t) = 4e ; t
{(0, 4), (.25, 5), (.5, 6.3), (.75, 7.8), (1, 9.8)} [A] (n) A sky diver whose parachute just opened while a
constant breeze is blowing sideways. [A]
(b) y 0 = 3x2 − cos x, y(π) = 0; y(x) = x3 − sin x − π 3 ;
{(π, 0), ( 54 π, 30), ( 32 π, 74), ( 47 π, 135), (2π, 216)} (o) A football originally kicked at a 40 degree angle
2
just as it reaches its peak, ignoring drag. [A]
1
(c) ṡ = (1 − 2s)t, s(0) = 1; s(t) = 2
1 + e−t ;
(p) A football moving up and to the right approach-
[A]
{(0, 1), (.5, 1), (1, .75), (1.5, .5), (2, .5)} ing its peak, ignoring drag.
3
(d) f 0 = − fx + x2 , f (4) = 20; f (x) = x4 + 16
;
x
[S] 7. Use the free body diagram from question 6 to find the
{(4, 20), (4.25, 23), (4.5, 26), (4.75, 30), (5, 34)}
equation of motion in the tangential direction for (6a)-
(e) h0 = 1+4x−h
2h+x+1
, h(0) = −1; h(x) = −2x − 1; (6k), and in the vertical direction for (6l)-(6p). [S][A]
{(0, −1), (.25, −1.5), (.5, −2), (.75, −2.5), (1, −3)}
√ 8. How much easier is it to slide a sofa by pushing paral-
(f) r̈ṙt2 = − 81 , r(9) = 0, ṙ(9) = − 61 ; r(t) = t − 3; lel to the floor as opposed to slightly toward the floor?
{(9, 0), (10, .16), (11, .31), (12, .46), (13, .61)} [A] Compare the kinetic friction for a sofa being pushed
6. Draw a free body diagram for the situation. parallel to the floor to one being pushed at an angle of
20 degrees from parallel. Then calculate the necessary
(a) Pendular motion ignoring air resistance (no applied force to overcome kinetic friction in each case.
damping). [A] Assume the floor is level. [A]
Answers
4 2t/32 2t/3
3θ̈ − 8θ̇ + 4θ = 3 ae ae −8+ 4 ae2t/3
9 3
4 16 12
= a(e2t/3 ) − a(e2t/3 ) + a(e2t/3 )
3 3 3
4 16 12
= a − a + a e2t/3
3 3 3
= 0.
200 CHAPTER 6. ORDINARY DIFFERENTIAL EQUATIONS
θ = ce2t + ae2t/3 is a solution of 3θ̈ − 8θ̇ + 4θ = 0: θ̇ = 2ce2t + 32 ae2t/3 and θ̈ = 4ce2t + 49 ae2t/3 so
4 2t/3 2 2t/3
2t 2t
3θ̈ − 8θ̇ + 4θ = 3 4ce + ae − 8 2ce + ae + 4 ce2t + ae2t/3
9 3
4 16 12
= 12c(e2t ) + a(e2t/3 ) − 16c(e2t ) − a(e2t/3 ) + 4c(e2t ) + a(e2t/3 )
3 3 3
4 16 12
= (12c − 16c + 4c)e2t + a − a + a e2t/3
3 3 3
= 0.
6.2. TAYLOR METHODS 201
is y(t) = t4 + 16
3
t , t > 0, as verified in exercise 3d on page 199. For the time being, let us try to forget that we know
the exact solution, and study a method for approximating it. We will recall that we have the exact solution when
we are ready to check how the approximation is going. The initial condition, y(4) = 20, means that the graph of
the exact solution passes through (4, 20). What a great place to start an approximate solution—at a point that is
on the graph of the exact solution! Thus the approximation is seeded by the initial condition. There are numerous
ways to proceed from there. Perhaps the simplest way is to use the differential equation to compute the exact slope
(derivative) of y at (4, 20):
y(4)
ẏ(4) = − + 42
4
20
= − + 42
4
= 11.
You might imagine a graph like that in figure 6.2.1. The graph is that of the first order Taylor polynomial expanded
2
about t0 = 4. According to Taylor’s theorem, y(t) = 20 + 11(t − 4) + ÿ(ξ)
2 (t − 4) for t near 4 and some ξ, depending
on t. So, y(2) ≈ T1 (2) = 20 + 11(2 − 4) = −2 and y(5) ≈ T1 (5) = 20 + 11(5 − 4) = 31 (as long as y has two
derivatives on an open interval containing [2, 5]), and so on. As always, there is the concern of how good these
approximations are.
In section 4.4, two different approximations for the same number were used to estimate error in the adaptive
methods. A similar tack may be used here. We will compare approximations given by T1 and T2 . The differential
equation can be used to compute ÿ, in terms of y and t. Implicitly differentiating the differential equation gives
ẏt − y
ÿ = − + 2t.
t2
But ẏ = − yt + t2 , so we may substitute into and simplify the expression for ÿ:
(− yt + t2 )t − y
ÿ = − + 2t
t2
−y + t3 − y
= − + 2t
t2
2y t3
= − 2 + 2t
t2 t
2y
= + t.
t2
202 CHAPTER 6. ORDINARY DIFFERENTIAL EQUATIONS
t0 y(t0 )
4 20
3.75 17.25
3.5 14.88437
3.25 12.88504
3 11.23557
2.75 9.92187
2.5 8.93323
2.25 8.26406
2 7.91666
The method of using formula (6.2.2) repeatedly to compute a sequence of points approximately on the solution of
an ordinary differential equation is most often called Euler’s method.[7] It may also be referred to as the Taylor
6.2. TAYLOR METHODS 203
method of degree 1 since it uses Taylor polynomials of degree 1 at each step. The value ti+1 − ti is called the step
size and is often held constant, so you are likely to see Euler’s method written as
Assumptions: The solution of the o.d.e. exists and is unique on the interval from t0 to t1 .
Input: Differential equation ẏ = f (t, y); initial condition y(t0 ) = y0 ; numbers t0 and t1 ; number of steps N .
Step 1: Set t = t0 ; y = y0 ; h = (t1 − t0 )/N
Step 2: For j = 1 . . . N do Steps 3-4:
Step 3: Set y = y + hf (t, y)
Step 4: Set t = t0 + Ni (t1 − t0 )
Output: Approximation y of the solution at t = t1 .
Assumptions: The solution of the o.d.e. exists and is unique on the interval from t0 to t1 .
...
Input: Differential equation ẏ = f (t, y); formulas ÿ(t, y) and y (t, y); initial condition y(t0 ) = y0 ; numbers
t0 and t1 ; number of steps N .
Step 1: Set t = t0 ; y = y0 ; h = (t1 − t0 )/N
Step 2: For j = 1 . . . N do Steps 3-4:
...
Step 3: Set y = y + hf (t, y) + 21 h2 ÿ(t, y) + 61 h3 y (t, y)
Step 4: Set t = t0 + Ni (t1 − t0 )
Output: Approximation y of the solution at t = t1 .
204 CHAPTER 6. ORDINARY DIFFERENTIAL EQUATIONS
Using Octave code based on the pseudo-code presented in this section, Table 6.2 summarizes the approximate
solution of 6.2.1 using Euler’s method and Taylor’s method of degree 3 to approximate y(2).
Now is a good time to say something about the error of Taylor methods. Remember a Taylor polynomial of
degree n has an error of order n + 1, so Euler’s method uses a Taylor polynomial with error of order 2 and Taylor’s
degree 3 method uses a Taylor polynomial with error of order 4. But how does that translate into an error term
for the Taylor method?
Though we will not answer this question completely here, we can get some idea what to expect from Table 6.2.
From the Euler’s method row, we see the error decrease from (roughly) 3.9 to 2.08 to 1.08 as the step size is reduced
by a factor of one half. Since
1
2.08 1.08 1
≈ ≈ ,
3.9 2.08 2
we conclude that Euler’s method is of first order. Considering the row on Taylor’s degree 3 method, we see the
error decrease from about .024 to .0037 to .00051 as the step size is reduced by a factor of one half. Since
3
.0037 .00051 1 1
≈ ≈ = ,
.024 .0037 8 2
we conclude that Taylor’s degree 3 method is of order 3.
Notice the similarity between this observation and the observation we made about composite integration. In
section 4.4, we argued that the error term for a composite integration formula had order one less than that of a
single application of the underlying integration formula. The same thing happens here. When the truncation error
for the underlying Taylor polynomial has order n, the corresponding o.d.e. solver has order n − 1, an order equal
to the degree of the Taylor polynomial itself.
u0 = f (u, y, x)
y0 = u
which can be solved using a numerical method for first order differential equations.
For example, the equation of a pendulum (6.1.1) can be rearranged as θ̈ = − m c
θ̇ − g` sin θ. If we substitute the
auxiliary variable u = θ̇ into the equation, it becomes u̇ = − m u − ` sin θ, and the system
c g
c g
u̇ = − u − sin θ
m l
θ̇ = u
is equivalent to (6.1.1). Euler’s method, for example, can be applied to this system in the following way:
c g
un+1 = un + h − un − sin θn
m l
θn+1 = θn + hun
tn+1 = tn + h
Key Concepts
Taylor method: A method for approximating the solution of a first order o.d.e. in which a Taylor polynomial of
some predetermined order is used at each step to compute the next.
Euler’s method: Another name for the first order Taylor method, having formula yi+1 = yi + h · ẏ(ti , yi ).
Exercises 12. Use your code from exercise 9 to calculate y(2) for the
o.d.e. in 1a uisng h = 0.5, 0.25, 0.125, and 0.0625. Use
1. Use Euler’s method with step size h = 0.5 to approxi-
your calculations and the fact that the exact value of
mate y(2). −2
y(2) is 9+e4 to verify that Taylor’s method of degree
[S]
(a) 3 is an order 3 numerical method.
dy 13. Use your code from exercise 10 to calculate y(2) for the
= 3x − 2y o.d.e. in 1a uisng h = 0.5, 0.25, 0.125, and 0.0625. Use
dx
y(1) = 1 your calculations and the fact that the exact value of
−2
y(2) is 9+e4 to verify that Taylor’s method of degree
(b) 4 is an order 4 numerical method.
dy 14. Write the equation of motion you derived in exercise 7
= 3x3 − y
dx on page 199 as a first order system. [S][A]
y(1) = 3 15. Given the following parameter values and initial con-
[A] ditions for the referenced system, use Euler’s method
(c)
with a step size h = 0.25 to compute s(0.5) or θ(0.5)
ẏ = ty as appropriate.
y(1) = 0.5 14a: g = 9.81 m/s2 ; ` = .31 m; θ(0) = π
; θ̇(0) = 0 [A]
3
(d) [S]
14b: g = 32.2 ft/s2 ; µ = .21; α = .25 rad; s(0) = 0;
ṡ(0) = .3 ft/s [A]
cos(x)y 0 + sin(x)y = 2 cos3 (x) sin(x) − 1
14c: g = 32.2 ft/s2 ; µ = .21; α = .25 rad; s(0) = 0;
y(1) = 0
ṡ(0) = 0 [S]
(e) 14d: g = 32.2 ft/s2 ; µ = .21; α = .25 rad; m = .19
lbm; Fapplied = 15 lb; s(0) = 0; ṡ(0) = 1 ft/s
7ẏ + 3y = 5
14e: g = 9.81 m/s2 ; µ = .15; m = 35 kg; Fapplied = 75
y(1) = 2
N; s(0) = 0; ṡ(0) = .03 m/s [A]
2. Repeat exercise 1 using Taylor’s method of order 2. 14f: g = 9.81 m/s2 ; µ = .15; β = 10 π
rad; m = 35 kg;
[S][A]
Fapplied = 75 N; s(0) = 0; ṡ(0) = .03 m/s [S]
3. Repeat exercise 1 using Taylor’s method of order 3. 14g: g = 9.81 m/s2 ; µ = .15; α = .05 rad; β = 10
π
rad;
[S][A]
m = 35 kg; Fapplied = 90 N; s(0) = 0; ṡ(0) = .03
4. Execute two steps of Euler’s method for solving ẏ = ty m/s [A]
with y(1) = −0.5 and h = 0.25, thus approximating 14h: g = 32.2 ft/s2 ; µ = .01; s(0) = 0; ṡ(0) = 30 ft/s
y(1.5). [A] [A]
[A]
5. Write pseudo-code for Taylor’s method of order 2.
14i: g = 32.2 ft/s2 ; µ = .01; α = π
6
rad; s(0) = 0;
6. Write pseudo-code for Taylor’s method of order 4. ṡ(0) = 10 ft/s [A]
7. Write an Octave function that implements Euler’s 14j: g = 32.2 ft/s2 ; µ = .003; s(0) = 0; ṡ(0) = 88 ft/s
method. [S] [A]
8. Write an Octave function that implements Taylor’s 14k: g = 32.2 ft/s2 ; µ = 0; s(0) = 0; ṡ(0) = 88 ft/s
method of degree 2. [A] 14l: g = 9.81 m/s2 ; c = 4.5; m = 70 kg; s(0) = 10000;
9. Write an Octave functon that implements Taylor’s ṡ(0) = −10 m/s [A]
method of degree 3. 14m: g = 9.81 m/s2 ; c = 26; m = 70 kg; s(0) = 2000;
ṡ(0) = −55 m/s [S]
10. Write an Octave functon that implements Taylor’s
method of degree 4. 16. Find a formula for the angle at which a stationary block
11. Use your code from exercise 8 to calculate y(2) for the on an inclined plane (whose angle of inclination is in-
o.d.e. in 1a uisng h = 0.5, 0.25, 0.125, and 0.0625. Use creasing) will start moving.
your calculations and the fact that the exact value of 17. Find a formula for the angle at which a block moving
−2
y(2) is 9+e4 to verify that Taylor’s method of degree down an inclined plane (whose angle of inclination is
2 is an order 2 numerical method. [A] decreasing) will stop moving.
206 CHAPTER 6. ORDINARY DIFFERENTIAL EQUATIONS
[A]
18. Undetermined Coefficients. For each differential (e) 2ẏ + y = t4 + 1; y(t) = A + Bt + Ct2 + Dt3 + Et4
equation, a solution with undetermined coefficients is
suggested. Find values for the coefficients that make (f) ẍ + 2ẋ − x = 1 + tet ; x(t) = Atet + Bet + C
the suggested solution an actual solution. (g) [A]
θ̇ − θ = e−t sin t; θ(t) = Ae−t sin t + Be−t cos t
[S] 00 2 2
(a) y + 5y − 8y = 3x ; y(x) = Ax + Bx + C
0
[S] 1
(h) θ̈ + 10 θ̇ + θ = t cos t; θ(t) = At cos t + Bt sin t +
(b) 2y 000 − 5y 00 + 3y 0 + 5y = x + 1; y(x) = Ax + B C cos t + D sin t
(c) [A] 3y 0 + 2y = 3x + 2; y(x) = Ax + B
[A]
(d) [A] y 00 − 14y 0 + 7y = 2x2 + 3x − 1; y(x) = Ax2 + (i) ẍ − 2ẋ − 35x = e7t + 1; x(t) = Ate7t + Be7t + C
Bx + C
Answers
...
T3 (2): Begin by calculating y = d
dt ÿ.
... d 2y
y = +t
dt t2
2ẏt2 − 4ty
= +1
t4
2 − yt + t2 t2 − 4ty
= +1
t4
−2ty + 2t4 − 4ty
= +1
t4
−6y
= +3
t3
...
so y (4) = −6(20)
43 + 3 = 3 − 120 9 13 2 3 3
64 = 8 . Therefore, T3 (t) = 20 + 11(t − 4) + 4 (t − 4) + 16 (t − 4) , and T3 (2) = 9.5
so it is close to T2 (2) = 11. We can start to believe that y(2) is somewhere around 9.5 or 11.
Details:
ẏ = f (t, y)
y(t0 ) = y0
has an exact solution that can be written in terms of an integral. For any value t̃, and assuming existence of a
solution over the interval from t0 to t̃, we can find a value for y(t̃) by integrating both sides of ẏ = f (t, y) with
respect to t:
ˆ t̃ ˆ t̃
ẏ dt = f (t, y) dt
t0 t0
ˆ t̃
y(t̃) − y(t0 ) = f (t, y) dt
t0
ˆ t̃
y(t̃) = y(t0 ) + f (t, y) dt. (6.3.1)
t0
When t0 and t̃ are not close to one another, which is what we normally assume, we need to proceed in small steps
as done in section 6.2. ´t ´t
Substituting t1 for t̃ in equation 6.3.1, y(t1 ) = y(t0 ) + t01 f (t, y) dt, so we can add t01 f (t, y) dt to the known
value y(t0 ) to get y(t1 ), our first small step on the way to approximating y(t̃). Now substituting t1 for t0 and t2
´t
for t̃ in equation 6.3.1, y(t2 ) = y(t1 ) + t12 f (t, y) dt. So, we can compute y(t2 ) from knowledge of y(t1 ). Similarly
we can compute y(t3 ) from knowledge of y(t2 ), y(t4 ) from knowledge of y(t3 ), and so on, eventually computing
y(tn ) = y(t̃). With this in mind, we rewrite the integral representation in terms of ti and ti+1 instead of t0 and t̃:
ˆ ti+1
y(ti+1 ) = y(ti ) + f (t, y) dt. (6.3.2)
ti
This formula suggests that finding one approximation, y(ti+1 ), from the previous, y(ti ), boils down to approximating
´ ti+1
ti
f (t, y) dt. That should not be too challenging at this point. About half of chapter 4 is dedicated to exactly
this task! Every numerical integration formula is a candidate for use here, but let’s start simple. We know y(ti ),
the value of the function at the left endpoint of integration, at least approximately, so it makes sense to use a stencil
that includes the left endpoint of integration as one of the nodes. And to make our first stab as easy as possible,
let’s let that node be the only one! That is, let’s find an integration formula for the stencil
Using the method of undetermined coefficients, we calculate the left hand side of system 4.2.4 (which for us will
only be one equation since we only have one node):
ˆ b ˆ x0 +h ˆ x0 +h
x +h
p0 (x)dx = p0 (x)dx = 1dx = (x − x0 )|x00 = h
a x0 x0
i=0
Adopting the notation yi = y(ti ) and f = ẏ from section 6.2, this formula becomes
Wait a minute! We’ve seen this before. This is exactly equation 6.2.2.
The search for new methods of approximating solutions of o.d.e.s by integrating has not yielded anything new
yet. It has to be different, however. Integration formulas include evaluation of the integrand at various points
while Taylor methods involve evaluation of derivatives at a single point. Let’s push on. Perhaps the next simplest
integration formula that includes the left endpoint of integration is the trapezoidal rule (see section 4.3),
ˆ x0 +h
h
f (x)dx = [f (x0 ) + f (x0 + h)] + O(h3 f 00 (ξh ))
x0 2
ti+1 − ti
yi+1 = yi + [f (ti , yi ) + f (ti+1 , yi+1 )] .
2
This equation is great except the right hand side includes yi+1 , the quantity we are trying to approximate! One
theory is to leave it at that. The equation for yi+1 is implicit in nature and that’s alright. Some root finding
method could be used to determine yi+1 for each step of the method. While this path is not impossible, it is also
not the simplest solution. Since the step size (ti+1 − ti ) is likely to be small, perhaps using Euler’s method to
approximate yi+1 on the right side will not cause irreparable harm to the overall approximation. Giving it a shot,
we let yi+1 = yi + (ti+1 − ti ) · f (ti , yi ) on the right hand side to get the new formula
ti+1 − ti
yi+1 = yi + [f (ti , yi ) + f (ti+1 , yi + (ti+1 − ti ) · f (ti , yi ))] .
2
Pausing for a moment to consider what we have, we might conclude the formula is getting a little unwieldy. Let’s
see if we can tidy it up a bit. First, substituting h for ti+1 − ti makes it a little nicer:
h
yi+1 = yi + [f (ti , yi ) + f (ti+1 , yi + h · f (ti , yi ))] .
2
Second, letting k1 = f (ti , yi ) and k2 = f (ti+1 , yi + h · f (ti , yi )) = f (ti+1 , yi + h · k1 ), we get a nice, neat, three-step
computation:
k1 = f (ti , yi )
k2 = f (ti+1 , yi + hk1 )
h
yi+1 = yi + (k1 + k2 ). (6.3.3)
2
But before getting too carried away with the clean formulation, it would be nice to have some evidence that this
“advanced” method gives a reasonable approximation of the solution to an o.d.e. as expected. Let’s have Octave
compute approximate solutions of o.d.e. 6.2.1 using both Euler’s method and this method based on the trapezoidal
rule, and compare them to the exact solution, y(t) = t4 + 16
3
This test code may be downloaded at the companion website (rungeKuttaDemo.m). The only part of this code that
may appear unfamiliar to you at this point is the sprintf() command. The first argument,
’%12.5g%12.5g%12.5g%12.5g%12.5g’,
is the formatting string. This particular string means to string together 5 floating point numbers using 12 spaces
each and displaying 5 significant digits. In the sprintf command, %12.5g means “general” formatting of a floating
point number with 12 spaces and 5 significant figures. The computer will decide whether to use scientific notation
in the output. Since it is repeated 5 times, this particular command will format five such floating point values.
The rest of the arguments are the five numbers to print. The command sprintf should not be read as “sprint-eff”
but rather “ess-print-eff” or “string print formatted”. The s is for string and the f is for formatted. If you’re
thinking this command seems a bit arcane, you’re right. This type of print formatting command originated in the
C programming language during the 1970s!1 The output of running this Octave code is
Our method based on the trapezoidal rule, which we will call trapezoidal-ode for now, seems to do a better job
of approximating the solution of this o.d.e. than does Euler’s method. The last two columns contain the absolute
errors for each approximation. The errors in trapeziodal-ode are roughly 0.01 to 0.1 while the errors for Euler’s
method are roughly 0.2 to 2. All of the errors in trapezoidal-ode are smaller than all the errors in Euler’s method.
Of course trapezoidal-ode requires two evaluations of f per step, so it better deliver better results for the extra
work if it is to be useful at all.
Buoyed by this success, perhaps it is worth investing some time in other integration formulas, like Simpson’s
rule, for example. Recall from section 4.3, Simpson’s rule states
ˆ x0 +2h
h
f (x)dx = [f (x0 ) + 4f (x0 + h) + f (x0 + 2h)] + O(h5 f (4) (ξh )),
x0 3
1 See https://en.wikipedia.org/wiki/Printf_format_string for some details.
210 CHAPTER 6. ORDINARY DIFFERENTIAL EQUATIONS
ignoring the error term, and using the notation ti+1/2 to mean ti + 21 h and yi+1/2 to mean y(ti + 12 h). So an o.d.e.
solver based on Simpson’s rule might look like
h
yi+1 = yi + f (ti , yi ) + 4f (ti+1/2 , yi+1/2 ) + f (ti+1 , yi+1 ) .
6
Again, this is an implicit formula. Again, we can use Euler’s method to estimate yi+1 , and, in fact, we can use
Euler’s method to estimate yi+1/2 too! Since ti+1/2 is closer to ti than is ti+1 , we estimate yi+1/2 first. That is, we
replace yi+1/2 by yi + h2 f (ti , yi ). Using a multiple-step calculation as before, that gives us
k1 = f (ti , yi )
h h
k2 = f ti + , yi + k1
2 2
so far. This takes care of the first two terms in brackets. Now we estimate yi+1 by approximating f (ti+1 , yi+1 ).
But we now have an estimate of f at ti + h2 , and ti + h2 is closer to ti+1 than is ti . So, even though we could use
yi + hf (ti , yi ) = yi + hk1 to approximate yi+1 (as done before), we might expect yi + hk2 to be a better estimate.
With this hope in hand, we complete the method by calculating as follows:
k1 = f (ti , yi )
h h
k2 = f ti + , yi + k1
2 2
k3 = f (ti+1 , yi + hk2 )
h
yi+1 = yi + [k1 + 4k2 + k3 ] .
6
For now, we will refer to this method as Simpson’s-ode.
Before trying to assess whether this new method is better than the previous ones, let’s derive a couple more,
and compare them all together. The formula
ˆ x0 +3h
3h
f (x)dx = [f (x0 + h) + f (x0 + 2h)] + O(h3 f 00 (ξh ))
x0 2
(an open Newton-Cotes formula from section 4.3) leads to the method
k1 = f (ti , yi )
h h
k2 = f ti + , yi + k1
3 3
2h 2h
k3 = f ti + , yi + k2
3 3
h
yi+1 = yi + [k2 + k3 ] .
2
Can you fill in the steps to derive this method? Answer on page 213. We will call this method open-ode. Finally,
we use the stencil
to derive yet another integration formula. This is not an open Newton-Cotes formula nor is it a closed Newton-Cotes
formula. It is not one that was covered in section 4.3. Perhaps it might be called a “clopen” (half closed and half
open) Newton-Cotes formula. Can you derive the corresponding integration method? Details on page 214. The
result is ˆ x0 +3h
3h
f (x)dx ≈ [f (x0 ) + 3f (x0 + 2h)] ,
x0 4
6.3. FOUNDATIONS FOR RUNGE-KUTTA METHODS 211
k1 = f (ti , yi )
h h
k2 = f ti + , yi + k1
3 3
2h 2h
k3 = f ti + , yi + k2
3 3
h
yi+1 = yi + [k1 + 3k3 ] .
4
We will call this method clopen-ode. Notice two things. First, even though k2 is not used in the final line, it is still
computed since it is used to compute k3 . Second, the calculations of k1 , k2 , and k3 are identical to those in the
open-ode method. The only difference is how the kj are combined. The integration methods combine the values of
the function at the nodes differently. This idea of using the same kj for different purposes will come up again!.
So now we have three new methods to test out—one based on Simpson’s rule (Simpson’s-ode), one based on an
open Newton-Cotes formula (open-ode), and a third based on a “clopen” Newton-Cotes formula (clopen-ode). Can
you write test code for comparing the three new formulas (similar to the code used to compare Euler’s method with
trapezoidal-ode)? Answer on page 215. Results are summarized in the following Octave output:
Simpson’s-ode does the poorest job of finding an approximate solution and clopen-ode does the best. But why?
We’ve done a pretty thorough job of sweeping error analysis under the rug up until now. The bulk of that
investigation will happen in the next section, but we can do a quick analysis here. From section 4.3, we know
that the trapezoidal rule and the open Newton-Cotes formula we used here both have error terms of O(h3 ), while
Simpson’s rule has error term O(h5 ). The integration methods based on the stencils
(which led to Euler’s method and the clopen method) have yet undetermined error terms. Can you show that
their error terms are O(h2 ) and O(h4 ), respectively? Answer on page 215. Based on the error terms of the
underlying integration methods, we should expect these o.d.e. solvers to be, in order from least accurate to most
accurate, Euler’s method (based on a O(h2 ) integration formula), open-ode (based on a O(h3 ) integration formula),
clopen-ode (based on a O(h4 ) integration formula), and Simpson’s-ode (based on a O(h5 ) integration formula); with
trapezoidal-ode to be on par with open-ode. Table 6.3 shows the errors in calculating y(2) for 6.2.1 for the five
methods of this section using various values of h. Since the value of h in each row is half that of the previous row,
`
we would expect the ratio of the errors in consecutive rows to be approximately 21 where the rate of convergence
`
for the method is O(h` ). For Euler’s method, dividing the error in row 3 by that of row 2, we get 12 ≈ .55114 1
1.0809 ≈ 2
`
and dividing the error in row 6 by that in row 5, we get 12 ≈ .07013 1
.1399 ≈ 2 , for example. This evidence suggests that
` = 1 for Euler’s method, and therefore, Euler’s method has an O(h) convergence. Repeating the same calculation
for the other methods yields Table 6.4.
With the exception of Simpson’s-ode, Table 6.4 suggests that o.d.e. solvers have an error term of one less degree
than their underlying (single step) integration formula. In section 4.4 we noted that composite integration formulas
also have error terms of one less degree than their corresponding single-step integration formulas (and we made a
212 CHAPTER 6. ORDINARY DIFFERENTIAL EQUATIONS
Table 6.4: The error terms of five o.d.e solvers and their underlying integration methods
Euler’s Trap-ode Open-ode Clopen-ode Simpson’s-ode
Integration method O(h2 ) O(h3 ) O(h3 ) O(h4 ) O(h5 )
2 2 3
O.D.E. solver O(h) O(h ) O(h ) O(h ) O(h2 )
similar observation about Taylor methods in section 6.2). There is reason to believe in this parallel as the methods
proposed in this section are essentially composite integration techniques. So, it should be a little troubling that
Simpson’s-ode does not fit the pattern. A deeper exploration of the error term is needed to explain this anomaly.
Exercises
1. Derive an o.d.e. solver based on the stencil and corresponding integration formula.
[S]
(a)
h 2
f (x0 ) + 3f x0 + h + O(h4 )
4 3
[A]
(b)
1
hf x0 + h + O(h3 )
2
[A]
(c)
h 1
3f x0 + h − f (x0 ) + O(h3 )
2 3
(d)
1
hf x0 + h + O(h2 )
3
[S]
(e)
h 1
3f x0 + h + f (x0 + h) + O(h4 )
4 3
(f)
2
hf x0 + h + O(h2 )
3
[A]
(g)
h 1 1 2
3f x0 + h − 4f x0 + h + 3hf x0 + h + O(h5 )
2 3 2 3
6.3. FOUNDATIONS FOR RUNGE-KUTTA METHODS 213
(h)
h 1
3f x0 + h + f (x0 + h) + O(h4 )
4 3
(i)
√ √
3−1 3+1
h
f x0 + √ h +f x0 + √ h + O(h5 )
2 2 3 2 3
[A]
(j)
√ √ √ √
5− 3 1 5+ 3
h
5f x0 + √ h + 8f x0 + h 5f x0 + √ h + O(h7 )
18 2 5 2 2 5
2. Conduct a numerical experiment on test o.d.e. 6.2.1 to determine the rate of convergence of the method derived in
question 1. Based on the error term of the integration formula, is the rate of convergence of the o.d.e. solver as
expected?
[A]
3. Write an Octave function that implements Euler’s method.
6. Write an Octave function that implements the solver you derived in exercise 1b. This is called the midpoint method
or the modified Euler method. It is based on the midpoint rule for integration. [A]
7. Write an Octave function that implements the solver you derived in exercise 1a. This is called Ralston’s method.
[A]
8. Use your code from exercise 3 to compute y(2) for the o.d.e. in exercise 1 on page 205 using step size h = 0.05.
[S][A]
9. Use your code from exercise 4 to compute y(2) for the o.d.e. in exercise 1 on page 205 using step size h = 0.05.
[S][A]
10. Use your code from exercise 5 to compute y(2) for the o.d.e. in exercise 1 on page 205 using step size h = 0.05.
[S][A]
11. Use your code from exercise 6 to compute y(2) for the o.d.e. in exercise 1 on page 205 using step size h = 0.05.
[S][A]
12. Use your code from exercise 7 to compute y(2) for the o.d.e. in exercise 1 on page 205 using step size h = 0.05.
[S][A]
Answers
Filling in the gaps: Beginning with the integration formula
ˆ x0 +3h
3h
f (x)dx = [f (x0 + h) + f (x0 + 2h)] + O(h3 f 00 (ξh )),
x0 2
we “shrink” the interval of integration to [x0 , x0 + s] by making the substitution s = 3h:
ˆ x0 +s
1 2
s
f (x)dx = f (x0 + s) + f (x0 + s) + O(s3 f 00 (ξk )).
x0 2 3 3
With the integration formula rephrased in terms of step size s, the o.d.e. solving method is
h
yi+1 = yi + f (ti+1/3 , yi+1/3 ) + f (ti+2/3 , yi+2/3 ) ,
2
214 CHAPTER 6. ORDINARY DIFFERENTIAL EQUATIONS
where we revert to using h for step size. We then use Euler’s method to estimate yi+1/3 and yi+2/3 , starting
with yi+1/3 . That is, we replace yi+1/3 by yi + h3 f (ti , yi ). Then we estimate yi+2/3 . Using a multiple-step
calculation as before, that gives us
k1 = f (ti , yi )
h h
k2 = f ti + , yi + k1 ,
3 3
taking care of the first term in brackets. It remains to estimate f (ti+2/3 , yi+2/3 ). But we now have an estimate
of f (the derivative of y) at ti + h3 , and ti + h3 is closer to ti+2/3 than is ti . So, we approximate yi+2/3 by
yi + 32 hk2 :
k1 = f (ti , yi )
h h
k2 = f ti + , yi + k1
3 3
2h 2h
k3 = f (ti + , yi + k2 )
3 3
h
yi+1 = yi + [k2 + k3 ] .
2
Clopen Newton-Cotes:
For this stencil, a = x0 , b = x0 + 3h, and θi = ih, i = 0, 1, 2. Therefore, we will have a system of three
equations in the three unknowns. First, the left-hand sides:
ˆ b ˆ x0 +3h ˆ x0 +3h
x +3h
p0 (x)dx = p0 (x)dx = 1dx = (x − x0 )|x00 = 3h
a x0 x0
ˆ ˆ x0 +3h ˆ x0 +3h x0 +3h
b
1 2 9
p1 (x)dx = p1 (x)dx = (x − x0 )dx = (x − x0 ) = h2
a x0 x0 2 x0 2
ˆ ˆ x0 +3h ˆ x0 +3h x0 +3h
b
1
p2 (x)dx = p2 (x)dx = (x − x0 )2 dx = (x − x0 )3 = 9h3
a x0 x0 3 x0
Now putting them together with the right-hand sides (and swapping sides):
2
(θi h)0 ai
X
= a0 + a1 + a2 = 3h
i=0
2
9 2
(θi h)1 ai
X
= ha1 + 2ha2 = h
i=0
2
2
(θi h)2 ai = h2 a1 + 4h2 a2 = 9h3
X
i=0
This system is small enough to solve by hand (without the use of a computer algebra system):
h2 a1 +4h2 a2 = 9h3
9 3
− (h2 a1 +2h2 a2 = 2h ) ⇒ a2 = 94 h.
9 3
2h2 a2 = 2h
t=4;
h=-1/4;
f=inline("-y/t+t^2");
exact=inline("t^3/4+16/t");
simp=20;
open=20;
clop=20;
disp(’ Simpsons Open Clopen Simp err Open err Clop err’)
disp(’ ------------------------------------------------------------’)
for i=1:8
k1simp=f(t,simp);
k1open=f(t,open);
k1clop=f(t,clop);
k2simp=f(t+h/2,simp+h/2*k1simp);
k2open=f(t+h/3,open+h/3*k1open);
k2clop=f(t+h/3,clop+h/3*k1clop);
k3simp=f(t+h,simp+h*k2simp);
k3open=f(t+2*h/3,open+2*h/3*k2open);
k3clop=f(t+2*h/3,clop+2*h/3*k2clop);
simp=simp+h/6*(k1simp+4*k2simp+k3simp);
open=open+h/2*(k2open+k3open);
clop=clop+h/4*(k1clop+3*k3clop);
t=t+h;
x=exact(t);
sierr=abs(simp-x);
operr=abs(open-x);
clerr=abs(clop-x);
sprintf(’%12.5g%12.5g%12.5g%12.5g%12.5g%12.5g’,simp,open,clop,sierr,operr,clerr)
end%for
is derived in the section 4.3 solutions. See page 273. The error term for
ˆ x0 +h
f (x)dx ≈ hf (x0 )
x0
is derived similarly. We are given that the error is O(h2 ), so we can skip the discovery. Expanding f (x) in a
Taylor polynomial with error term,
So
ˆ x0 +h ˆ x0 +h
f (x)dx − hf (x0 ) = (f (x0 ) + (x − x0 )f 0 (ξx )) dx − hf (x0 )
x0 x0
ˆ x0 +h
x +h
= xf (x0 )|x00 + (x − x0 )f 0 (ξx )dx − hf (x0 )
x0
ˆ x0 +h
= hf (x0 ) + (x − x0 )f 0 (ξx )dx − hf (x0 )
x0
ˆ x0 +h
= (x − x0 )f 0 (ξx )dx.
x0
´ x0 +h
By the weighted mean value theorem, there exists c ∈ (x0 , x0 + h) such that (x − x0 )f 0 (ξx )dx =
´ x +h x0
f 0 (c) x00 (x − x0 )dx = 21 f 0 (c)h2 . Hence
ˆ x0 +h
1 0
f (x)dx − hf (x0 ) = f (c)h2 ≤ M h2 f 0 (ξh )
x0 2
1 1 ...
y(t0 + h) = y(t0 ) + hẏ(t0 ) + h2 ÿ(t0 ) + h3 y (t0 ) + · · · .
2 6
Each derivative of y can be replaced by some function of f and its partial derivatives, starting with ẏ, which is
given by the o.d.e. we are trying to solve.
ẏ = f (t, y)
d d
ÿ = ẏ = f (t, y) = ft (t, y) + fy (t, y)ẏ = ft (t, y) + fy (t, y) · f (t, y)
dt dt
..
.
ẏ = f
ÿ = ft + fy f
...
y = ftt + fty f + (fyt + fyy f )f + fy (ft + fy f )
= ftt + 2fty f + fyy f 2 + ft fy + fy2 f
..
.
...
so y(t0 + h) = y(t0 ) + hẏ(t0 ) + 21 h2 ÿ(t0 ) + 16 h3 y (t0 ) + · · · in terms of f is
1 1
y(t0 + h) = y(t0 ) + hf + h2 (ft + fy f ) + h3 (ftt + 2fty f + fyy f 2 + ft fy + fy2 f ) + · · · ,
2 6
and as an o.d.e. solver (replacing y(t0 ) by yi and y(t0 + h) by yi+1 ),
1 1
yi+1 = yi + hf + h2 (ft + fy f ) + h3 (ftt + 2fty f + fyy f 2 + ft fy + fy2 f ) + · · · . (6.4.1)
2 6
Rewriting high degree Taylor
... polynomials in terms of f quickly becomes complicated. We will focus on analysis
requiring only ẏ, ÿ, and y .
The o.d.e. solvers of section 6.3 have the form
k1 = f (ti , yi )
k2 = f (ti + β2 h, yi + β2 hk1 )
k3 = f (ti + β3 h, yi + β3 hk2 )
..
.
ks = f (ti + βs h, yi + βs hks−1 )
yi+1 = yi + h [α1 k1 + α2 k2 + α3 k3 + · · · + αs ks ] . (6.4.2)
We did not actually see any o.d.e. solvers with s > 3 in section 6.3, but the process we followed would clearly
require it should there be more than three nodes in the underlying integration formula.
The difference between y(t0 + h) from (6.4.1) and yi+1 from (6.4.2) is the local truncation error of the o.d.e.
solver (the error in taking a single step). In order to write this truncation error in the form O(h` ), though, we need
to expand each kj in its Taylor polynomial. Taylor’s theorem in two variables is needed.
218 CHAPTER 6. ORDINARY DIFFERENTIAL EQUATIONS
Theorem 8. Suppose f (t, y) and all its partial derivatives of order n + 1 and lower are continuous on the rectangle
D = {(t, y) : a ≤ t ≤ b, c ≤ y ≤ d}, and let (t0 , y0 ) ∈ D. Then for every (t, y) ∈ D, there exist ξ ∈ (a, b) and
µ ∈ (c, d) such that
f (t, y) = f (t0 , y0 ) + [(t − t0 ) · ft (t0 , y0 ) + (y − y0 ) · fy (t0 , y0 )]
1
+ (t − t0 )2 ftt (t0 , y0 ) + 2(t − t0 )(y − y0 ) · fty (t0 , y0 ) + (y − y0 )2 fyy (t0 , y0 )
2
+··· +
n
1 n
X n ∂ f
(t − t0 )n−j (y − y0 )j n−j j (t0 , y0 )
n! j=0 j ∂t ∂y
n+1
1 n+1 n+1
X ∂ f
+ (t − t0 )n+1−j (y − y0 )j n+1−j j (ξ, µ) .
(n + 1)! j=0
j ∂t ∂y
As with Taylor’s theorem (of one variable), the first n + 1 terms form the Taylor polynomial and the last term is
the remainder term.
To illustrate, we let f (t, y) = − yt + t2 and compute its second Taylor polynomial with remainder term expanded
about (t0 , y0 ) = (1, 1). For this, we will need all partial derivatives of f up to and including order 3.
y
ft = 2 + 2t
t
1
fy = −
t
y
ftt = −2 3 + 2
t
1
fty = fyt = 2
t
fyy = 0
y
fttt = 6 4
t
2
ftty = ftyt = fytt = − 3
t
ftyy = fyty = fyyt = 0
fyyy = 0.
It follows that
f (1, 1) = 0
ft (1, 1) = 3
fy (1, 1) = −1
ftt (1, 1) = 0
fty (1, 1) = 1
fyy (1, 1) = 0
µ
fttt (ξ, µ) = 6
ξ4
2
ftty (ξ, µ) = − 3
ξ
ftyy (ξ, µ) = 0
fyyy (ξ, µ) = 0.
Therefore, the second Taylor polynomial for f (t, y) is
T2 (t, y) = f (1, 1) + [(t − 1) · ft (1, 1) + (y − 1) · fy (1, 1)]
1
+ (t − 1)2 ftt (1, 1) + 2(t − 1)(y − 1) · fty (1, 1) + (y − 1)2 fyy (1, 1)
2
= 0 + 3(t − 1) − (y − 1) + 0(t − 1)2 + (t − 1)(y − 1) + 0(y − 1)2
= 3(t − 1) − (y − 1) + (t − 1)(y − 1)
6.4. ERROR ANALYSIS 219
More generally, suppose we are interested in Taylor polynomial expansions of expressions like f (ti + βj h, yi +
βj hkj−1 ), as we have in our o.d.e. solvers. Expanding about (ti , yi ), we let t0 = ti , y0 = yi , t = ti + βj h, and
y = yi + βj hkj−1 . Thus t − t0 = βj h and y − y0 = βj hkj−1 , and the second Taylor polynomial without explicit
listing of the arguments ti and yi on the right-hand side is
1
f (ti + βj h, yi + βj hkj−1 ) = f + hβj [ft + kj−1 fy ] + h2 βj2 ftt + 2kj−1 fty + kj−1
2
fyy
2
with remainder term O(h3 ).
In particular, when we set j = 1, βj = β1 = 0, we get
k1 = f (ti , yi ) = f.
When we set j = 2,
k2 = f (ti + β2 h, yi + β2 hk1 )
1
= f + hβ2 [ft + f fy ] + h2 β22 ftt + 2f fty + f 2 fyy + O(h3 ).
2
The calculation of k3 is a little bit messier since it involves k22 . Before diving in headlong, though, consider what
we will do with k3 first. After computing k1 , k2 , and k3 , we will substitute each into the formula
and subtract the result from (6.4.1). For purposes of this discussion, we seek a method with local truncation error
O(h4 ). Therefore, we need only retain constant terms and terms containing a factor of h3 , h2 , or h in equation
(6.4.3). Terms with higher powers of h are irrelevant. They will be assumed (or should I say consumed?) by the
O(h4 ). Since the sum α1 k1 + α2 k2 + α3 k3 is multiplied by h, we need only retain terms with factors of up to h2 in
k1 , k2 , and k3 . Taking a look at the expansion of k3 :
k3 = f (ti + β3 h, yi + β3 hk2 )
1
= f + hβ3 [ft + k2 fy ] + h2 β32 ftt + 2k2 fty + k22 fyy
2
we see only the term 12 h2 β32 · k22 f contains k22 , and it already has a factor of h2 . Consequently, we only need to
include the constant term of k22 . The rest of the terms of k22 become part of the O(h4 ). That’s not so bad!
k22 = f 2 + O(h).
Similarly, when we substitute expressions for k2 into k3 , we will be careful to avoid any terms that would give a
factor of h to any power greater than 2:
of the o.d.e. in (6.4.1). The difference of the two is the local truncation error, so we will be interested in the least
power of h that remains after subtraction. Copying the two equations here for convenience, we are subtracting
1 1
yi+1 = yi + hf + h2 (ft + fy f ) + h3 (ftt + 2fty f + fyy f 2 + ft fy + fy2 f ) + O(h4 )
2 6
from
yi+1 = yi + h [α1 k1 + α2 k2 + α3 k3 ]
= yi + hα1 k1 + hα2 k2 + hα3 k3
= yi + hα1 f
1
+hα2 f + hβ2 [ft + f fy ] + h2 β22 ftt + 2f fty + f 2 fyy + O(h3 )
2
1
+hα3 f + hβ3 ft + hβ3 f fy + h2 β2 β3 (ft fy + f fy2 ) + h2 β32 ftt + 2f fty + f 2 fyy + O(h3 ) .
2
The constant term (term containing no factor of h) for each equation is simply yi , so no constant will remain after
subtraction. The difference of the terms involving h is hf − (hα1 f + hα2 f + hα3 f ) = hf (1 − (α1 + α2 + α)), so if
there is to be no h left in the difference, we must have
α1 + α2 + α3 = 1.
1
α2 β2 + α3 β3 = .
2
Similarly, we consider the differences of the rest of the terms to get the following conditions on the αj and βj .
We have considered all 8 different terms, but have only arrived at 4 distinct conditions:
α1 + α2 + α3 = 1
1
α2 β2 + α3 β3 =
2
1
α2 β22 + α3 β32 =
3
1
α3 β2 β3 = . (6.4.4)
6
Since we have 5 variables and only 4 conditions, we should think that there are multiple o.d.e. solvers of the form
(6.4.2) with s = 3 and local truncation error O(h4 ).
Evidence from section 6.3 suggests that clopen-ode should have local truncation error O(h4 ). Let’s check. For
that method, we have
1 3
α1 = , α2 = 0, α3 =
4 4
1 2
β2 = , β3 = ,
3 3
6.4. ERROR ANALYSIS 221
so
1 3
α1 + α2 + α3 = +0+ =1
4 4
1 3 2 1
α2 β2 + α3 β3 = 0· + · =
3 4 3 2
2 2
1 3 2 1
α2 β22 + α3 β32 = 0 + =
3 4 3 3
3 1 2 1
α3 β2 β3 = · · = .
4 3 3 6
Indeed, clopen-ode satisfies all the conditions of an o.d.e. solver with local truncation error (at least) O(h4 ). We
would actually have to show that at least one term containing an h4 remains in the difference to prove that the
local truncation error is not of greater degree.
Before finally answering the question of what happened to Simpson’s-ode, our hard work so far is sufficient
to check that trapezoidal-ode and open-ode have local truncation error O(h3 ) and that Euler’s method has local
truncation error O(h2 ). For trapezoidal-ode, we have α1 = 21 , α2 = 12 , α3 = 0, β2 = 1, and β3 undefined (we may
assign any particular number we choose since having α3 = 0 makes β3 irrelevant to the method), which gives us
1 1
α1 + α2 + α3 = + +0=1
2 2
1 1
α2 β2 + α3 β3 = ·1+0=
2 2
2
1 1 1 1
α2 β22 + α3 β32 = +0= 6 =
2 3 18 3
1
α3 β2 β3 = 0 6= .
6
The first two conditions are satisfied, but the last two are not. Recall, though, that the first two conditions were
derived from the h and h2 terms while the last two conditions were derived from the h3 terms. So, for trapezoidal-
ode, the local truncation error is O(h3 ).
For Euler’s method, we have α1 = 1, α2 = α3 = 0, and β2 and β3 undefined (or whatever we choose), which
gives us
α1 + α2 + α3 = 1+0+0=1
1
α2 β2 + α3 β3 = 0 + 0 = 0 6=
2
1
α2 β22 + α3 β32 = 0 + 0 = 0 6=
3
1
α3 β2 β3 = 0 6= .
6
The second equation, which was derived from terms involving h2 , is not satisfied but the first equation, which was
derived from terms involving h, is, so the local truncation error for Euler’s method is O(h2 ).
Finally, for Simpson’s-ode, we have α1 = 16 , α2 = 23 , α3 = 61 , β2 = 12 , and β3 = 1, which gives us
1 2 1
α1 + α2 + α3 = + + =1
6 3 6
2 1 1 1
α2 β2 + α3 β3 = · + ·1=
3 2 6 2
2
2 1 1 2 1
α2 β22 + α3 β32 = + (1) =
3 2 6 3
1 1 1
α3 β2 β3 = · · 1 6= .
6 2 6
The first two equations are satisfied, so the local truncation error is (at least) O(h3 ), but the last equation is
not satisfied, so the local truncation error is no more than O(h3 ). No terms containing factors of h or h2 (that
don’t also contain higher powers of h) appear in the local truncation error, but the term h3 α3 β2 β3 (ft fy + f fy2 ) =
1 3 2 3
6 h (ft fy + f fy ) does, so it is O(h ).
222 CHAPTER 6. ORDINARY DIFFERENTIAL EQUATIONS
To derive any Runge-Kutta method of order 4, the stages of the computation must be expanded in a third Taylor
polynomial:
1 2 2 2
f (ti + βj h, yi + βj hkj−1 ) = f + hβj [ft + kj−1 fy ] + h βj ftt + 2kj−1 fty + kj−1
fyy
2
1
+ h3 βj3 fttt + 3kj−1 ftty + 3kj−1
2 3
ftyy + kj−1 fyyy + O(h4 )
6
1 2 1 ... 1 4 ....
y(t0 + h) = y(t0 ) + hẏ(t0 ) + h ÿ(t0 ) + h3 y (t0 ) + h y (t0 ) + O(h5 ).
2 6 24
....
But y , in terms of f , is
d ... d
(y) = ftt + 2fty f + fyy f 2 + ft fy + fy2 f
dt dt
= fyyy f 3 + 3ftyy f 2 + 4fy fyy f 2 + 3ftty f + 5fty fy f + fy3 f
+3ft fyy f + ft fy2 + ftt fy + fttt + 3ft fty
so
1 2 1
yi+1 = yi + hf + h (ft + fy f ) + h3 (ftt + 2fty f + fyy f 2 + ft fy + fy2 f )
2 6
1 4
+ h fyyy f 3 + 3ftyy f 2 + 4fy fyy f 2 + 3ftty f + 5fty fy f + fy3 f
24
+3ft fyy f + ft fy2 + ftt fy + fttt + 3ft fty + O(h5 ).
Furthermore,
k1 = f (ti , yi ) = f
and
k2 = f (ti + β2 h, yi + β2 hk1 )
1
= f + hβ2 [ft + f fy ] + h2 β22 ftt + 2f fty + f 2 fyy
2
1
+ h3 β23 fttt + 3f ftty + 3f 2 ftyy + f 3 fyyy + O(h4 ).
6
1 2 2
= f + hβ3 [ft + k2 fy ] + h β3 ftt + 2k2 fty + k22 fyy
k3
2
1
+ h3 β33 fttt + 3k2 ftty + 3k22 ftyy + k23 fyyy
6
1
h i
= f + hβ3 ft + f + hβ2 [ft + f fy ] + h2 β22 ftt + 2f fty + f 2 fyy fy
2
1 2 2
+ h β3 ftt + 2 (f + hβ2 [ft + f fy ]) fty + f 2 + 2hβ2 [ft + f fy ] f fyy
2
1
+ h3 β33 fttt + 3f ftty + 3f 2 ftyy + f 3 fyyy + O(h4 )
6
1
= f + hβ3 [ft + f fy ] + h2 β2 β3 [ft + f fy ] fy + h2 β32 ftt + 2f fty + f 2 fyy
2
1
+ h3 β3 β22 ftt + 2f fty + f 2 fyy fy + h3 β32 β2 [ft + f fy ] [fty + f fyy ]
2
1
+ h3 β33 fttt + 3f ftty + 3f 2 ftyy + f 3 fyyy + O(h4 ).
6
224 CHAPTER 6. ORDINARY DIFFERENTIAL EQUATIONS
with coefficients in
yi+1 = yi + h [α1 k1 + α2 k2 + α3 k3 + α4 k4 ]
up to order 4 yields the conditions
α1 + α2 + α3 + α4 = 1 (6.4.5)
1
α2 β2 + α3 β3 + α4 β4 = (6.4.6)
2
1
α2 β22 + α3 β32 + α4 β42 = (6.4.7)
3
1
α3 β2 β3 + α4 β3 β4 = (6.4.8)
6
1
α2 β23 + α3 β33 + α4 β43 = (6.4.9)
4
1
α3 β32 β2 + α4 β42 β3 = (6.4.10)
8
1
2α3 β32 β2 + 2α4 β42 β3 + α3 β3 β22 + α4 β4 β32 = (6.4.11)
3
5
α3 β32 β2 + α4 β42 β3 + α3 β3 β22 + α4 β4 β32 = (6.4.12)
24
1
α3 β3 β22 + α4 β4 β32 = (6.4.13)
12
1
α4 β2 β3 β4 = . (6.4.14)
24
Any four-stage (s = 4) fourth order Runge-Kutta method of the form (6.4.2) will have to satisfy these 10 equations
with only 7 degrees of freedom (7 variables). Either the equations form a dependent set or solutions will be rare.
In an attempt to solve the system, we solve (6.4.14) for α4 :
1
α4 = .
24β2 β3 β4
Substituting our formula for α4 into (6.4.8) and solving for α3 :
4β2 − 1
α3 = .
24β22 β3
Substituting our formulas for α3 and α4 into (6.4.13) and solving for β3 :
β3 = −4β22 + 3β2 .
6.4. ERROR ANALYSIS 225
Substituting our formulas for α3 , α4 and β3 into (6.4.10) and solving for β4 :
Substituting our formulas for α3 , α4 , β3 and β4 into (6.4.6) and solving for α2 :
1
β2 =
2
1
α2 =
3
β4 = 1
1
β3 =
2
1
α3 =
3
1
α4 = .
6
Substituting these values of α2 , α3 , and α4 into (6.4.5), we find
1
α1 = .
6
These seven values are the unique simultaneous real solution of the equations (6.4.14), (6.4.8), (6.4.13), (6.4.10),
(6.4.6), (6.4.7), and (6.4.5). So the seven parameters are determined by 7 of the ten conditions. It remains to
show that these seven values also satisfy (6.4.9), (6.4.11), and (6.4.12), which they do. Finally, note that these
are the values of the parameters for the (classic) Runge-Kutta method of order 4.
Key Concepts
Taylor’s theorem in two variables: Suppose f (t, y) and all its partial derivatives of order n + 1 and lower are
continuous on the rectangle D = {(t, y) : a ≤ t ≤ b, c ≤ y ≤ d}, and let (t0 , y0 ) ∈ D. Then for every (t, y) ∈ D,
there exist ξ ∈ (a, b) and µ ∈ (c, d) such that
In this article from 1900 [16] Karl Heun puts forth the third order method that bears his name. Even if you can
not read the German, his formula VI) is clear!
Due to its inefficiency, open-ode should never be used in practice by itself, but combined with Heun’s third order
228 CHAPTER 6. ORDINARY DIFFERENTIAL EQUATIONS
k1 = f (ti , yi )
h h
k2 = f ti + , yi + k1
3 3
2h 2h
k3 = f ti + , yi + k2
3 3
h
yi+1 = yi + [k1 + 3k3 ] + O(h4 ).
4
Using the same k1 , k2 , and k3 , the open-ode method is calculated as
h
yi+1 = yi + [k2 + k3 ] + O(h3 ).
2
The difference between these estimates is
h
[k1 − 2k2 + k3 ] = M h3 + O(h4 ) (6.5.1)
4
for some constant M , and represents the local truncation error of the lower order method, open-ode. This error
estimate can be used to adapt the size of h from one step to the next, decreasing the step size when the local
truncation error is bigger than some tolerance and increasing the step size when the local truncation error is smaller
than some tolerance.
To illustrate the algrorithm and the benefits of adaptive routines, let’s return to o.d.e. 6.2.1, ẏ = − yt + t2 , which
we have generously leaned upon already. As before we will estimate y(2) given initial condition y(4) = 20. This
time the number of steps to compute will be determined by the algorithm, not by us, at least after the first step.
Unfortunately, there is no standard or fool-proof way to choose the size of the first step. Because we are looking
for a computation that can be done by hand, let’s try h = −1 to begin, 12 of the width of the interval [2, 4], over
which we will integrate.
As was needed for adaptive quadrature, a desired level of accuracy, or tolerance, is needed here too. Again
because we are looking for a computation that can be done by hand, let’s try 0.1, a pretty modest accuracy.
Finally, we are ready to compute:
k1 = f (4, 20) = 11
1 1
k2 = f 4 − , 20 − · 11 ≈ 8.98989898989899
3 3
2 2
k3 = f 4 − , 20 − · 8.9898 . . . ≈ 6.90909090909091.
3 3
Before computing y1 from these values, we need to check that the expected accuracy of the calculation would not
violate the 0.1 requirement:
h
[k1 − 2k2 + k3 ] ≈ 0.017.
4
The approximate error in stepping to t1 = 3 is about 0.02, well below the desired threshhold. We are clear to
proceed:
h
y1 = y0 + [k1 + 3k3 ] ≈ 12.06818181818182
4
t1 = t0 + h = 3.
Before computing y2 from these values, we need to check that the expected accuracy of the calculation would not
violate the 0.1 requirement:
h
[k1 − 2k2 + k3 ] ≈ 0.062.
4
The approximate error in stepping to t2 = 2 is about 0.06, well below the desired threshhold. We are clear to
proceed:
h
y2 = y1 + [k1 + 3k3 ] ≈ 9.932224025974026
4
t1 = t0 + h = 2.
Hence we have y(2) ≈ 9.932. After two steps, the actual error is about |10 − 9.932| = 0.068. Of course, we could
have simply executed Heun’s third order method with step size h = 1 (and no error checking) and gotten the same
answer. The difference is we would not have had any idea what to expect for an error! With the adaptive method,
you can be reasonably sure each step incurs only the error you request. At the risk of belaboring the point, consider
redoing the calculation with step size h = −2:
k1 = f (4, 20) = 11
2 2
k2 = f 4 − , 20 − · 11 ≈ 7.311111111111111
3 3
4 4
k3 = f 4 − , 20 − · 7.3111 . . . ≈ 3.266666666666667.
3 3
If we proceed with Heun’s third order method (and no error checking), we get
h
y1 = y0 + [k1 + 3k3 ] ≈ 9.6
4
t1 = t0 + h = 2.
However, without the exact answer, which will be the usual when using a numerical method, we have no way to
know how accurate this estimate is! In that regard, the value 9.6 is a somewhat useless estimate.
On the other hand, since we know the exact value of y(2) is 10, we know the error is 0.4, larger than the desired
0.1. The adaptive Heun should catch this and arrive at a more accurate estimate:
h
[k1 − 2k2 + k3 ] ≈ 0.177.
4
The adaptive method would reject this step because the approximate error is greater than the desired accuracy,
without calculating y1 ! So what should it do instead? The adaptive method will try again with a smaller step size.
Since
h
[k1 − 2k2 + k3 ] ≈ M h3 ,
4
we have M h3 ≈ 0.177 for any step size close to the one just attempted. If we scale the step size by a factor of q, say,
we should expect the new error to be approximately M (qh)3 , or q 3 M h3 ≈ 0.177q 3 . Since we would q like that error
0.1 0.1
to be no more than 0.1, we should choose q so that 0.177q 3 < 0.1 or q 3 < 0.177 , which implies q < 3 0.177 ≈ 0.8254.
But it would slow down the algorithm immensely if the step size were too large very often, so instead, we will take
a somewhat conservative next step of 0.9qh ≈ 0.9(0.8254)(−2) ≈ −1.485. Recalculating with the new step size:
k1 = f (4, 20) = 11
1.485 1.485
k2 = f 4− , 20 − · 11 ≈ 8.130924301356263
3 3
4 4
k3 = f 4 − , 20 − · 7.3111 . . . ≈ 5.087191526760124.
3 3
and
h
[k1 − 2k2 + k3 ] ≈ 0.06487930780869297,
4
230 CHAPTER 6. ORDINARY DIFFERENTIAL EQUATIONS
Now we keep the new step size until it proves to be inappropriate. In this case, that happens right away. Another
step of −1.485 would take the solution to t2 ≈ 1.028, well past the desired t = 2. So, we shorten the step size to
2 − t1 = −0.514132737997418. There is no worry about shortening the step size as that is expected to reduce the
error! Finally, with h = −0.514132737997418:
and
h
[k1 − 2k2 + k3 ] ≈ 0.01476646399275057,
4
this step is accepted:
h
y2 = y1 + [k1 + 3k3 ] ≈ 9.879332752200975
4
t1 = t0 + h = 2.
We have y(2) ≈ 9.879332752200975 with some confidence that the error will not be terribly much more than about
0.2, since we took two steps each of which may have incurred an error of about 0.1. There is no guarantee the error
will be less than 0.2, but at least we have some confidence that it’s not drastically greater. And because we used
a conservative estimate for step size, the actual error is probably a bit smaller (as it turns out, the error is about
0.12).
Step 9: Set i = i + 1;
Step 10: If err < tol
5 or err > tol then do steps 11-14:
13
Step 11: Set q = 0.9 errtol
1 1
Step 12: If q < 10 then set q = 10
Step 13: If q > 5 then set q = 5
Step 14: Set h = qh
Step 15: If not done then Print “Method failed. Maximum iterations exceeded.”
Output: Approximation y(b) or message of failure.
The formulas for ki and err will need to be changed for different adaptive Runge-Kutta schemes, as will the
recalculation of h in Steps 11-14, but the basic algorithm does not require modification for other embedded methods.
k1 = f (ti , yi )
k2 = f (ti + β2 h, yi + β2 hk1 )
k3 = f (ti + β3 h, yi + β3 hk2 )
..
.
ks = f (ti + βs h, yi + βs hks−1 )
yi+1 = yi + h [α1 k1 + α2 k2 + α3 k3 + · · · + αs ks ] .
In methods of this type, k1 is used in the computation of k2 ; k2 is used in the computation of k3 ; k3 is used in the
computation of k4 ; and so on. However, there is nothing preventing one from deriving a method where both k1
and k2 are used in the computation of k3 ; all of k1 , k2 , and k3 are used in the computation of k4 ; and in general
allowing all of k1 , k2 , . . . , kj−1 to be used in computing kj . Doing so gives more degrees of freedom for satisfying
the error analysis equations, lending hope that there are many more Runge-Kutta methods possible. Any method
of this more general form is called an explicit Runge-Kutta method and can be formulated as
k1 = f (ti , yi )
k2 = f (ti + δ2 h, yi + β21 hk1 )
k3 = f (ti + δ3 h, yi + β31 hk1 + β32 hk2 )
..
.
s−1
X
ks = f (ti + δs h, yi + βsj hkj )
j=1
yi+1 = yi + h [α1 k1 + α2 k2 + α3 k3 + · · · + αs ks ] . (6.5.2)
0
δ2 β21
δ3 β31 β32
.. .. ..
. . .
δs βs1 βs2 ··· βs(s−1)
α1 α2 ··· αs−1 αs
much like the coefficients of a system of linear equations might be summarized in a matrix. The Butcher tableau
for any of the Runge-Kutta methods we have considered so far will take the form
232 CHAPTER 6. ORDINARY DIFFERENTIAL EQUATIONS
0
δ2 β21
δ3 0 β32
δ4 0 0 β43
.. .. .. .. ..
. . . . .
δs 0 0 ··· 0 βs(s−1)
α1 α2 α3 ··· αs−1 αs
For example, Heun’s third order method would be summarized in a Butcher tableau as
0
1 1
3 3
2 2
3 0 3
1 3
4 0 4
For our purposes, adaptive Runge-Kutta schemes, also called embedded methods, will be coded in a Butcher tableau
by adding one more line for the coefficients αj of the lower order method. For example the Butcher tableau for
RK2(3) as presented above would be
0
1 1
3 3
2 2
3 0 3
1 3
4 0 4
1 1
0 2 2
The most general Butcher tableaux for non-embedded methods take the form
If any of the βij with j > i are nonzero, the associated Runge-Kutta scheme is an implicit method. Each step
of the method will require solving a system of equations. Implicit Runge-Kutta methods can be considered for
approximating the solutions of stiff o.d.e. since explicit methods are often exceedingly bad at it.
ẋ = x2 − x3
x(0) = δ (6.5.3)
has no closed form solution. The best one can do is derive an implicit solution, so a numerical solution is necessary
to approximate values of the function. Some basic analysis can give an idea what the solution is like, however. It
has an equilibrium at x = 0, which means if x(t0 ) = 0 for some t0 , then x(t) = 0 for all t. The function remains
constant for all time. It is in equilibrium. It does not change. This follows from the fact that when x = 0,
6.5. ADAPTIVE RUNGE-KUTTA METHODS 233
ẋ = 02 − 03 = 0. Similarly, the o.d.e. has an equilibrium at x = 1 (because 1 is another root of the polynomial
x2 − x3 ), and it has no others. However, the two equilibria are very different from one another. The equilibrium
at x = 0 is unstable while the equilibrium at x = 1 is stable. If x(t0 ) is near enough to 1 (|x(t0 ) − 1| < 1 will do),
then x will tend toward 1 as t → ∞. However, there is no such condition near x = 0. No matter how close x(t0 )
is to zero, if it is positive, x will still tend to the other equilibrium, 1, as t → ∞. More to the point, though, is
how the values of x approach 1 as t → ∞.
The hope for an adaptive o.d.e. solver is that it will take large steps where the function is not varying quickly
(has a small first derivative) and will be more careful by taking small steps where the function is varying quickly
(has a large first derivative). More often than not, this is exactly what happens. Stiff o.d.e.s are an exception to
the rule where an adaptive method takes many small steps even in a region where the function has a small first
derivative. The following figures show the solution of (6.5.3) using RK2(3) with tolerance 10−6 , δ = 10−3 , and
initial step size 3 over the interval [0, 2δ ]. First, the solution over [0, 980] acts as we would hope. The solver takes
large steps, including one step from t ≈ 93 to t ≈ 210, a step size h > 117 at the beginning where the function
changes very slowly.
0.045
0.04
0.035
0.03
0.025
x
0.02
0.015
0.01
0.005
0
0 100 200 300 400 500 600 700 800 900
t
In the middle, the solution over [980, 1020] continues to act as we would hope. The solution begins to vary more
quickly here and, consequently, the solver takes a number of smaller steps.
0.8
0.6
x
0.4
0.2
0
980 985 990 995 1000 1005 1010 1015 1020
t
Toward the end, the solution over [1020, 2000] demonstrates the consequence of stiffness. The exact solution is
very nearly constant over this region, gradually approaching 1 from below. A good solver would again take large
steps across this region, but adaptive explicit Runge-Kutta schemes do not. The numerical solution oscillates
within tolerance about 1, so it does what it is supposed to do, but it takes many short steps to do so.
1.000002
1.000001
1.000001
1.000000
1.000000
x
0.999999
0.999999
0.999998
0.999998
1100 1200 1300 1400 1500 1600 1700 1800 1900 2000
t
234 CHAPTER 6. ORDINARY DIFFERENTIAL EQUATIONS
Key Concepts
Embedded Runge-Kutta method: A Runge-Kutta method in which there are two schemes of different orders
derived from the same set of function evaluations.
Adaptive Runge-Kutta method: A Runge-Kutta method that takes advantage of an embedded Runge-Kutta
scheme to automatically adapt the step size as it estimates the solution of an o.d.e.
Butcher tableau: A tabular representation of a Runge-Kutta method.
RKm(n): Shorthand for an embedded Runge-Kutta method containing schemes with rates of convergence (com-
monly called orders) m and n.
Exercises 0
2. Which are the Butcher tableaux of implicit methods? 5. Show that the method given by the Butcher tableau
[A]
has order 2 for any δ ∈ [ 12 , 1].
0
0
1 1 1
4 8 8 δ δ
1 1
0 1 1
(a) 2 2 1− 2δ 2δ
3 3 9
4 16
0 16
1 − 73 2 − 12
7
8
7
6. Demonstrate numerically that the method sug-
7 32 12 32 7 gested by the Butcher tableau has rate of convergence
90 90 90 90 90 O(h3 ).
0
0
1 1
4 4 1 1
3 3 3
− 49 3 2 2
(b) 4 (a) 3
0 3
1 1 5 1
2 18 12 36 1 0 0 1
7
1 9
− 35 − 19 2 3 1
0 4
0 4
1 2 1
6
0 0 3 6
0
0 2 2
7 7
1 1
4 8 4 [S]
2 2 (b) 7
− 35 5
1 1
(c) 0 6 29
2 2
7 42
− 23 5
6
1 0 0 1 1 1 5 1
6 6 12 4
1 1 1 1
6 3 3 6
√ √
0
1
0 12
− 125 12
5 1
12
1 1
2 2
√
5− 5 1 1
√
10−7 5
√
5 (c)
3 3
10
√
12 4
√
60 60
√ 4
0 4
5+ 5 1 10+7 5 1
(d) 10 12 60 4
− 605 2 1 4
9 3 9
1 5 5 1
1 12 12 12 12
1 5 5 1
7. Euler’s method and the improved Euler method use the
12 12 12 12 same function evaluations. Thus, they can be combined
into an embedded, and therefore adaptive, method.
3. Show that this is the Butcher tableau for Euler’s Write the Butcher tableau for the Euler/improved Eu-
method. ler embedded method.
0 0
1 1
1 1 3 3
5 5
1 1 1
3 3 9 3 6 6
10 40 40
1 1 3
3 3 9
− 10 6 2 8
0 8
5 10 5
1
1 − 11 5
− 70 35 1 2
0 − 32 2
54 2 27 27
1 2 1
7 1631 175 575 44275 253 6
0 0 3 6
8 55296 512 13824 110592 4096
1 3 2 1
37
0 250 125
0 512 10
0 10 5 5
378 621 594 1771
2825 18575 13525 277 1
27648
0 48384 55296 14336 4 14. Merson (1957). Write an Octave function that
12. The following pairs of Runge-Kutta methods use the implements the adaptive method of exercise 13. [A]
same function evaluations, but have different rates of 15. The initial value problem
convergence. They can each therefore be paired to form
an embedded Runge-Kutta scheme. Write the Butcher x + 2ey cos(ex )
y0 =
tableau for the embedded method. 1 + ey
y(0) = 2 (6.5.4)
(a) The method of exercise 6a and open-ode.
3 [A] can not be solved analytically. The solution must be
(b) The -rule (exercise 9) and the following.
8 approximated. Use your code from the given exercise to
approximate y(4) with an error of no more than 10−4 .
0
[S]
1 1 (a) 1
3 3
2 (b) 8
3
− 13 1
1 1 (c) 10
0 2 2 [A]
(d) 11
3
(c) The 8
-rule (exercise 9) and the following. (e) 12a
[A]
(f) 12b
0
(g) 12c
1 1
3 3
(h) 12a
2
− 13 1
3 (i) 12b
1 1 −1 1 (j) 13
3
2
− 32 0 1 (k) 14
236 CHAPTER 6. ORDINARY DIFFERENTIAL EQUATIONS
√
16. The initial value problem due to their failure to proceed beyond x = e.
They get “stuck” taking tinier and tinier steps
x2 + y √
y0 = near x = e, as they should since the solution
x − y2 does not exist beyond that point.
y(0) = 5 (6.5.5)
18. Attempt to approximate y(4) for the initial value
can not be solved analytically. The solution must be
problem in exercise 16. Use a variety of adaptive and
approximated. Use your code from the given exercise to
non-adaptive methods with a variety of tolerances. You
approximate y(3) with an error of no more than 10−4 .
should find that you can not obtain dependable results.
(a) 1 [S] Can you explain why not? HINT: You may wish to plot
(b) 8 the approximate solutions. If your solvers are written
so as to store the points in arrays, it is a simple mat-
(c) 10 ter to plot the solutions, as demonstrated for RK2(3),
[A]
(d) 11 using the code from the solution of exercise 1.
(e) 12a
[y,x]=rk23(f,0,5,4,.0001,1000);
[A]
(f) 12b plot(x,y)
(g) 12c
(h) 12a 19. The initial value problem
(i) 12b y0 = ln(x + y)
(j) 13 1
y(0) =
(k) 14 2
can not be solved analytically. The solution must be
17. Consider the initial value problem approximated. Apply the indicated method to com-
pute y(5) using tolerance 10−4 and an initial step
2
+ y2 1
y0 = −x size 10 . Is the global error (the error in approximat-
2xy ing y(5)) around 10−4 ? significantly smaller? sig-
y(1) = 1. nificantly larger? Accurate to 10 significant digits,
(a) Use your code from exercise 5 on page 226 (Heun’s y(5) = 6.409445034. [A]
third order method) to estimate y(2) with step (a) Cash-Karp (exercise 11)
size 0.01.
(b) Bogacki-Shampine (exercise 12b)
(b) Use your code from exercise 6 on page 226 (RK4)
to estimate y(2) with step size 0.01. (c) Merson (exercise 14)
(c) Compare the results of parts (a) and (b). You (d) RK2(3) (exercise 1)
should notice that they are rather different. The
rest of this exercise explores the reason for the 20. Modify the code you used in exercise 19 to count
discrepancy. the number of function evaluations performed. Which
method was most efficient? The method with the
(d) Use your code from exercise 1 (rk2(3)) to estimate
fewest evaluations was the most efficient. [A]
y(2) with tolerance 0.001 and maximum number
of steps 1000. 21. There are many embedded methods not mentioned
(e) Use your code from any of the parts of exercise 12 in this text, mostly of high order. Look some of
to estimate y(2) with tolerance 0.001 and maxi- them up, write code to implement them, and test your
mum number of steps 1000. code. In particular, you may look for the methods of
Fehlberg, Verner, or Dormand & Prince.
(f) You should have found that the method fails in
both parts (d) and (e). However, if you look at the 22. The Cash-Karp RK4(5) method [8] was designed to
last calculated values of x and y anyway (x(1001) contain embedded methods of all orders from 1 through
and y(1001)), you should find that in both cases, 5, not just orders 4 and 5. Show that the three em-
x ≈ 1.648 and y ≈ 0. The failure to approxi- bedded methods given in the Butcher tableau have the
mate y(2) is not a shortcoming of the numerical indicated orders.
method. The solution of the initial value problem
√ 0
only exists over the interval [1, e) ≈ [1, 1.648).
1 1
For dependable results, care must be taken that 5 5
the solution of the o.d.e. exists and is unique over 3 3 9
10 40 40
the entire interval from a to b. That said, the ba- 3 3 9 6
sic (non-adaptive) solvers plow right along and 5 10
− 10 5
give an approximation for y(2) that is entirely in- 19
54
0 − 10
27
55
54
Order 3
correct. Without some further analysis, you may
− 32 5
0 0 Order 2
not notice that the basic solvers are producing 2
Section 1.1
3a: |p̃ − p| = 1106
9 − 123 =
1
≈ 0.111
9
abs(10^-4-pi^-7).
1106
9 − 123
|p̃ − p| 1
4a: = = ≈ 9.03(10)−4
|p| 123 1107
|p̃ − p| |1000 − 210 | 3
4c: = 10
= ≈ 0.0234
|p| 2 128
|p̃ − p| |10−4 − π −7 | π7
4e: = =1− ≈ 0.69797, using the Octave command
|p| π −7 10000
abs(10^-4-pi^-7)/pi^-7.
123
p
5a: log = log ≈ 3.0
1106
9 − 123
p̃ − p
210
p
5c: log = log ≈ 1.6
p̃ − p 1000 − 210
π −7
p
5e: log = log −4 ≈ 0.15616, using the Octave command
p̃ − p 10 − π −7
log(pi^-7/abs(10^-4-pi^-7))/log(10).
239
240 Solutions to Selected Exercises
Section 1.2
P3 00 000
1a: From Taylor’s theorem, T3 (x) = k=0 f k!(x0 ) (x − x0 )k = f (x0 ) + f 0 (x0 ) · (x − x0 ) + f 2! (x0 )
· (x − x0 )2 + f 3!(x0 ) ·
(k)
(x − x0 )3 for any function f with enough derivatives. So to find T3 (x), we need to evaluate f , f 0 , f 00 , f 000 at
x0 = 0. To that end, f (x) = sin(x), so f 0 (x) = cos(x), f 00 (x) = − sin(x), and f 000 (x) = − cos(x). Therefore,
f (x0 ) = sin(0) = 0, f 0 (x0 ) = cos(0) = 1, f 00 (x0 ) = − sin(0) = 0, and f 000 (x0 ) = − cos(0) = −1. Substituting
this information into the formula for T3 (x), we have
0 −1
T3 (x) = 0 + 1 · (x − 0) + · (x − 0)2 + · (x − 0)3
2! 3!
1
= x − x3 .
6
Also from Taylor’s Theorem, we know R3 (x) = f 4!(ξ) (x − x0 )4 for any function f with enough derivatives.
(4)
So we need to evaluate f (4) (x) at x = ξ. To that end, f (4) (x) = sin(x) so f (4) (ξ) = sin(ξ). Hence,
sin(ξ) 4
R3 (x) = x .
24
P3 00 000
1c: From Taylor’s theorem, T3 (x) = k=0 f k!(x0 ) (x − x0 )k = f (x0 ) + f 0 (x0 ) · (x − x0 ) + f 2!(x0 )
· (x − x0 )2 + f 3!(x0 ) ·
(k)
(x − x0 )3 for any function f with enough derivatives. So to find T3 (x), we need to evaluate f , f 0 , f 00 , f 000 at
x0 = π. To that end, f (x) = sin(x), so f 0 (x) = cos(x), f 00 (x) = − sin(x), and f 000 (x) = − cos(x). Therefore,
f (x0 ) = sin(π) = 0, f 0 (x0 ) = cos(π) = −1, f 00 (x0 ) = − sin(π) = 0, and f 000 (x0 ) = − cos(π) = 1. Substituting
this information into the formula for T3 (x), we have
0 1
T3 (x) = 0 + (−1) · (x − π) + · (x − π)2 + · (x − π)3
2! 3!
1
= π − x + (x − π)3 .
6
Also from Taylor’s Theorem, we know R3 (x) = f 4!(ξ) (x − x0 )4 for any function f with enough derivatives.
(4)
So we need to evaluate f (4) (x) at x = ξ. To that end, f (4) (x) = sin(x) so f (4) (ξ) = sin(ξ). Hence,
sin(ξ)
R3 (x) = (x − π)4 .
24
octave:1> f=inline(’1-x^2/2+x^4/24’)
f = f(x) = 1-x^2/2+x^4/24
octave:2> f(0)
ans = 1
octave:3> f(1/2)
ans = 0.87760
octave:4> f(1)
ans = 0.54167
octave:5> f(pi)
ans = 0.12391
241
10: taylorExercise.m:
f=inline(’1-x^2/2+x^4/24’);
f(0)
f(1/2)
f(1)
f(pi)
Running taylorExercise.m:
octave:1> taylorExercise
ans = 1
ans = 0.87760
ans = 0.54167
ans = 0.12391
P2 00
26: (a) From Taylor’s theorem, T2 (x) = k=0 f k!(x0 ) (x − x0 )k = f (x0 ) + f 0 (x0 ) · (x − x0 ) + f 2! (x0 )
(k)
· (x − x0 )2 for
any function f with enough derivatives. So to find T2 (x), we need to evaluate f , f , and f at x0 = 5. To
0 00
that end, f (x) = x1 , so f 0 (x) = − x12 , and f 00 (x) = x23 . Therefore, f (x0 ) = 15 , f 0 (x0 ) = − 25
1
, and f 00 (x0 ) = 1252
.
Substituting this information into the formula for T2 (x), we have
1 1 2/125
T2 (x) = + − · (x − 5) + · (x − 5)2
5 25 2!
1 x − 5 (x − 5)2
= − + .
5 25 125
(b) From Taylor’s Theorem, R2 (x) = f 3!(ξ) (x − x0 )3 for any function f with enough derivatives. So we need
(3)
−6/ξ 4
R2 (x) = (x − 5)3
6
(x − 5)3
= − .
ξ4
(1−5) 2
(9−5) 2
(c) f (1) ≈ T2 (1) = 51 − 1−5 1 4 16
25 + 125 = 5 + 25 + 125 =
61
125 and f (9) ≈ T2 (9) = 15 − 9−5 1 4 16
25 + 125 = 5 − 25 + 125 =
21
125
64
(d) The bounds are 64 and 625 respectively. According to Taylor’s Theorem, the absolute error |f (x)−T2 (x)| =
|R2 (ξ)| for some ξ strictly between x and x0 . So we can obtain a theoretical bound by bounding |R2 (x)| over
64
all values of ξ between x and x0 . For x = 1, R2 (x) = − (1−5)
3
1
T2(x)
f(x)
0.8
0.6
0.4
0.2
0
1 2 3 4 5 6 7 8 9
30b: Perhaps it may initially come as a surprise, but we do not need to find T4 (x) in order to answer this question.
The matter of error is entirely taken up by the remainder term. So we need only calculate R4 (x). This does,
however, require us to find the first 5 derivatives of f (x):
2
f (x) = e−x
2
f 0 (x) = −2xe−x
2 2
f 00 (x) = −2e−x + (−2x)(−2xe−x )
2(2x2 − 1)e−x
2
=
2[4xe−x + (2x2 − 1)(−2xe−x )]
2 2
f 000 (x) =
−4(2x3 − 3x)e−x
2
=
f (4) (x) −4[(6x2 − 3)e−x + (2x3 − 3x)(−2xe−x )]
2 2
=
= −4(−4x4 + 12x2 − 3)e−x
2
2
+15ξ)e−ξ
Now, R4 (x) = f 5!(ξ) x5 = −8(4ξ −20ξ
(5) 5 3 5
x5 = x15 4ξ 5 − 20ξ 3 + 15ξ e−ξ . For any given value of x, we
2
120
are faced with maximizing the absolute value of this expression over all ξ between 0 and x. We may ignore the
x5 5 3 −ξ 2
15 factor which is independent of ξ, and focus on finding extrema of 4ξ − 20ξ + 15ξ e . Sometimes, at
this point, the expression requiring optimization is easy enough to handle using standard calculus techniques—
finding critical points and evaluating. However, in this case, that would involve finding the roots of a sixth
degree polynomial. Ironically, techniques we will learn later in this course would be helpful right now, but as
it is, we have no way to do that in general. The best we can do is have a look at a graph and hope it helps.
Letting g(ξ) = 4ξ 5 − 20ξ 3 + 15ξ e−ξ , we proceed by graphing g(ξ):
2
4 g
-1
-2
-3
-4
-4 -3 -2 -1 0 1 2 3 4
.
243
With the goal of maximization in mind, it makes sense to take note of the relative extrema. The function
appears to have 6 relative extrema and seems to approach zero as ξ approaches ±∞. To confirm that these
observations are facts, we start by calculating g 0 (ξ) = −(8ξ 6 − 60ξ 4 + 90ξ 2 − 15). Since a sixth degree
polynomial has at most 6 distinct roots, g has at most 6 relative extrema. Since we can see 6 relative extrema
on the graph, there are no others. Also,
since the exponential factor dominates the polynomial factor. We would possibly not have thought to consider
these two facts if it were not for the graph. But there’s more. The graph appears to be odd. Again, we can
verify that this is indeed the case:
Due to this symmetry, we may focus on finding extrema for positive values of ξ. And since we are ultimately
interested in maximizing |g|, it is a good time to consider the graph of |g(ξ)| over ξ ∈ [0, 4]:
4.5
|g|
4
3.5
2.5
1.5
0.5
0
0 0.5 1 1.5 2 2.5 3 3.5 4
.
Finally, we can tackle the maximization. The relative maximum, marked with a red plus, will be the key to
ˆ g(ξ)).
the answer. Let the coordinates of this point be (ξ, ˆ Then, since |g(ξ)| is increasing on the interval from
ˆ we can conclude that
0 to ξ,
max |g(ξ)| = |g(x)| = g(x)
ξ∈[0,x]
ˆ Moreover,
for all x between 0 and ξ.
ˆ = g(ξ)
max |g(ξ)| = |g(ξ)| ˆ
ξ∈[0,x]
ˆ By symmetry, we can conclude that max |g(ξ)| = g(x) for x between −ξˆ and 0, and
for all x ≥ ξ.
ξ∈[x,0]
ˆ for all x ≤ −ξ.
max |g(ξ)| = g(ξ) ˆ Putting it all together,
ξ∈[x,0]
if |x| < ξˆ
(
x5
|T4 (x) − f (x)| = |R4 (x)| ≤ 15 g(x) .
x5 ˆ if |x| ≥ ξˆ
15 g(ξ)
Section 1.3
n+1 n+1
1/3e 1/3e
1b: We need to find α such that limn→∞ (1/3en )α
= λ for some λ 6= 0. So, taking a close look at (1/3en )α
should
help:
n
3e α
n+1
1/3e
α =
(1/3en ) 3en+1
n
3αe
=
3en+1
n
3αe
= .
3e·en
en+1 en+1
1/3 1/3
Consequently, if α = e, then (1/3 en )α = 1, from which if follows that limn→∞ (1/3en )α = 1. Therefore, the
order of convergence is α = e.
n+1 n+1
22 −2 −1 22 −2 −1
22n+1 +3 22n+1 +3
1c: We need to find α such that limn→∞ 22n −2 α = λ for some λ 6= 0. So, taking a close look at 22n −2 α
n −1 n −1
22 +3 22 +3
should help:
2n+1 2n+1 n+1
2 2 22 +3
22n+1 −2 − 1 −2
−
n+1 n+1
+3 22 +3 22 +3
n α = n n
α
22 −2 22 −2 22 +3
22n +3 − 1 22n +3 − 22n +3
−5
22n+1 +3
=
−5 α
22n +3
2n
α
1−α 2 +3
= 5 .
22n+1 + 3
If α = 2, the leading terms in both numerator and denominator of the resulting fraction will match. This is
strong evidence that α = 2 is the right choice. Let’s try it:
2n
2
1−2 2 +3 1 22·2 + 6 · 22 + 3
n n
5 = ·
22n+1 + 3 5 22n+1 + 3
1 22 + 6 · 22 + 3
n+1 n
= ·
5 22n+1 + 3
n n+1
1 1 + 6 · 2−2 + 3 · 2−2
= · .
5 1 + 3 · 2−2n+1
In the last step, we have divided both numerator and denominator by 22
n+1
to make taking the limit as n
approaches ∞ simple:
2n+1
2
22n+1 −2 − 1 n n+1
+3
1 1 + 6 · 2−2 + 3 · 2−2
lim 2 = lim ·
n→∞ 22n −2 n→∞ 5 1 + 3 · 2−2n+1
22n +3 − 1
1
= .
5
So, the order of convergence is α = 2.
6c: To begin, we are looking for a function of the form C
that will be at least as great as sin
or the form K
for
√ n
np an n
large n. In the end, though, we want the smallest such function (up to a constant). The key to the solution
is to note that | sin n| ≤ 1 for all n:
sin n | sin n| 1 1
√ = √ ≤ √ = 1/2 .
n n n n
1
Since this inequality will not hold for any higher power of n, the rate of convergence is O n1/2 .
245
4
6d: To begin, we are looking for a function of the form C
or the form K
that will be at least as great as 10n +35n+9
np an
for large n. In the end, though, we want the smallest such function (up to a constant). The key to the solution
is to note that 10n + 35n + 9 > 10n for all n:
4 4 4
10n + 35n + 9 = 10n + 35n + 9 ≤ 10n .
Since this inequality will not hold for any base greater than 10, the rate of convergence is O 101n .
4
6e: To begin, we are looking for a function of the form nCp or the form aKn that will be at least as great as 10n −35n−9
for large n. In the end, though, we want the smallest such function (up to a constant). The key to the solution
is dealing with the fact that 10n − 35n − 9 < 10n for all n:
4 8
10n − 35n − 9 = 2 · 10n − 70n − 18
8
=
+ 10n (10n − 70n − 18)
8
≤
10n
for sufficiently large n since 10 − 70n − 18 ≥ 0 for all large n. Since no similar inequality will hold for any
n
base greater than 10, the rate of convergence is O 101n . Notice we have the same rate of convergence as in
question 6d even though we ended up with a larger constant. The rate of convergence is not dependent on
the constant needed in the inequality.
2
6k: To begin, we are looking for a function of the form nCp or the form aKn that will be at least as great as 2nn
for large n. In the end, though, we want the smallest such function (up to a constant). Let 2 > ε > 0 be
2
1 n2 1
arbitrary. Notice that 2nn ≤ (2−ε) n for large n by rearranging the inequality like so: 2n ≤ (2−ε)n if and only
n
2n 2
if n2 ≤ (2−ε) n if and only if n
2
≤ 2−ε . We know this last inequality to be true for sufficiently large n
2
because > 1, and exponential functions dominate polynomial functions. Hence, we can use any rate of
2−ε
1
convergence of the form O (2−ε)n , but there is no smallest such function. Hence, we are left simply using
2
O 2n as the rate of convergence.
n
Section 1.4
8: (a) trominos.m may be downloaded at the companion website.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% trominos() written by Leon Q. Brin 14 February 2013 %
% is a recursively defined function for %
% calculating the number of trominos needed to %
% cover an n X n grid of squares, save one corner %
% INPUT: nonnegative integer n. %
% OUTPUT: T(n) %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function ans = trominos(n)
if (n==0)
ans = 0;
else
ans = 1+4*trominos(n-1);
end%if
end%function
(b)
octave:1> trominos(10)
ans = 349525
This demonstrates that the 4-disk game can be completed by completing the 3-disk game twice (the first and
last moves) plus one extra move (moving the bottom disk). There is no quicker way to do it because the top 3
disks must be moved off the bottom one before the bottom one can move. Then the bottom one must move,
and must take at least one move. Then the three top disks must be put back on top of the bottom disk. Since
we already know the minimum number of moves to move a stack of 3 disks, this diagram shows a minimum
number of moves to complete the 4-disk game.
ii. It takes a minimum of 2 · 7 + 1, or 15, moves to complete the 4-disk game.
247
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% hanoi() written by Leon Q. Brin 14 February 2013 %
% is a recursively defined function for %
% calculating the number of moves needed to %
% complete the Tower of Hanoi with n disks. %
% INPUT: positive integer n. %
% OUTPUT: H(n) %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function ans = hanoi(n)
if (n==1)
ans = 1;
else
ans = 1+2*hanoi(n-1);
end%if
end%function
(c)
octave:1> hanoi(10)
ans = 1023
12a: This is asking for the number of ways to partition a set of 10 elements into a single nonempty subset. There
is only one way since there is only one subset allowed. That is, the “partition” contains just the set itself. So,
S(10, 1) = 1.
12d: This question is asking for the number of ways to partition a set of 4 elements into two nonempty subsets.
As implied by the question, the actual elements of the set are immaterial, so we can work with any set of
four elements and arrive at the correct answer. Consider the set {α, β, γ, δ}. The list of all partitions can be
categorized into those where one of the subsets has 1 element, one of the sets has 2 elements, or one of the
sets has 3 elements. One does not have a partition of nonempty subsets if one of the sets contains 0 or 4
elements. Here is the list of partitions where one of the sets has exactly one element:
{{α}, {β, γ, δ}}, {{β}, {α, γ, δ}}, {{γ}, {α, β, δ}}, {{δ}, {α, β, γ}}
Note that this is also the list of all partitions where one of the sets has exactly three elements. Here is the
list of partitions where one of the sets has exactly two elements (and, therefore, the other set also has two
elements):
{{α, β}, {γ, δ}}, {{α, γ}, {β, δ}}, {{α, δ}, {β, γ}}
There are no other partitions. Since we have listed 7 partitions, S(4, 2) = 7.
13: (a) S(n, 1) is the number of ways to partition a set of n elements into 1 nonempty subset. Of course, this is 1.
The only such partition contains the set itself.
(b) S(n, n) is the number of ways to partition a set of n elements into n nonempty subsets. Since the set
contains only n elements and we need to divide them among n subsets, each subset of the partition must
contain exactly one element, thus forming a partition of singleton sets. Since order does not matter in a
partition, there is only one way to do this. Thus, S(n, n) = 1.
16: 987. If we take a stack that is n − 1 inches high and add a block that is 1 inch high, we have a stack that is
n inches high with the top block being 1 inch tall. If we take a stack that is n − 2 inches high and add a
block that is 2 inches high, we have a stack that is n inches high with the top block being 2 inches tall. Any
stack created by adding a 1-inch block to a stack that is n − 1 inches tall is necessarily different from a stack
created by adding a 2-inch block to a stack that is n − 2 inches tall since the top blocks are different. Now, if
we take all the stacks that are n − 1 inches high and add 1-inch blocks to them, we have all the stacks that
are n inches high and have a 1-inch block on top. And if we take all the stacks that are n − 2 inches high
and add 2-inch blocks to them, we have all the stacks that are n inches high and have a 2-inch block on top.
248 Solutions to Selected Exercises
There are no other n-inch high stacks since any such stack will either have a 1-inch block or a 2-inch block on
top. Therefore, the number of n-inch high stacks is just the number of (n − 1)-inch stacks plus the number of
(n − 2)-inch stacks. Of course, this doesn’t make sense for n = 1 or n = 2, so we need to specify that there
is exactly 1 way to create a stack of blocks 1 inch high (one 1-inch block), and there are exactly two ways to
create a stack of blocks 2 inches high (two 1-inch blocks or one 2-inch block). Now we can use the recursive
answer to find out how many ways of building taller stacks. The number of 3-inch stacks is the number of
2-inch stacks plus the number of 1-inch stacks, or 2 + 1 = 3. The number of 4-inch stacks is the number of
3-inch stacks plus the number of 2-inch stacks, or 3 + 2 = 5. The number of 5-inch stacks is the number of
4-inch stacks plus the number of 3-inch stacks, or 5 + 3 = 8. Continuing this way reveals the following table:
n 6 7 8 9 10 11 12 13 14 15
number of 13 21 34 55 89 144 233 377 610 987
n-inch stacks
Section 2.1
2c: Since g is a polynomial, it is continuous on [0, 0.9]. g(0) = 2 and g(0.9) = −.1897 so g has opposite signs on
the endpoints of [0, 0.9]. Therefore, the Intermediate Value Theorem guarantees a root on the interval [0, 0.9].
2f: The discontinuities of g are at ±1 due to the (1 − t2 ) factor in the denominator and at odd multiples of π2
due to the (tan t) factor in the numerator. None of these discontinuities occurs in the interval [21.5, 22.5], so
g is continuous on it. g(21.5) ≈ 1.6 > 0 and g(22.5) ≈ −1.6 < 0 so g has opposite signs on the endpoints
of [21.5, 22.5]. Therefore, the Intermediate Value Theorem guarantees a root on the interval [21.5, 22.5].
Incidentally, the discontinuities closest to [21.5, 22.5] are 13π 15π
2 ≈ 20.42 and 2 ≈ 23.56.
3: There is no single correct table for executing the bisection method. Anything that shows successive choices of
interval and accompanying computations will do.
For g(x) = 3x4 − 2x3 − 3x + 2 on [0, 0.9]:
3
ln ≤ ln(10−3 )
2j
ln(3) − ln(2j ) ≤ −3 ln(10)
ln(3) + 3 ln(10) ≤ j ln(2)
ln(3) + 3 ln(10)
≤ j
ln(2)
ln(3)+3 ln(10)
So we need j ≥ ln(2) ≈ 11.55. The least integer satisfying this inequality is 12. We need 12 iterations.
249
21: sin(42 ) = sin(16) < 0 and sin(52 ) = sin(25) < 0 so the assumptions of the bisection are not met on [4, 5] as
stated. However, if the bisection method is run anyway, the first iteration will be 4.5 and sin(4.52 ) > 0. No
matter which endpoint (left or right) becomes 4.5, the assumptions of the bisection method will be met from
here on. It will work as prescribed starting with the second iteration, and, therefore, will return a root.
Section 2.2
2c: (i) g does satisfy the hypotheses of the Mean Value Theorem on [0, 0.9]. The hypotheses of the Mean Value
Theorem require a function to be continuous on the closed interval [a, b] and have a derivative on the open
interval (a, b). In this question, a = 0 and b = 0.9. Since g is a polynomial, it is continuous over all real
numbers. Therefore, g is continuous over [0, 0.9] = [a, b]. Furthermore, g 0 is a polynomial and exists over all
real numbers, so g has a derivative on (0, 0.9) = (a, b). Remark: g actually satisfies the hypotheses of the
Mean Value Theorem on any closed interval, as do all polynomials.
(ii) We need to find c such that g 0 (c) = g(b)−g(a)
b−a . To begin, g 0 (x) = 12x3 − 6x2 − 3, g(0) = 2, and g(0.9) =
3(.9)4 − 2(.9)3 − 3(.9) + 2 = −.1897. So we need to solve 12c3 − 6c2 − 3 = −.1897−2
.9−0 for c:
−2433
12c3 − 6c2 − 3 =
1000
567
12c3 − 6c2 − = 0.
1000
We can not solve this equation using basic techniques of algebra since the cubic does not factor. However, we
know the solution is between 0 and 0.9, so we can apply the bisection method to get an answer! Using Octave
with a tolerance of 10−10 , we get
ans = 0.622093084518565.
2f: g does not satisfy the hypotheses of the Mean Value Theorem on [20, 23]. The discontinuities of g are at ±1 due
to the (1 − t2 ) factor in the denominator and at odd multiples of π2 due to the (tan t) factor in the numerator.
The discontinuity at 13π 2 ≈ 20.42 is in the interval [20, 23], so g is not continuous over the given interval.
3h: We are asked to find the fixed points of h. By definition, a fixed point of h satisfies the equation h(x) = x, so
we are looking for all such values. h(x) = x − 10 + 3x + 25 · 3−x so we need to solve x − 10 + 3x + 25 · 3−x = x:
x − 10 + 3x + 25 · 3−x = x
−10 + 3 + 25 · 3
x −x
= 0
3x − 10 + 25 · 3−x = 0
3 · 3 − 3 · 10 + 3 · 25 · 3
x x x x −x
= 0
x 2
(3 ) − 10 · 3 + 25 x
= 0.
(3x − 5)2 = 0
3x − 5 = 0
3x = 5
log3 3 x
= log3 5
x = log3 5.
x2 − e3x+4 = 0
2
x = e3x+4
ln x2 = ln(e3x+4 )
2 ln x = 3x + 4
2 ln x − 4 = 3x
2 ln x − 4
= x.
3
2 ln x−4
This gives another candidate function, f3 (x) = 3 .
Remark: There are always infinitely many ways to turn the equation g(x) = 0 into an equation of the form
f (x) = x. We can multiply both sides by any nonzero real number, c, and then add x to both sides.
This gives the infinitely many candidates fc (x) = x + cg(x).
Remark: See question 20 for another infinite set of candidates.
5b: We are asked to calculate the first 5 iterations of the fixed point iteration method applied to g(x) = 10 + x −
cosh(x) beginning with (initial value) x0 = −3. We have to apply g to x0 , then apply g to the result to get a
new result, then apply g to the new result to get a newer result, then apply g to the newer result to get yet
another result, and so on, until we have 5 results:
x0 = −3
x1 = g(x0 ) = 10 − 3 − cosh(−3) ≈ −3.067661995777765
x2 = g(x1 ) = 10 + x1 − cosh(x1 ) ≈ −3.836725126419593
x3 = g(x2 ) = 10 + x2 − cosh(x2 ) ≈ −17.03418648356706
x4 = g(x3 ) = 10 + x3 − cosh(x3 ) ≈ −12497508.54310043
x5 = g(x4 ) = 10 + x4 − cosh(x4 ) ≈ ’floating point overflow’
So the first 5 iterations are (approximately) −3.067, −3.836, −17.03, −1.249(10)7 , and a floating point error.
It does not look like fixed point iteration is converging on a fixed point. The numbers are getting larger in
magnitude with each iteration.
Remark: Calculators and computers using standard floating point arithmetic will not be able to calculate
cosh(−12497508.54310043) because it is too big! Thus the overflow. It does not mean it can not be
calculated. It’s just too large for a floating point calculator. Using a computer algebra system with
capability to handle such numbers, we find that
x5 ≈ −4.97(10)5427598 .
x5 has over 5 million digits to the left of the decimal point! Indeed, the magnitude of each iteration is
greater than the last.
6b: Using Octave with a properly programmed fixed point iteration function, we get the following:
fixedPointIteration(inline(’10+x-cosh(x)’),-3,1e-10,100)
ans = Method failed---maximum number of iterations reached
Remark: As we find out in question 5b, this iteration causes an overflow in just 5 iterations.
Remark: The line y = x is not set at a 45◦ angle because the aspect ratio of the graph is not 1 : 1. The
y-axis covers a length of 20, from −20 to 0 while the x-axis covers a length of only 3, from −5 to −2.
10: (a) To establish that f has a unique fixed point on [−4, −.9], we will show that f is continuous on [−4, −.9],
f ([−4, −.9]) ⊆ [−4, −.9] and |f 0 (x)| ≤ 1 for all x ∈ (−4, −.9). Proposition 3 gives us the result.
(i) f is continuous on [−4, −.9] because its only discontinuity is at x = − 32 , where the denominator, 6x + 4,
is zero, and − 23 ≈ −.6666 is not in [−4, −.9].
18x +24x+6
2 3(x+1)(3x+1)
(ii) We find the absolute extrema of f over [−4, −.9]. f 0 (x) = 36x 2 +48x+16 = 2(3x+2)2 has zeroes at
x = −1 and x = − 13 and is undefined at x = − 32 . The only relevant critical value is −1, so we check
f (−4) = − 47 143
20 = −2.35, f (−1) = −1, and f (−.9) = − 140 ≈ −1.021. Hence, f ([−4, −.9]) ⊆ [−2.35, −1] ⊆
[−4, −0.9]. Remark: For many functions, we can be happy enough with visual evidence or at least use
the graph to verify our conclusions. In this question, the graph of f for both x and y values from −4 to
−.9 looks like
The graph of the function does not leave the view through the top (no values greater than −.9) or the
bottom (no values less than −4), so f ([−4, −.9]) ⊆ [−4, −.9].
(iii) We find the absolute extrema of f 0 over [−4, −.9]. f 00 (x) = 27x3 +54x32 +36x+8 = (3x+2)
3
3 has no zeroes and
2 99
is undefined only at x = − 3 . There are no relevant critical values, so we check f (−4) = 200
0
= 0.495 and
f (−.9) = − 98 ≈ −.5204. Hence, − 98 ≤ f (x) ≤ 200 for all x ∈ (−4, −.9), which means |f 0 (x)| ≤ 51
0 51 51 0 99
98 < 1
for all x ∈ (−4, −.9). Remark: As with check (ii), we can be happy enough with visual evidence or
at least use the graph to verify our conclusions. In this question, the graph of f 0 for x ∈ [−4, −.9] and
y ∈ [−1, 1] looks like
252 Solutions to Selected Exercises
.
The graph of the function does not leave the view through the top (no values greater than 1) or the
bottom (no values less than −1), so |f 0 (x)| < 1 for all x ∈ (−4, −.9).
(b) Using the fixed point iteration method as described in the text with tolerance 10−2 and x0 = −4, we get
x6 = −1.00000176319, and we presume this is accurate to within 10−2 of the actual fixed point. Remark:
Since we don’t have a dependable way to calculate the error, it is possible that the final answer will not be
within tolerance of the actual root. In this case, though, the actual fixed point is −1, so we are well within
bounds.
√
12: First, f (x) = 3 8 − 4x = x =⇒ 8 − 4x = x3 =⇒ x3 + 4x − 8 = 0, so any fixed point of f is a root of g. It
remains to show that the fixed point iteration method will converge to a fixed point of f for any initial value
x0 ∈ [1.2, 1.5]. According to the Fixed Point Convergence Theorem, we need to establish that [1.2, 1.5] is a
neighborhood of a fixed point in which the magnitude of the derivative is less than 1.
(i) To
qestablish that there is a fixed point in [1.2, 1.5], note that f is continuous and that f (1.2) − 1.2 =
3 16
√
5 − 1.2 ≈ .27 > 0 and f (1.5) − 1.5 = 2 − 1.5 ≈ −.24 < 0. The Intermediate Value Theorem
3
guarantees there will be a value c ∈ (1.2, 1.5) such that f (c) − c = 0, or f (c) = c.
(ii) We need to establish that the magnitude of the derivative of f is less than 1 for all x ∈ (1.2, 1.5).
4 32
f 0 (x) = − 3(8−4x) 2/3 and f (x) = − 9(8−4x)5/3 . Since f (x) < 0 for all x ∈ (1.2, 1.5), we know f
00 00 0
is
decreasing over this interval. For this reason and the fact that f 0
(x) < 0 for all x ∈ (1.2, 1.5), we know
√ 3
|f 0 (x)| is bounded by |f 0 (1.5)| = − 2 3 2 ≈ .84 < 1.
Section 2.3
5: Because there is no particular pattern to the values n is to take, we will store the six values in an array. Then
we will loop over the array to get the values of f .
n=[0,1,2,4,6,10];
f=inline(’(2^(2^x)-2)/(2^(2^x)+3)’);
i=1;
while (i<7)
disp(f(n(i)));
i=i+1;
end%while
0
0.285714285714286
0.736842105263158
253
0.999923709546987
1
NaN
Remark: We can avoid the NaN, read “Not a Number”, on the sixth value by rewriting the function as the
algebraically equivalent f=inline(’(1-2*2^-(2^x))/(1+3*2^-(2^x))’);. With this one change to the
above program, the following output is produced:
0
0.285714285714286
0.736842105263158
0.999923709546987
1
1
This works because 2^(2^10), which equals 21024 , produces an overflow while 2^-(2^10), which equals
2−1024 , evaluates to 0. 21024 ≈ 1.8(10)308 is too big to be represented as a standard floating point value.
11:
(a) Proceeding according to proposition 5, we will need an initial error and a bound on the magnitude of the
derivative of f .
(i) All we know about the initial value, x0 , and the fixed point, x̂, is that they both lie in [−4, −.9], so
the best we can do for an initial error is the width of the interval. Thus we take |x0 − x̂| = 3.1.
(ii) In 10 of section 2.2, we established the fact that |f 0 (x)| ≤ 51 51
98 < 1. Hence, we have M = 98 .
k
Therefore, we know |xk − x̂| ≤ 3.1 · 5198 , and we need this quantity to be less than 10−11 :
k
51
3.1 · < 10−11
98
k
51 1
<
98 3.1(10)11
51 1
k ln < ln
98 3.1(10)11
− ln 3.1(10)11
k > ≈ 40.51.
ln 51
98
Hence, 41 iterations will suffice for any initial value in [−4, −.9].
51
Remark: The inequality must switch from < to > in the last step because we are dividing by ln ,
98
which is negative.
(b) x0 = −4,
x1 = f (x0 ) = −2.35,
x2 = f (x1 ) ≈ −1.541336633663366,
x3 = f (x2 ) ≈ −1.167517670666227,
x4 = f (x3 ) ≈ −1.028014489100897,
x5 = f (x4 ) ≈ −1.001085950365354,
x6 = f (x5 ) ≈ −1.00000176318809, and
x7 = f (x6 ) ≈ −1.000000000004663.
It takes 7 iterations to come up with an estimate within 10−11 of the actual fixed point, −1.
(c) The theoretical bound is 41 while the actual number of iterations is 7. The bound is nearly six times the
actual! This is not a very tight bound.
Remark: The reason the bound is so loose is because the derivative at the fixed point is zero. The
estimate of proposition 5 does not account for this case where we know the convergence is quadratic
or better.
16a:
254 Solutions to Selected Exercises
n pn an
0 0.5 0.2586844276
1 0.2004262431 0.2576132107
2 0.2727490651 0.2575358323
3 0.2536071566 0.2575306600
4 0.2585503763 0.2575303107
5 0.2572656363
6 0.2575989852
20: The tenth iteration of Steffensen’s method is 0.01462973293 while the eleventh is 0.009752946539, so it takes
but 11 iterations to reach a number below 0.01. This is an incredible acceleration of convergence—from 29, 992
iterations to 11.
Section 2.4
8: Newton’s (fixed point iteration) method requires iteration of the function f (x) = x − gg(x) 0 (x) , so we need to know
− x31
x1
− x4
x1
18: Since Newton’s method is a fixed point iteration method, we may use the fixed point convergence theorem to
find such an interval. As indicated in exercise 26 on page 55, though, we are guaranteed convergence over any
neighborhood of the root where the iterated function f has a derivative with magnitude less than 1. To that
x4 +2x3 −x−3
end, f (x) = x − gg(x)
0 (x) = x − 4x3 +6x2 −1 . Hence,
seems to indicate that |f 0 (x)| < 1 for all x from just about 0.9 to ∞. This is an acceptable answer, but if we
would like to be more precise about the lower bound and prove our assertion, there is considerable work to
do. First, the roots of 4x3 + 6x2 − 1 are around −1.4, −0.5, and 0.4, so there are no asymptotes in the interval
under consideration. f 0 is continuous there. To locate the lower end of this interval, we solve the equation
f 0 (x) = −1:
(12x2 + 12x)(x4 + 2x3 − x − 3)
= −1
(4x3 + 6x2 − 1)2
(12x2 + 12x)(x4 + 2x3 − x − 3) = −(4x3 + 6x2 − 1)2
12 x6 + 36x5 + 24x4 − 12x3 − 48x2 − 36x = −16x6 − 48x5 − 36x4 + 8x3 + 12x2 − 1
6 5 4 3 2
28x + 84x + 60x − 20x − 60x − 36x + 1 = 0.
The real solutions of this equation are, in decreasing order, approximately 0.871748, 0.026590, −1.026590,
and −1.871748. A graph of 28x6 + 84x5 + 60x4 − 20x3 − 60x2 − 36x + 1 will point you in the right direction,
and Newton’s method can be used to find these roots. The one we seek is 0.871748. This value marks the
lower end of the desired interval. To verify that the interval is unbounded above, we solve f 0 (x) = 1:
(12x2 + 12x)(x4 + 2x3 − x − 3)
= 1
(4x3 + 6x2 − 1)2
(12x2 + 12x)(x4 + 2x3 − x − 3) = (4x3 + 6x2 − 1)2
12 x6 + 36x5 + 24x4 − 12x3 − 48x2 − 36x = 16x6 + 48x5 + 36x4 − 8x3 − 12x2 + 1
0 = 4x6 + 12x5 + 12x4 + 4x3 + 36x2 + 36x + 1.
The real solutions of this equation are, in decreasing order, approximately −0.028593 and −0.971407. Again,
a graph will point you in the right direction, and Newton’s method can be used to find these roots. There
are no solutions of f 0 (x) = ±1 greater than the root 1.097740792. We conclude that |f 0 (x)| < 1 for all
x ∈ (0.87175, ∞), so Newton’s method will converge to x̂ ≈ 1.097740792 for any initial value in (0.87175, ∞).
Finally, by looking at the graph of f (x),
,
256 Solutions to Selected Exercises
we see that the interval from the asymptote around 0.4 to the root maps into the interval from the root to
infinity. Therefore, Newton’s method converges to 1.097740792 for all initial values between the asymptote
near 0.4 to 0.87175 as well. Finally, we use Newton’s method to get a more accurate value for the asymptote
near 0.4. It turns out to be 0.366025403784439, so we conclude that Newton’s method will converge to the
root x̂ ≈ 1.097740792 for any initial value in (0.36602540378444, ∞).
Remark: Depending on how rigorously you want your answer shown, you may start with the graph of f as
above, approximate the asymptote near 0.4, and proceed straight to the final answer. This conclusion
can be justified (graphically) by assuming that the graph of f is more or less linear to the right of the
part shown and imagining the web diagram for any value in this interval. To make this argument slightly
more rigorous, note that f has a slant asymptote, y = 34 x, as x approaches ∞, so the assumption that
the graph of f is more or less a straight line to the right of the part shown is valid.
21:
26: The sum of two numbers, call them x and y, is√20, so x + y = 20. If each number is added to its square root,
√
the product of the two sums is 172.2, so (x + x)(y + y) = 172.2. Hence, we need to solve the system
x + y = 20
√ √
(x + x)(y + y) = 172.2
of two equations with two unknowns. Since this system is not linear, our best hope is to use substitution.
The first equation gives us y = 20 − x. Substituting this value of y in the second equation gives us
√ √
(x + x)(20 − x + 20 − x) = 172.2
√ √
or (x + x)(20 − x + 20 − x) − 172.2 = 0. It is a solution of this last equation we seek. Without having
any idea what the roots might be besides the reasonable assumption that they are between 0 and 20, it is
not clear what initial values to use. With a few different √
attempts, you √are likely to find some that work.
For example, applying the secant method to g(x) = (x + x)(20 − x + 20 − x) − 172.2 with x0 = 9 and
x1 = 10 gives 9.149620618, which is accurate to all digits shown, in just 9 iterations. The other number is
20 − 9.149620618 = 10.850379382. We can verify this is a solution by calculating
√ √
(9.149620618 + 9.149620618)(10.850379382 + 10.850379382)
27: Newton’s method will fail to find a root of g on the second iteration if g 0 (x1 ) = 0. For example, let g(x) =
x3 − 3x + 3. Then g 0 (x) = 3x2 − 3 has zeroes when x = ±1. So we need a value x0 such that x1 = 1 or
257
g(x0 ) x3 −3x+3
x1 = −1. We need to find any solution of x1 = x0 − g 0 (x0 ) =x− 3x2 −3 = ±1. One such solution follows.
x3 − 3x + 3
x− = 1
3x2 − 3
2x3 − 3
= 1
3x2 − 3
2x3 − 3 = 3x2 − 3
2x3 − 3x2 = 0
x2 (2x − 3) = 0
3
so either of the initial values x0 = 0 or x0 = 2 will produce the desired result.
3
Remark: The equation x − x 3x −3x+3
2 −3 = −1 has only one real solution, but it is irrational. It is, accurate to 20
significant digits, 1.0786168885087585968. Setting x0 = 1.078616888508759 as in the following Octave
code does not fail, however! There is enough round-off error that x1 is not exactly −1 and g 0 (x1 ) is not
exactly zero, so the method proceeds to find the result. It takes 99 iterations to settle in on the solution,
but it gets there. x1 displays as -0.999999999999999 and x2 displays as 7.50599937895082e+14.
format(’long’)
f=inline(’x^3-3*x+3’)
fp=inline(’3*x^2-3’)
x0=1.0786168885087585968
c=1;
for i=1:120
x=x0-f(x0)/fp(x0)
if (abs(x-x0)<1e-15)
c
return
end%if
x0=x;
c=c+1;
end%for
Section 2.5
1: Before trying to match any functions with their diagrams, we take stock of the functions available. f and h
are polynomials of degree 5 and, therefore, have at most 5 distinct roots. l is the product of the natural
logarithm with a third degree polynomial. The polynomial has three roots and the logarithm has one distinct
from those of the polynomial, so l has four roots. Now looking at the diagrams, we can match two functions
with their diagrams. Diagram (d) has patches of nine different colors, indicating nine roots within the area
shown. Since functions f , h, and l have fewer than 9 roots, function g must match with diagram (d). Along
the same lines, diagrams (a) and (b) both show 5 roots, so l can not match either of those. l has only four
roots. By process of elimination, function l matches with diagram (c). That leaves (a) and (b) to match with
f and h. Both diagrams show 5 roots, but there is a fundamental difference between the two. The real axis
passes horizontally through the center of each diagram. Diagram (a) has one patch covering the entire real
axis, indicating only one real root while diagram (b) has three patches covering the real axis, indicating three
real roots. The graph of f ,
258 Solutions to Selected Exercises
clearly shows that f has three roots, so f matches with (b) and h matches with (a). To recap,
f ↔ (b)
g ↔ (d)
h ↔ (a)
l ↔ (c).
3c: For each root r, the polynomial must have a factor of (x − r) and no other factors. This polynomial must have
factors of (x−(−4)), (x−(−1)), (x−2), (x−2i), and (−(−2i)), making p(x) = (x+4)(x+1)(x−2)(x−2i)(x+2i)
one solution.
Remark: q(x) = a(x + 4)(x + 1)(x − 2)(x − 2i)(x + 2i) where a is any nonzero complex number is another
solution.
Remark: Though it is not necessary to multiply the factors, p(x) = x5 + 3x4 −2x3 + 4x2 −24x−32.
3d: For each root r, the polynomial must have a factor of (x − r) and no other factors. This polynomial must have
factors of (x − (−4)), (x − (−1)), (x − 2), and (x − 2i), making p(x) = (x + 4)(x + 1)(x − 2)(x − 2i) one solution.
Remark: q(x) = a(x + 4)(x + 1)(x − 2)(x − 2i) where a is any nonzero complex number is another solution.
Remark: Though it is not necessary to multiply the factors, p(x) = x4 +(3−2i)x3 −(6+6i)x2 −(8−12i)x+16i.
Notice that not all the coefficients are real numbers. This is consistent with the conjugate roots theorem
stating that if a polynomial with real coefficients has complex roots, they must come in conjugate pairs.
7: f is periodic and has infinitely many roots regularly spread across the real axis. The only diagram showing roots
of this nature is (a) so f matches with (a). g and f differ only by a small amount for large real values so we
should expect to see infinitely many more or less regularly spaced roots on the positive real axis. The only
diagram with roots of this nature is (d) so g matches with (d). l is a fifth degree polynomial so has at most
5 roots. Diagram (b) shows 8 colors so 8 roots. Therefore, h matches with (b) and l matches with (c). To
recap,
f ↔ (a)
g ↔ (d)
h ↔ (b)
l ↔ (c).
Section 2.6
6a: g(2) = 38 and g 0 (2) = 71:
2 3 12 −13 −8
6 36 46
2 3 18 23 38
6 48
3 24 71
259
38 104
8a: From 6a, g(2) = 38 and g 0 (2) = 71, so x1 = 2 − 71 = 71 . g( 104
71 ) =
2911104
357911 and g 0 ( 104
71 ) =
209027
5041 :
104
71 3 12 −13 −8
312 121056 5774392
71 5041 357911
104 1164 55523 2911104
71 3 71 5041 357911
312 153504
71 5041
1476 209027
3 71 5041
−12 1 9 −32 48
−12 36 −48
1 −3 4 0
the deflated polynomial is x2 − 3x + 4 which is quadratic. The quadratic formula gives the remaining roots,
so √
√ √ √ √
3± 9−4(4) 3+i 7 3−i 7
2 = 2 and 2 . To recap, the four roots are 3, −12, 3+i2 7 3−i 7
, 2 .
15a: format(’long’); c=[-40,16,-12,-2,1]; newtonhorner(c,1,1e-5,100) returns
ans = -3.54823289798023
c =
so the deflated polynomial is approximately x3 − 5.5482x2 + 7.6864x − 11.2732 and the coefficients of this poly-
nomial are now contained in array c. newtonhorner(c,-3.5,1e-5,100) returns ans = 4.38111344099655
so 4.38111344099655 is another root. c=deflate(c,ans) returns
c =
so the deflated polynomial is approximately x2 − 1.1671x + 2.5731 and the coefficients of this polynomial are
now contained in array c. Since we have deflated the polynomial to a quadratic, we find the last two roots
using the quadratic formula. [s,t]=quadraticRoots(c(3),c(2),c(1)) returns
s = 0.583559728491838 + 1.494188006012761i
t = 0.583559728491838 - 1.494188006012761i.
returns
ans = -3.54823289797970.
c=[-40,16,-12,-2,1]; newtonhorner(c,4.38111344099655,1e-5,100)
returns
ans = 4.38111344099594.
c=[-40,16,-12,-2,1]; newtonhorner(c,0.583559728491838+1.494188006012761i,1e-5,100)
returns
returns
Section 3.2
3c: We begin by constructing three polynomials—the first with roots at the second two data points and a value
of 1 at the first, the second polynomial with roots at the first and third data points and a value of 1 at the
second, the third polynomial with roots at the first and second data points and a value of 1 at the third.
Those polynomials are
(x − 20)(x − 1019)
l1 (x) =
(−10 − 20)(−10 − 1019)
(x + 10)(x − 1029)
l2 (x) =
(20 + 10)(20 − 1029)
(x + 10)(x − 20)
l3 (x) = .
(1019 + 10)(1019 − 20)
We then multiply li by yi and sum the products:
(x − 20)(x − 1019) (x + 10)(x − 1019)
P2 (x) = (10) + (58)
(−10 − 20)(−10 − 1019) (20 + 10)(20 − 1019)
(x + 10)(x − 20)
+ (−32).
(1019 + 10)(1019 − 20)
261
4c: Estimating (or approximating) the value of a function f using an interpolating polynomial means to evaluate
the polynomial there instead.
5c: Neville’s method is best executed on a computer or in a tabular format. f (1.3) ≈ P0,2 (1.3). The tabular format
is shown here:
7: Since the interpolating polynomial error term contains the product (x − x0 )(x − x1 ) · · · (x − xn ), we should
choose data near the point of estimation x. This way, the product is minimized and we arrive at what is
likely to be the best approximation possible with the given data. It does not always work this way (perhaps
it would make a good exercise to find an example where using the data nearest the point of estimation does
not give the best estimate) but we have the best chance of good results this way. For the degree at most 1
polynomial, we will use the data at 2 and 3.5 since these are the two abscissas nearest 3. For the degree at
most 2 polynomial, we will use the data at 2, 3.5, and 4 since these are the three abscissas nearest 3. For the
degree at most 3 polynomial we have no choice but to use all of the data. Here is where Neville’s method
shines! The first estimate uses the first two data points. The second estimate uses these same two plus a
third. The last estimate uses these three plus a fourth. We can reuse each of the first two calculations in the
next by creating a single Neville’s method table. With the data in the table in the order in which we would
like to use them, we get
P0,1 gives the at most degree 1 estimate. P0,2 gives the at most degree 2 estimate, and P0,3 gives the at most
degree 3 estimate.
(3−3.5)(.8)−(3−2)(.7)
(a) P0,1 (3) = 2−3.5 = 0.73.
(3−4)(.7)−(3−3.5)(.75)
(b) P1,1 (3) = 3.5−4 = 0.65; P0,2 (3) = (3−4)(.73)−(3−2)(.65)
2−4 = .6916
(3−5)(.75)−(3−4)(.5) (3−5)(.65)−(3−3.5)(1) (3−5)(.6916)−(3−2)(.53)
(c) P2,1 (3) = 4−5 = 1; P1,2 (3) = 3.5−5 = .53; P0,3 (3) = 2−5 =
.638
8b: Since the interpolating polynomial error term contains the product (x − x0 )(x − x1 ) · · · (x − xn ), we should
choose data near the point of estimation x. This way, the product is minimized and we arrive at what is
likely to be the best approximation possible with the given data. It does not always work this way (perhaps
it would make a good exercise to find an example where using the data nearest the point of estimation does
not give the best estimate) but we have the best chance of good results this way. For the degree at most 1
262 Solutions to Selected Exercises
polynomial, we will use the data at .1 and .2 since these are the two abscissas nearest .18. For the degree at
most 2 polynomial, we will use the data at .1, .2, and .3 since these are the three abscissas nearest .18. For
the degree at most 3 polynomial we have no choice but to use all of the data. Here is where Neville’s method
shines! The first estimate uses the first two data points. The second estimate uses these same two plus a
third. The last estimate uses these three plus a fourth. We can reuse each of the first two calculations in the
next by creating a single Neville’s method table. With the data listed in the Octave function in the order in
which we would like to use them, we get
>> nevilles(.18,[.1,.2,.3,.4],[-.29004986,-.56079734,-.81401972,-1.0526302])
ans =
For the interpolating polynomial of degree at most one, f (.18) ≈ P0,1 (.18) = −.506647844. For the interpo-
lating polynomial of degree at most two, f (.18) ≈ P0,2 (.18) = −.508049852. For the interpolating polynomial
of degree at most three, f (.18) ≈ P0,3 (.18) = −.5081430744.
8c: Since the interpolating polynomial error term contains the product (x − x0 )(x − x1 ) · · · (x − xn ), we should
choose data near the point of estimation x. This way, the product is minimized and we arrive at what is
likely to be the best approximation possible with the given data. It does not always work this way (perhaps
it would make a good exercise to find an example where using the data nearest the point of estimation does
not give the best estimate) but we have the best chance of good results this way. For the degree at most 1
polynomial, we will use the data at 2 and 2.5 since these are the two abscissas nearest 2.26. For the degree
at most 2 polynomial, we will use the data at 2, 2.5, and 1.5 since these are the three abscissas nearest 2.26.
For the degree at most 3 polynomial we have no choice but to use all of the data. Here is where Neville’s
method shines! The first estimate uses the last two data points. The second estimate uses these same two
plus a third. The final uses these three plus a fourth. We can reuse each of the first two calculations in the
next by creating a single Neville’s method table. With the data listed in the Octave function in the order in
which we would like to use them, we get
>> nevilles(2.26,[2,2.5,1.5,1],[-1.329,1.776,-2.569,1.654])
ans =
For the interpolating polynomial of degree at most one, f (2.26) ≈ P0,1 (2.26) = −.28560. For the interpolating
polynomial of degree at most two, f (2.26) ≈ P0,2 (2.26) = .05285. For the interpolating polynomial of degree
at most three, f (2.26) ≈ P0,3 (2.26) = .28036.
9a: Since the interpolating polynomial error term contains the product (x − x0 )(x − x1 ) · · · (x − xn ), we should
choose data near the point of estimation x. This way, the product is minimized and we arrive at what is likely
to be the best approximation possible with the given data. It does not always work this way (perhaps it would
make a good exercise to find an example where using the data nearest the point of estimation does not give
the best estimate) but we have the best chance of good results this way. For the degree at most 1 polynomial,
we will use the data at 1.25 and 1.6 since these are the two abscissas nearest 1.4. For the degree at most
2 polynomial, we have no choice but to use all of the data. We can use Neville’s method or the Langrange
form in this case. Neither method provides obvious advantage over the other. To begin, f (1) = sin π = 0;
f (1.25) = sin 1.25π ≈ −.70711; f (1.6) = sin(1.6π) ≈ −.95106.
Lagrange form: (degree at most 1) L1 (x) = 1.25−1.6
x−1.6
(−.70711) + 1.6−1.25 (−.95106)
x−1.25
so f (1.4) ≈ L1 (1.4) =
1.4−1.6 1.4−1.25
1.25−1.6 (−.70711) + 1.6−1.25 (−.95106) = −.81166.
(x−1.25)(x−1.6) (x−1)(x−1.6) (x−1)(x−1.25)
(degree at most 2) L2 (x) = (1−1.25)(1−1.6) (0) + (1.25−1)(1.25−1.6) (−.70711) + (1.6−1)(1.6−1.25) (−.95106) so
(1.4−1)(1.4−1.6) (1.4−1)(1.4−1.25)
f (1.4) ≈ L2 (1.4) = (1.25−1)(1.25−1.6) (−.70711) + (1.6−1)(1.6−1.25) (−.95106) = −.918232.
263
Neville’s Method: We use the same table for both the degree at most 1 and degree at most 2 polynomials:
(x − 1.6)(−.70711) − (x − 1.25)(−.95106)
P0,1 (x) = = .16414 − .697x
1.25 − 1.6
(x − 1)(−.95106)
P1,1 (x) = = 1.5851 − 1.5851x
1.6 − 1
(x − 1)P0,1 (x) − (x − 1.25)P1,1 (x)
P0,2 (x) = = 3.5524x2 − 10.82134x + 7.26894
1.25 − 1
(degree at most 1) P0,1 (1.4) = .16414 − .697(1.4) = −.8166
(degree at most 2) P0,2 (1.4) = 3.5524(1.4)2 − 10.82134(1.4) + 7.26894 = −.918232
f (2) (ξ1.4 )
10a: (degree at most 1) f (1.4) − P1 (1.4) = 2! (1.4 − 1.25)(1.4 − 1.6) so our bound is
= .015π 2 |sin(1.5π)|
< .149
The actual absolute error is |f (1.4) − P1 (1.4)| = | sin(1.4π) + .8166| ≈ .134, which is rather near the bound.
f (3) (ξ1.4 )
(degree at most 2) f (1.4) − P2 (1.4) = 3! (1.4 − 1.25)(1.4 − 1.6)(1.4 − 1) so our bound is
= .002π 3
< .0620
The actual absolute error is |f (1.4) − P2 (1.4)| = | sin(1.4π) + .918232| ≈ .0328, which is of the same order of
magnitude as the bound.
Section 3.3
4: The Newton form of an interpolating polynomial follows from a table of divided differences. Recursion 3.3.3 is
used to compute the entries in the table, as in Table 3.3. Answers will depend on the order in which the data
are listed in the table and on how the data are read from the table. Placing the data in the table in the order
given in the question, we have:
Reading the coefficients across the first row, we use f0,0 , f0,1 , f0,2 , and f0,3 . This is a valid sequence to read
from the table since each coefficient depends on the same data as the previous plus one point. f0,0 depends
on x0 ; f0,1 depends on x0 and x1 ; f0,2 depends on x0 , x1 , and x2 ; and f0,3 depends on x0 , x1 , x2 , and x3 .
Therefore, one answer is
2
P0,3 (x) = 2 + 0(x − 1) − 1(x − 1)(x − 2) + (x − 1)(x − 2)(x − 3)
3
2
= 2 − (x − 1)(x − 2) + (x − 1)(x − 2)(x − 3).
3
264 Solutions to Selected Exercises
The sequence of coefficients f1,0 , f2,1 , f1,2 , f0,3 is not a valid sequence to choose. f1,0 depends on x1 but
f2,1 depends on x2 and x3 , two completely different data values from the first. With some study, you might
be able to draw the conclusion, and maybe even prove, that any sequence of coefficients starting in the first
column and progressing to the right one column at a time and either jumping up one row or remaining in
the same row with each change of column forms a valid sequence. For example, we can use coefficients f2,0 ,
f1,1 , f1,2 , f0,3 because f2,0 depends on x2 ; f1,1 depends on x2 and x1 ; f1,2 depends on x2 , x1 , and x3 ; and
f0,3 depends on x2 , x1 , x3 , and x0 . And the order in which new dependencies are encountered matters. The
(x − xi ) monomials must appear in the same order. Therefore, another answer is
2
P0,3 (x) = 0 − 2(x − 3) + 1(x − 3)(x − 2) + (x − 3)(x − 2)(x − 4)
3
2
= −2(x − 3) + (x − 3)(x − 2) + (x − 3)(x − 2)(x − 4).
3
Other possible answers garnered from this same divided difference table are
2
P0,3 (x) = (x − 4)(x − 3) + (x − 4)(x − 3)(x − 2)
3
2
P0,3 (x) = −2(x − 3) − (x − 3)(x − 2) + (x − 3)(x − 2)(x − 1).
3
With some algebra and a bit of patience, each of the four forms above can be reduced to
2 3 31
P0,3 (x) = x − 5x2 + x − 4.
3 3
6: Recursion 3.3.3 is used to compute the entries in the table, as in Table 3.3. Answers will depend on the order in
which the data are listed in the table and on how the data are read from the table. Placing the data in the
table in the order given in the question, we have
Reading the coefficients across the first row, we use f0,0 , f0,1 , and f0,2 . This is a valid sequence to read from
the table since each coefficient depends on the same data as the previous, plus one point. f0,0 depends on x0 ;
f0,1 depends on x0 and x1 ; and f0,2 depends on x0 , x1 , and x2 . Therefore, one answer is
The sequence of coefficients f0,0 , f1,1 , f1,2 is not a valid sequence to choose. f0,0 depends on x0 but f1,1
depends on x1 and x2 , two completely different data values from the first. Not to mention f1,2 , which is
not even part of the table. With some study, you might be able to draw the conclusion, and maybe even
prove, that any sequence of coefficients starting in the first column and progressing to the right one column
at a time and either jumping up one row or remaining in the same row with each change of column forms a
valid sequence.For example, we can use coefficients f1,0 , f0,1 , f0,2 because f1,0 depends on x1 ; f1,1 depends
on x1 and x2 ; and f0,2 depends on x1 , x2 , and x0 . And the order in which new dependencies are encountered
matters. The (x − xi ) monomials must appear in the same order. Therefore, another answer is
The other two possible answers garnered from this same divided difference table are
With some algebra and a bit of patience, each of the four forms above can be reduced to
10: Answers will depend on the order in which the data are listed in the Octave call and on how the data are read
from the table. Placing the data in the Octave command in the same order they are listed in the question,
your Octave code should produce something like
dividedDiffs([0,.1,.3,.6,1],[-6,-5.89483,-5.65014,-5.17788,-4.28172])
ans =
One possibility for the interpolating polynomial of degree (at most) four is
P0,4 (x) = −6 + 1.05170x + .5725x(x − .1) + .215x(x − .1)(x − .3)
+.06302x(x − .1)(x − .3)(x − .6).
See discussion of question 4 above for other possibilities. Adding the point (1.1, −3.9958) to the table, we get
(accurate to 5 decimal places)
f5,0 = −3.9958
−4.28172 + 3.9958
f4,1 = = 2.8592
1 − 1.1
2.2404 − 2.8592
f3,2 = = 1.2376
.6 − 1.1
.95171 − 1.2376
f2,3 = = .35736
.3 − 1.1
.27802 − .35736
f1,4 = = .07934
.1 − 1.1
.06302 − .07934
f0,5 = = .01484.
0 − 1.1
Now we can add one more term to P0,4 to get (one possible representation of) P0,5 :
P0,5 (x) = −6 + 1.05170x + .5725x(x − .1) + .215x(x − .1)(x − .3)
+.06302x(x − .1)(x − .3)(x − .6) + .01484x(x − .1)(x − .3)(x − .6)(x − 1).
12: Since Nn , Ln , P0,n , and Pn are all the same polynomial except possibly the form in which they are written,
the error term for a Newton polynomial is the same as that for a Lagrange polynomial:
f (n+1) (ξx )
f (x) − Pn (x) = (x − x0 )(x − x1 ) · · · (x − xn ).
(n + 1)!
In this particular case, we have
f (3) (ξ2 )
f (x) − Pn (x) = (2 − 1)(2 − 2.2)(2 − 3)
3!
1 (3)
= f (ξ2 ).
30
Since all derivatives are bounded between −2 and 1 over the interval [1, 3], |f (3) (ξ2 )| ≤ 2 and, therefore, the
error has bound
2 1
|f (x) − Pn (x)| ≤ = = .06.
30 15
17: Since 0.75 is one of the nodes (it is x3 ), N3 and f agree there. That is what it means for N3 to interpolate the
data at x0 , x1 , x2 , x3 . Hence,
f (.75) = N3 (.75)
16
= 1 + 4(.75) + 4(.75)(.75 − .25) + (.75)(.75 − .25)(.75 − .5)
3
= 6.
266 Solutions to Selected Exercises
18: f is periodic and has infinitely many roots regularly spread across the real axis. The only diagram showing
roots of this nature is (d) so f matches with (d). g and f differ only by a small amount for large real values
so we should expect to see infinitely many more or less regularly spaced roots on the positive real axis. The
only diagram with roots of this nature is (a) so g matches with (a). l is a fifth degree polynomial so has at
most 5 roots. Diagram (b) shows 8 colors so 8 roots. Therefore, h matches with (b) and l matches with (c).
To recap,
f ↔ (d)
g ↔ (a)
h ↔ (b)
l ↔ (c).
Section 4.1
x − x1 x − x0 f (x0 ) f (x1 ) f (x1 ) − f (x0 )
1: (a) L1 (x) = f (x0 ) + f (x1 ) (b) L01 (x) = + = (c) L0 (x0 + 2)
h
=
x0 − x1 x1 − x0 x0 − x1 x1 − x0 x1 − x0
f (x0 +h)−f (x0 )
x0 +h−x0 = f (x0 +h)−f
h
(x0 )
so
f (x0 + h) − f (x0 )
h
f 0 x0 + ≈ .
2 h
4: (a) The Newton form of an interpolating polynomial derives from a table of divided differences whether it is a
single value or a formula for a general case. The divided differences table for this case is
f (x0 +h)−f (x0 ) f (x0 +2h)−2f (x0 +h)+f (x0 )
x0 f (x0 ) h 2h2
f (x0 +2h)−f (x0 +h)
x0 + h f (x0 + h) h
x0 + 2h f (x0 + 2h)
f (x0 + h) − f (x0 ) f (x0 + h) − f (x0 )
f0,1 = =
(x0 + h) − x0 h
f (x0 + 2h) − f (x0 + h) f (x0 + 2h) − f (x0 + h)
f1,1 = =
(x0 + 2h) − (x0 + h) h
f (x0 +2h)−f (x0 +h) f (x0 +h)−f (x0 )
f1,1 − f0,1 −
f0,2 = = h h
(x0 + 2h) − x0 2h
f (x0 + 2h) − 2f (x0 + h) + f (x0 )
=
2h2
dθ N2 (x(θ))
d
d d dx
(b) dx
dθ = h and dθ N2 (x(θ))
d
= dx N2 (x)
d
· dx
dθ so N2 (x) = N2 (x(θ)) ÷ = . Similarly, we
dx dθ dθ h
d2
d2 dθ 2 N2 (x(θ))
get N2 (x) = :
dx2 h2
f (x0 +2h)−2f (x0 +h)+f (x0 )
d [f (x0 + h) − f (x0 )] + 2 (2θ − 1)
N2 (x) =
dx h
f (x0 +2h)−2f (x0 +h)+f (x0 )
d2 2 (2)
N2 (x) =
dx2 h·h
f (x0 + 2h) − 2f (x0 + h) + f (x0 )
= .
h2
267
6c: To use this formula, we need x0 − h = 10 and x0 + 6h = 17, a system of two equations with two unknowns
whose solution is x0 = 11 and h = 1. Plugging these values into formula 4.1.6:
ˆ 17
1 1
dx ≈ [5257f (17) − 5880f (16) + 59829f (15)
10 x−5 8640
−81536f (14) + 102459f (13) − 50568f (12) + 30919f (11)]
1 1 1 1
= 5257 · − 5880 · + 59829 ·
8640 12 11 10
1 1 1 1
−81536 · + 102459 · − 50568 · + 30919 ·
9 8 7 6
≈ 0.8753962951271979.
´ 17 1 17
7c: (i) 10 x−5 dx = ln |x − 5||10 = ln(12) − ln(5) = ln 125 ≈ 0.8754687373539001 (ii) The absolute error is the
absolute value of the difference between the approximation and the exact value: | ln 12
5 −0.8753962951271979| ≈
7.24(10)−5 .
11d: To approximate some quantity in regard to a non-polynomial function, we simply evaluate the corresponding
quantity for the interpolating polynomial. That means in this case, f 0 (2) ≈ p0 (2). But p0 (x) = 12x3 − 4x + 1
so f 0 (2) ≈ 12 · 23 − 4 · 2 + 1 = 89.
13d: To use this formula, we need only to substitute proper values for θ and the θi . θ must be 0 since the point of
evaluation is at x0 (which equals x0 + 0h). It does not matter which stencil point gives which θi , but the θi
come from the fact that the nodes are x0 − h, x0 + 2h, and x0 + 3h. That gives us −1, 2, and 3 for the θi .
Setting θ0 = −1, θ1 = 2, and θ2 = 3:
15c: The integral over this stencil is from x0 to x0 + 2h so θ0 = 0 and θ1 = 2. The nodes are x0 + 31 h and x0 + 43 h
so θ2 and θ3 are 31 and 34 . It does not matter which is which. Setting θ2 = 13 and θ3 = 43 , the formula from
question 14c becomes − h2 · 42−0 (2 · 13 − 2 − 0)f (x0 + 43 h) − (2 · 34 − 2 − 0)f (x0 + 31 h) , which simplifies to
−1
3 3
ˆ x0 +2h
2h 1 4
f (x)dx ≈ f x0 + h + 2f x0 + h
x0 3 3 3
Section 4.2
1d: We are trying to find the undetermined coefficients ai of formula 4.2.1. We solve system 4.2.2 to do so. The
stencil of this question has 2 nodes, x0 and x0 + h, and point of evaluation x0 + 43 h, so in system 4.2.2 we
have n = 1, θ0 = 0 and θ1 = 1, and θ = 34 . Because we are deriving a first derivative formula, we also have
k = 1. Therefore, the system we need to solve is
3
p00 x0 + h = a0 p0 (x0 ) + a1 p0 (x0 + h)
4
3
p01 x0 + h = a0 p1 (x0 ) + a1 p1 (x0 + h).
4
Now, p0 (x) = 1 so p00 (x0 + 43 h) = 0; and p1 (x) = x − x0 so p01 (x0 + 43 h) = 1. Substituting this information
into the system,
0 = a0 + a1
1 = a1 h.
From the second equation, a1 = h1 . Substituting into the first equation, 0 = a0 + 1
h so a0 = − h1 . Our
approximation, formula 4.2.1, becomes
3 1 1
f x0 + h
0
≈ − f (x0 ) + f (x0 + h)
4 h h
f (x0 + h) − f (x0 )
= .
h
That formula should look familiar!
1j: We are trying to find the undetermined coefficients ai of formula 4.2.1. We solve system 4.2.2 to do so. The
stencil of this question has 4 nodes, x0 , x0 + h, x0 + 32 h, and x0 + 2h with point of evaluation x0 + 12 h, so
in system 4.2.2 we have n = 3, θ0 = 0, θ1 = 1, θ2 = 23 , θ3 = 2, and θ = 12 . Because we are deriving a first
derivative formula, we also have k = 1. Therefore, the system we need to solve is
1 3
p00 x0 + h = a0 p0 (x0 ) + a1 p0 (x0 + h) + a2 p0 (x0 + h) + a3 p0 (x0 + 2h)
2 2
1 3
p01 x0 + h = a0 p1 (x0 ) + a1 p1 (x0 + h) + a2 p1 (x0 + h) + a3 p1 (x0 + 2h)
2 2
1 3
p02 x0 + h = a0 p2 (x0 ) + a1 p2 (x0 + h) + a2 p2 (x0 + h) + a3 p2 (x0 + 2h)
2 2
1 3
p03 x0 + h = a0 p2 (x0 ) + a1 p2 (x0 + h) + a2 p2 (x0 + h) + a3 p2 (x0 + 2h)
2 2
Now, p0 (x) = 1 so p00 (x0 + 21 h) = 0; p1 (x) = x − x0 so p01 (x0 + 21 h) = 1; p2 (x) = (x − x0 )2 so p02 (x0 + 21 h) = h;
and p3 (x) = (x − x0 )3 so p03 (x0 + 12 h) = 34 h2 . Substituting this information into the system,
0 = a0 + a1 + a2 + a3
3
1 = a1 h + a2 · h + a3 · 2h
2
9
h = a1 h2 + a2 · h2 + a3 · 4h
4
3 2 27
h = a1 h3 + a2 · h3 + a3 · 8h.
4 8
269
The first equation is the only one in which a0 appears so we concentrate on solving the last three equations,
which simplify to:
2
= 2a1 + 3a2 + 4a3
h
4
= 4a1 + 9a2 + 16a3
h
6
= 8a1 + 27a2 + 64a3 .
h
From the first equation, 2a1 = h2 − 3a2 − 4a3 so 4a1 = 4
h − 6a2 − 8a3 and 8a1 = 8
h − 12a2 − 16a3 . Substituting
into the second and third equations, respectively,
4 4
= − 6a2 − 8a3 + 9a2 + 16a3
h h
6 8
= − 12a2 − 16a3 + 27a2 + 64a3
h h
which simplifies to
0 = 3a2 + 8a3
2
− = 15a2 + 48a3 .
h
From the first equation, a3 = − 83 a2 . Substituting into the last equation, − h2 = 15a2 + 48(− 38 a2 ), which
simplifies to − h2 = −3a2 so
2
a2 = .
3h
Back-substituting, a3 = − 38 a2 = − 38 ( 3h
2
) so
1
a3 = − .
4h
Continuing the back-substitution, 2a1 = h2 − 3a2 − 4a3 = h2 − 3( 3h
2 1
) − 4(− 4h ), which simplifies to 2a1 = h1 so
1
a1 = .
2h
1 2 1
Finally, a0 = −a1 − a2 − a3 = − 2h − 3h + 4h so
11
a0 = − .
12h
Our approximation, formula 4.2.1, thus becomes
1 11 1 2 3 1
f x0 + h
0
≈ f (x0 ) + f (x0 + h) + f (x0 + h) − f (x0 + 2h)
2 12h 2h 3h 2 4h
−11f (x0 ) + 6f (x0 + h) + 8f (x0 + 23 h) − 3f (x0 + 2h)
= .
12h
2f: We are trying to find the undetermined coefficients ai of formula 4.2.1. We solve system 4.2.2 to do so. The
stencil of this question has 4 nodes, x0 , x0 + h, x0 + 23 h, and x0 + 2h with point of evaluation x0 + 12 h, so
in system 4.2.2 we have n = 3, θ0 = 0, θ1 = 1, θ2 = 23 , θ3 = 2, and θ = 21 . Because we are deriving a first
derivative formula, we also have k = 1. Therefore, the system we need to solve is
1 3
p0 x0 + h
00
= a0 p0 (x0 ) + a1 p0 (x0 + h) + a2 p0 (x0 + h) + a3 p0 (x0 + 2h)
2 2
1 3
p001 x0 + h = a0 p1 (x0 ) + a1 p1 (x0 + h) + a2 p1 (x0 + h) + a3 p1 (x0 + 2h)
2 2
1 3
p002 x0 + h = a0 p2 (x0 ) + a1 p2 (x0 + h) + a2 p2 (x0 + h) + a3 p2 (x0 + 2h)
2 2
1 3
p003 x0 + h = a0 p2 (x0 ) + a1 p2 (x0 + h) + a2 p2 (x0 + h) + a3 p2 (x0 + 2h)
2 2
270 Solutions to Selected Exercises
Now, p0 (x) = 1 so p000 (x0 + 12 h) = 0; p1 (x) = x − x0 so p001 (x0 + 21 h) = 0; p2 (x) = (x − x0 )2 so p002 (x0 + 21 h) = 2;
and p3 (x) = (x − x0 )3 so p003 (x0 + 12 h) = 3h. Substituting this information into the system,
0
= a0 + a1 + a2 + a3
3
0 = a1 h + a2 · h + a3 · 2h
2
9
2 = a1 h2 + a2 · h2 + a3 · 4h
4
27
3h = a1 h3 + a2 · h3 + a3 · 8h.
8
The first equation is the only one in which a0 appears so we concentrate on solving the last three equations,
which simplify to:
2
a3 = − .
h2
8
Back-substituting, h2 = 3a2 + 8a3 = 3a2 + 8(− h22 ) so
8
a2 = .
h2
Continuing the back-substitution, 2a1 = −3a2 − 4a3 = −3( h82 ) − 4(− h22 ), which simplifies to 2a1 = − h162 so
8
a1 = − .
h2
8 8 2
Finally, a0 = −a1 − a2 − a3 = h2 − h2 + h2 so
2
a0 = .
h2
Our approximation, formula 4.2.1, thus becomes
1 2 8 8 3 2
f 0 x0 + h ≈ 2
f (x0 ) − 2 f (x0 + h) + 2 f (x0 + h) − 2 f (x0 + 2h)
2 h h h 2 h
2f (x0 ) − 8f (x0 + h) + 8f (x0 + 23 h) − 2f (x0 + 2h)
= .
h2
271
4b: We are trying to find the undetermined coefficients ai of formula 4.2.3. We solve system 4.2.4 to do so. The
stencil of this question has 1 node, x0 + 23 h and endpoints of integration x0 and x0 + 2h, so in system 4.2.4
we have n = 0, a = x0 and b = x0 + 2h. Therefore, the “system” we need to solve is
ˆ x0 +2h
p0 (x)dx = a0 p0 (x0 ).
x0
´ x0 +2h ´ x0 +2h
Now, p0 (x) = 1 so x0
p0 (x)dx = x0
dx = 2h. Substituting this information into the system,
2h = a0 .
Our approximation, formula 4.2.3, becomes
ˆ x0 +2h
2
f (x)dx ≈ 2hf x0 + h .
x0 3
4l: We are trying to find the undetermined coefficients ai of formula 4.2.3. We solve system 4.2.4 to do so. The
stencil of this question has 3 nodes, x0 , x0 + h, and x0 + 2h with endpoints of integration x0 and x0 + 2h, so
in system 4.2.4 we have n = 2, a = x0 and b = x0 + 2h. Therefore, the system we need to solve is
ˆ x0 +2h
p0 (x)dx = a0 p0 (x0 ) + a1 p0 (x0 + h) + a2 p0 (x0 + 2h)
x0
ˆ x0 +2h
p1 (x)dx = a0 p1 (x0 ) + a1 p1 (x0 + h) + a2 p1 (x0 + 2h)
x0
ˆ x0 +2h
p2 (x)dx = a0 p2 (x0 ) + a1 p2 (x0 + h) + a2 p2 (x0 + 2h)
x0
´ x +2h ´ x +2h ´ x +2h ´ x +2h
Now, p0 (x) = 1 so x00 p0 (x)dx = x00 dx = 2h; p1 (x) = x − x0 so x00 p1 (x)dx = x00 (x − x0 )dx =
1 2 x0 +2h 2 2
´ x0 +2h ´ x0 +2h 1
x0 +2h
(x − x0 ) = 2h ; and p2 (x) = (x − x0 ) so p2 (x)dx = (x − x0 ) dx = (x − x0 )3
2
=
2 x0 x0 x0 3 x0
8 3
3h . Substituting this information into the system,
2h = a0 + a1 + a2
2
2h = a1 h + a2 (2h)
8 3
h = a1 h2 + a2 (4h2 ).
3
The first equation is the only one in which a0 appears so we concentrate on the last two equations, which
simplify to:
2h = a1 + 2a2
8
h = a1 + 4a2 .
3
8
From the first equation, a1 = 2h − 2a2 . Substituting into the second equation, 3h = 2h − 2a2 + 4a2 , which
simplifies to 23 h = 2a2 , so
1
a2 = h.
3
Back-substituting, a1 = 2h − 2a2 = 2h − 2( 13 h) so
4
a1 = h.
3
Finally, a0 = 2h − a1 − a2 = 2h − 43 h − 13 h so
1
a0 = h.
3
Our approximation, formula 4.2.3, thus becomes
ˆ x0 +2h
1 4 1
f (x)dx ≈ hf (x0 ) + hf (x0 + h) + f (x0 + 2h)
x0 3 3 3
h
= [f (x0 ) + 4f (x0 + h) + f (x0 + 2h)] .
3
You may recognize this formula as Simpson’s rule!
272 Solutions to Selected Exercises
Section 4.3
ˆ x0 +2h
h
3a: Simpson’s rule for integral approximation is f (x)dx = [f (x0 ) + 4f (x0 + h) + f (x0 + 2h)]. To apply
3
ˆ 0 x0
it to the integral x ln(x + 1)dx we need to identify f , x0 , and h. In the formula, x0 is the lower limit of
−0.5
integration, so we have x0 = −0.5 in this question. In the formula, the length of the interval of integration
is 2h, so we have 2h = 0.5 in this question, or h = 0.25. In the formula, f is the integrand, so we have
f (x) = x ln(x + 1). With the parameters identified, we plug them into the right side of Simpson’s rule and we
have our estimate:
ˆ 0
.25
x ln(x + 1)dx ≈ [−0.5 ln(0.5) + 4(−0.25) ln(.75) + 0 ln(1)]
−0.5 3
≈ 0.05285463856097945.
ˆ x0 +h
h
4a: Trapezoidal rule for integral approximation is f (x)dx = [f (x0 ) + f (x0 + h)]. To apply it to the inte-
2
ˆ 0 x0
gral x ln(x + 1)dx we need to identify f , x0 , and h. In the formula, x0 is the lower limit of integration,
−0.5
so we have x0 = −0.5 in this question. In the formula, the length of the interval of integration is h, so we
have h = 0.5 in this question. In the formula, f is the integrand, so we have f (x) = x ln(x + 1). With the
parameters identified, we plug them into the right side of the trapezoidal rule and we have our estimate:
ˆ 0
.5
x ln(x + 1)dx ≈ [−0.5 ln(0.5) + 0 ln(1)]
−0.5 2
≈ 0.08664339756999316.
ˆ x0 +2h
5a: The midpoint rule for integral approximation is f (x)dx = 2hf (x0 + h). To apply it to the integral
x0
ˆ 0
x ln(x + 1)dx,
−0.5
we need to identify f , x0 , and h. In the formula, x0 is the lower limit of integration, so we have x0 = −0.5
in this question. In the formula, the length of the interval of integration is 2h, so we have 2h = 0.5 in this
question, or h = 0.25. In the formula, f is the integrand, so we have f (x) = x ln(x + 1). With the parameters
identified, we plug them into the right side of the trapezoidal rule and we have our estimate:
ˆ 0
x ln(x + 1)dx ≈ 2(.25)(−0.25 ln(0.75))
−0.5
≈ 0.03596025905647261.
7a: See above for the exact evaluation of the integral. The error follows as
|0.08664339756999316 − 0.05256980729002053| ≈ 0.034073.
8a: See above for the exact evaluation of the integral. The error follows as
|0.03596025905647261 − 0.05256980729002053| ≈ 0.016609.
2
11a: − h6 f 000 (ξh ) is the error term for this approximation formula. The remainder of the equation is the approxi-
mation. We simply plug the given information into the approximation formula:
f (x0 + h) − f (x0 − h)
f 0 (x0 ) ≈
2h
e2.1 − e1.9
=
2(.1)
≈ 7.401377351441916.
2
12a: The error term, − h6 f 000 (ξh ), dictates the error. As in Taylor’s Theorem, this error term is exact for some
2
value of ξh . Finding a bound on the error means minimizing or maximizing | h6 f 000 (ξh )| over all possible values
of ξh . The possible values of ξh are all values between the least node and the greatest node, a fact that follows
from Taylor’s Theorem. For this question, h = .1 and f 000 (ξ) = eξ , so a lower bound for the error is
.12
min eξ
6 ξ∈[1.9,2.1]
But eξ is an increasing function, so its minimum value over [1.9, 2.1] occurs at 1.9 and its maximum at 2.1.
1.9 2.1
Hence, we have the error between .01
6 e and .01
6 e , or as floating point approximations, 0.01114315740379878
and 0.01361028318761275. f (x) = e so f (2) = e2 exactly. The actual error is thus |e2 −7.401377351441916| ≈
0 x 0
´ x0 +h
The left side of this approximation is x0
f (x)dx, so replace f (x) by f (x0 ) + (x − x0 )f 0 (x0 ) + 21 (x −
x0 )2 f 00 (x0 ) + 16 (x − x0 )3 f 000 (x0 ) + · · · :
ˆ x0 +h ˆ x0 +h
1 1
2 00 3 000
f (x)dx = f (x0 ) + (x − x0 )f (x0 ) + (x − x0 ) f (x0 ) + (x − x0 ) f (x0 ) + · · · dx
0
x0 x0 2 6
x0 +h
1 1 1
= xf (x0 ) + (x − x0 )2 f 0 (x0 ) + (x − x0 )3 f 00 (x0 ) + (x − x0 )4 f 000 (x0 ) + · · ·
2 6 24 x0
1 2 0 1 3 00 1 4 000
= hf (x0 ) + h f (x0 ) + h f (x0 ) + h f (x0 ) + · · · .
2 6 24
The right side of the approximation includes f (x0 + 32 h), so this expression is also expanded in a Taylor series:
2 2 2 4
f x0 + h = f (x0 ) + hf 0 (x0 ) + h2 f 00 (x0 ) + h3 f 000 (x0 ) + · · · .
3 3 9 81
Substitute these expansions into the difference of the two sides and simplify. The error is
1 1 1
hf (x0 ) + h2 f 0 (x0 ) + h3 f 00 (x0 ) + h4 f 000 (x0 ) + · · ·
2 6 24
2 0 2 2 00 4 3 000
h
− 3 f (x0 ) + hf (x0 ) + h f (x0 ) + h f (x0 ) + · · · + f (x0 )
4 3 9 81
=
1 2 0 1 3 00 1
hf (x0 ) + h f (x0 ) + h f (x0 ) + h4 f 000 (x0 ) + · · ·
2 6 24
1 2 0 1 3 00 1 4 000
− hf (x0 ) + h f (x0 ) + h f (x0 ) + h f (x0 ) + · · ·
2 6 27
=
1 4 000
h f (x0 ) + · · · .
216
Work done heretofore is informal evidence that the error term is O(h4 f 000 (ξh )). To formalize, we truncate the
Taylor series, making them Taylor polynomials of convenient degree, with error terms! The error terms from
the Taylor polynomials become the error term for the approximation formula. Beginning with the left side of
the formula, the exact value:
ˆ x0 +h ˆ x0 +h
1 1
2 00 3 000
f (x)dx = f (x0 ) + (x − x0 )f (x0 ) + (x − x0 ) f (x0 ) + (x − x0 ) f (ξx ) dx
0
x0 x0 2 6
x0 +h
1 1
= xf (x0 ) + (x − x0 )2 f 0 (x0 ) + (x − x0 )3 f 00 (x0 )
2 6 x0
ˆ x0 +h
1
+ (x − x0 )3 f 000 (ξx )dx
x0 6
ˆ x0 +h
1 2 0 1 3 00 1
= hf (x0 ) + h f (x0 ) + h f (x0 ) + (x − x0 )3 f 000 (ξx )dx
2 6 x0 6
for some unknown function ξx of x. Now, the f (x0 + 32 h) term from the right side of the formula, the
approximate value:
2 2 2 4
f x0 + h = f (x0 ) + hf 0 (x0 ) + h2 f 00 (x0 ) + h3 f 000 (ξ1 )
3 3 9 81
for some ξ1 ∈ (x0 , x0 + h). Subtracting the two sides, we know all terms with derivative lower than the third
will drop out since none of those terms have changed since our discovery. The error is, therefore,
ˆ x0 +h
1 h 4
(x − x0 )3 f 000 (ξx )dx − · 3 · h3 f 000 (ξ1 ).
x0 6 4 81
275
´ x +h ´ x0 +h
The Weighted Mean Value Theorem allows us to replace x00 16 (x − x0 )3 f 000 (ξx )dx by 1 000
6 f (c) x0 (x −
1 4 000
x0 )3 dx = 24 h f (c) for some c ∈ (x0 , x0 + h). The error term thus becomes
1 4 000 1
h f (c) − h4 f 000 (ξ1 )
24 27
for some c ∈ (x0 , x0 + h) and some ξh ∈ (x0 , x0 + h). The final formality is to replace this term with big-O
notation:
1 4 000
h f (c) − 1 h4 f 000 (ξ1 ) ≤ h4 1 |f 000 (c)| + 1 |f 000 (ξ1 )|
24 27 24 27
1 1
≤ h4 + max {|f 000 (c)| , |f 000 (ξ1 )|}
24 27
= M h4 |f 000 (ξh )|
1 1 17
for some ξh ∈ (x0 , x0 + h) and M = 24 + 27 = 216 (the value of ξh is either c or ξ1 ). Hence, the error is
O(h4 f 000 (ξh )).
18c: The error in any approximation formula is the difference between the two sides. One side holds the exact
quantity and the other holds the approximation. To find the error, we subtract the two sides from one another,
expand each appearance of f in a Taylor series about x0 and simplify. The term of least degree remaining
−3f (x0 ) + 4f (x0 + h2 ) − f (x0 + h)
determines the error term. f 0 (x0 ) ≈
h
The left side of this approximation is f 0 (x0 ), so its Taylor expansion is itself! The right side of the approxi-
mation includes f (x0 + 21 h) and f (x0 + h), so these expressions are expanded in Taylor series:
1 1 1 1
f x0 + h = f (x0 ) + hf 0 (x0 ) + h2 f 00 (x0 ) + h3 f 000 (x0 ) + · · ·
2 2 8 48
1 1
f (x0 + h) = f (x0 ) + hf 0 (x0 ) + h2 f 00 (x0 ) + h3 f 000 (x0 ) + · · · .
2 6
To simplify the display of the algebra, we begin by summing −3f (x0 ) + 4f (x0 + h2 ) − f (x0 + h):
1 1 1 1
f x0 + h = f (x0 ) + hf 0 (x0 ) + h2 f 00 (x0 ) + h3 f 000 (ξ1 )
2 2 8 48
1 2 00 1 3 000
f (x0 + h) = f (x0 ) + hf (x0 ) + h f (x0 ) + h f (ξ2 )
0
2 6
for some ξ1 , ξ2 ∈ (x0 , x0 + h). Subtracting the two sides, we know all terms with derivative lower than the
third will drop out since none of those terms have changed since our discovery. The remaining terms, those
with the third derivative in them, is the error and is
1 3 000
−4 · 48 h f (ξ1 ) + 61 h3 f 000 (ξ2 ) 1 000 1 000
2
=h f (ξ2 ) − f (ξ1 )
h 6 12
276 Solutions to Selected Exercises
for some ξ1 , ξ2 ∈ (x0 , x0 + h). The final formality is to replace this term with big-O notation:
2 1 000 1 000 1 000 1 000
2
(ξ ) (ξ ) (ξ )| + (ξ )|
h f 2 − f 1
≤ h |f 2 |f 1
6 12 6 12
1 1
≤ h2 + max {|f 000 (ξ2 )| , |f 000 (ξ1 )|}
6 12
= M h2 |f 000 (ξh )|
1 1 1
for some ξh ∈ (x0 , x0 + h) and M = 6 + 12 = 4 (the value of ξh is either ξ2 or ξ1 ). Hence, the error is
O(h2 f 000 (ξh )).
19: Diffy Rence is using a second derivative formula with x0 = 3 since the left side is f 00 (3.0). On the right
side, we see a term with sin(3) in it. This is likely sin(x0 ) from one of the second derivative formulas.
We also see sin(2.8) and sin(3.2) which look likely to play the roles of sin(x0 − h) and sin(x0 + h) in the
approximation formula used. Looking at table 4.3 for a formula with f (x0 − h), f (x0 ), and f (x0 + h) in
it, we find f 00 (x0 ) = f (x0 −h)−2fh(x2 0 )+f (x0 +h) + O(h2 f (4) (ξh )). Continuing with the hypothesis that we have
f (x) = sin(x), x0 = 3, and h = .2, we plug into the formula to find
for some constant k dependent on the method, not the function f or the nodes used. Now,
max f (5) (x) = max |cos(x)|
x∈[3,3.04] x∈[3,3.04]
= |cos(3.04)| .
A bound on the error is, therefore, 0.0001k cos(3.04) or 9.9485(10)−5 k for some k dependent on the method.
23f: First, we need to identify the formula being used. The unusual points of evaluation in the approximation
identify it quickly as
ˆ x0 +h
1 1
f (x)dx = h f x0 − √ h + f x0 + √ h + O(h5 f (4) (ξh ))
x0 −h 3 3
with x0 = 3.5, h = 0.5, and error term O(h5 f (4) (ξh )). The error is, therefore, bounded by
k(.5)5 max f (4) (x)
x∈[3,4]
for some constant k dependent on the method, not the function f or the nodes used. Now,
max f (4) (x) = max |sin(x)|
x∈[3,4] x∈[3,4]
= |sin(4)| .
A bound on the error is, therefore, 0.03125k sin(4) or 0.023651k for some k dependent on the method.
277
24: (a) We are given only 5 nodes, so we must use them all for each approximation. The nodes are (thankfully)
evenly spaced so we can use one of the formulas in table 4.2. There are two nodes to the left of 2 and two
to the right, so we need to use the five-point formula with nodes x0 − 2h, x0 − h, x0 , x0 + h, and x0 + 2h
to approximate f 0 (2). All four of the nodes other than 4 are to the left of 4 so we need to use the five-point
formula with nodes x0 − 4h, x0 − 3h, x0 − 2h, x0 − h, and x0 to approximate f 0 (4). Hence,
(b) We should expect the approximation of f 0 (2) to be better because the error term for the formula used is
h4 (5) 4
30 f (ξh ) where the error term for the formula used in approximating f 0 (4) is h5 f (5) (ξh ), six times greater.
Another reason we should expect the f 0 (2) approximation to be better is because 2 is centrally located amongst
the nodes where 4 is as far from centrally located as possible!
1 25
(c) f 0 (x) = − (x−4.2) 2 so f (2) = − 121 and f (4) = −25. The absolute errors are
0 0
So, as expected the absolute error in the approximation of f 0 (2) is smaller than that of f 0 (4), but the relative
errors, which are perhaps more important, are exactly the opposite in comparison!
The area of trapezoid CDEF represents the approximation by the trapezoidal rule (which is where it gets
its name). The function f (x) was chosen so that the two brownish areas are (very nearly) equal, one above
line segment CD and one below. This means the trapezoidal rule approximation will be (very nearly) exact.
Moreover, since the point A is not on line segment CD, the approximation by Simpson’s rule will not be (very
nearly) exact. Other examples can be created similarly. To summarize, any example of a smooth function
where the following occur will work.
• The areas above and below the line segment from (0, f (0)) to (1, f (1)) are equal.
• (.5, f (.5)) does not lie on the line segment from (0, f (0)) to (1, f (1)).
278 Solutions to Selected Exercises
REMARK: Non-smooth functions with the two properties above also provide examples. The reason we chose
to give a smooth example is because the errors for non-smooth functions are completely unpredictable
(since they don’t possess the required number of derivatives), and, hence, it is not as surprising in that
case that we can find examples where the trapezoidal rule outdoes Simpson’s rule. The trapezoidal rule
and Simpson’s rule can not be applied reliably to functions without sufficient derivatives.
REMARK: The question did not request a formula, so any hand-sketched graph with the two properties
above would suffice. Since we have a formula, however, we can demonstrate numerically the result. For
the function f pictured above,
ˆ 1
f (x)dx ≈ 3.443097449311693
0
f (0) + f (1)
Trapezoidal Rule = ≈ 3.443097449311694
2
f (0) + 4f (.5) + f (1)
Simpson’s Rule = ≈ 3.632535470843161.
6
34: Five-point formulas for the 2nd derivative have error term O(h3 f (5) (ξh )) or O(h4 f (6) (ξh )) so E.1 = k(.1)3 f (5) (ξ.1 )
or E.1 = k(.1)4 f (6) (ξ.1 ) and E.02 = k(.02)3 f (5) (ξ.02 ) or E.02 = k(.02)4 f (6) (ξ.02 ). Assuming f (5) (ξ.1 ) ≈
f (5) (ξ.02 ) if the error term is O(h3 f (5) (ξh )) or that f (6) (ξ.1 ) ≈ f (6) (ξ.02 ) if the error term is O(h4 f (6) (ξh )), we
should expect
3
k(.1)3 f (5) (ξ.1 )
E.1 .1
= ≈ = 125
E.02 k(.02)3 f (5) (ξ.02 ) .02
or 4
k(.1)4 f (6) (ξ.1 )
E.1 .1
= ≈ = 625.
E.02 k(.02)4 f (6) (ξ.02 ) .02
Section 4.4
1a: Divide the interval of integration, [1, 3] into 3 subintervals of equal length and apply the midpoint rule to each
of the subintervals. The sum of the three estimates is the answer.
ˆ 3
ln(sin(x))dx ≈ −0.6040146059410205
1
2a: Divide the interval of integration, [1, 3] into 3 subintervals of equal length and apply the trapezoidal rule to
each of the subintervals. The sum of the three estimates is the answer.
interval trapezoidal
rule
[1, 1 + 23 ] 1
3 ln(sin(1)) + ln(sin(1 + 23 )) ≈ −0.05906878811071457
[1 + 23 , 2 + 13 ] 1
3 ln(sin(1 + 23 )) + ln(sin(2 + 31)) ≈ −0.1096099655624244
[2 + 13 , 3] 1 1
3 ln(sin(2 + 3 )) + ln(sin(3)) ≈ −0.7607906360781023
ˆ 3
ln(sin(x))dx ≈ −0.9294693897512412
1
3a: Divide the interval of integration, [1, 3] into 3 subintervals of equal length and apply Simpson’s rule to each of
the subintervals. The sum of the three estimates is the answer. Let f (x) = ln(sin(x)).
279
ˆ 3
ln(sin(x))dx ≈ −0.7124995338777608
1
3
4a: Divide the interval of integration, [1, 3] into 3 subintervals of equal length and apply Simpson’s 8 rule to each
of the subintervals. The sum of the three estimates is the answer. Let f (x) = ln(sin(x)).
ˆ 3
ln(sin(x))dx ≈ −0.7075487879729477
1
5a: Divide the interval of integration, [1, 3] into 3 subintervals of equal length and apply the quadrature rule to
each of the subintervals. The sum of the three estimates is the answer. Let f (x) = ln(sin(x)).
ˆ 3
ln(sin(x))dx ≈ −0.63357525404685
1
´π
7: The trapezoidal rule applied to 0
sin4 x dx gives
π
sin4 (0) + sin4 (π) = 0,
2
which has absolute error 83 π. Since the trapezoidal rule has error term O n12 , dividing the interval of
integration into n subintervals should decrease the error by a factor of about n12 . Therefore, we need to solve
3
π
the equation 8 2 = 10−4 :
n
3
8π
= 10−4
n2
3
8π
= n2
s 10−4
3
8π
= n
10−4
n ≈ 108.5.
Increasing the number of intervals by a factor of 109 should do the trick. Since our initial estimate used but
one interval, we need to use 109 intervals to achieve 10−4 accuracy.
280 Solutions to Selected Exercises
15: Let Sk (a, b) mean applying composite Simpson’s rule to the interval [a, b] with k subintervals and ek mean the
error in Sk (a, b). We now repeat the analysis we did in deriving the adaptive trapezoidal rule but applied to
Simpson’s rule:
4 4
1 1
en ≈ M and e2n ≈ M
n 2n
so
4
en M n1
≈ 4 = 16, which implies en ≈ 16e2n .
e2n M 1 2n
´b
Because a
f (x)dx = S2 (a, b) + e2 = S1 (a, b) + e1 ,
S2 (a, b) − S1 (a, b) = e1 − e2
≈ 16e2 − e2
= 15e2
1
so e2 ≈ 15 (S2 (a, b) − S1 (a, b)). Explicitly,
ˆ b
1
f (x)dx − S2 (a, b) ≈ (S2 (a, b) − S1 (a, b)).
a 15
Now we know what quantity to use in order to estimate the error. We tabulate the necessary computations:
1
a b S1 (a, b) S2 (a, b) 15 |S2 (a, b) − S1 (a, b)| tol
1 3 −0.837026 −0.730741 0.00708 .002
1 2 −0.046286 −0.045560 4.8(10)−5 .001
2 3 −0.684454 −0.661383 0.00153 .001
2 2.5 −0.134349 −0.134243 7.0(10)−6 .0005
2.5 3 −0.527034 −0.523129 0.00026 .0005
´3
1
ln(sin(x))dx ≈ −0.045560 − 0.134243 − 0.523129 = −0.702932
23: First,
ˆ 1
1
ln(x + 1)dx = [(x + 1) ln(x + 1) − x − 1]0
0
= 2 ln 2 − 2 − (−1)
= 2 ln 2 − 1
≈ 0.3862943611198906.
Now we need to get an estimate using the composite trapezoidal rule with a small number of intervals, say 10
or 20. This part of the computation is mere speculation. Really, any number of intervals that will not give
the desired accuracy will suffice:
T10 (0, 1) = 0.385877936745754.
The error with 10 subintervals is
Since the error term for the composite trapezoidal rule (assuming f 00 (ξh ) is constant, as we do in deriving the
adaptive method) is O n12 , we expect the error to decrease by a factor of n2 as the number of intervals is
increased by a factor of n. The needed factor of decrease is
10−6
≈ 0.00240139641699267.
4.16424374136581(10)−4
q
1
Therefore, the necessary factor of increase is 0.00240139641699267 ≈ 20.406. Our “test” calculation used 10
intervals, so we need to use 10 · 20.406 = 204.06, or rounding up, 205 intervals to achieve 10−6 accuracy.
281
REMARK: Another way to find the necessary factor of increase is to solve the equation
4.16424374136581(10)−4
= 10−6 .
n2
This comes from the fact that increasing the number of intervals by a factor of n decreases the error by
a factor of n2 . Thus we take the known error (of T10 (0, 1)), divide by n2 and set it equal to the desired
accuracy, 10−6 . The solution, of course, is n ≈ 20.406, the factor of increase.
REMARK: We have used the Octave code
####################################################
# Written by Dr. Len Brin 2 April 2012 #
# MAT 322 Numerical Analysis I #
# Purpose: Implementation of composite Trapezoidal #
# rule #
# INPUT: function f, interval endpoints a and b, #
# number of subintervals n #
# OUTPUT: approximate integral of f(x) from a to b #
####################################################
function integral = compositeTrapezoidal(f,a,b,n)
h = (b-a)/n;
s = 0;
for i = 1:n-1
s = s + f(a+i*h);
end#for
integral = h*(f(a)+2*s+f(b))/2;
end#function
to calculate T10 (0, 1):
>> f=inline(’log(x+1)’);
>> compositeTrapezoidal(f,0,1,10)
ans = 0.385877936745754
compositeTrapezoidal.m may be downloaded at the companion website.
REMARK: Using the code above to calculate the approximation with 205 subintervals:
>> compositeTrapezoidal(f,0,1,205)
ans = 0.386293369647938
and it has error
>> 0.3862943611198906-ans
ans = 9.91471952871414e-07
just less than 10−6 .
Section 4.5
7: We need to combine N (h), N ( h2 ), and N ( h3 ) so that terms involving h and h2 vanish, leaving h3 as the lowest
order term.
N (h) = M − K1 h − K2 h2 − K3 h3 − · · ·
1 1 1
h
N = M − K1 h − K2 h2 − K3 h3 − · · ·
2 2 4 8
1 1 1
h
N = M − K1 h − K2 h2 − K3 h3 − · · ·
3 3 9 27
so N (h) + aN ( h2 ) + bN ( h3 ) is
a b a b 2 a b
(1 + a + b)M − 1 + + K1 h − 1 + + K2 h − 1 + + K3 h3 − · · · .
2 3 4 9 8 27
282 Solutions to Selected Exercises
Both N1 and N̂1 are O(h2 ) approximations, so we can combine them to get the O(h3 ) approximation.
Unfortunately, the Richardson’s extrapolation formula does not apply. It assumes the same constants in
each approximation. But the general idea does. We need to combine these approximations
1 3
N1 (h) = M + K2 h2 + K3 h3 + · · ·
2 4
1 5
h
N̂1 = M + K2 h2 + K3 h3 + · · ·
2 6 36
1
h
3N̂1 − N1 (h) = 2M − K3 h3 − · · · .
2 3
Therefore, the O(h3 ) approximation for M we are looking for is
3N̂1 ( h2 ) − N1 (h)
N2 (h) =
2
3 3N ( h3 ) − 2N ( h2 ) − 2N ( h2 ) − N (h)
=
2
N (h) − 8N ( h2 ) + 9N ( h3 )
= .
2
1
8: For the first extrapolation, we use formula 4.5.4 with α = 2 and m1 = 2:
4N ( h2 ) − N (h)
N1 (h) = ,
3
which leaves N1 (h) = M + l2 h4 + l3 h6 + · · · . We get a second round of refinements from formula 4.5.4 with
α = 21 and m1 = 4:
16N1 ( h2 ) − N1 (h)
N2 (h) = ,
15
283
1
which leaves N2 (h) = M + c3 h6 + · · · . We get a third round of refinements from formula 4.5.4 with α = 2
and m1 = 6:
64N2 ( h2 ) − N2 (h)
N3 (h) = .
63
Tabulating the computation, it goes something like this:
N N1 N2 N3
2.356194
−0.4879837 −1.436042
−0.8815732 −1.012769 −0.9845514
−0.9709157 −1.000696 −0.9998916 −1.000135
The third Richardson extrapolation is −1.000135. Not bad considering the exact value of the integral is −1.
10: To summarize the method, let N0 (k) = Tk (1, 3), the trapezoidal rule itself applied with k subintervals. Then
since the error of the trapezoidal rule only contains even powers,
Section 5.2
2: Since there are three points given, the spline consists of two cubic pieces. Each cubic piece has 4 coefficients, so
we will need to construct a system of 8 equations in the 8 unknowns. The spline S takes the form
(
S1 (x) = a1 + b1 (x − 1) + c1 (x − 1)2 + d1 (x − 1)3 , x ∈ [0, 1]
S(x) = .
S2 (x) = a2 + b2 (x − 2) + c2 (x − 2)2 + d2 (x − 2)3 , x ∈ [1, 2]
The 8 equations come from the three sets of requirements on any free cubic spline.
Interpolation:
• S1 (0) = −9 ⇒ a1 − b1 + c1 − d1 = −9
• S1 (1) = −13 ⇒ a1 = −13
• S2 (1) = −13 ⇒ a2 − b2 + c2 − d2 = −13
• S2 (2) = −29 ⇒ a2 = −29
Derivative matching:
• S10 (1) = S20 (1) ⇒ b1 = b2 − 2c2 + 3d2
• S100 (1) = S200 (1) ⇒ 2c1 = 2c2 − 6d2
Endpoint conditions:
284 Solutions to Selected Exercises
7: Since there are three points given, the spline consists of two cubic pieces. Each cubic piece has 4 coefficients, so
we will need to construct a system of 8 equations in the 8 unknowns. The spline S takes the form
(
S1 (x) = a1 + b1 (x − 2) + c1 (x − 2)2 + d1 (x − 2)3 , x ∈ [1, 2]
S(x) = .
S2 (x) = a2 + b2 (x − 4) + c2 (x − 4)2 + d2 (x − 4)3 , x ∈ [2, 4]
The 8 equations come from the three sets of requirements on any clamped cubic spline.
Interpolation:
• S1 (1) = 1 ⇒ a1 − b1 + c1 − d1 = 1
• S1 (2) = 3 ⇒ a1 = 3
• S2 (2) = 3 ⇒ a2 − 2b2 + 4c2 − 8d2 = 3
• S2 (4) = 2 ⇒ a2 = 2
Derivative matching:
Endpoint conditions:
Combined with the equation c2 = 0, we find c1 = −9. Now we have the ai and ci . The rest of the
solution amounts to back-substitution. From the left endpoint condition, d1 = 13 c1 = −3. From second
derivative matching, d2 = c2 −c
3
1
= 0−(−9)
3 = 3. Now we have the di . From the interpolation requirements,
b1 = a1 + c1 − d1 + 9 and b2 = a2 + c2 − d2 + 13, so
b1 = −13 − 9 + 3 + 9 = −10
b2 = −29 + 0 − 3 + 13 = −19.
REMARK: The solution outlined in the text is not the only way to get the solution. Any method of solving
the six equations involving bi , ci , and di can be used.
9e: Following the solution outlined in the text, equation 5.2.8 gives n − 2 = 0 equations in the ci . We can not
use equation 5.2.11 since it was derived from free endpoint conditions. Instead, we need to use the clamped
1
endpoint conditions to come up with two equations in the ci . Equation 5.2.10 gives us b1 = −2 + −4
3 c1 + 3 c2 .
−2
Solving the second derivative matching equation for d2 , we have d2 = 6 . Substituting expressions for b1 ,
c2 −c1
1
b2 , and d2 into the first derivative matching equation, −2 = − 32 c1 − 43 c2 , which simplifies to 4c1 +8c2 = 3. This
is our first equation in ci . Now solving the left endpoint condition for d1 , we have d1 = 2c13−b1 . Substituting
1
expressions for a1 , b1 , and d1 into the first interpolation equation, we have 3 − ( −2 + −4
3 c1 + 3 c2 ) + c1 −
−2
285
2c1 −( −2
1
+ −4 −2
3 c1 + 3 c2 )
3 = 1, which simplifies to 11c1 + 4c2 = −21. The two equations in ci can now be solved to
find c1 = − 52 and c2 = 13
8 . As with the free spline, the rest of the solution amounts to back-substitution:
1 5 13 7
−4 −2
b1 = + − + =
−2 3 2 3 8 4
2(− 25 ) − 74 9
d1 = =−
3 4
13 5
11
8 − −2
d2 = = .
6 16
REMARK: The solution outlined in the text is not the only way to get the solution. Any method of solving
the six equations involving bi , ci , and di can be used.
b =
-10 -19
c =
-9 0
d =
-3 3
11: First, the declaration of the function must be changed. Left and right endpoint derivatives, m0 and mn , will
be specified, so there must be additional arguments to the function. Also, the name of the function should be
changed:
should become
The rest of the modifications involve the endpoint conditions and their effect on the equations within the
function. We begin by solving the left endpoint condition for d1 : b1 + 2c1 h1 + 3d1 h21 = m0 ⇒
m0 − b1 − 2c1 h1
d1 = . (6.5.6)
3h21
Substituting this equation, ai = yi , and equation 5.2.10 into 5.2.1 with i = 1 gives
y1 − y2 2 1
m0 − y1 −y2
h2 + 32 h2 c1 + 13 h2 c2 − 2c1 h1
y1 + + h2 c1 + h2 c2 h1 + c1 h21 + h31 = y0 ,
h2 3 3 3h21
286 Solutions to Selected Exercises
h2 3 3 3 h1
2 1
y1 − y2 y1 − y2 y0 − y1
3 + 2h2 c1 + h2 c2 + 3c1 h1 + m0 − + h2 c1 + h2 c2 + −2c1 h1 = 3
h2 h2 3 3 h1
2 1
y1 − y2 y0 − y1
2 + 2h2 c1 + h2 c2 + c1 h1 + m0 − h2 c1 + h2 c2 = 3
h2 3 3 h1
y1 − y2 y0 − y1
6 + 6h2 c1 + 3h2 c2 + 3c1 h1 + 3m0 − (2h2 c1 + h2 c2 ) = 9
h2 h1
y1 − y2 y0 − y1
6 + 4h2 c1 + 2h2 c2 + 3c1 h1 + 3m0 = 9 ,
h2 h1
and finally
y0 − y1 y1 − y2
(4h2 + 3h1 ) c1 + 2h2 c2 = 9 −6 − 3m0 . (6.5.7)
h1 h2
The right endpoint condition, Sn0 (xn ) = mn gives bn = mn . Substituting this information into 5.2.7 with
i = n gives mn = yn−1
hn
−yn
− (cn−1 +2c
3
n )hn
, which simplifies to
yn−1 − yn
hn cn−1 + 2hn cn = 3 − mn . (6.5.8)
hn
Equation 6.5.7 should be reflected in the modified code on lines 21 and 22:
m(1,1)=2*(h(1)+h(2)); m(1,2)=h(2);
m(1,n+1)=3*((y(1)-y(2))/h(1)-(y(2)-y(3))/h(2));
becomes
m(1,1)=3*h(1)+4*h(2); m(1,2)=2*h(2);
m(1,n+1)=9*(y(1)-y(2))/h(1)-6*(y(2)-y(3))/h(2)-3*m0;
becomes
The solution for the ci remains unchanged. We have only left to modify the computation of b1 and d1 on lines
47 and 48. b1 now comes from 5.2.10, so
b(1)=(y(1)-y(2))/h(1)-2*c(1)*h(1)/3;
becomes
b(1)=(y(2)-y(3))/h(2)+2*c(1)*h(2)/3+h(2)*c(2)/3;
d(1)=-c(1)/(3*h(1));
becomes
287
d(1)=(m0-b(1)-2*c(1)*h(1))/(3*h(1)^2);
Of course, the comments at the beginning of the function should be updated as well. The modified code,
then, should look something like this:
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Written by Dr. Len Brin 3 June 2014 %
% Purpose: Calculation of a natural cubic %
% spline. %
% INPUT: points (x(1),y(1)), (x(2),y(2)), ... %
% spline must interpolate; first %
% derivative at left endpoint, m0; first %
% derivative at right endpoint, mn. %
% OUTPUT: coefficients of each piece of the %
% piecewise cubic spline: %
% S(i,x) = a(i) %
% + b(i)*(x-x(i+1)) %
% + c(i)*(x-x(i+1))^2 %
% + d(i)*(x-x(i+1))^3 %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function [a,b,c,d] = clampedCubicSpline(x,y,m0,mn)
n=length(x)-1;
for i=1:n
h(i)=x(i)-x(i+1);
end%for
% Left endpoint condition:
% m(1,1)*c(1) + m(1,2)*c(2) = m(1,n+1)
m(1,1)=3*h(1)+4*h(2); m(1,2)=2*h(2);
m(1,n+1)=9*(y(1)-y(2))/h(1)-6*(y(2)-y(3))/h(2)-3*m0;
% Right endpoint condition:
% m(n,n-1)*c(n-1) + m(n,n)*c(n) = m(n,n+1)
m(n,n-1)=h(n); m(n,n)=2*h(n); m(n,n+1)=3*((y(n)-y(n+1))/h(n)-mn);
% Conditions for all splines:
for i=2:n-1
m(i,i-1)=h(i);
m(i,i)=2*(h(i)+h(i+1));
m(i,i+1)=h(i+1);
m(i,n+1)=3*((y(i)-y(i+1))/h(i)-(y(i+1)-y(i+2))/h(i+1));
end%for
% Solve for c(i)
l(1)=m(1,1); u(1)=m(1,2)/l(1); z(1)=m(1,n+1)/l(1);
for i=2:n-1
l(i)=m(i,i)-m(i,i-1)*u(i-1);
u(i)=m(i,i+1)/l(i);
z(i)=(m(i,n+1)-m(i,i-1)*z(i-1))/l(i);
end%for
l(n)=m(n,n)-m(n,n-1)*u(n-1);
c(n)=(m(n,n+1)-m(n,n-1)*z(n-1))/l(n);
for i=n-1:-1:1
c(i)=z(i)-u(i)*c(i+1);
end%for
% Compute a(i), b(i), d(i)
% Endpoint conditions:
b(1)=(y(2)-y(3))/h(2)+2*c(1)*h(2)/3+h(2)*c(2)/3;
d(1)=(m0-b(1)-2*c(1)*h(1))/(3*h(1)^2);
% Conditions for all splines:
a(1)=y(2);
288 Solutions to Selected Exercises
for i=2:n
d(i)=(c(i-1)-c(i))/(3*h(i));
b(i)=(y(i)-y(i+1))/h(i)-(c(i-1)+2*c(i))*h(i)/3;
a(i)=y(i+1);
end%for
b(n)=mn;
end%function
Notice the addition of the final computation, b(n)=mn. The value of b(n) from the loop is subject to
floating point error. Setting bn equal to mn at the end of the program eliminates this potential variation.
clampedCubicSpline.m may be downloaded at the companion website.
12b: >> [a,b,c,d]=clampedCubicSpline([1,2,4],[1,3,2],0,0)
a =
3 2
b =
1.75000 0.00000
c =
-2.5000 1.6250
d =
-2.25000 0.68750
Section 2.7
1: (c) g(a) = g(0) = 2 and g(b) = g(.9) = −.1897 so the bracket is good. Moreover, we now know that if the
value of the function is positive at any given iteration, that iteration becomes the left endpoint. Otherwise it
becomes the right endpoint. Recall, the secant method when applied to a proper bracket will always produce
an iteration inside the bracket, so bisection is never needed.
The method is undefined beyond this point due to division by zero. The method fails.
REMARK: We will see later (question 6h) that Octave is able to handle the division by zero well enough
that the method does continue, and eventually arrives at a solution!
at which point it stops since |.66700 − .66071| = .00629 < .01. The (pure) secant method takes significantly
longer to converge than does its bracketed cousin. This is largely due to the fact that in the secant method,
the third iteration comes from the secant method applied to .9 and .82203, the last two iterations (which do
not comprise a proper bracket), whereas the third iteration in false position comes from the secant method
applied to 0 and .82203 (a proper bracket).
(h) The secant method produces the sequence of approximations
at which point it stops since |2.2192 − 2.2142| = .005 < .01. The (pure) secant method and its bracketed
cousin produce the exact same sequence of iterations. It just happens that, at each step, the secant method
produces an approximation, which when paired with the previous iteration forms a proper bracket!
at which point it stops since |1 − 1.003| = .003 < .01. The (pure) Newton’s method converges to a different
root, one outside the bracket! It is quick, but it fails to produce a root between 0 and .9, something that
should not be surprising from an un-safeguarded method.
(h) Newton’s method produces the sequence of approximations
20, 1062.3, 3803.0, 971.14, 377.14, 2880.5, 1606.3, 330.83, 66.635, 20.301,
−5.5823, −21.983, −10.454, −4.6688, 1.9357, 2.2550, 2.2193, 2.2191
at which point it stops since |2.2191 − 2.2193| = .0002 < .01. The (pure) Newton’s method takes significantly
longer to converge than does its bracketed cousin! Newton’s method is allowed to wander in a seemingly
random pattern before it comes close enough to the root to converge. Bracketing forces the iterations to
approach much more quickly the interval in which Newton’s method will converge.
1. Use the bracketed secant method (false position) to find a root in the indicated interval, accurate to within
10−2 .
290 Solutions to Selected Exercises
[A]
(a) f (x) = 3 − x − sin x; [2, 3]
(b) g(x) = 3x4 − 2x3 − 3x + 2; [0, 1]
(c) g(x) = 3x4 − 2x3 − 3x + 2; [0, 0.9] [S]
5: (c)
>> f2=inline(’3*x^4-2*x^3-3*x+2’);
>> [res,i]=falsePosition(f2,0,.9,10^-6,100)
b = 0.900000000000000
b = 0.822030415125360
b = 0.744866113620209
b = 0.696903242045358
b = 0.676602659540989
b = 0.669712929388636
b = 0.667578776723430
b = 0.666937771712738
b = 0.666747069128180
b = 0.666690496216585
b = 0.666673727853602
b = 0.666668758921090
b = 0.666667286598371
res = 0.666666850350527
i = 13
>> f7=inline(’exp(sin(r))-r’);
>> [res,i]=falsePosition(f7,-20,20,10^-6,100)
b = 20
b = 1.52625394347853
b = 2.70134274226916
b = 2.11862078217644
b = 2.21421804475756
b = 2.21893051185485
b = 2.21910087293432
b = 2.21910692606145
res = 2.21910714100071
i = 8
6: (c)
291
>> f2=inline(’3*x^4-2*x^3-3*x+2’);
>> f2p=inline(’12*x^3-6*x^2-3’);
>> [res,i]=bracketedNewton(f2,f2p,0,.9,10^-6,100)
b = 0.900000000000000
b = 0.450000000000000
b = 0.639257968925196
b = 0.665474256136936
b = 0.666663994320019
b = 0.666666666653136
res = 0.666666666666667
i = 6
>> f7=inline(’exp(sin(r))-r’);
>> f7p=inline(’exp(sin(r))*cos(r)-1’);
>> [res,i]=bracketedNewton(f7,f7p,-20,20,10^-6,100)
b = 20
b = 0
warning: division by zero
b = 10
b = 3.66539525575696
b = 1.65966535497164
b = 2.50454805267468
b = 2.22298743934113
b = 2.21911019802387
b = 2.21910714891565
res = 2.21910714891375
i = 9
REMARK: When we tried to compute this solution by hand (question 2h), we quit after the first iteration
due to the division by zero. However, Octave continues, treating the undefined estimate as one that
lands outside the bracket. Thus the second iteration is 10 (the bisection method applied to [0, 20]).
7: (c)
>> f2=inline(’3*x^4-2*x^3-3*x+2’);
>> [res,i]=bracketedInverseQuadratic(f2,0,.9,10^-6,100)
b = 0.900000000000000
b = 0.822030415125360
b = 0.411015207562680
b = 0.729556813485380
b = 0.629464108906733
b = 0.671561434924253
b = 0.666977335665865
b = 0.666666168461076
res = 0.666666666960237
i = 8
>> f7=inline(’exp(sin(r))-r’);
>> [res,i]=bracketedInverseQuadratic(f7,-20,20,10^-6,100)
b = 20
b = 1.52625394347854
b = 2.70134274226916
b = 2.11862078217644
b = 2.21421804475756
b = 2.21917736990638
b = 2.21910707796098
res = 2.21910714891272
i = 7
>> f2=inline(’3*x^4-2*x^3-3*x+2’);
>> g2=inline(’f2(x)+x’);
>> [res,i]=bracketedSteffensens(g2,0,.9,10^-6,100)
b = 0.900000000000000
b = 0.559577120523157
b = 0.707986331365555
b = 0.669737865924576
b = 0.666686284030401
b = 0.666666667476795
res = 0.666666667666825
i = 6
>> f7=inline(’exp(sin(r))-r’);
>> g7=inline(’f7(x)+x’);
>> [res,i]=bracketedSteffensens(g7,-20,20,10^-6,100)
b = 20
b = 1.80564417969925
b = 2.18151287547235
b = 2.21873144340028
b = 2.21910711013891
res = 2.21910707929096
i = 5
>> f2=inline(’3*x^4-2*x^3-3*x+2’);
>> [res,i]=bracketedInverseQuadraticRE(f2,0,.9,10^-6,100)
b = 0.900000000000000
b = 0.822030415125360
b = 0.411015207562680
b = 0.729556813485380
b = 0.629464108906733
b = 0.671561434924253
b = 0.666977335665865
293
b = 0.666666168461076
res = 0.666666666960237
i = 8
>> f7=inline(’exp(sin(r))-r’);
>> [res,i]=bracketedInverseQuadraticRE(f7,-20,20,10^-6,100)
b = 20
b = 1.52625394347854
b = 2.70134274226916
b = 2.11862078217644
b = 2.21421804475756
b = 2.21917736990638
b = 2.21910707796098
res = 2.21910714891272
i = 7
Section 6.1
1d: The degree of the differential equation equals the degree of the highest degree derivative in the equation. The
only appearance of a derivative in the equation is the f 0 term. That makes the highest degree derivative 1,
so the degree of the differential equation is 1.
2d: In the differential equation f 0 + fx = x2 , both f and f 0 appear. To verify that a given function f is a solution,
we need to substitute both f and f 0 into the equation. f 0 is not given, so we calculate it:
3x2 4
f 0 (x) = − 2.
4 x
Now that we have everything needed, we substitute f and f 0 into the differential equation and verify that the
equation is true. Substituting:
3
4
4 + x
2 x
3x 4
− 2 + = x2 .
4 x x
It is not obvious that this equation is true, so we need to do a little work. To finish the verification, we must
show that the two sides are equal using algebra. Adding or subtracting or doing anything else to both sides
simultaneously supposes that the two sides are equal, so these things are not allowed! Instead, we need to
manipulate the two sides separately. Working with the left side only:
4 3
3x 16 4
x
− 2 + + = x2
4x2 4x 4x x2
3x4 16 x4 16
2
− 2
+ 2
+ 2 = x2
4x 4x 4x 4x
4x4
= x2 .
4x2
Almost done, but technically, this equation is not true! It is false when x = 0 because the left side is undefined
for x = 0. Luckily we do not have to worry about that case. It was given that x > 0, so we know x 6= 0 and
we can reduce 4x
4
2
4x2 to x , which finishes the verification.
3d: In order to verify that a function is a solution of an initial value problem, we need to verify that it solves the
differential equation and satisfies the initial value requirement.
294 Solutions to Selected Exercises
both f and f 0 appear. To verify that a given function f is a solution, we need to substitute both f and
f 0 into the equation. f 0 is not given, so we calculate it:
3x2 16
f 0 (x) =
− 2.
4 x
Now that we have everything needed, we substitute f and f 0 into the differential equation and verify
that the equation is true. Substituting:
3
16
4 + x
2 x
3x 16
− 2 + = x2 .
4 x x
It is not obvious that this equation is true, so we need to do a little work. To finish the verification, we
must show that the two sides are equal using algebra. Adding or subtracting or doing anything else to
both sides simultaneously supposes that the two sides are equal, so these things are not allowed! Instead,
we need to manipulate the two sides separately. Working with the left side only:
4 3
3x 64 16
x
− 2 + + = x2
4x2 4x 4x x2
3x4 64 x4 64
2
− 2
+ 2
+ 2 = x2
4x 4x 4x 4x
4x4
= x2 .
4x2
Almost done, but technically, this equation is not true! It is false when x = 0 because the left side is
undefined for x = 0. Luckily we do not have to worry about that case. It was given that x > 0, so we
4x4 2
know x 6= 0 and we can reduce 4x 2 to x , which finishes the verification.
• Showing that f (4) = 20: To show that f satisfies the initial value requirement, we simply compute f (4)
and show that it is 20 as required. f (4) = 44 + 16 64 16 80
3
4 = 4 + 4 = 4 = 20.
4c: The given ẏ = t − sin t can be restated as y 0 (t) = t − sin t. In other words, we are given the derivative of y as
a function of t. The fundamental theorem of calculus tells us that y must be the integral (antiderivative) of
the given function. That is,
ˆ
y(t) = (t − sin t)dt
1 2
= t + cos t + C.
2
So the (infinitely many) solutions of the o.d.e. are y(t) = 21 t2 + cos t + C.
5d: Though we could give them, this question is not asking for exact measurements of the error. It is simply
requesting a comment on the accuracy of the approximate solution. It will suffice to compare the graphs of
the exact solution and approximate solution over the interval covered by the approximate solution, [4, 5], and
do a calculation or two. The graph of the exact solution is a graph of the function f (x) = x4 + 16
3
x and the
graph of the approximate solution is a graph of the set {(4, 20), (4.25, 23), (4.5, 26), (4.75, 30), (5, 34)}:
295
From the graphs, the only point in the approximation that is visually separate from the graph of the exact
solution is the point (5, 34). And it only misses by a small relative amount. To be more precise, the relative
error there is |f (5)−34|
|f (5)|
9
= 689 ≈ 0.013. Any general comment on the accuracy of an approximation should
take into account the requirements of the situation. In this case, there is no context to say whether we should
hope for 10%, 1%, .1%, or smaller relative error or whether we should be more concerned about absolute
error. Without any such context, we will simply use the visual representation, which shows the points of
the approximation very close to the graph of the exact solution, and conclude the approximation is a good
representation of the exact solution.
6c: The forces acting on a stationary block on an inclined plane are gravity, friction, and the normal force of the
surface on which it is lying. Gravity acts vertically downward. Friction acts parallel to the surface and up the
slope since it is resisting gravity which pulls the block down the slope. The normal force acts perpendicular to
the surface. Representing the block as a rectangle and each force by a vector, the free body diagram should
look something like this:
Note that the line representing the surface is NOT part of the free body diagram, so it is dashed. It is only
there to show the (potential) direction of motion.
6f: The forces acting on a sofa being pushed across a level floor are gravity, friction, the normal force of the floor,
and the applied force. Gravity acts veritcally downward. Friction acts parallel to the floor opposing the
applied force. The normal force acts perpendicular to the floor. And the applied force acts in an unspecified
direction not parallel to the floor. Representing the sofa as a rectangle and each force by a vector, the free
body diagram should look something like this:
Note that the line representing the floor is NOT part of the free body diagram, so it is dashed. It is only
there to show the direction of motion.
6m: The forces acting on a sky diver—whether his parachute is open, closed, or in the process of opening does
not matter—are gravity and drag (air resistance). Gravity acts vertically downward and drag acts vertically
upward. Representing the sky diver as a rectangle and each force by a vector, the free body diagram should
look something like this:
296 Solutions to Selected Exercises
7c: (See solution of 6c for free body diagram) Since the block is not moving, the net force in any direction must
be zero! That makes the equation of motion s(t) = 0. The end. This answers the question asked.
In a situation where the block is moving, however, it is necessary to consider the magnitudes of the forces
acting in the direction of motion, friction and gravity. For sake of discussion, here is how they may be resolved.
The normal force acts normal to the motion so has zero tangential component. Friction is proportional to
the normal force, and by convention we use µ for the constant of proportionality, so the magnitude of friction
is µN . Adding an auxiliary line perpendicular to the surface, we see that the component of gravity in the
tangential direction is mg sin α.
Taking the positive direction to be down the slope, the forces acting tangential (parallel) to the surface are
mg sin α − µN . To complete the equation of motion, we need to compute N . Since the block does not move
in the normal direction, the net force in that direction must be zero. The only forces acting in the normal
direction are the normal force itself and a component of gravity. Therefore, N must equal the magnitude
of gravity in the normal direction. Again using the auxiliary line, the component of gravity in the normal
direction is mg cos α. Hence N = mg cos α. Substituting this expression into the tangential forces, we have
mg sin α − µmg cos α acting tangential to the surface. By Newton’s Second Law, this force must equal ma, so
the equation of motion is ms̈ = mg sin α − µmg cos α, which simplifies to
This equation can be used for a block in motion down an inclined plane.
7f: (See solution of 6f for free body diagram) Both gravity and the normal force act normal to the motion, so have
zero tangential components. The only forces that act (with nonzero component) in the direction of motion
are friction and the applied force. Friction is proportional to the normal force, and by convention we use µ
for the constant of proportionality, so the magnitude of friction is µN . Adding an auxiliary line parallel to
the surface, we mark the angle of the applied force and see that the component of the applied force in the
tangential direction is Fapplied cos β.
Taking the positive direction to be left, the forces acting tangential (parallel) to the surface are Fapplied cos β −
µN . To complete the equation of motion, we need to compute N . Since the block does not move in the
normal direction, the net force in the normal direction must be zero. The forces acting in that direction are
N itself, gravity, and a component of the applied force. Therefore, in the normal direction, we must have
N + Fapplied sin β = mg or N = mg − Fapplied sin β. Substituting this expression into the tangential forces,
we have Fapplied cos β − µ(mg − Fapplied sin β) acting tangential to the surface. By Newton’s Second Law, this
force must equal ma, so the equation of motion is ms̈ = Fapplied cos β − µ(mg − Fapplied sin β), which simplifies
to
Fapplied
s̈ = (cos β + µ sin β) − µg.
m
297
7m: (See solution of 6m for free body diagram) Both forces in the free body diagram act in the vertical direction, so
the equation of motion is particularly simple in this case. No trigonometry is needed. F = ma simply becomes
Fdrag − mg = ms̈, taking upward to be the positive direction. The drag force is taken to be proportional to
speed but in the opposite direction, so Fdrag may be replaced by −cṡ (for some positive constant c) and the
equation of motion becomes, more precisely, −cṡ − mg = ms̈. With a little bit of algebra, this equation can
be rewritten as
c
s̈ + ṡ + g = 0.
m
Section 6.2
1a: Replacing the t in Euler’s Method (6.2.3) by x, Euler’s Method applied to this problem has the form yi+1 =
yi + h · y 0 (xi , yi ). Because the initial condition is y(1) = 1, we begin with x0 = 1 and y0 = 1. Then
y1 = y0 + 0.5(3x0 − 2y0 )
= 1 + 0.5(3(1) − 2(1))
= 1.5
x1 = x0 + h = 1 + 0.5 = 1.5
y2 = y1 + 0.5(3x1 − 2y1 )
= 1.5 + 0.5(3(1.5) − 2(1.5))
= 2.25
x2 = x1 + h = 1.5 + 0.5 = 2.0
1d: Because the o.d.e. is not written in the form y 0 = f (t, y), it is our job to rewrite it in that form, taking what
is given and solving for y 0 :
So we have f (x, y) = 2 cos2 (x) sin(x) − sec(x) − y tan(x). Now replacing the t in Euler’s Method (6.2.3) by x,
Euler’s Method applied to this problem has the form yi+1 = yi + h · y 0 (xi , yi ). Because the initial condition
is y(1) = 0, we begin with x0 = 1 and y0 = 0. Then
y1 = y0 + 0.5f (x0 , y0 )
= 0 + 0.5f (1, 0)
= 0.5(2 cos2 (1) sin(1) − sec(1))
≈ −0.67976011062352
x1 = x0 + h = 1 + 0.5 = 1.5
y2 = y1 + 0.5f (x1 , y1 )
≈ −0.67976 + 0.5f (1.5, −0.67976)
≈ −2.9503939532546
x2 = x1 + h = 1.5 + 0.5 = 2.0
2a: For Taylor’s Method of degree 2, we will need the second derivative of y. The only thing we have to work with
is the o.d.e. itself, dx
dy
= 3x − 2y. By implicit differentiation,
d2 y dy
2
=3−2 .
dx dx
d2 y
= 3 − 2(3x − 2y)
dx2
= 3 − 6x + 4y.
1
yi+1 = yi + h · y 0 (xi , yi ) + h2 · y 00 (xi , yi )
2
xi+1 = xi + h
1
y1 = y0 + h · y 0 (x0 , y0 ) + h2 · y 00 (x0 , y0 )
2
1
= 1 + 0.5(3 · 1 − 2 · 1) + (0.5)2 · (3 − 6 · 1 + 4 · 1)
2
= 1.625
x1 = x0 + h = 1 + 0.5 = 1.5
1
y2 = y1 + h · y 0 (x1 , y1 ) + h2 · y 00 (x1 , y1 )
2
1
= 1.625 + 0.5(3 · 1.5 − 2 · 1.625) + (0.5)2 · (3 − 6 · 1.5 + 4 · 1.625)
2
= 2.3125
x1 = x0 + h = 1.5 + 0.5 = 2.0
2d: For Taylor’s Method of degree 2, we will need the second derivative of y. The only thing we have to work
with is the o.d.e. itself (after it’s been solved for dx
dy
: dx
dy
= 2 cos2 (x) sin(x) − sec(x) − y tan(x). By implicit
differentiation,
d2 y dy
= − tan(x) · − sec(x) tan(x) − 4 cos(x) sin2 (x) − y sec2 (x) + 2 cos3 (x).
dx2 dx
d2 y
dx2 (and simplifying a lot!) yields
d2 y
= −y + 8 cos3 (x) − 6 cos(x)
dx2
Now we are ready. Symbolically, Taylor’s Method of degree 2 is
1
yi+1 = yi + h · y 0 (xi , yi ) + h2 · y 00 (xi , yi )
2
xi+1 = xi + h
299
Therefore, we have y(2) = −1.3896462555267. If this exercise does not convince you that Taylor’s Methods
of degree higher than 2 are not particularly user-friendly, just wait until you try Taylor’s Method of degree 3
on this problem.
3a: For Taylor’s Method of degree 3, we will need the second and third derivatives of y. The only thing we have
to work with is the o.d.e. itself, dx
dy
= 3x − 2y. By implicit differentiation,
d2 y dy
=3−2 .
dx2 dx
d2 y
= 3 − 2(3x − 2y)
dx2
= 3 − 6x + 4y.
d2 y
Implicitly differentiating the equation for dx2 gives
d3 y dy
= −6 + 4 ·
dx3 dx
= −6 + 4(3x − 2y)
= 12x − 8y − 6.
d2 y
dx2 (and simplifying a lot!) yields
d2 y
= −y + 8 cos3 (x) − 6 cos(x)
dx2
d2 y
Implicitly differentiating the equation for dx2 gives
d3 y dy
= − − 24 cos2 (x) sin(x) + 6 sin(x)
dx3 dx
= y tan(x) + (6 − 26 cos2 (x)) sin(x) + sec(x)
Now we are ready. Symbolically, Taylor’s Method of degree 3 is
1 1
yi+1 = yi + h · y 0 (xi , yi ) + h2 · y 00 (xi , yi ) + h3 · y 000 (xi , yi )
2 6
xi+1 = xi + h
Beginning with the initial conditions, x0 = 1, y0 = 0,
1 1
y1 = y0 + h · y 0 (x0 , y0 ) + h2 · y 00 (x0 , y0 ) + h3 · y 000 (x0 , y0 )
2 6
= 0 + 0.5(2 cos2 (1) sin(1) − sec(1))
1
+ (0.5)2 · (8 cos3 (1) − 6 cos(1))
2
1
+ (0.5)3 · (sec(1) + (6 − 26 cos2 (1)) sin(1))
6
≈ −0.91657489783846
x1 = x0 + h = 1 + 0.5 = 1.5
Now x0 and y0 can be forgotten as we compute x2 and y2 :
1 1
y2 = y1 + h · y 0 (x1 , y1 ) + h2 · y 00 (x1 , y1 ) + h3 · y 000 (x1 , y1 )
2 6
1
≈ −0.9166 + 0.5f (1.5, −0.9166) + (0.5)2 · y 00 (1.5, −0.9166)
2
1
+ (0.5)3 · y 000 (1.5, −0.9166)
6
≈ −1.3083937870918
x1 = x0 + h = 1.5 + 0.5 = 2.0
Therefore, we have y(2) = −1.3083937870918. If this exercise does not convince you that Taylor’s Methods
of degree higher than 2 are not particularly user-friendly, nothing will!
301
7: Remember to document your code! In fact, the documentation for a function should almost always be written
before the function itself. Putting down in print exactly what the intended inputs and outputs of the function
will be should help guide how it is written. From the pseudo-code for Euler’s Method, the inputs are the
differential equation ẏ = f (t, y); initial condition y(t0 ) = y0 ; numbers t0 and t1 ; and the number of steps N .
A reasonable comment for the beginning of the function would list all of these inputs and the output, plus
document who wrote it when and for what reason:
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Written by Leon Brin 29 January 2012 %
% Purpose: This function implements Euler’s method where the %
% step size is calculated and held constant. %
% INPUT: function f(x,y); interval [a,b]; y(a); steps n %
% OUTPUT: approximation (x(i),y(i)) of the solution of y’=f(x,y) %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
The declaration of the function has to have the five inputs as arguments and the output as a return value.
Something like function [y,x] = eulerode(f,a,ya,b,n) should do, where ya of course is the input y(a).
The rest of the function should follow almost verbatim the pseudo-code. I’ve used x instead of t for the
independent variable. eulerode.m may be downloaded at the companion website.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Written by Leon Brin 29 January 2012 %
% Purpose: This function implements Euler’s method where the %
% step size is calculated and held constant. %
% INPUT: function f(x,y); interval [a,b]; y(a); steps n %
% OUTPUT: approximation (x(i),y(i)) of the solution of y’=f(x,y) %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function [y,x] = eulerode(f,a,ya,b,n)
i = 1;
x(i) = a;
y(i) = ya;
h = (b-a)/n;
while (i<=n)
y(i+1) = y(i) + h*f(x(i),y(i));
x(i+1) = a + (b-a)*i/n;
i = i+1;
end%while
end%function
14c: The equation of motion is s̈ = g(sin α − µ cos α). It is a second order differential equation with dependent
variable s and independent variable t. The g, α, and µ appearing in the equation are constants. We let u = ṡ
so u̇ = s̈ = g(sin α − µ cos α), and the first order system becomes
u̇ = g(sin α − µ cos α)
ṡ = u
Fapplied
14f: The equation of motion is s̈ = m (cos β
+ µ sin β) − µg. It is a second order differential equation with
dependent variable s and independent variable t. The β, m, Fapplied , and µ appearing in the equation are
F
constants. We let u = ṡ so u̇ = s̈ = applied
m (cos β + µ sin β) − µg, and the first order system becomes
Fapplied
u̇ = (cos β + µ sin β) − µg
m
ṡ = u
u̇ = g(sin α − µ cos α)
ṡ = u
with initial conditions s(0) = 0, ṡ(0) = 0 and parameter values g = 32.2 ft/s2 , µ = .21, α = .25 rad. No
conversion of units is needed. We plug the parameter values into the system to get the initial value problem
u̇ = 1.41462169238826
ṡ = u
u0 = ṡ(0) = 0
s0 = s(0) = 0
In particular,
u1 = u0 + 0.25u̇(u0 , s0 )
= 0 + 0.25(1.41462169238826) ≈ 0.353655423097065
s1 = s0 + 0.25u0
= 0 + 0.25(0) = 0
t1 = t0 + 0.25 = .25
and
u2 = u1 + 0.25u̇(u1 , s1 )
≈ 0.3536 + 0.25(1.414) ≈ 0.7073108461941298
s2 = s1 + 0.25u1
≈ 0 + 0.25(0.3536) ≈ 0.08841385577426622
t1 = t0 + 0.25 = .5
with initial conditions s(0) = 0, ṡ(0) = .03 and parameter values g = 9.81 m/s2 , µ = .15, β = 10
π
rad, m = 35
kg, and Fapplied = 75 N. No conversion of units is needed. We plug the parameter values into the system to
get the initial value problem
75 π π
u̇ = cos + .15 sin − .15(9.81) ≈ 0.6658051402529905
35 10 10
ṡ = u
u0 = ṡ(0) = .03
s0 = s(0) = 0
In particular,
u1 = u0 + 0.25u̇(u0 , s0 )
= .03 + 0.25(0.6658051402529905) ≈ 0.1964512850632476
s1 = s0 + 0.25u0
= 0 + 0.25(.03) = 0.0075
t1 = t0 + 0.25 = .25
and
u2 = u1 + 0.25u̇(u1 , s1 )
≈ 0.1964 + 0.25(0.6658) ≈ 0.3629025701264953
s2 = s1 + 0.25u1
≈ 0.0075 + 0.25(0.1964) ≈ 0.05661282126581191
t1 = t0 + 0.25 = .5
Therefore, s(0.5) ≈ 0.05661282126581191.
15m: The system we are solving is
c
u̇ = − u−g
m
ṡ = u
with initial conditions s(0) = 2000, ṡ(0) = −55 and parameter values g = 9.81 m/s2 , c = 26, and m = 70
kg. No conversion of units is needed. We plug the parameter values into the system to get the initial value
problem
26 13
u̇ = u − 9.81 = − u − 9.81
70 35
ṡ = u
u0 = ṡ(0) = −55
s0 = s(0) = 2000
Applying Euler’s method to this system means iterating
13
un+1 = un + hu̇(un , sn ) = un + 0.25 − un − 9.81
35
sn+1 = sn + hṡ(un , sn ) = sn + 0.25un
tn+1 = tn + h
In particular,
u1 = u0 + 0.25u̇(u0 , s0 )
13
= −55 + 0.25 − (−55) − 9.81 ≈ −52.34535714285715
35
s1 = s0 + 0.25u0
= 2000 + 0.25(−55) = 1986.25
t1 = t0 + 0.25 = .25
and
u2 = u1 + 0.25u̇(u1 , s1 )
13
≈ −52.34 + 0.25 − (−52.34) − 9.81 ≈ −49.9372168367347
35
s2 = s1 + 0.25u1
≈ 1986.25 + 0.25(−52.34) ≈ 1973.163660714286
t1 = t0 + 0.25 = .5
Therefore, s(0.5) ≈ 1973.163660714286.
304 Solutions to Selected Exercises
18a: A number of differential equations solution techniques require you to have some idea what the solution will
be before you know exactly what it is. You then take this “rough guess” and refine it by forcing it to solve the
given differential equation. The method of undetermined coefficients is an example of such a technique. We
know the solution will be a linear combination of certain functions, but we don’t know the right coefficients
to use. To find the coefficients, we plug the solution with unknown (undetermined) coefficients into the
differential equation and match the coefficients of like terms. This process leaves us with a linear system of
equations to solve for the unknowns. In this particular example, we are given that y(x) = Ax2 + Bx + C is a
solution of y 00 + 5y 0 − 8y = 3x2 , and it is our job to figure out the values of A, B, and C. We will find y 0 and
y 00 and substitute them into the o.d.e.:
y 0 (x) = 2Ax + B
y (x)
00
= 2A
Therefore
y 00 + 5y 0 − 8y = 2A + 5(2Ax + B) − 8(Ax2 + Bx + C).
Thus, if we are to have a solution of the o.d.e., we will need
2A + 5(2Ax + B) − 8(Ax2 + Bx + C) = 3x2
Simplifying, that is
−8Ax2 + (10A − 8B)x + (2A + 5B − 8C) = 3x2 .
Matching the coefficients of like terms on the left and the right, we have
−8A = 3
10A − 8B = 0
2A + 5B − 8C = 0.
15
The solution of this system is A = − 83 , B = − 32 99
, C = − 256 . Hence the solution of the o.d.e. is y(x) =
3 2 15 99
− 8 x − 32 x − 256 .
18h: A number of differential equations solution techniques require you to have some idea what the solution will
be before you know exactly what it is. You then take this “rough guess” and refine it by forcing it to solve the
given differential equation. The method of undetermined coefficients is an example of such a technique. We
know the solution will be a linear combination of certain functions, but we don’t know the right coefficients to
use. To find the coefficients, we plug the solution with unknown (undetermined) coefficients into the differential
equation and match the coefficients of like terms. This process leaves us with a linear system of equations to
solve for the unknowns. In this particular example, we are given that θ(t) = At cos t+Bt sin t+C cos t+D sin t
1
is a solution of θ̈ + 10 θ̇ + θ = t cos t, and it is our job to figure out the values of A, B, C, and D. We will find
θ̇ and θ̈ and substitute them into the o.d.e.:
˙
θ(t) = (D + A) cos(t) + (B − C) sin(t) + Bt cos(t) − At sin(t)
θ̈(t) = (2B − C) cos(t) + (−D − 2A) sin(t) − At cos(t) − Bt sin(t)
Therefore
1
θ̈ + θ̇ + θ = (2B − C) cos(t) + (−D − 2A) sin(t) − At cos(t) − Bt sin(t)
10
1
+ ((D + A) cos(t) + (B − C) sin(t) + Bt cos(t) − At sin(t))
10
+At cos t + Bt sin t + C cos t + D sin t
Simplifying, that is
1 1 1
θ̈ + θ̇ + θ = D + 2B + A cos(t)
10 10 10
+ (B − C − 2A) sin(t)
1
+ Bt cos(t)
10
1
− At sin(t)
10
305
Section 6.3
1a: Each o.d.e. solver has the form
It is the integration formula that gives us the weighted average. In this case, the formula
2
h
f (x0 ) + 3f x0 + h
4 3
tells us to average f (x0 ), the value of f at the first node, with f (x0 + 32 h) in a 1 : 3 ratio. That is, we sum one
f (x0 ) with three f (x0 + 23 h) and divide by 4. Unfortunately, we are using f here in two different settings. The
f in an o.d.e. solver is not the same f used in deriving the integration formulas. The f from the integration
formulas is a function of one variable, x. The f we need in an o.d.e. solver is a function of two variables, t
and y. Nevertheless, they play the same role. They each hold the values of the function we are integrating. If
we need to sum one f (x0 ) with three f (x0 + 32 h) in the integration formula, then we need to sum one f (ti , yi )
with three f (ti+2/3 , yi+2/3 ) in the o.d.e. solver. Generally, f (x0 + αh) in an integration formula translates to
f (ti+α , yi+α ) in the o.d.e. solver as long as the integration formula is written for an interval of length h.
Each o.d.e. solver begins with k1 = f (ti , yi ) where (ti , yi ) is the last point approximated. Each successive
value in the o.d.e. solver is obtained by using Euler’s method with initial condition (starting point) (ti , yi ).
For this particular integration formula, there is only one node other than x0 , so we will need only one more
stage. We approximate yi+2/3 by yi + 2h 3 k1 (Euler’s method using starting point (ti , yi ) and approximate
slope k1 ). This makes k2 = f (ti + 3 , yi + 2h
2h
3 k1 ). The final step is to compute the weighted average. As
discussed, we need to sum one k1 with three k2 and divide by 4. In summary, the o.d.e. solver suggested by
this integration formula is
k1 = f (ti , yi )
2h 2h
k2 = f ti + , yi + k1
3 3
h
yi+1 = yi + [k1 + 3k2 ] .
4
It is the integration formula that gives us the weighted average. In this case, the formula
1
h
3f x0 + h + f (x0 + h)
4 3
tells us to average f (x0 + 31 h), the value of f at the first node, with f (x0 + h) in a 3 : 1 ratio. That is, we
sum three f (x0 + 31 h) with one f (x0 + h) and divide by 4. Unfortunately, we are using f here in two different
settings. The f in an o.d.e. solver is not the same f used in deriving the integration formulas. The f from
the integration formulas is a function of one variable, x. The f we need in an o.d.e. solver is a function of
two variables, t and y. Nevertheless, they play the same role. They each hold the values of the function we
are integrating. If we need to sum three f (x0 + 31 h) with one f (x0 + h) in the integration formula, then we
need to sum three f (ti+1/3 , yi+1/3 ) with one f (ti+1 , yi+1 ) in the o.d.e. solver. Generally, f (x0 + αh) in an
integration formula translates to f (ti+α , yi+α ) in the o.d.e. solver as long as the integration formula is written
for an interval of length h.
Each o.d.e. solver begins with k1 = f (ti , yi ) where (ti , yi ) is the last point approximated. Each successive
value in the o.d.e. solver is obtained by using Euler’s method with initial condition (starting point) (ti , yi ).
For this particular integration formula, there are two nodes other than x0 , so we will need two more stages.
We approximate yi+1/3 by yi + h3 k1 (Euler’s method using starting point (ti , yi ) and approximate slope k1 ).
This makes k2 = f (ti + h3 , yi + h3 k1 ). We then approximate yi+1 by yi + hk2 (Euler’s method using starting
point (ti , yi ) and approximate slope k2 ). The final step is to compute the weighted average. As discussed, we
need to sum three k2 with one k3 and divide by 4. In summary, the o.d.e. solver suggested by this integration
formula is
k1 = f (ti , yi )
h h
k2 = f ti + , yi + k1
3 3
k3 = f (ti + h, yi + hk2 )
h
yi+1 = yi + [3k2 + k3 ] .
4
2a: We will modify the test code from the text in two essential ways.
k1 = f (ti , yi )
2h 2h
k2 = f ti + , yi + k1
3 3
h
yi+1 = yi + [k1 + 3k2 ]
4
2. An extra loop will be added so it approximates y(2) for a number of step sizes.
These modifications will make it a simple matter to determine the rate of convergence.
t0=4;
h=-1/4;
n=8;
f=inline("-y/t+t^2");
exact=inline("t^3/4+16/t");
y0=20;
disp(’ h y Error’)
disp(’ ------------------------------------’)
for j=1:6
t=t0;
y=y0;
for i=1:n
k1=f(t,y);
307
k2=f(t+2*h/3,y+2*h/3*k1);
y=y+h/4*(k1+3*k2);
t=t+h;
end%for
x=exact(t);
sprintf(’%12.5g%12.5g%12.5g’,h,y,abs(y-x))
n=n*2;
h=h/2;
end%for
h y Error
------------------------------------
ans = -0.25 9.9391 0.060922
ans = -0.125 9.9846 0.015433
ans = -0.0625 9.9961 0.0038827
ans = -0.03125 9.999 0.00097364
ans = -0.015625 9.9998 0.00024378
ans = -0.0078125 9.9999 6.099e-05
2
The ratio of the step size on one line to the next is 12 , and the ratio of consecutive errors is about 14 = 12 ,
so it appears the o.d.e. solver has rate of convergence O(h2 ). The integration method has rate of convergence
O(h4 ) so we would expect the o.d.e. solver to be O(h3 ). Our experiment does not show the expected rate of
convergence.
2e: An extra loop will be added so it approximates y(2) for a number of step sizes.
These modifications will make it a simple matter to determine the rate of convergence.
t0=4;
h=-1/4;
n=8;
f=inline("-y/t+t^2");
exact=inline("t^3/4+16/t");
y0=20;
disp(’ h y Error’)
disp(’ ------------------------------------’)
for j=1:6
t=t0;
y=y0;
for i=1:n
k1=f(t,y);
k2=f(t+h/3,y+h/3*k1);
k3=f(t+h,y+h*k2);
y=y+h/4*(3*k2+k3);
t=t+h;
end%for
x=exact(t);
sprintf(’%12.5g%12.5g%12.5g’,h,y,abs(y-x))
n=n*2;
h=h/2;
end%for
h y Error
------------------------------------
ans = -0.25 9.9697 0.03027
308 Solutions to Selected Exercises
8a: The Octave function we wrote to implement Euler’s method takes 5 arguments. As explained in the comment
preceding the function declaration,
they are, in order, (f) the function f (x, y) appearing on the right side of the o.d.e., (a) the x-coordinate of the
initial condition, (ya) the y-coordinate of the initial condition, (b) the x-coordinate of the desired solution,
and (n) the number of steps that should be taken. From the Octave command line, the solution can be found
this way:
>> format(’long’)
>> f=inline(’3*x-2*y’)
f = f(x, y) = 3*x-2*y
>> eulerode(f,1,1,2,20)
ans =
Columns 1 through 4:
1.00000000000000 1.05000000000000 1.10250000000000 1.15725000000000
Columns 5 through 8:
1.21402500000000 1.27262250000000 1.33286025000000 1.39457422500000
Columns 9 through 12:
1.45761680250000 1.52185512225000 1.58716961002500 1.65345264902250
Columns 13 through 16:
1.72060738412025 1.78854664570823 1.85719198113740 1.92647278302366
Columns 17 through 20:
1.99632550472130 2.06669295424917 2.13752365882425 2.20877129294183
Column 21:
2.28039416364764
The value in Column 21 is the desired result, so y(2) ≈ 2.28039416364764. The rest of the output gives approxima-
tions for the solution at other points. For example, y(1.95) ≈ 2.20877129294183. Use [y,x]=eulerode(f,1,1,2,20)
to see all the corresponding x-coordinates.
8d: The Octave function we wrote to implement Euler’s method takes 5 arguments. As explained in the comment
preceding the function declaration,
they are, in order, (f) the function f (x, y) appearing on the right side of the o.d.e., (a) the x-coordinate of the
initial condition, (ya) the y-coordinate of the initial condition, (b) the x-coordinate of the desired solution,
and (n) the number of steps that should be taken. From the Octave command line, the solution can be found
this way:
309
>> format(’long’)
>> f=inline(’(2*cos(x)^3-1-y*sin(x))/cos(x)’)
f = f(x, y) = (2*cos(x)^3-1-y*sin(x))/cos(x)
>> eulerode(f,1,0,2,20)
ans =
Columns 1 through 3:
0.000000000000000 -0.063348127711403 -0.133556806761731
Columns 4 through 6:
-0.210091730766547 -0.292335849279218 -0.379594108676440
Columns 7 through 9:
-0.471098428249811 -0.566012332190405 -0.663433947473280
Columns 10 through 12:
-0.762393924730387 -0.861836463006993 -0.960521838453174
Columns 13 through 15:
-1.055901027787366 -1.150767311038156 -1.243138035592362
Columns 16 through 18:
-1.331810188637979 -1.415726818259857 -1.493905125626401
Columns 19 through 21:
-1.565422860316011 -1.629418404020635 -1.685095172485204
The value in Column 21 is the desired result, so y(2) ≈ −1.685095172485. The rest of the output gives approxima-
tions for the solution at other points. For example, y(1.95) ≈ −1.629418404020. Use [y,x]=eulerode(f,1,0,2,20)
to see all the corresponding x-coordinates.
9a: The Octave functions we wrote to implement other methods take 5 arguments. Here, we imagine a similar
function for trapezoidal-ode has been written and looks like
The arguments are, in order, (f) the function f (x, y) appearing on the right side of the o.d.e., (a) the x-
coordinate of the initial condition, (ya) the y-coordinate of the initial condition, (b) the x-coordinate of the
desired solution, and (n) the number of steps that should be taken. From the Octave command line, the
solution can be found this way:
>> format(’long’)
>> f=inline(’3*x-2*y’)
f = f(x, y) = 3*x-2*y
>> trapode(f,1,1,2,20)
ans =
Columns 1 through 4:
1.00000000000000 1.05125000000000 1.10475625000000 1.16030440625000
Columns 5 through 8:
1.21770048765625 1.27676894132891 1.33735089190266 1.39930255717191
Columns 9 through 12:
1.46249381424058 1.52680690188772 1.59213524620839 1.65838239781859
Columns 13 through 16:
1.72546107002583 1.79329226837337 1.86180450287790 1.93093307510450
Columns 17 through 20:
2.00061943296957 2.07081058683746 2.14145858108790 2.21252001588455
Column 21:
2.28395561437552
The value in Column 21 is the desired result, so y(2) ≈ 2.28395561437552. The rest of the output gives approxima-
tions for the solution at other points. For example, y(1.95) ≈ 2.21252001588455. Use [y,x]=trapode(f,1,1,2,20)
to see all the corresponding x-coordinates.
310 Solutions to Selected Exercises
9d: The Octave functions we wrote to implement other methods take 5 arguments. Here, we imagine a similar
function for trapezoidal-ode has been written and looks like
The arguments are, in order, (f) the function f (x, y) appearing on the right side of the o.d.e., (a) the x-
coordinate of the initial condition, (ya) the y-coordinate of the initial condition, (b) the x-coordinate of the
desired solution, and (n) the number of steps that should be taken. From the Octave command line, the
solution can be found this way:
>> format(’long’)
>> f=inline(’(2*cos(x)^3-1-y*sin(x))/cos(x)’)
f = f(x, y) = (2*cos(x)^3-1-y*sin(x))/cos(x)
>> trapode(f,1,0,2,20)
ans =
Columns 1 through 3:
0.000000000000000 -0.066778403380866 -0.139846898631295
Columns 4 through 6:
-0.218610595984683 -0.302399307505556 -0.390473688925680
Columns 7 through 9:
-0.482031924143591 -0.576216643912361 -0.672121275727591
Columns 10 through 12:
-0.768792826665983 -0.865212265743696 -0.959857757799220
Columns 13 through 15:
-1.056576584732967 -1.151350240932434 -1.242238115924874
Columns 16 through 18:
-1.328187356783625 -1.408239476567505 -1.481492346014993
Columns 19 through 21:
-1.547099820528092 -1.604277373646634 -1.652308958787397
The value in Column 21 is the desired result, so y(2) ≈ −1.652308958787. The rest of the output gives approxima-
tions for the solution at other points. For example, y(1.95) ≈ −1.604277373646. Use [y,x]=trapode(f,1,0,2,20)
to see all the corresponding x-coordinates.
10a: The Octave functions we wrote to implement other methods take 5 arguments. Here, we imagine a similar
function for clopen-ode has been written and looks like
The arguments are, in order, (f) the function f (x, y) appearing on the right side of the o.d.e., (a) the x-
coordinate of the initial condition, (ya) the y-coordinate of the initial condition, (b) the x-coordinate of the
desired solution, and (n) the number of steps that should be taken. From the Octave command line, the
solution can be found this way:
>> format(’long’)
>> f=inline(’3*x-2*y’)
f = f(x, y) = 3*x-2*y
>> clopen(f,1,1,2,20)
ans =
Columns 1 through 4:
1.00000000000000 1.05120833333333 1.10468084027778 1.16020204697801
Columns 5 through 8:
1.21757698550727 1.27662924238649 1.33719919281938 1.39914240296940
311
The value in Column 21 is the desired result, so y(2) ≈ 2.28383076622349. The rest of the output gives approxima-
tions for the solution at other points. For example, y(1.95) ≈ 2.21238894775115. Use [y,x]=clopen(f,1,1,2,20)
to see all the corresponding x-coordinates.
10d: The Octave functions we wrote to implement other methods take 5 arguments. Here, we imagine a similar
function for clopen-ode has been written and looks like
The arguments are, in order, (f) the function f (x, y) appearing on the right side of the o.d.e., (a) the x-
coordinate of the initial condition, (ya) the y-coordinate of the initial condition, (b) the x-coordinate of the
desired solution, and (n) the number of steps that should be taken. From the Octave command line, the
solution can be found this way:
>> format(’long’)
>> f=inline(’(2*cos(x)^3-1-y*sin(x))/cos(x)’)
f = f(x, y) = (2*cos(x)^3-1-y*sin(x))/cos(x)
>> clopen(f,1,0,2,20)
ans =
Columns 1 through 3:
0.000000000000000 -0.066674788135152 -0.139650010793905
Columns 4 through 6:
-0.218333343735571 -0.302057681694326 -0.390087513340042
Columns 7 through 9:
-0.481626032825074 -0.575822930559361 -0.671782830658695
Columns 10 through 12:
-0.768574489070735 -0.865241984556076 -0.960839121780159
Columns 13 through 15:
-1.051332254162207 -1.136768664871208 -1.218181121459446
Columns 16 through 18:
-1.294632701999881 -1.365219285669536 -1.429077386836689
Columns 19 through 21:
-1.485393339498179 -1.533411938658838 -1.572444496803329
The value in Column 21 is the desired result, so y(2) ≈ −1.572444496803329. The rest of the output gives approxima-
tions for the solution at other points. For example, y(1.95) ≈ −1.533411938658838. Use [y,x]=clopen(f,1,0,2,20)
to see all the corresponding x-coordinates.
11a: The Octave function we wrote to implement the midpoint method takes 5 arguments. As explained in the
comment preceding the function declaration,
they are, in order, (f) the function f (x, y) appearing on the right side of the o.d.e., (a) the x-coordinate of the
initial condition, (ya) the y-coordinate of the initial condition, (b) the x-coordinate of the desired solution,
and (n) the number of steps that should be taken. From the Octave command line, the solution can be found
this way:
>> format(’long’)
>> f=inline(’3*x-2*y’)
f = f(x, y) = 3*x-2*y
>> midpoint(f,1,1,2,20)
ans =
Columns 1 through 4:
1.00000000000000 1.05125000000000 1.10475625000000 1.16030440625000
Columns 5 through 8:
1.21770048765625 1.27676894132891 1.33735089190266 1.39930255717191
Columns 9 through 12:
1.46249381424058 1.52680690188772 1.59213524620839 1.65838239781859
Columns 13 through 16:
1.72546107002582 1.79329226837337 1.86180450287790 1.93093307510450
Columns 17 through 20:
2.00061943296957 2.07081058683746 2.14145858108790 2.21252001588455
Column 21:
2.28395561437552
The value in Column 21 is the desired result, so y(2) ≈ 2.28395561437552. The rest of the output gives approxima-
tions for the solution at other points. For example, y(1.95) ≈ 2.21252001588455. Use [y,x]=midpoint(f,1,1,2,20)
to see all the corresponding x-coordinates.
11d: The Octave function we wrote to implement the midpoint method takes 5 arguments. As explained in the
comment preceding the function declaration,
they are, in order, (f) the function f (x, y) appearing on the right side of the o.d.e., (a) the x-coordinate of the
initial condition, (ya) the y-coordinate of the initial condition, (b) the x-coordinate of the desired solution,
and (n) the number of steps that should be taken. From the Octave command line, the solution can be found
this way:
>> format(’long’)
>> f=inline(’(2*cos(x)^3-1-y*sin(x))/cos(x)’)
f = f(x, y) = (2*cos(x)^3-1-y*sin(x))/cos(x)
>> midpoint(f,1,0,2,20)
ans =
Columns 1 through 3:
0.000000000000000 -0.066766774094073 -0.139831999606821
Columns 4 through 6:
-0.218600428030388 -0.302401486830318 -0.390495486389841
Columns 7 through 9:
-0.482080439082276 -0.576299298636036 -0.672247230148908
Columns 10 through 12:
-0.768977840728485 -0.865503930033315 -0.960754716787988
Columns 13 through 15:
-1.057757600117324 -1.154510687305015 -1.247336119828964
Columns 16 through 18:
-1.335197000042218 -1.417135309027307 -1.492245593754752
Columns 19 through 21:
-1.559677244661507 -1.618640905170988 -1.668415622421331
313
The value in Column 21 is the desired result, so y(2) ≈ −1.668415622421. The rest of the output gives approxima-
tions for the solution at other points. For example, y(1.95) ≈ −1.618640905170. Use [y,x]=midpoint(f,1,0,2,20)
to see all the corresponding x-coordinates.
12a: The Octave function we wrote to implement Ralston’s method takes 5 arguments. As explained in the
comment preceding the function declaration,
they are, in order, (f) the function f (x, y) appearing on the right side of the o.d.e., (a) the x-coordinate of the
initial condition, (ya) the y-coordinate of the initial condition, (b) the x-coordinate of the desired solution,
and (n) the number of steps that should be taken. From the Octave command line, the solution can be found
this way:
>> format(’long’)
>> f=inline(’3*x-2*y’)
f = f(x, y) = 3*x-2*y
>> raslston(f,1,1,2,20)
ans =
Columns 1 through 4:
1.00000000000000 1.05125000000000 1.10475625000000 1.16030440625000
Columns 5 through 8:
1.21770048765625 1.27676894132891 1.33735089190266 1.39930255717191
Columns 9 through 12:
1.46249381424058 1.52680690188772 1.59213524620839 1.65838239781859
Columns 13 through 16:
1.72546107002583 1.79329226837337 1.86180450287790 1.93093307510450
Columns 17 through 20:
2.00061943296957 2.07081058683746 2.14145858108790 2.21252001588455
Column 21:
2.28395561437552
The value in Column 21 is the desired result, so y(2) ≈ 2.28395561437552. The rest of the output gives approxima-
tions for the solution at other points. For example, y(1.95) ≈ 2.21252001588455. Use [y,x]=ralston(f,1,1,2,20)
to see all the corresponding x-coordinates.
12d: The Octave function we wrote to implement Ralston’s method takes 5 arguments. As explained in the
comment preceding the function declaration,
they are, in order, (f) the function f (x, y) appearing on the right side of the o.d.e., (a) the x-coordinate of the
initial condition, (ya) the y-coordinate of the initial condition, (b) the x-coordinate of the desired solution,
and (n) the number of steps that should be taken. From the Octave command line, the solution can be found
this way:
>> format(’long’)
>> f=inline(’(2*cos(x)^3-1-y*sin(x))/cos(x)’)
f = f(x, y) = (2*cos(x)^3-1-y*sin(x))/cos(x)
>> ralston(f,1,0,2,20)
ans =
Columns 1 through 3:
0.000000000000000 -0.066770300283373 -0.139836235303672
314 Solutions to Selected Exercises
Columns 4 through 6:
-0.218602682516778 -0.302399209595394 -0.390486264578185
Columns 7 through 9:
-0.482061961403970 -0.576269242338981 -0.672202937226713
Columns 10 through 12:
-0.768915255605024 -0.865412688887274 -0.960565629810385
Columns 13 through 15:
-1.056164950925061 -1.150218643526626 -1.240368616917767
Columns 16 through 18:
-1.325575733886901 -1.404886704290576 -1.477402196316258
Columns 19 through 21:
-1.542278061151791 -1.598731451393269 -1.646047861531770
The value in Column 21 is the desired result, so y(2) ≈ −1.6460478615317. The rest of the output gives approxima-
tions for the solution at other points. For example, y(1.95) ≈ −1.598731451393. Use [y,x]=ralston(f,1,0,2,20)
to see all the corresponding x-coordinates.
Section 6.4
1a: The o.d.e. solver previously derived is
k1 = f (ti , yi )
2h 2h
k2 = f ti + , yi + k1
3 3
h
yi+1 = yi + [k1 + 3k2 ] ,
4
1 3
+ +0 = 1
4 4
3 2 1
· +0·0 =
4 3 2
2
3 2 1
· + 0 · 02 =
4 3 3
2 1
0 · · 0 6= .
3 6
Since the only unsatisfied equation was derived from h3 terms, we conclude that this method has local
truncation error O(h3 ). The integration formula from which it was derived has local truncation error O(h4 ),
so it is not quite as accurate as an o.d.e. solver. However, local truncation error O(h3 ) is consistent with the
experimentally determined O(h2 ) rate of convergence. In fact, it is this local truncation error that leads to
the O(h2 ) rate of convergence.
k1 = f (ti , yi )
h h
k2 = f ti + , yi + k1
3 3
k3 = f (ti + h, yi + hk2 )
h
yi+1 = yi + [3k2 + k3 ] .
4
315
3 1
0+ + = 1
4 4
3 1 1 1
· + ·1 =
4 3 4 2
2
3 1 1 1
· + · 12 =
4 3 4 3
1 1 1
· ·1 6= .
4 3 6
Since the only unsatisfied equation was derived from h3 terms, we conclude that this method has local
truncation error O(h3 ). The integration formula from which it was derived has local truncation error O(h4 ),
so it is not quite as accurate as an o.d.e. solver. However, local truncation error O(h3 ) is consistent with the
experimentally determined O(h2 ) rate of convergence. In fact, it is this local truncation error that leads to
the O(h2 ) rate of convergence.
2: From the initial value problem, f (t, y) = ty and y(1) = 12 . For the o.d.e. solver, this means t0 = 1 and y0 = 12 .
To compute y(2) in one step, h = 1 and
1 1
k1 = f (ti , yi ) = 1 · =
2 2
1 1 1 1 1 1 9
k2 = f (ti + h, yi + hk1 ) = 1 + · 1 + ·1· =
2 2 2 2 2 2 8
1 1 1 1 1 9 51
k3 = f (ti + h, yi + hk2 ) = 1 + · 1 + ·1· =
2 2 2 2 2 8 32
1 51 67
k4 = f (ti + h, yi + hk3 ) = (1 + 1) +1· =
2 32 16
1
y1 = y0 + h(k1 + 2k2 + 2k3 + k4 )
6
1 1 1 9 51 67
= + ·1 +2· +2· +
2 6 2 8 32 16
35
= = 2.1875
16
t1 = t0 + h = 1 + 1 = 2
Thus y(2) ≈ 2.1875. Euler’s method with two steps yielded y(2) ≈ 1.3125. Since the exact solution is
3/2
y(2) = e 2 ≈ 2.240844535169032, RK4 did a much better job in one step than did Euler’s method in two
steps. Incidentally, even four steps of Euler’s method (which means 4 function evaluations—just as many as
one step of RK4), yields y(2) ≈ 1.621398925781250.
Section 6.5
4: The blanks in the table are to be read as zeros, so β11 = β12 = 0, for example. The only non-zero value for the
βij is β21 = 1. The values in the left column are the δi , so δ2 = 1. The values in the bottom row are the αi ,
so α1 = α2 = 12 . In summary,
1
δ2 = 1, β21 = 1, α1 = α2 = .
2
Because the tableau has two rows above the row of αi , it is a two-stage method. Therefore, the method takes
the form
k1 = f (ti , yi )
k2 = f (ti + δ2 h, yi + β21 hk1 )
yi+1 = yi + h[α1 k1 + α2 k2 ].
316 Solutions to Selected Exercises
See equation 6.5.2. Plugging in the parameter values, this tableau represents the method
k1 = f (ti , yi )
k2 = f (ti + h, yi + hk1 )
1 1
yi+1 = yi + h k1 + k2 .
2 2
This last equation simplifies to yi+1 = yi + h2 [k1 + k2 ]. These equations are exactly those in equation 6.3.3,
trapezoidal-ode, or the improved Euler method.
6b: First, decoding the table into the form 6.5.2, we see this is a 4-stage method with formula
k1 = f (ti , yi )
2 2
k2 = f ti + h, yi + hk1
7 7
4 8 4
k3 = f ti + h, yi − hk1 + hk2
7 35 5
6 29 2 5
k4 = f ti + h, yi + hk1 − hk2 + hk3
7 42 3 6
1 1 5 1
yi+1 = yi + h k1 + k2 + k3 + k4 .
6 6 12 4
Code similar to the samples in sections 6.3 and 6.4 might look like thirdOrder.m, which may be downloaded
at the companion website.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Written by Leon Brin 9 June 2016 %
% Purpose: This function implements a 3rd order Runge-Kutta %
% method where the step size is calculated and held %
% constant. %
% INPUT: function f(x,y); interval [a,b]; y(a); steps n %
% OUTPUT: approximation (x(i),y(i)) of the solution of y’=f(x,y) %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function [y,x] = thirdOrder(f,a,ya,b,n)
i = 1;
x(i) = a;
y(i) = ya;
h = (b-a)/n;
while (i<=n)
k1 = f(x(i), y(i));
k2 = f(x(i)+2*h/7, y(i)+2*h/7*k1);
k3 = f(x(i)+4*h/7, y(i)+h/35*(-8*k1+28*k2));
k4 = f(x(i)+6*h/7, y(i)+h/42*(29*k1-28*k2+35*k3));
y(i+1) = y(i) + h/12*(2*k1+2*k2+5*k3+3*k4);
x(i+1) = a + (b-a)*i/n;
i = i+1;
end%while
end%function
to approximate y(2), which we know has exact value 10, with various step sizes yields
317
>> format(’long’)
>> f=inline(’-y/t+t^2’)
f = f(t, y) = -y/t+t^2
>> [y,x]=thirdOrder(f,4,20,2,5);
>> abs(10-y(length(y)))
ans = 4.14600417808941e-04
>> [y,x]=thirdOrder(f,4,20,2,10);
>> abs(10-y(length(y)))
ans = 5.20403883292886e-05
>> [y,x]=thirdOrder(f,4,20,2,20);
>> abs(10-y(length(y)))
ans = 6.48395888624975e-06
>> [y,x]=thirdOrder(f,4,20,2,40);
>> abs(10-y(length(y)))
ans = 8.08029787080500e-07
Since the number of steps is doubling from one call of thirdOrder to the next, the step size is halving. As
the step size is halved, the error is decreasing by a factor of 8, or by ( 21 )3 , lending numerical evidence that
the rate of convergence is O(h3 ).
10: First, decoding the table into the form 6.5.2, we see the embedded methods have 5 and 4 stages with formulas
k1 = f (ti , yi )
1 1
k2 = f ti + h, yi + hk1
4 4
3 9
k3 = f ti + h, yi − hk1 + 3hk2
4 4
1 1 5 1
k4 = f ti + h, yi + hk1 + hk2 + hk3
2 18 12 36
7 5 1
k5 = f ti + h, yi + hk1 − hk2 − hk3 + 2hk4
9 3 9
1 2 1
{first method} yi+1 = yi + h k1 + k4 + k5
6 3 6
7 5 1
{second method} yi+1 = yi + h k1 − k2 − k3 + 2k4 .
9 3 9
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Written by Leon Brin 9 June 2016 %
% Purpose: This function implements an adaptive rk3(4) method of %
% Butcher where the step size is controlled by the routine. %
% INPUT: function f(x,y); interval [a,b]; y(a); initial step %
% size h; tolerance eps; maximum steps N; %
% OUTPUT: approximation (x(i),y(i)) of the solution of y’=f(x,y) %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
318 Solutions to Selected Exercises
12b: The method of exercise 6c shares the first three stages with this method. All we need to do is append the
line of αi values from that table to this one, noting that we need to add a zero at the end:
0
1 1
2 2
3 3
4 0 4
2 1 4
1 9 3 9
7 1 1 1
24 4 3 8
2 1 4
9 3 9
15a: There are two difficulties with this problem. The more straightforward of the two is knowing what the error
of the approximation really is. This o.d.e. is not solvable exactly, so we can’t compute the exact solution. We
can certainly run the method with a tolerance of 10−4 , but this is only a local truncation error. It does not
necessarily translate into any estimate of the global error (the total accumulated error at the last step). Often
times, they will be similar in magnitude, but there is far from any guarantee of it. In any case, here are the
1
results of running the method with initial step size 10 and tolerance 10−4 :
>> f=inline(’(x+2*exp(y)*cos(exp(x)))/(1+exp(y))’)
f = f(x, y) = (x+2*exp(y)*cos(exp(x)))/(1+exp(y))
>> [y,x]=rk23(f,0,2,4,1/10,1e-4,100000);
>> y(length(y))
ans = 2.37564101044550
319
1e-4
global error
1e-5
1e-6
1e-7
1e-5 1e-4 1e-3
tolerance
>> length(y)
ans = 152
suggesting that y(4) ≈ 2.37564. Though we should have some confidence that this is a reasonable estimate
(say with error no more than 10−2 ), we should certainly not claim that the error is less than, or really all that
close to 10−4 . The algorithm took 152 steps to arrive at the result, so the error had a chance to accumulate.
If it is extremely important to know that the estimate is accurate to the nearest 10−4 or better, it could be
compared to a second run with a smaller tolerance:
>> [y,x]=rk23(f,0,2,4,1/10,1e-5,100000);
>> y(length(y))
ans = 2.37616344347848
The difference between the estimates is about 5.22(10)−4 . This would suggest that the error in the first
estimate is likely a bit more than 10−4 . But even this evidence is far from iron-clad. The second difficulty
is that small adjustments in the tolerance can lead to large changes in the global error. Global error as a
function of tolerance is very rough and discontinuous (see Figure 6.5.1). The oscillatory nature of the solution
exacerbates this problem with adaptive Runge-Kutta methods. If the global error scaled perfectly with the
truncation error, Figure 6.5.1 would show a perfectly straight line parallel to the line y = x, shown in red.
This figure shows that most tolerances between 10−5 and 10−3 would suffice to give a global error of 10−4 or
less, though there are some exceptions, most notably one right around 10−4 . Figure 6.5.2 shows the solution
over the interval [0, 4], illustrating its oscillations. Generally speaking, comparing multiple approximations
using different tolerances is not how global error is controlled. Global error can be reasonably well controlled
by scaling the tolerance relative to the step size as the solution progresses or using relative errors instead of
absolute. Either way, this concern adds another layer of complexity to the method.
16a: There are two difficulties with this problem. The more straightforward of the two is knowing what the error
of the approximation really is. This o.d.e. is not solvable exactly, so we can’t compute the exact solution. We
can certainly run the method with a tolerance of 10−4 , but this is only a local truncation error. It does not
necessarily translate into any estimate of the global error (the total accumulated error at the last step). Often
320 Solutions to Selected Exercises
2.4
2.2
2
y
1.8
1.6
1.4
1.2
0 1 2 3 4
x
times, they will be similar in magnitude, but there is far from any guarantee of it. In any case, here are the
1
results of running the method with initial step size 10 and tolerance 10−4 :
>> f=inline(’(x^2+y)/(x-y^2)’)
f = f(x, y) = (x^2+y)/(x-y^2)
>> [y,x]=rk23(f,0,5,3,1/10,1e-4,100000);
>> y(length(y))
ans = 3.66765768487404
>> length(y)
ans = 17
suggesting that y(4) ≈ 3.66765. Though we should have some confidence that this is a reasonable estimate
(say with error no more than 10−2 ), we should certainly not claim that the error is less than, or really all
that close to 10−4 . The algorithm took 17 steps to arrive at the result, so the error had a small chance to
accumulate. If it is extremely important to know that the estimate is accurate to the nearest 10−4 or better,
it could be compared to a second run with a smaller tolerance:
>> [y,x]=rk23(f,0,5,3,1/10,1e-5,100000);
>> y(length(y))
ans = 3.66757804370410
The difference between the estimates is about 7.96(10)−5 . This would suggest that the error in the first
estimate is likely right around 10−4 . But even this evidence is far from iron-clad. The second difficulty
is that small adjustments in the tolerance can lead to large changes in the global error. Global error as a
function of tolerance is rough and discontinuous (see Figure 6.5.3). If the global error scaled perfectly with
the truncation error, Figure 6.5.3 would show a perfectly straight line parallel to the line y = x, shown in red.
This figure shows that most tolerances between 10−5 and 10−3 would suffice to give a global error of 10−4 or
less, though there may be some exceptions not plotted. Figure 6.5.4 shows the solution over the interval [0, 3].
Generally speaking, comparing multiple approximations using different tolerances is not how global error is
controlled. Global error can be reasonably well controlled by scaling the tolerance relative to the step size
as the solution progresses or using relative errors instead of absolute. Either way, this concern adds another
layer of complexity to the method.
321
1e-4
global error
1e-5
1e-6
1e-5 1e-4 1e-3
tolerance
4.8
4.6
4.4
y
4.2
3.8
3.6
0 0.5 1 1.5 2 2.5 3
x
322 Solutions to Selected Exercises
Answers to Selected Exercises
Section 1.1
10e: 0.83333
24a: (i) 8.99999974990351 (ii) 2.5009649(10)−7 (iii) 2.7788499(10)−8 (iv) (10)−14 (v) 2.5009647(10)−7
Section 1.2
ξ sin(ξ)−4 cos(ξ) 4
1f: T3 (x) = x2 . R3 (x) = 24 x .
9d: 10.760
12π 2 −48
12a: ξ(π) = cos−1 π4 ≈ 0.7625.
Section 1.3
1d: α = 1
1
6f: O n
6h: O √1
n
1
6n: O n
19e: 4 iterations
Section 1.4
7: (a) 1 more than 4 times the number required for the 2n−1 × 2n−1 grid. (b) 0 (c) 0
323
324 Answers to Selected Exercises
Section 2.1
4c: In 27 iterations, we get 0.666666664928, which is within 10−8 of an actual root.
37π
10: 2
16: 33
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Written by on %
% Purpose: implementation of the collatz function %
% INPUT: integer n %
% OUTPUT: n/2 or 3n+1 depending on whether n is %
% even or odd %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function res=collatz(n)
if (ceil(n/2)==n/2)
res=n/2
else
res=3*n+1
end%if
end%function
√
25: (a) 20π
Section 2.2
2d: (i) The hypotheses of the MVT are met. (ii) c ≈ −2.540793513382845.
2h: (i) The hypotheses of the MVT are met. (ii) c ≈ 17.41987374102208.
3c: −2 and 5
3d: −1 and − 31
q q
4−3x2 4−3x5
4c: f1 (x) = 5
2 and f2 (x) = 6 . There are many others.
6c: 1.79047660196506
18: (a) 15 (b) The equations g(x) = x and f (x) = x are equivalent.
23: − 41
Section 2.3
10: (a) 15. HINT: It is valid to bound the derivative over the interval [1.618033988749895, 2.5] instead of the entire
interval [.5, 3.5]. Why? On the other hand, if you do consider the whole interval [.5, 3.5], you get a bound of
43. (b) It actually takes 15 iterations.
13: a1 ≈ 1.942415717 and a2 ≈ 1.623271404
√
14: 2.732050809. HINT: use f (x) = 4 2x3 + 4x2 − 4x − 4. Why?
15: a0 = 3, a1 = 23 , and a2 = 4
3
18: No. Aitken’s delta-squared method is designed to speed up linearly convergent sequences, not superlinearly
convergent sequences.
21: a1 ≈ 2.152904629 and a2 ≈ 1.873464044
3
23: 2 or 0
24: x̂ ≈ 5.259185715
Section 2.4
4c: Using x0 = 2 and x1 = 3, we find x8 = 1.47883214766643.
4d: Using x0 = 3 and x1 = 4, we find x10 = 0.948434069243393.
5c: Using x0 = 2.5, we find x6 = 1.47883214733021.
5d: Using x0 = 3.5, we find x7 = 0.948434069919634.
6c: Using x0 = 2.5, we find x18 = 0.948434068437721.
6d: Using x0 = 3.5, we find x15 = 0.948434069313413.
7c: Using x0 = 2 and x1 = 3, we find x10 = 1.47883214733021. The difference between x10 and x8 is about
3.3(10)−10 , so x8 was indeed accurate to within 10−5 .
7d: Using x0 = 3 and x1 = 4, we find x12 = 0.948434069919636. The difference between x12 and x10 is about
6.7(10)−10 , so x10 was indeed accurate to within 10−5 .
9b: x14 = 0.580055888962675.
15b: x14 = 0.580055888962675. This is different from 0. Why?
326 Answers to Selected Exercises
Section 2.5
2: f and (a), g and (d), h and (b), l and (c)
8: f and (b), g and (c), h and (d), l and (a)
Section 2.6
6b: g(2) = 5 and g 0 (2) = −8
21 241003
8b: x1 = 8 and x2 = 100544
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Written by Dr. Len Brin 15 January 2014 %
% Purpose: Implementation of Newton’s Method %
% for polynomials of the form %
% p(x) = c1 + c2*x + c3*x^2 + ... + c(n+1)*x^n %
% using Horner’s Method, n > 1. %
% INPUT: coefficients c; tolerance tol; maximum %
% number of iterations N %
% OUTPUT: approximations to all roots, roots %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function roots = newthornall(c,tol,N,x0)
n=length(c)-1;
for i=1:n-2
res=newtonhorner(c,x0,tol,N)
roots(i)=res;
x0=roots(i);
c=deflate(c,x0);
end%for
[roots(n-1),roots(n)]=quadraticRoots(c(3),c(2),c(1));
end%function
Remark: This code is often successful, but can easily come up empty. For example,
327
newthornall([56,-152,140,-17,-48,9],1e-5,100,2)
returns
res = 0.763932022500484
res = 5.23606797749979
res = Method failed---maximum number of iterations reached
error: newthornall: A(I) = X: X must have the same size as I
error: called from:
error: .../newthornall.m at line 16, column 13
It fails to come up with the third real root, −2. After finding the first two roots, the polynomial has
been deflated to
14.00000000000065 − 16.99999999999987x+
6.00000000000002x2 + 9.00000000000000x3 .
With this cubic and initial value 5.23606797749979, Newton’s method does not converge to −2. On the
other hand, newthornall([56,-152,140,-17,-48,9],1e-5,100,-2) returns
res = -2
res = 0.763932022500211
res = 5.23606797749979
ans =
Columns 1 and 2:
Columns 3 and 4:
Column 5:
0.666666666666667 - 0.577350269189623i
Having found −2 first, it has no problem finding the other roots.
21: (a)
1.5858
−13
4.4142
−2 + 2.2361i
−2 − 2.2361i
(b)
3 − 1.4142i
−2.6
−2 + 2.2361i
−2 − 2.2361i
3 + 1.4142i
Section 2.7
1: (a) x4 = 2.1806 (e) x10 = −502.19 (j) x3 = 1.0079
8: (a), (e), and (j): Bracketed inverse quadratic interpolation is at least as fast or faster than false position or
bracketed Newton’s method.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Written by Dr. Len Brin 15 January 2014 %
% Purpose: Implementation of Steffensen’s method %
% INPUT: function f; initial value x0; tolerance %
% TOL; maximum iterations N0 %
% OUTPUT: approximation x and number of %
% iterations i; or message of failure %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function [x,i] = bracketedSteffensens(f,a,b,TOL,N0)
i=1;
A=f(a);
B=f(b);
while (i<=N0)
b
x0=b;
x1=B;
x2=f(x1);
if (abs(x2-x1)<TOL)
x=x2;
disp(" ");
return
end%if
x=x0-(x1-x0)^2/(x2-2*x1+x0);
if (x<min([a,b]) || x>max([a,b]))
x=a+(b-a)/2;
end%if
if (abs(x-x2)<TOL)
disp(" ");
return
end%if
X=f(x);
if ((B<b && X>x) || (B>b && X<x))
a=b; A=B;
end%if
b=x; B=X;
i=i+1;
end%while
x="Method failed---maximum number of iterations reached";
end%function
11: (a), (e), and (j): Bracketed inverse quadratic interpolation is at least as fast or faster than bracketed Steffensen’s
method, counting only number of iterations. However, bracketed Steffensen’s requires two function evaluations
per iteration, so for all practical purposes requires more than twice the computational power of bracketed
inverse quadratic interpolation.
329
Section 3.2
15: 4
√ √
18+ 142 18− 142
16: 3, 7 , or 7
21: 8
Section 3.3
5: P2 (x) = −0.001642458785316x2 + 1.64927376355948x + 10
8: P3 (x) = 2x − 1. Is degree 1 what you expected?
3
14: (a) 40000 f (4) (ξ8.4 ) (b) 8.7364(10)−5 max f (4) (x) (c) .52501
x∈[8.1,8.7]
Section 4.1
f (x0 +h)−f (x0 )
2cc: f 0 x0 + h
2 ≈ h
12a: −1
12c: −23
7f (x0 +2h)−15f (x0 )+8f (x0 −h)
13b: f 0 (x0 + 3h) ≈ 6h
−f (x0 +2h)+9f (x0 )−8f (x0 −h)
13f: f 0 (x0 − h) ≈ 6h
´ x0 +2h
f (x)dx ≈ h
f (x0 ) + 3f x0 + 43 h
15a: x0 2
´ x0 +h
15e: x0
f (x)dx ≈ h
2 [f (x0 ) + f (x0 + h)]
330 Answers to Selected Exercises
Section 4.2
f (x0 + h) − f (x0 )
h
1: (b) f 0
x0 + ≈
4 h
−3f (x 0 − h) + 4f (x0 ) − f (x0 + h)
(f) f 0 (x0 − h) ≈
2h
−7f (x0 − h) + 16f (x0 + 2h) − 9f (x0 + 3h)
(h) f 0 (x0 − h) ≈
12h
−3f (x0 − h) − 10f (x0 ) + 18f (x0 + h) − 6f (x0 + 2h) + f (x0 + 3h)
(l) f (x0 ) ≈
0
12h
f (x0 − h) − 2f (x0 ) + f (x0 + h)
2: (b) f 00 (x0 − h) ≈
h2
f (x 0 − h) − 4f (x0 + 2h) + 3f (x0 + 3h)
(d) f 00 (x0 − h) ≈
6h2
11f (x0 − h) − 20f (x0 ) + 6f (x0 + h) + 4f (x0 + 2h) − f (x0 + 3h)
(h) f (x0 ) ≈
00
12h2
ˆ x0 +h
4: (d) f (x)dx ≈ hf (x0 )
ˆ x0 +2h
x0
2 4
(f) f (x)dx ≈ h f x0 + h + f x0 + h
3 3
ˆx0x0 +h
h
(h) f (x)dx ≈ (f (x0 ) + f (x0 + h))
2
ˆ xx00 +2h
2
h
(j) f (x)dx ≈ 3f x0 + h + f (x0 + 2h)
x0 2 3
Section 4.3
2: f 0 (−2.7) ≈ −0.9151775; f 0 (−2.5) ≈ 1.5014075; f 0 (−2.3) ≈ 2.17825; f 0 (−2.1) ≈ 1.11535
3c: 0.4897985468241977
3e: 149/24 = 6.2083
4: (c) 0.4693956404725931 (e) 17/2 = 8.5
5: (c) 0.5 (e) 81/16 = 5.0625
6: (c) 8.57775220962087(10)−5 (e) 0.0083
7: (c) 0.02031712882950837 (e) 2.3
8: (c) 0.0102872306978985 (e) 1.1375
10: 0
11b: 288666.8155482048
12b: lower: 1565.147456974753 upper: 2334.925631788689 actual: 1915.502415038936
13b: 3.142092629759007
17a: error term: O(h2 f 0 (ξ)) degree of precision: 0
17e: error term: O(h4 f 000 (ξ)) degree of precision: 2
17g: error term: O(h4 f 000 (ξ)) degree of precision: 2
17i: error term: O(h5 f (4) (ξ)) degree of precision: 3
18a: O(hf 00 (ξ))
18e: O(h4 f (5) (ξ))
331
23a: 0.0134k for some constant k depending on the approximation formula, not the function sin x.
√
3 π3
25: (a) O(h3 ) (b) 1 (c) 2 π ≈ 2.720699046351327 (d) 36 ≈ 0.8612854633416616 (e) actual absolute error:
0.7206990463513265
27: − 12
28: O(h2 )
30: 0
31: 10506.03569166666
Section 4.4
1: (c) 17.52961733248352 (e) 1.560867019857898
8: 141
11b:
ˆ x0 +2h
" n−1
X n #
h h X h
f (x)dx ≈ f (x0 ) + f (x0 + 2h) + 2 f x0 + 2i +4 f x0 + (2i − 1)
x0 3n i=1
n i=1
n
11c:
ˆ x0 +3h
" n−1
3h X
h
f (x)dx ≈ f (x0 ) + f (x0 + 3h) + 2 f x0 + 3i
x0 8n i=1
n
n #
X h h
+3 f x0 + (3i − 2) + f x0 + (3i − 1)
i=1
n n
16: 0.386259562814567
26: 0.3862939349171364; 5
####################################################
# Written by Leon Brin 15 May 2014 #
# Purpose: Implementation of adaptive Simpson’s #
# rule #
# INPUT: function f, interval endpoints a and b, #
# desired accuracy TOL. #
# OUTPUT: approximate integral of f(x) from a to b #
# within TOL of actual. #
####################################################
A pair of functions that minimizes the number of evaluations of f , aSimp() and adaptiveSimpsons():
####################################################
# Written by Leon Brin 15 May 2014 #
# Purpose: Wrapper for aSimp() #
# INPUT: function f, interval endpoints a and b, #
# desired accuracy TOL. #
# OUTPUT: approximate integral of f(x) from a to b #
# within TOL of actual. #
####################################################
function res = adaptiveSimpsons(f,a,b,TOL)
res = aSimp(f,a,b,f(a),f((a+b)/2),f(b),TOL);
end#function
####################################################
# Written by Leon Brin 15 May 2014 #
# Purpose: Implementation of adaptive Simpson’s #
# rule #
# INPUT: function f, interval endpoints a and b, #
# f0=f(a), f2=f((a+b)/2), f4=f(b), desired #
# accuracy TOL. #
# OUTPUT: approximate integral of f(x) from a to b #
# within TOL of actual. #
####################################################
function res = aSimp(f,a,b,f0,f2,f4,TOL)
h = (b-a)/4;
f1 = f(a+h);
f3 = f(a+3*h);
error = abs(h*(f0-4*(f1+f3)+6*f2+f4))/45;
if (error <= TOL)
res = h/3*(f0+4*(f1+f3)+2*f2+f4);
else
333
>> format(’long’);
>> f=inline(’x*sin(x^2)’);
>> adaptiveSimpsons(f,0,2*pi,10^-5)
ans = 0.603500307287469
(ii) 1−cos(4 − 0.603500307287469 ≈ 6.175(10)−7 (iii) yes
2
Section 4.5
8 sin πh
− sin(πh)
2
1:
3h
3: O(h9 )
16
4: 9
####################################################
# Written by Dr. Len Brin 16 May 2014 #
# Purpose: Implementation of Romberg integration #
# INPUT: function f, interval endpoints a and b, #
# tolerance tol #
# OUTPUT: approximate integral of f(x) from a to b #
####################################################
function integral = romberg(f,a,b,tol)
N(1,1)=compositeTrapezoidal(f,a,b,1);
N(2,1)=compositeTrapezoidal(f,a,b,2);
N(2,2)=(4*N(2,1)-N(1,1))/3;
i=2;
334 Answers to Selected Exercises
12a: (i)
>> romberg(inline(’x*sin(x^2)’),0,2*pi,10^-5)
ans = 0.603500924593406
1 − cos(4π 2 )
(ii) − 0.603500924593406 ≈ 2.34(10)−10 (iii) yes, and not just barely
2
Section 5.2
9c:
2 3
−.28 + 3.1861(x − .2) − 3.208(x − .2) − 10.693333(x − .2) , x ∈ [.1, .2]
2 3
S(x) = .0066 + 2.5465(x − .3) − 3.188(x − .3) + .066667(x − .3) , x ∈ [.2, .3]
3
.24 + 2.2277(x − .4) + 10.626667(x − .4) x ∈ [.3, .4]
9f:
2 3
−.28 + 3.84613(x − .2) − 20.0773(x − .2) − 245.387(x − .2) ,
x ∈ [.1, .2]
2 3
S(x) = .0066 + 2.91347(x − .3) + 10.7507(x − .3) + 102.76(x − .3) , x ∈ [.2, .3]
.24 + 0.1(x − .4) − 38.8853(x − .4)2 − 165.453(x − .4)3 , x ∈ [.3, .4]
b =
3.1861 2.5465 2.2277
c =
-3.20800 -3.18800 0.00000
d =
-10.693333 0.066667 10.626667
b =
3.84613 2.91347 0.10000
c =
-20.077 10.751 -38.885
d =
-245.39 102.76 -165.45
335
Section 6.1
1a: one
1c: two
1f: two
2c:
√ √
1 −t/2 √ 3 3
ṡ(t) = e 3 cos t − sin t
2 2 2
√ √
1 √ 3 3
s̈(t) = − e−t/2 3 cos t + sin t
2 2 2
a true statement.
1 1√ 1 1√ 1
2f: ṙ(t) = 2√ t
and r̈(t) = − 4t t
. Substituting into r̈ ṙt2
= − 8 yields − 4t t
√
2 t
t2 = − 81 , a true statement for
t > 0.
3a: ẏ(t) = 4et . Substituting into ẏ = y yields 4et = 4et , a true statement. Furthermore, y(0) = 4e0 = 4 as required.
3c: ṡ(t) = −te−t . Substituting into ṡ = (1 − 2s)t yields −te−t = 1 − 2 × 21 1 + e−t
2 2 2
t, a true statement.
Furthermore, s(0) = 21 (1 + e0 ) = 1 as required.
1 1√ 1 1√ 1
3f: ṙ(t) =2 t
√and r̈(t) = − 4t t
. Substituting into r̈ ṙt2
= − 8 yields − 4t t
√
2 t
t2 = − 81 , a true statement for
√ 1
t > 0. Furthermore, r(9) = 9 − 3 = 0 and ṙ(9) = 2√ 9
= 61 as required.
4a: y(x) = x5 + C
5a: From the graphs of the exact and approximate solutions, it appears the approximation is reasonable, but gets
progressively worse as t increases. The greatest error occurs at 1, and to be more precise, the relative error
there is about 0.099, less than 10%.
336 Answers to Selected Exercises
5c: From the graphs of the exact and approximate solutions, it appears the approximation is very good at t = 0
and t = 2, but is not particularly accurate between. To be more precise, the relative errors at t = 0.5, 1, 1.5
are about .124, .097, and .095. At three of the five points, the relative error is 9.5% or more.
5f: From the graphs of the exact and approximate solutions, the approximation looks very good for all values of t.
The greatest errors seem to occur at t = 11 and t = 13. To get an idea of just how good the approximation
is, the absolute errors at t = 11 and t = 13 are about .0066 and .0044, respectively. The relative errors are
about .021 and .0073, respectively. All small errors.
6a:
6b:
337
6g:
6h:
6i:
6j:
338 Answers to Selected Exercises
6l:
6n:
6o:
1
7: (6a) θ̈ + g` sin θ = 0; (6b) with dowhnill as the positive direction: s̈ = g(sin α − µ cos α); (6e) s̈ = m Fapplied − µg;
1
(6g) with uphill as the positive direction: s̈ = m Fapplied cos(β − α) − g(sin α + µ cos α); (6h) with the direction
of the sled’s motion as the positive direction: s̈ = −µg; (6i) with downhill as the positive direction: s̈ =
g(sin α − µ cos α); (6j) with the direction of the puck’s motion as the positive direction: s̈ = −µg; (6l) with
up as the positive direction: s̈ = m c
ṡ − g; (6n) with up as the positive direction: s̈ = m c
ṡ − g; (6o) with up as
the positive direction: s̈ = −g
8: Kinetic friction: µmg versus µ(mg + Fapplied sin 20◦ ). Necessary applied force to overcome friction: µmg versus
cos 20◦ −µ sin 20◦ . The applied force pushing parallel to the floor will need to be only (cos 20 − µ sin 20 ) times
µmg ◦ ◦
as great as when pushing at 20◦ from parallel. For example, when µ = .3, cos 20◦ − µ sin 20◦ ≈ .837 so the
necessary force pushing parallel to the floor is only 83.7% of that needed pushing at 20◦ from parallel.
Section 6.2
1c: y(2) ≈ 1.3125
2c: y(2) ≈ 1.88671875
339
Assumptions: The solution of the o.d.e. exists and is unique on the interval from t0 to t1 .
Input: Differential equation ẏ = f (t, y); formula ÿ(t, y); initial condition y(t0 ) = y0 ; numbers t0 and t1 ;
number of steps N .
Step 1: Set t = t0 ; y = y0 ; h = (t1 − t0 )/N
Step 2: For j = 1 . . . N do Steps 3-4:
Step 3: Set y = y + hf (t, y) + 12 h2 ÿ(t, y)
Step 4: Set t = t0 + Ni (t1 − t0 )
Output: Approximation y of the solution at t = t1 .
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Written by Leon Brin 13 November 2015 %
% Purpose: This function implements Taylor’s method of order 2 %
% where the step size is calculated and held constant. %
% INPUT: function f(x,y); function (df/dx)(x,y); interval [a,b]; %
% y(a); steps n %
% OUTPUT: approximation (x(i),y(i)) of the solution of y’=f(x,y) %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function [y,x] = taylor2ode(f,ft,a,ya,b,n)
i = 1;
x(i) = a;
y(i) = ya;
h = (b-a)/n;
while (i<=n)
y(i+1) = y(i) + h*(f(x(i),y(i)) + 0.5*h*ft(x(i),y(i)));
x(i+1) = a + (b-a)*i/n;
i = i+1;
end%while
end%function
11:
y(2) ≈ 2.3125, 2.28814697265625, 2.28469446951954, 2.28402793464698
absolute errors are approximately
0.02866617919084, 0.004313151847096, 8.606487103870(10)−4 , 1.941138378267(10)−4
error ratios are approximately 6.6, 5.0, 4.4.
14a:
g
u̇ = − sin θ
`
θ̇ = u
14b:
u̇ = g(sin α − µ cos α)
ṡ = u
14e:
1
u̇ = Fapplied − µg
m
ṡ = u
340 Answers to Selected Exercises
14g:
1
u̇ = Fapplied cos(β − α) − g(sin α + µ cos α)
m
ṡ = u
14h:
u̇ = −µg
ṡ = u
14i:
u̇ = g(sin α − µ cos α)
ṡ = u
14j:
u̇ = −µg
ṡ = u
14l:
c
u̇ = u−g
m
ṡ = u
15: (a) −0.6656470478206087 (b) 0.2384138557742662 (e) 0.05695982142857142 (g) 0.2313498206324268 (h) 14.979875
(i) 5.988821238748838 (j) 43.9939625 (l) 4.387767857142857
18c: y(x) = 23 x − 5
4
18d: y(x) = 72 x2 + 11
7 x + 143
49
Section 6.3
1b:
k1 = f (ti , yi )
h h
k2 = f ti + , yi + k1
2 2
yi+1 = yi + hk2
1c:
k1 = f (ti , yi )
h h
k2 = f ti + , yi + k1
3 3
h
yi+1 = yi + [3k2 − k1 ]
2
341
1g:
k1 = f (ti , yi )
h h
k2 = f ti + , yi + k1
3 3
h h
k3 = f ti + , yi + k2
2 2
2h 2h
k4 = f ti + , yi + k1
3 3
h
yi+1 = yi + [3k2 − 4k3 + 3k4 ]
2
1j:
k1 = f (ti , yi )
√ √ √ √
5− 3 5− 3
k2 = f ti + √ h, yi + √ hk1
2 5 2 5
h h
k3 = f ti + , yi + k2
2 2
√ √ √ √
5+ 3 5+ 3
k4 = f ti + √ h, yi + √ hk1
2 5 2 5
h
yi+1 = yi + [5k2 + 8k3 + 5k4 ]
18
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Written by Leon Brin 28 May 2016 %
% Purpose: This function implements the Midpoint method where %
% the step size is calculated and held constant. %
% INPUT: function f(x,y); interval [a,b]; y(a); steps n %
% OUTPUT: approximation (x(i),y(i)) of the solution of y’=f(x,y) %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function [y,x] = midpoint(f,a,ya,b,n)
i = 1;
x(i) = a;
y(i) = ya;
h = (b-a)/n;
while (i<=n)
k1 = f(x(i),y(i));
k2 = f(x(i)+h/2,y(i)+h/2*k1);
y(i+1) = y(i) + h*k2;
x(i+1) = a + (b-a)*i/n;
i = i+1;
end%while
end%function
This code may be downloaded at the companion website.
342 Answers to Selected Exercises
7:
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Written by Leon Brin 28 May 2016 %
% Purpose: This function implements Ralston’s method where %
% the step size is calculated and held constant. %
% INPUT: function f(x,y); interval [a,b]; y(a); steps n %
% OUTPUT: approximation (x(i),y(i)) of the solution of y’=f(x,y) %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function [y,x] = ralston(f,a,ya,b,n)
i = 1;
x(i) = a;
y(i) = ya;
h = (b-a)/n;
while (i<=n)
k1 = f(x(i),y(i));
k2 = f(x(i)+2*h/3,y(i)+2*h/3*k1);
y(i+1) = y(i) + h/4*(k1+3*k2);
x(i+1) = a + (b-a)*i/n;
i = i+1;
end%while
end%function
This code may be downloaded at the companion website.
8c: 2.071336302192492
9c: 2.237523715781341
10c: 2.240722979472185
11c: 2.235615854209425
12c: 2.236251636584492
Section 6.4
1b: O(h3 ); equal to that of underlying integration formula; yes, one degree higher than rate of convergence.
1c: O(h3 ); equal to that of underlying integration formula; yes, one degree higher than rate of convergence.
1g: NOTE: Since this is a four-stage method, equations 6.4.5-6.4.14 must be used to determine the rate of conver-
gence. O(h4 ); less than that of underlying integration formula; yes, one degree higher than rate of convergence.
1j: NOTE: Since this is a four-stage method, equations 6.4.5-6.4.14 must be used to determine the rate of conver-
gence. O(h3 ); less than that of underlying integration formula; yes, one degree higher than rate of convergence.
4: eulerimp.m may be downloaded at the companion website.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Written by Leon Brin 31 May 2016 %
% Purpose: This function implements improved Euler’s method %
% where the step size is calculated and held constant. %
% INPUT: function f(x,y); interval [a,b]; y(a); steps n %
% OUTPUT: approximation (x(i),y(i)) of the solution of y’=f(x,y) %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function [y,x] = eulerimp(f,a,ya,b,n)
i = 1;
x(i) = a;
y(i) = ya;
h = (b-a)/n;
343
while (i<=n)
k1 = f(x(i),y(i));
k2 = f(x(i)+h,y(i) + h*k1);
y(i+1) = y(i) + h/2*(k1+k2);
x(i+1) = a + (b-a)*i/n;
i = i+1;
end%while
end%function
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Written by Leon Brin 31 May 2016 %
% Purpose: This function implements Heun’s third order method %
% where the step size is calculated and held constant. %
% INPUT: function f(x,y); interval [a,b]; y(a); steps n %
% OUTPUT: approximation (x(i),y(i)) of the solution of y’=f(x,y) %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function [y,x] = heun(f,a,ya,b,n)
i = 1;
x(i) = a;
y(i) = ya;
h = (b-a)/n;
while (i<=n)
k1 = f(x(i), y(i));
k2 = f(x(i)+h/3, y(i)+h/3*k1);
k3 = f(x(i)+2*h/3, y(i)+2*h/3*k2);
y(i+1) = y(i) + h/4*(k1+3*k3);
x(i+1) = a + (b-a)*i/n;
i = i+1;
end%while
end%function
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Written by Leon Brin 1 June 2016 %
% Purpose: This function implements Runge-Kutta 4th order (RK4) %
% where the step size is calculated and held constant. %
% INPUT: function f(x,y); interval [a,b]; y(a); steps n %
% OUTPUT: approximation (x(i),y(i)) of the solution of y’=f(x,y) %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function [y,x] = rk4(f,a,ya,b,n)
i = 1;
x(i) = a;
y(i) = ya;
h = (b-a)/n;
while (i<=n)
k1 = f(x(i), y(i));
k2 = f(x(i)+h/2, y(i)+h/2*k1);
k3 = f(x(i)+h/2, y(i)+h/2*k2);
k4 = f(x(i)+h, y(i)+h*k3);
y(i+1) = y(i) + h/6*(k1+2*k2+2*k3+k4);
x(i+1) = a + (b-a)*i/n;
i = i+1;
end%while
end%function
344 Answers to Selected Exercises
Section 6.5
1: One way to code it would be the following. rk23.m may be downloaded at the companion website.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Written by Leon Brin 31 May 2016 %
% Purpose: This function implements an adaptive rk2(3) method %
% where the step size is controlled by the routine. %
% Heun’s third order method is combined with open-ode. %
% INPUT: function f(x,y); interval [a,b]; y(a); initial step %
% size h; tolerance eps; maximum steps N; %
% OUTPUT: approximation (x(i),y(i)) of the solution of y’=f(x,y) %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function [y,t] = rk23(f,a,ya,b,h,eps,N)
i = 1;
t(i) = a;
y(i) = ya;
done = 0;
while (!done && i<=N)
if ((b-t(i)-h)*(b-a)<=0)
h=b-t(i);
done = 1;
endif
k1 = f(t(i), y(i));
k2 = f(t(i)+h/3, y(i)+h/3*k1);
k3 = f(t(i)+2*h/3, y(i)+2*h/3*k2);
err = abs(h/4*(k1-2*k2+k3));
if (done || err<=eps)
y(i+1) = y(i) + h/4*(k1+3*k3);
t(i+1) = t(i) + h;
if (t(i+1) == t(i))
disp("Procedure failed. Step size reached zero.")
return
endif
i = i+1;
endif
q = 0.9*realpow(eps/err,1/3);
q = max(q,0.1);
q = min(5.0,q);
h = q*h;
end%while
if (!done)
disp("Procedure failed. Maximum number of iterations reached.")
endif
end%function
0
1 1
3 3
2
3 − 13 1
1 1 −1 1
1 3 3 1
8 8 8 8
1 1
0 2 2 0
14: One way to code it would be the following. merson.m may be downloaded at the companion website.
345
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Written by Leon Brin 9 June 2016 %
% Purpose: This function implements the method of Merson (1957) %
% where the step size is controlled by the routine. %
% INPUT: function f(x,y); interval [a,b]; y(a); initial step %
% size h; tolerance eps; maximum steps N; %
% OUTPUT: approximation (x(i),y(i)) of the solution of y’=f(x,y) %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function [y,t] = merson(f,a,ya,b,h,eps,N)
i = 1;
t(i) = a;
y(i) = ya;
done = 0;
while (!done && i<=N)
if ((b-t(i)-h)*(b-a)<=0)
h=b-t(i);
done = 1;
endif
k1 = f(t(i), y(i));
k2 = f(t(i)+h/3, y(i)+h/3*k1);
k3 = f(t(i)+h/3, y(i)+h/6*(k1+k2));
k4 = f(t(i)+h/2, y(i)+h/8*(k1+3*k3));
k5 = f(t(i)+h, y(i)+h/2*(k1-3*k3+4*k4));
err = abs(h/30*(2*k1-9*k3+8*k4-k5));
if (done || err<=eps)
y(i+1) = y(i) + h/6*(k1+4*k4+k5);
t(i+1) = t(i) + h;
if (t(i+1) == t(i))
disp("Procedure failed. Step size reached zero.")
return
endif
i = i+1;
endif
q = 0.9*realpow(eps/err,1/4);
q = max(q,0.1);
q = min(5.0,q);
h = q*h;
end%while
if (!done)
disp("Procedure failed. Maximum number of iterations reached.")
endif
end%function
15d: As can be seen from the diagram, most tolerances greater than 10−4 do not produce a global error of 10−4
or less, though there are exceptions. If just guessing and checking, likely you will end up with a tolerance of
5(10)−5 or less.
346 Answers to Selected Exercises
Cash-Karp
1e+0
1e-1
1e-2
1e-3
global error
1e-4
1e-5
1e-6
1e-7
1e-8
1e-5 1e-4 1e-3
tolerance
15f: As can be seen from the diagram, most tolerances less than 10−3 produce a global error of 10−4 or less, as do
some greater tolerances.
RK2(4)
1e-1
1e-2
1e-3
1e-4
global error
1e-5
1e-6
1e-7
1e-8
1e-9
1e-4 1e-3 1e-2
tolerance
16d: As can be seen from the diagram, tolerances less than 10−4 produce a global error of 10−4 or less, as do some
slightly higher tolerances.
347
Cash-Karp
1e-3
1e-4
global error
1e-5
1e-6
1e-5 1e-4 1e-3
tolerance
16f: As can be seen from the diagram, most tolerances less than about 5(10)−3 produce a global error of 10−4 or
less, as do some slightly greater tolerances.
RK2(4)
1e-2
1e-3
1e-4
global error
1e-5
1e-6
1e-7
1e-4 1e-3 1e-2
tolerance
19: (a) y(5) ≈ 6.40926980783945; error ≈ 1.75(10)−4 , 75% greater than the tolerance. (b) y(5) ≈ 6.40708478227220;
error ≈ 2.36(10)−3 , nearly 24 times the tolerance. (c) y(5) ≈ 6.40937679658180; error ≈ 6.82(10)−5 , about
348 Answers to Selected Exercises
68% of the tolerance. (d) y(5) ≈ 6.40885618182156; error ≈ 5.88(10)−4 , nearly 6 times the tolerance.
20: In order from most to least efficient: Cash-Karp, Merson, RK2(3), Bogacki-Shampine, with evaluations 42, 50,
69, and 138, respectively.
Bibliography
[1] Robert E. Barnhill and Richard F. Riesenfeld, editors. Computer Aided Geometric Design : Proceedings of
a conference held at the University of Utah, Salt Lake City, Utah, March 18-21, 1974. Academic Press, New
York, 1974.
[3] R. P. Brent. An algorithm with guaranteed convergence for finding a zero of a function. The Computer Journal,
14(4):422–425, 1971.
[4] John Briggs and F. David Peat. Turbulent Mirror, page 69. Harper & Row Publishers, New York, 1989.
[5] Richard L. Burden and J. Douglas Faires. Numerical Analysis. Thomson Brooks/Cole, 8th edition, 2005.
[6] J.C. Butcher. The Numerical Analysis of Ordinary Differential Equations : Runge-Kutta and General Linear
Methods. John Wiley & Sons, 1987.
[7] J.C. Butcher. A history of runge-kutta methods. Applied Numerical Mathematics, 20:247–260, 1996.
[8] J.R. Cash and Alan H. Karp. A variable order runge-kutta method for initial value problems with rapidly
varying right-hand sides. ACM Transactions on Mathematical Software, 16(3):201–222, September 1990.
[10] Paul de Faget de Casteljau. De Casteljau’s autobiography : My time at Citroën. Computer Aided Geometric
Design, 16(7):583–586, August 1999.
[11] David Goldberg. What every computer scientist should know about floating-point arithmetic. http://docs.
oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html, Accessed June 2014.
[12] S. W. Golomb. Checker boards and polyominoes. Amer. Math. Monthly, 61:675–682, 1954.
[14] Denny Gulick. Encounters with Chaos, page 2. McGraw-Hill, New York, 1992.
[15] Bryce Harrington and Johan Engelen. Inkscape. Software available at http://www.inkscape.org/.
[16] K. Heun. Neue methode zur approximativen integration der differentialgleichungen einer unabhängigen verän-
derlichen. Zeitschrift für Mathematik und Physik, 45:23–38, 1900.
[17] Jeffery J. Leader. Numerical Analysis and Scientific Computing. Pearson, 2004.
[18] Eugene Loh and G. William Walster. Rump’s example revisited. Reliable Computing, 8(3):245–248, 2002.
[19] Edward N. Lorenz. Deterministic nonperiodic flow. Journal of the Atmospheric Sciences, 20(2):130–141, March
1963.
349
350 Index
[20] Michael R. Matthews. Time for science education : how teaching the history and philosophy of pendulum
motion can contribute to science literacy. Kluwer Academic/Plenum Publishers, New York, 2000.
[21] Michael R. Matthews, Michael P. Clough, and Craig Ogilvie. Pendulum motion: The value of idealization in
science. http://www.storybehindthescience.org/pdf/pendulum.pdf.
[22] Cleve Moler. Numerical Computing with MATLAB, chapter 4. The MathWorks, Natick, MA, 2004. https:
//www.mathworks.com/moler/index_ncm.html.
[23] David E. Müller. A method for solving algebraic equations using an automatic computer. Mathematical Tables
and Other Aids to Computation, 10(56):208–215, October 1956.
[24] L. Mumford. Technics and Civilization. Harcourt Brace Jovanovich, New York, 1934.
[25] Ron Naylor. Galileo, copernicanism and the origins of the new science of motion. The British Journal for the
History of Science, 36(2):151–181, June 2003.
[26] William H. Press, Saul A. Teukolsky, William T. Vetterling, and Brian P. Flannery. Numerical Recipes in C :
The Art of Scientific Computing. Cambridge University Press, New York, 2nd edition, 1999.
[29] J. R. Sharma. A family of methods for solving nonlinear equations using quadratic interpolation. Computers
and Mathematics with Applications, 48(5-6):709–714, September 2004.
[30] Avram Sidi. Generalization of the secant method for nonlinear equations. Applied Mathematics E-Notes,
8:115–123, 1999. Available free at mirror sites of http://www.math.nthu.edu.tw/~amen/.
[34] Charles F. Van Loan. Introduction to Scientific Computing : A Matrix Vector Approach Using MATLAB.
Prentice-Hall, Upper Saddle River, NJ, 2nd edition, 2000.
[35] Christopher Vickery. IEEE-754 analysis. http://babbage.cs.qc.cuny.edu/IEEE-754/. Accessed June 2013.
Index
351
352 Index
Heun’s third order method, 222, 227 modified Euler, see modified Euler method
code, 343 Neville’s, see Neville’s method
Horner’s method, 82, 88 Newton’s, see Newton’s method
code, 326 Ralston’s, see Ralston’s method
pseudo-code, 84 regula falsi, see bracketed secant method
Huygens, Christiaan, 193, 194 RK4, see RK4 method
Runge-Kutta, see Runge-Kutta method
implicit Runge-Kutta method, 232 secant, see secant method
improved Euler method, 222, 234 seeded secant, see seeded secant method
code, 342 Sidi’s, see Sidi’s method
initial value problem, 196, 197 Steffensen’s, see Steffensen’s method
interpolating function, 106, 114 Taylor’s, see Taylor’s method
interpolating polynomial, 114 midpoint method, 213
inverse quadratic interpolation method, 94, 98 code, 341
order of convergence, 95 midpoint rule, 156
iteration, 46 modified Euler method, 213
Taylor
Brook, 14
error term, 11
polynomial, 10, 13
remainder term, 10
Taylor’s method, 201, 205