0% found this document useful (0 votes)
375 views365 pages

Teatime Numerical Analysis

This document is the preface to a textbook titled "Tea Time Numerical Analysis" by Leon Q. Brin. It is the first in a series of tea time textbooks on numerical analysis. The preface provides information on how to obtain the free Octave code accompanying the textbook and acknowledges those who contributed to the work. It is licensed under a Creative Commons license and the accompanying code is distributed under the GNU Public License.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
375 views365 pages

Teatime Numerical Analysis

This document is the preface to a textbook titled "Tea Time Numerical Analysis" by Leon Q. Brin. It is the first in a series of tea time textbooks on numerical analysis. The preface provides information on how to obtain the free Octave code accompanying the textbook and acknowledges those who contributed to the work. It is licensed under a Creative Commons license and the accompanying code is distributed under the GNU Public License.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 365

LEON Q.

BRIN

C
Tea Time C
Numerical
Analysis −
B
Experiences in Mathematics, 2nd edition
Y

S
the first in a series of tea time textbooks
A
SOUTHERN
CONNECTICUT
STATE
UNIVERSITY
ii

2016. Tea Time Numerical Analysis by Leon Q.


Brin is licensed under a Creative Commons
Attribution-ShareAlike 4.0 International License.

The code printed within and accompanying Tea Time Numerical Analysis electronically is distributed under the
GNU Public License (GPL).
This code is free software: you can redistribute it and/or modify it under the terms of the GNU General Public
License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later
version.
The code is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the
implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General
Public License for more details. For a copy of the GNU General Public License, see GPL.
iii

To
Victorija, Cecelia, and Amy
iv
Contents

Preface ix
About Tea Time Numerical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
How to Get Octave . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
How to Get the Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi

1 Preliminaries 1
1.1 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Measuring Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Sources of Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Key Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Octave . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2 Taylor Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Key Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Octave . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.3 Speed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Key Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Octave . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
1.4 Recursive Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
The Mathemagician . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Trominos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Octave . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2 Root Finding 37
2.1 Bisection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
The Bisection Method (pseudo-code) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Analysis of the bisection method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.2 Fixed Point Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Root Finding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
The Fixed Point Iteration Method (pseudo-code) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Key Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.3 Order of Convergence for Fixed Point Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Convergence Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Steffensen’s Method (pseudo-code) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Key Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Octave . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

v
vi CONTENTS

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
2.4 Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
A Geometric Derivation of Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Newton’s Method (pseudo-code) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Secant Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Secant Method (pseudo-code) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Seeded Secant Method (pseudo-code) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Key Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
2.5 More Convergence Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
2.6 Roots of Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
Synthetic division revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
Finding all the roots of polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Newton’s method and polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Müller’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Key Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
2.7 Bracketing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Bracketing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Inverse Quadratic Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
Stopping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Key Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

3 Interpolation 99
3.1 A root-finding challenge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
The function f and its antiderivative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
The derivative of f and more graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Octave . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
3.2 Lagrange Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
An application of interpolating polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
Neville’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Uniqueness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
Octave . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
Key Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
3.3 Newton Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
Sidi’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
Octave . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
More divided differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
Key Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

4 Numerical Calculus 127


4.1 Rudiments of Numerical Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
The basic idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
Stencils . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
Integrals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
Key Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
4.2 Undetermined Coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
CONTENTS vii

The basic idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137


Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
Integrals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
Practical considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
Key Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
4.3 Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
Errors for first derivative formulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
Errors for other formulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
Gaussian quadrature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
Some standard formulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
Key Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
4.4 Composite Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
Composite Trapezoidal Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
Adaptive quadrature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
Key Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
4.5 Extrapolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
Differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
Key Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174

5 More Interpolation 175


5.1 Osculating Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
Bèzier Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
Key Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
5.2 Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
Piecewise polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
Cubic splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
Natural spline Octave code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
An application of natural cubic splines? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191

6 Ordinary Differential Equations 193


6.1 The Motion of a Pendulum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
A brief history . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
The equation of motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
Forces in a free body diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
Solutions of ordinary differential equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
Initial Value Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
Key Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
6.2 Taylor Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
Euler’s Method (pseudo-code) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
Higher Degree Taylor Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
Taylor’s Method of Degree 3 (pseudo-code) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
Reducing a second order equation to a first order system . . . . . . . . . . . . . . . . . . . . . . . . 204
Key Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
6.3 Foundations for Runge-Kutta Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
viii CONTENTS

Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
6.4 Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
A Note About Convention and Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
Higher Order Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
Key Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
6.5 Adaptive Runge-Kutta Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
Adaptive Runge-Kutta (pseudo-code) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
General Runge-Kutta Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
Key Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234

Solutions to Selected Exercises 239

Answers to Selected Exercises 323

Bibliography 349

Index 350
Preface

About Tea Time Numerical Analysis


Greetings! And thanks for giving Tea Time Numerical Analysis a read. This textbook was born of a desire
to contribute a viable, completely free, introductory Numerical Analysis textbook for instructors and students
of mathematics. When this project began (summer 2012), there were traditionally published (very expensive
hardcover) textbooks, notably the excellent Numerical Analysis by Burden and Faires, which was in its ninth
edition. As you might guess by the number of editions, this text is a classic. It is one of very few numerical
analysis textbooks geared for the mathematician, not the scientist or engineer. In fact, I studied from an early
edition in the mid 1990’s! Also in the summer of 2012 there were a couple of freely available websites, notably the
popular http://nm.mathforcollege.com/, complete with video lectures. However, no resource I could find included
a complete, single-pdf downloadable textbook designed for mathematics classes. To be just that is the ultimate
goal of Tea Time Numerical Analysis.
The phrase “tea time” is meant to do more than give the book a catchy title. It is meant to describe the general
nature of the discourse within. Much of the material will be presented as if it were being told to a student during tea
time at University, but with the benefit of careful planning. There will be no big blue boxes highlighting the main
points, no stream of examples after a short introduction to a topic, and no theorem. . . proof. . . theorem. . . proof
structure. Instead, the necessary terms and definitions and theorems and examples will be woven into a more
conversational style. My hope is that this blend of formal and informal mathematics will be easier to digest, and
dare I say, students will be more invited to do their reading in this format.
Those who enjoy a more typical presentation might still find this textbook suits their preference to a large extent.
There will be a summary of the key concepts at the end of each conversation and a number of the exercises will be
solved in complete detail in the appendix. So, one can get a closer-to-typical presentation by scanning for theorems
in the conversations, reading the key concepts, and then skipping to the exercises with solutions. I hope most
readers won’t choose to do so, but it is an option. In any case, the exercises with solutions will be critical reading
for most. Learning by example is often the most effective means. After reading a section, or at least scanning
it, readers are strongly encouraged to skip to the statements of the exercises with solutions (marked by [S] or [S] ),
contemplate their solutions, solve them if they can, and then turn to the back of the book for full disclosure. The
hope is that, with their placement in the appendix, readers will be more apt to consider solving the exercises on
their own before looking at the solutions.
The topical coverage in Tea Time Numerical Analysis is fairly typical. The book starts with an introductory
chapter, followed by root finding methods, interpolation (part 1), numerical calculus, interpolation (part 2), and
the second edition introduces a chapter on differential equations. The first five chapters cover what, at SCSU,
constitutes a first semester course in numerical analysis. As this book is intended for use as a free download or
an inexpensive print-on-demand volume, no effort has been made to keep the page count low or to spare copious
diagrams and colors. In fact, I have taken the inexpensive mode of delivery as liberty to do quite the opposite. I
have added many passages and diagrams that are not strictly necessary for the study of numerical analysis, but
are at least peripherally related, and may be of interest to some readers. Most of these passages will be presented
as digressions, so they will be easy to identify. For example, Taylor’s theorem plays such a central role in the
subject that not only its statement is presented. Its proof and a bit of history are added as “crumpets”. Of course
they can be skipped, but are included to provide a more complete understanding of this fundamental theorem of
numerical analysis. For another example, as a fan of dynamical systems, I found it impossible to refrain from
including a section on visualizing Newton’s Method. The powerful and beautiful pictures of Newton’s Method as a

ix
x Preface

dynamical system should be eyebrow-raising and question-provoking even if only tangentially important. There are,
of course, other examples of somewhat less critical content, but each is there to enhance the reader’s understanding
or appreciation of the subject, even if the material is not strictly necessary for an introductory study of numerical
analysis.
Along the way, implementation of the numerical methods in the form of computer code will also be discussed.
While one could simply ignore the programming sections and exercises and still get something out of this text, it
is my firm belief that full appreciation for the content can not be achieved without getting ones hands “dirty” by
doing some programming. It would be nice if readers have had at least some minimal exposure to programming
whether it be Java, or C, web programming, or just about anything else. But I have made every effort to give
enough detail so that even those who have never written even a one-line program will be able to participate in this
part of the study.
In keeping with the desire to produce a completely free learning experience, GNU Octave was chosen as the
programming language for this book. GNU Octave (Octave for short) is offered freely to anyone and everyone! It
is free to download and use. Its source code is free to download and study. And anyone is welcome to modify or
add to the code if so inclined. As an added bonus, users of the much better-known MATLAB will not be burdened
by learning a new language. Octave is a MATLAB clone. By design, nearly any program written in MATLAB will
run in Octave without modification. So, if you have access to MATLAB and would prefer to use it, you may do so
without worry. I have made considerable effort to ensure that every line of Octave in this book will run verbatim
under MATLAB. Even with this earnest effort, though, it is possible that some of the code will not run under
MATLAB. It has only been tested in Octave! If you find any code that does not run in MATLAB, please let me
know.
I hope you enjoy your reading of Tea Time Numerical Analysis. It was my pleasure to write it. Feedback is
always welcome.

Leon Q. Brin
[email protected]

How to Get GNU Octave


Octave is developed by the GNU Project for the GNU operating system, which is most often paired with a Linux
kernel. At its core, Octave is, therefore, GNU/Linux software. It runs natively on GNU/Linux machines. It must be
ported (converted somehow) to run on other operating systems like Windows or OS X. Ports (converted programs)
do exist for these operating systems, but are significantly more complicated to install than native Windows or native
OS X programs. Nonetheless, the advantage to this approach is the end result which looks and runs very much
like a native application, desktop shortcut/alias and all. The disadvantage is the somewhat lengthy installation
procedure with parts that sometimes don’t work together as expected, resulting in a failed installation.
Windows and Mac users may also install hardware virtualizing software. Such software is freely available as
native Windows and native OS X software. Then a complete GNU/Linux operating system can be installed inside
the virtualizer as a so-called virtual machine. Octave can be installed in the virtual machine as a native program.
With some configuring, the virtual machine can be made to look and feel almost like other Windows or OS X
apps. The advantage to this approach is that installation is relatively straightforward. The disadvantage is that it
requires a lot of computing resources. People with an old (slow) machine or a machine with little RAM (memory)
will likely be disappointed in performance. Octave will be even slower than other programs if it runs at all.
GNU/Linux can also be installed “side-by-side” with Windows or OS X, creating a dual-boot machine. The
advantage to this approach is it relieves all of the issues of the other two methods. Octave is installed as a native
application and all computer resources are dedicated to GNU/Linux so Octave will run as quickly as possible on
your machine. The primary disadvantage to this approach is that you will have to decide whether to run your
usual (Windows or OS X) operating system or GNU/Linux every time the computer starts. You will not be able
to switch between Octave and the apps you are used to running. For example, switching from iTunes to Octave,
or from Word to Octave and back, is not possible. You get one or the other. A secondary disadvantage is the
need to repartition the computer’s hard drive (or the need to add an additional hard drive to the machine), making
the installation process potentially devastating to the machine. A complete backup of your machine is required to
maintain safety.
All that may not mean much to you. To see how it translates into advice and step-by-step instructions on
installing GNU Octave, see this textbook’s companion website,

http://lqbrin.github.io/tea-time-numerical/more.html.
xi

How to Get the Code


All the code appearing in the textbook can be downloaded from this textbook’s companion website,

http://lqbrin.github.io/tea-time-numerical/ancillaries.html.

The code printed within and accompanying Tea Time Numerical Analysis electronically is distributed under the
GNU Public License (GPL). Details are available at the website.

Acknowledgments
I gratefully acknowledge the generous support I received during the writing of this textbook, from the patience
my immediate family, Amy, Cecelia, and Victorija exercised while I was absorbed by my laptop’s screen, to the
willingness of my Spring 2013 Seminar class, Elizabeth Field, Rachael Ivison, Amanda Reyher, and Steven Warner
to read and criticize an early version of the first chapter. In between, the Woodbridge Public Library staff, especially
Pamela Wilonski, helped provide a peaceful and inspirational environment for writing the bulk of the text. Many
thanks to Dick Pelosi for his extensive review and many kind words and encouragements throughout the endeavor.
xii Preface
Chapter 1
Preliminaries

1.1 Accuracy
Measuring Error
Numerical methods are designed to approximate one thing or another. Sometimes roots, sometimes derivatives
or definite integrals, or curves, or solutions of differential equations. As numerical methods produce only approx-
imations to these things, it is important to have some idea how accurate they are. Sometimes accuracy comes
down to careful algebraic analysis—sometimes careful analysis of the calculus, and often careful analysis of Taylor
polynomials. But before we can tackle those details, we should discuss just how error and, therefore, accuracy are
measured.
There are two basic measurements of accuracy: absolute error and relative error. Suppose that p is the value
we are approximating, and p̃ is an approximation of p. Then p̃ misses the mark by exactly the quantity p̃ − p, the
so-called error. Of course, p̃ − p will be negative when p̃ misses low. That is, when the approximation p̃ is less
than the exact value p. On the other hand, p̃ − p will be positive when p̃ misses high. But generally, we are not
concerned with whether our approximation is too high or too low. We just want to know how far off it is. Thus,
we most often talk about the absolute error, |p̃ − p|. You might recognize the expression |p̃ − p| as the distance
between p̃ and p, and that’s not a bad way to think about absolute error.
The absolute error in approximating p = π by the rational number p̃ = 22 22
7 is | 7 − π| ≈ 0.00126. The absolute
5 16525 16525
error in approximating π by the rational number 54 is | 54 − π| ≈ 0.00116. The absolute errors in these
two approximations are nearly equal. To make the point more transparent, π ≈ 3.14159 and 22 7 ≈ 3.14285, while
π 5 ≈ 306.01968 and 16525 54 ≈ 306.01851. Each approximation begins to differ from its respective exact value in the
thousandths place. And each is off by only 1 in the thousandths place.
But there is something more going on here. π is near 3 while π 5 is near 300. To approximate π accurate to the
nearest one hundredth requires the approximation to agree with the exact value in only 3 place values—the ones,
tenths, and hundredths. To approximate π 5 accurate to the nearest one hundredth requires the approximation
to agree with the exact value in 5 place values—the hundreds, tens, ones, tenths, and hundredths. To use more
scientific language, we say that 22 7 approximates π accurate to 3 significant digits while
16525
54 approximates π 5
accurate to 5 significant digits. Therein lies the essence of relative errors—weighing the absolute error against the
magnitude of the number being approximated. This is done by computing the ratio of the error to the exact value.
| 22
7 − π|
Hence, the relative error in approximating π by 22 7 is ≈ 4.02(10)−4 while the relative error in approximating
|π|
| 16525
54 − π |
5
π 5 by 16525
54 is ≈ 3.81(10)−6 . The relative errors differ by a factor of about 100 (equivalent to about
|π 5 |
two significant digits of accuracy) even though the absolute errors are nearly equal. In general, the relative error in
|p̃ − p|
approximating p by p̃ is given by .
|p|

Sources of Error
There are two general categories of error. Algorithmic error and floating-point error. Algorithmic error is any error
due to the approximation method itself. That is, these errors are unavoidable even if we do exact calculations at

1
2 CHAPTER 1. PRELIMINARIES

every step. Floating-point error is error due to the fact that computers and calculators generally do not do exact
arithmetic, but rather do floating-point arithmetic.

Crumpet 1: IEEE Standard 754

Floating-point values are stored in binary. According to the IEEE Standard 754, which most computers use, the
mantissa (or significand) is stored using 52 bits, or binary places. Since the leading bit is always assumed to
be 1 (and, therefore, not actually stored), each floating point number is represented using 53 consecutive binary
place values. P
Now let’s consider how 1/7 is represented exactly. In binary, one seventh is equal to 0.001001001 . . .
because 17 = i=1 2−3i = 18 + 64 1 1

+ 512 + · · · . To see that this is true, remember from calculus that
∞ ∞
X X i
2−3i = 2−3
i=1 i=1

2−3
=
1 − 2−3
1/8
=
7/8
1
= .
7
1
But in IEEE Standard 754, 7
is chopped to

1.0010010010010010010010010010010010010010010010010010 × (2)−3
P18 2573485501354569
or i=1
2−3i which is exactly 18014398509481984
. The floating point error in calculating 1/7 is, therefore,

2573485501354569 1 1

− = ≈ 7.93(10)−18 .
18014398509481984 7 126100789566373888

References [35, 11]

In floating-point arithmetic, a calculator or computer typically stores its values with about 16 significant digits.
For example, in a typical computer or calculator (using double precision arithmetic), the number 17 is stored as
about 0.1428571428571428, while the exact value is 0.1428571428571428 . . .. In the exact value, the pattern of
142857 repeats without cease, while in the floating point value, the repetition ceases after the third 8. The value
is chopped to 16 decimal places in the floating-point representation. So the floating point error in calculating 1/7
is around 5(10)−17 . I say “around” or “about” in this discussion because these claims are not precisely true, but
the point is made. There is a small error in representing 1/7 as a floating point real number. And the same is true
about all real numbers save a finite set.
Yes, there is some error in the floating-point representation of real numbers, but it is always small in comparison
to the size of the real number being represented. The relative error is around 10−17 , so it may seem that the
consideration of floating-point error is strictly an academic exercise. After all, what’s an error of 7.93(10)−18 among
friends? Is anyone going to be upset if they are sold a ring that is .14285714285714284921 inches wide when it
should be .14285714285714285714 inches wide? Clearly not. But it is not only the error in a single calculation (sum,
difference, product, or quotient) that you should be worried about. Numerical methods require dozens, thousands,
and even millions of computations. Small errors can be compounded. Try the following experiment.

Experiment 1
Use your calculator or computer to calculate the numbers p0 , p1 , p2 , . . . , p7 as prescribed here:
• p0 = π
• p1 = 10p0 − 31
• p2 = 100p1 − 41
1.1. ACCURACY 3

• p3 = 100p2 − 59
• p4 = 100p3 − 26
• p5 = 100p4 − 53
• p6 = 100p5 − 58
• p7 = 100p6 − 97
According to your calculator or computer, p7 is probably something like one of these:
0.93116 (Octave)
.9311599796346854 (Maxima)
1 (CASIO fx-115ES)

However, a little algebra will show that p7 = 10000000000000π − 31415926535897 exactly (which is approximately
0.932384). Even though p0 is a very accurate approximation of π, after just a few (carefully selected) computations,
round-off error has caused p7 to have only one or two significant digits of accuracy!
This experiment serves to highlight the most important cause of floating-point error: subtraction of nearly equal
numbers. We repeatedly subtract numbers whose tens and ones digits agree. Their two leading significant digits
match. For example, 10π −31 = 31.415926 . . .−31. 10π is held accurate to about 16 digits (31.41592653589793) but
10π − 31 is held accurate to only 14 significant digits (0.41592653589793). Each subsequent subtraction decreases
the accuracy by two more significant digits. Indeed, p7 is represented with only 2 significant digits. We have
repeatedly subtracted nearly equal numbers. Each time, some accuracy is lost. The error grows.
In computations that don’t involve the subtraction of nearly equal quantities, there is the concern of algorithmic
error. For example, let f (x) = sin x. Then one can prove from the definition of derivative that
sin(1 + h) − sin(1 − h)
f 0 (1) = lim .
h→0 2h

Therefore, we should expect, in general, that p̃(h) = sin(1+h)−sin(1−h)


2h is a good approximation of f 0 (1) for small
values of h; and that the smaller h is, the better the approximation is.

Experiment 2
Using a calculator or computer, compute p̃(h) for h = 10−2 , h = 10−3 , and so on through h = 10−7 . Your results
should be something like this:
h p̃∗ (h)
10−2 0.5402933008747335
10−3 0.5403022158176896
10−4 0.5403023049677103
10−5 0.5403023058569989
10−6 0.5403023058958567
10−7 0.5403023056738121

The second column is labeled p̃∗ (h) to indicate that the approximation p̃(h) is calculated using approximate
(floating-point) arithmetic, so it is technically an approximation of the approximation. Since f 0 (1) = cos(1) ≈
.5403023058681398, each approximation is indeed reasonably close to the exact value. Taking a closer look, though,
there is something more to be said. First, the algorithmic error of p̃(10−2 ) is
101 99
    
|p̃(10 ) − f (1)| = 50 sin
−2 0
− sin − cos(1)


100 100
≈ 9.00(10)−6

accurate to three significant digits. That is, if we compute p̃(10−2 ) using exact arithmetic, the value still misses
f 0 (1) by about 9(10)−6 . The floating-point error is only how far the computed value of p̃(10−2 ), what we have
labeled p̃∗ (10−2 ) in the table, deviates from the exact value of p̃(10−2 ). That is, the floating-point error is given by
|p̃∗ − p̃|:
0.5402933008747335 − 50 sin 101 − sin 99
    
≈ 1.58(10)−17 ,

100 100
4 CHAPTER 1. PRELIMINARIES

as small as one could expect. The absolute error |p̃∗ (10−2 ) − f 0 (1)| = |0.5402933008747335 − cos(1)| is essentially
all algorithmic. The round-off error is dwarfed by the algorithmic error. The fact that we have used floating-point
arithmetic is negligible.
On the other hand, the algorithmic error of p̃(10−7 ) is

10000001 9999999
    
|p̃(10 ) − f (1)| = 5000000 sin
−7 0
− sin − cos(1)


10000000 10000000
≈ 9.00(10)−16

accurate to three
 significant digits. But  we should be a little bit worried about the floating-point error since
10000001 9999999

sin ≈ 0.8414710388 and sin ≈ .8414709307 are nearly equal. We are subtracting numbers
10000000 10000000
whose five leading significant digits match! Indeed, the floating-point error is, again |p̃∗ − p̃|, or

0.5403023056738121 − 5000000 sin 10000001 − sin 9999999


    
≈ 1.94(10)−10 .

10000000 10000000

Perhaps this error seems small, but it is very large compared to the algorithmic error of about 9(10)−16 . So, in
this case, the error is essentially all due to the fact that we are using floating-point arithmetic! This time, the
algorithmic error is dwarfed by the round-off error. Luckily, this will not often be the case, and we will be free to
focus on algorithmic error alone.

Crumpet 2: Chaos

Edward Lorenz, a meteorologist at the Massachusetts Institute of Technology, was among the first to recognize
and study the mathematical phenomenon now called chaos. In the early 1960’s he was busy trying to model
weather systems in an attempt to improve weather forecasting. As one version of the story goes, he wanted to
repeat a calculation he had just made. In an effort to save some time, he used the same initial conditions he
had the first time, only rounded off to three significant digits instead of six. Fully expecting the new calculation
to be similar to the old, he went out for a cup of coffee and came back to look. To his astonishment, he
noticed a completely different result! He repeated the procedure several times, each time finding that small
initial variations led to large long-term variations. Was this a simple case of floating-point error? No. Here’s a
rather simplified version of what happened. Let f (x) = 4x(1 − x) and set p0 = 1/7. Now compute p1 = f (p0 ),
p2 = f (p1 ), p3 = f (p2 ), and so on until you have p40 = f (p39 ). You should find that p40 ≈ 0.080685. Now set
p0 = 1/7 + 10−12 (so we can run the same computation only with an initial value that differs from the original
by the tiny amount, 10−12 ). Compute as before, p1 = f (p0 ), p2 = f (p1 ), p3 = f (p2 ), and so on until you have
p40 = f (p39 ). This time you should find that p40 ≈ 0.91909—a completely different result! If you go back and
run the two calculations using 100 significant digit arithmetic, you will find that beginning with p0 = 1/7 leads
to p40 ≈ .080736 while beginning with p0 = 1/7 + 10−12 leads to p40 ≈ 0.91912. In other words, it is not the
fact that we are using floating-point approximations that makes these two computations turn out drastically
different. Using 1000 significant digit arithmetic would not change the conclusion, nor would any more precise
calculation. This is a demonstration of what’s known as sensitivity to initial conditions, a feature of all chaotic
systems including the weather. Tiny variations at some point lead to vast variations later on. And the “errors”
are algorithmic. This is the basic principle that makes long-range weather forecasting impossible. In the words
of Edward Lorenz, “In view of the inevitable inaccuracy and incompleteness of weather observations, precise
very-long-range forecasting would seem non-existent.”

References [19, 14, 4]

Experiment 3
Let a = 77617 and b = 33096, and compute
a
333.75b6 + a2 (11a2 b2 − b6 − 121b4 − 2) + 5.5b8 + .
2b
1.1. ACCURACY 5

You will probably get a number like −1.180591620717411(10)21 even though the exact value is
54767
− ≈ −.8273960599468214.
66192
That’s an incredible error! But it’s not because your calculator or computer has any problem calculating each term
to a reasonable degree of accuracy. Try it.

333.75b6 = 438605750846393161930703831040
a (11a b − b − 121b4 − 2)
2 2 2 6
= −7917111779274712207494296632228773890
5.5b8 = 7917111340668961361101134701524942848
a 77617
= ≈ 1.172603940053179
2b 66192
The reason the calculation is so poor is that nearly equal values are subtracted after each term is calculated.
a2 (11a2 b2 − b6 − 121b4 − 2) and 5.5b8 have opposite signs and match in their greatest 7 significant digits, so
calculating their sum decreases the accuracy by about 7 significant digits. To make matters worse, a2 (11a2 b2 − b6 −
121b4 − 2) + 5.5b8 = −438605750846393161930703831042, which has the opposite sign of 333.75b6 and matches it in
every place value except the ones. That’s 29 digits! So we lose another 29 significant digits of accuracy in adding
this sum to 333.75b6 . Doing the calculation exactly, the sum 333.75b6 + a2 (11a2 b2 − b6 − 121b4 − 2) + 5.5b8 is −2.
But the computation needs to be carried out to 37 significant digits to realize this. Calculation using only about
16 significant digits, as most calculators and computers do, results in 0 significant digits of accuracy since 36 digits
of accuracy are lost during the calculation. That’s why you can get a number like −1.180591620717411(10)21 for
your final answer instead of the exact answer 2b a
− 2 ≈ −.8273960599468214.
What may be even more surprising is that a simple rearrangement of the expression leads to a completely
different result. Try computing
a
(333.75 − a2 )b6 + a2 (11a2 b2 − 121b4 − 2) + 5.5b8 +
2b
instead. This time you will likely get a number like 1.172603940053179. Again the result is entirely inaccurate, and
the reason is the same. This time the individual terms are

(333.75 − a2 )b6 = −7917110903377385049079188237280149504


2 2 2 4
a (11a b − 121b − 2) = −437291576312021946464244793346
8
5.5b = 7917111340668961361101134701524942848
a 77617
= ≈ 1.172603940053179
2b 66192
so the problem persists. We still end up subtracting numbers of nearly equal value. The difference between this
calculation and the last is rounding. In the first case, rounding caused two of the large numbers to disagree in their
last significant digit, so they added up to something huge. In the second case, the sum of the first three terms turns
out to be 0 because the large numbers agree in all significant digits. Note that in the second case, the final result
is simply the value of 2b a
.
As these examples show, sometimes floating-point error and sometimes algorithmic error can spoil a calculation.
In general, it is very difficult to catch floating-point error, though. Algorithmic error is much more accessible. And
most of the algorithms we will explore are not susceptible to floating point error. In almost all cases, the lion’s
share of the error will be algorithmic.

References [28, 18]

Key Concepts
p The exact value being approximated.

p̃ An approximation of the value p.

Absolute error: |p̃ − p| is known as the absolute error in using p̃ to approximate the value p.
|p̃ − p|
Relative error: is known as the relative error in using p̃ to approximate the value p.
|p|
6 CHAPTER 1. PRELIMINARIES

Accuracy: We say that p̃ is accurate to n significant digits if the leading


n significant digits of p̃ match those of
p. More precisely, we say that p̃ is accurate to d(p̃) = log p̃−p significant digits.
p

Floating-point arithmetic: Arithmetic using numbers represented by a fixed number of significant digits.
Algorithmic error: Error caused solely by the algorithm or equation involved in the approximation, |p̃ − p| where
p̃ is an approximation of p and is computed using exact arithmetic.
Truncation error: Algorithmic error due to use of a partial sum in place of a series. In this type of error, the tail
of the series is truncated—thus the name.
Floating-point error: Error caused solely by the fact that a computation is done using floating-point arithmetic,
|p̃∗ − p̃| where p̃∗ is computed using floating-point arithmetic, p̃ is computed using exact arithmetic, and both
are computed according to the same formula or algorithm.
Round-off error: Another name for floating-point error.

Octave
The computations of this section can easily be done using Octave. All you need are arithmetic operations and a
few standard functions like the absolute value and sine and cosine. Luckily, none of these is very difficult using
Octave. The arithmetic operations are done much like they would be on a calculator. There is but one important
distinction. Most calculators will accept an expression like 3x and understand that you mean 3 × x, but Octave
will not. The expression 3x causes a syntax error in Octave. Octave needs you to specify the operation as in 3*x.
Standard functions like absolute value, sine, and cosine (and many others) have simple abbreviations in Octave.
They all take one argument, or input. Think function notation and it will become clear how to find the sine or
absolute value of a number. You need to type the name of the function, a left parenthesis, the argument, and a right
parenthesis, as in sin(7.2). Some common functions and their abbreviations are listed in Table 1.1. Functions and

Table 1.1: Some common functions and their Octave abbreviations.


Function Octave Function Octave Function Octave
n! factorial(n) sin(x) sin(x) cos(x) cos(x)
|x| abs(x) tan(x) tan(x) cot(x) cot(x)
ex exp(x) sin−1 (x) asin(x) cos−1 (x) asin(x)
ln(x)
√ log(x) tan−1 (x) atan(x) cot−1 (x) acot(x)
x sqrt(x) sinh(x) sinh(x) cosh(x) cosh(x)
bxc floor(x) dxe ceil(x) bx b^x

arithmetic operations can be combined in the obvious way. A few examples from this section appear in Table 1.2.
There are two thing to observe. First, Octave notation is very much like calculator notation. Second, by default

Table 1.2: Octave computations of some expressions.


Expression Octave Result
| 22
7 − π| abs(22/7-pi) 0.0012645
5
| 16525
54 −π |
|π 5 | abs(16525/54-pi^5)/abs(pi^5) 3.8111e-06
sin(1.01)−sin(0.99)
0.02 (sin(1.01)-sin(0.99))/0.02 0.54029

Octave displays results using 5 significant digits. Don’t be fooled into thinking Octave has only computed those
five digits of the result, though. In fact, Octave has computed at least 15 digits correctly. And if you want to know
what they are, use the format(’long’) command. This command only needs to be used once per session. All
numbers printed after this command is run will be shown with 15 significant digits. For example, 1/7 will produce
0.142857142857143 instead of just 0.14286. If you would like to go back to the default format, use the format()
command with no arguments. We will discuss finer control over output later. For now, here are a few ways you
might do experiment 1 using Octave. The only differences are the amount of output and the format of the output.
The numbers are being calculated exactly the same way and with exactly the same precision.
1.1. ACCURACY 7

Experiment 1 in Octave, example 1


octave:1> p0=pi;
octave:2> p1=10*p0-31; p2=100*p1-41; p3=100*p2-59;
octave:3> p4=100*p3-26; p5=100*p4-53; p6=100*p5-58;
octave:4> p7=100*p6-97
p7 = 0.93116

Experiment 1 in Octave, example 2


octave:1> format(’long’)
octave:2> p0=pi
p0 = 3.14159265358979
octave:3> p1=10*p0-31
p1 = 0.415926535897931
octave:4> p2=100*p1-41
p2 = 0.592653589793116
octave:5> p3=100*p2-59
p3 = 0.265358979311600
octave:6> p4=100*p3-26
p4 = 0.535897931159980
octave:7> p5=100*p4-53
p5 = 0.589793115997963
octave:8> p6=100*p5-58
p6 = 0.979311599796347
octave:9> p7=100*p6-97
p7 = 0.931159979634685

Experiment 1 in Octave, example 3


octave:1> 10*pi-31
ans = 0.41593
octave:2> 100*ans-41
ans = 0.59265
octave:3> 100*ans-59
ans = 0.26536
octave:4> 100*ans-26
ans = 0.53590
octave:5> 100*ans-53
ans = 0.58979
octave:6> 100*ans-58
ans = 0.97931
octave:7> 100*ans-97
ans = 0.93116

Experiment 3 in Octave
octave:1> a=77617;
octave:2> b=33096;
octave:3> t1=333.75*b^6;
octave:4> t2=a^2*(11*a^2*b^2-b^6-121*b^4-2);
octave:5> t3=5.5*b^8;
octave:6> t4=a/(2*b);
octave:7> t1+t2+t3+t4
ans = -1.18059162071741e+21
octave:8> t1=(333.75-a^2)*b^6;
octave:9> t2=a^2*(11*a^2*b^2-121*b^4-2);
octave:10> t1+t2+t3+t4
8 CHAPTER 1. PRELIMINARIES

ans = 1.17260394005318

In the end, the way you choose to complete an exercise in Octave will be a matter of preference, and will depend on
your goal. You should ask yourself questions like the following. How many significant digits do I need? How many
intermediate results do I need to see? Which ones? The answers to such questions should guide your solution.
When needed, Octave has abbreviations for most common constants. Table 1.3 shows the three most common.

Table 1.3: Some Octave constants.


Constant Octave Result
e e 2.7183
π pi 3.1416
i i or j 0 + 1i

Exercises 10. Find f (2) using Octave.


1. Besides round-off error, how may the accuracy of a nu- (a) f (x) = esin(x) [S]
merical calculation be adversely affected?
(b) f (x) = sin (e )x
2. Compute the absolute and relative errors in the approx- [S]
imation of π by 3. (c) f (x) = tan−1 (x − 0.429)

3. Calculate the absolute error in approximating p by p̃. (d) f (x) = x − tan−1 (0.429)
[A]
1106 [S]
(e) f (x) = 10x /5!
(a) p = 123; p̃ = 9
(f) f (x) = 5!/x10
(b) p = 1e ; p̃ = .3666
(c) p = 210 ; p̃ = 1000 [S]
11. All of these equations are mathematically true.
(d) p = 24; p̃ = 48 Nonetheless, floating point error causes some of them
to be false according to Octave. Which ones? HINT:
[S] Use the boolean operator == to check. For example,
(e) p = π −7 ; p̃ = 10−4
to check if sin(0) = 0, type sin(0)==0 into Octave.
(f) p = (0.062847)(0.069234); p̃ = 0.0042 ans=1 means true (the two sides are equal according to
Octave—no round-off error) and ans=0 means false (the
4. Calculate the relative errors in the approximations of
two sides are not equal according to Octave—round-off
question 3. [S]
error).
5. How many significant digits of accuracy do the approx-
imations of question 3 have? [S] (a) (2)(12) = 92 − 4(9) − 21
6. Compute the absolute error and relative error in ap- (b) e3 ln(2) = 8
proximations of p by p̃. (c) ln(10) = ln(5) + ln(2)
√ √ √ √
(a) p = 2, p̃ = 1.414 (d) g( 1+2 5 ) = 1+ 5
2
where g(x) = 3
x2 + x
(b) p = 10π , p̃ = 1400 (e) b153465/3c = 153465/3

(c) p = 9!, p̃ = 18π(9/e)9 (f) 3π 3 + 7π 2 − 2π + 8 = ((3π + 7)π − 2)π + 8

1103 8 12. Find an approximation p̃ of p with absolute error .001.
7. Calculate using Octave.
9801
(a) p = π [S]
8. The number in question 7 is an approximation of √
(b) p = 5
1/π. Using Octave, find the absolute and relative errors
in the approximation. (c) p = ln(3) [S]
√ √23
9. Using Octave, calculate (d) p = 23
10 [S]
(e) p = ln(1.1)
(a) bln(234567)c
(f) p = tan(1.57079)
(b) edln(234567)e
13. Find an approximation p̃ of p with relative error .001
p
(c) 3
bsin(e5.2 )c
for each value of p in question 12. [S]
(d) −eiπ
14. p̃ approximates what value with absolute error .0005?
(e) 4 tan−1 (1)
[A]
p
bcos(3) − ln(3)c
5 (a) p̃ = .2348263818643
(f) (b) p̃ = 23.89627345677
darctan(3) − e3 e
1.1. ACCURACY 9

(c) p̃ = −8.76257664363 21. Find values for p and p̃ so that the relative and abso-
lute errors are equal. Make a general statement about
[A]
15. Repeat question 14 except with relative error .0005. conditions under which this will happen. [A]
1
16. p̃ approximates p with absolute error 100
and relative 22. Find values for p and p̃ so that the relative error is
3
error 100 . Find p and p̃. [A] greater than the absolute error. Make a general state-
17. p̃ approximates p with absolute error 3
and relative ment about conditions under which this will happen.
100
1
error 100 . Find p and p̃. 23. Find values for p and p̃ so that the relative error is
less than the absolute error. Make a general statement
18. Suppose p̃ must approximate p with relative error at
about conditions under which this will happen.
most 10−3 . Find the largest interval in which p̃ must
lie if p = 900. 24. Calculate (i) p̃∗ using a calculator or computer, (ii)
P∞ the absolute error, |p̃∗ − p|, and (iii) the relative error,
19. The number e can be defined by e = (1/n!). |p̃∗ −p|
n=0
Compute the absolute error and relative error in the |p|
. Then use the given value of p̃ to compute (iv)
following approximations of e: the algorithmic error, |p̃−p| and (v) the round-off error,
|p̃∗ − p̃|.
5
1
(a) Let f (x) = x4 +7x3 −63x2 −295x+350 and let p =
X
(a)
n! −7
)−f (−2−10−7 )
n=0 f 0 (−2). The value p̃ = f (−2+10 2(10)−7
10 is a good approximation of p. p̃ is exactly
X 1
(b) 8.99999999999999. [A]
n!
n=0
(b) Let f 0 (x) = ex sin(10x) and f (0) = 0
√ and let p = f (1). It can be shown that
1+ 5 1 10
20. The golden ratio, , is found in nature and in p = 101 e(sin 10 − 10 cos 10) + 101 . Eu-
2
mathematics in a variety of places. For example, if Fn ler’s method
P10 i/10 produces the approximation p̃ =
1
is the nth Fibonacci number, then 10 i=1
e sin i. Accurate to 28 significant dig-
√ its, p̃ is 0.2071647018159241499410798569.
Fn+1 1+ 5 √
lim = (c) Let a0 = 5+8 5 and an+1 = 4an (1 − an ), and con-
n→∞ Fn 2 sider p = a51 . It can be shown that p = a51 =

5− 5
Therefore, F11 /F10 may be used as an approximation 8
. The most direct algorithm for calculating
of the golden ratio. Find the relative error in this ap- a51 is to calculate a1 , a2 , a3 , . . . a51 in succession,
proximation. HINT: The Fibonacci sequence is defined according to the given recursion relation. Use this
by F0 = 1, F1 = 1, Fn = Fn−1 + Fn−2 for n ≥ 2. algorithm to compute p̃∗ and p̃.
10 CHAPTER 1. PRELIMINARIES

1.2 Taylor Polynomials


One of the cornerstones of numerical analysis is Taylor’s theorem about which you learned in Calculus. A short
study bears repeating here, however.
Theorem 1. Suppose that f (x) has n + 1 derivatives on (a, b), and x0 ∈ (a, b). Then for each x ∈ (a, b), there
exists a ξ, depending on x, lying strictly between x and x0 such that
n  (j)
f (x0 ) f (n+1) (ξ)
X 
f (x) = f (x0 ) + (x − x0 )j + (x − x0 )n+1 .
j=1
j! (n + 1)!

Proof. Let I be the open interval between x and x0 and I be the closure of I. Since I ⊂ I ⊂ (a, b) and f has n + 1
derivatives on (a, b), we have that f, f 0 , f 00 , . . . , f (n) are all continuous on I and that f (n+1) exists on I. We now
define
n
X f (j) (z)
F (z) = f (x) − f (z) − (x − z)j
j=1
j!

and will prove the theorem by showing that F (x0 ) = (x−x 0)


n+1
(n+1)
(n+1)! f (ξ) for some ξ ∈ I. Note that F 0 (z), a
telescoping sum, is given by
n  (j+1)
(z) f (j) (z)

X f
F 0 (z) = −f 0 (z) − (x − z)j − (x − z)j−1
j=1
j! (j − 1)!
 (n+1)
(z)

f
= −f (z) −
0
(x − z) − f (z)
n 0
n!
f (n+1) (z)
= − (x − z)n .
n!
 n+1
Now define g(z) = F (z) − x−z
x−x0 F (x0 ). It is easy to verify that g satisfies the premises of Rolle’s theorem.
Indeed, g(x0 ) = g(x) = 0 and the continuity and differentiability criteria are met. By Rolle’s theorem, there exists
(x−ξ)n
ξ ∈ I such that g 0 (ξ) = 0 = F 0 (ξ) + (n + 1) (x−x 0)
n−1 F (x0 ). Hence,

(x − x0 )n+1
F (x0 ) = −F 0 (ξ)
(n + 1)(x − ξ)n
f (n+1) (ξ)
= (x − x0 )n+1
n!(n + 1)
f (n+1) (ξ)
= (x − x0 )n+1 .
(n + 1)!
This completes the proof.
We will use the notation
n  (j)
f (x0 )
X 
Tn (x) = f (x0 ) + (x − x0 ) j

j=1
j!

and call this the nth Taylor polynomial of f expanded about x0 . We will also use the notation
f (n+1) (ξ)
Rn (x) = (x − x0 )n+1
(n + 1)!
and call this the remainder term for the nth Taylor polynomial of f expanded about x0 .

Crumpet 3: ξ

ξ is the (lower case) fourteenth letter of the Greek alphabet and is pronounced ksee. It is customary, but, of
course, not necessary to use this letter for the unknown quantity in Taylor’s theorem. The capital version of ξ is
Ξ, a symbol rarely seen in mathematics.
1.2. TAYLOR POLYNOMIALS 11

It will not be uncommon, for sake of brevity, to call Tn (x) the nth Taylor polynomial and Rn (x) the remainder
term when the function and center of expansion, x0 , are either unspecified or clear from context.
In calculus, you likely focused on the Taylor polynomial, or Taylor series, and did not pay much attention to the
remainder term. The situation is quite the reverse in numerical analysis. Algorithmic error can often be ascertained
by careful attention to the remainder term, making it more critical than the Taylor polynomial itself. The Taylor
polynomial will, however, be used to derive certain methods, so won’t be entirely neglected.
The most important thing to understand about the remainder term is that it tells us precisely how well Tn (x)
approximates f (x). From Taylor’s theorem, f (x) = Tn (x) + Rn (x), so the absolute error
(n+1) in using Tn (x) to approxi-
f (ξ)
mate f (x) is given by |Tn (x) − f (x)| = |Rn (x)|. But |Rn (x)| = (n+1)! (x − x0 ) for some ξ between x and x0 .
n+1

Therefore,
(n+1)
(ξ)

f
|Tn (x) − f (x)| = |Rn (x)| max (x − x0 )n+1


ξ (n + 1)!
n+1
|x − x0 |
= max f (n+1) (ξ) .

(n + 1)! ξ

We learn several things from this observation:

1. The remainder term is precisely the error in using Tn (x) to approximate f (x). Hence, it is sometimes referred
to as the error term.

2. The absolute error in using Tn (x) to approximate f (x) depends on three factors:

(a) |x − x0 |n+1
1
(b) (n+1)!

(c) |f (n+1) (ξ)|

3. We can find an upper bound on |Tn (x) − f (x)| by finding an upper bound on f (n+1) (ξ) .

Figure 1.2.1: For small n, Tn (x) is a good approximation only for small x.
3
T4
T10
2 cos(x)

-1

-2

-3
-10 -5 0 5 10

Because |Rn (x)| measures exactly the absolute error |Tn (x) − f (x)|, we will be interested in conditions that force
|Rn (x)| to be small. According to observation 2, there are three quantities to consider. First, |x−x0 |n+1 , or |x−x0 |,
1
the distance between x and x0 . The approximation Tn (x) will generally be better for x closer to x0 . Second, (n+1)! .
This suggests that the more terms we use in our Taylor polynomial (the greater n is), the better the approximation
will be. Finally, |f (n+1) (ξ)|, the magnitude of the (n + 1)st derivative of f . The tamer this derivative, the better
Tn (x) will approximate f (x). Be warned, however, these are just rules of thumb for making |Rn (x)| small. There
are exceptions to these rules.
12 CHAPTER 1. PRELIMINARIES

Figure 1.2.2: The actual error |Tn (x) − f (x)| is often much smaller than the theoretical bound.

4 T2(x)
T11(x)
ln(x)
3

(e2,2)
2

-1
0 2 4 6 8 10 12 14 16 18

To see these factors in action, consider f (x) = ln(x) expanded about x0 = e2 . According to Taylor’s theorem,

x − e2 (x − e2 )2 1
T2 (x) = 2 + 2
− and R2 (x) = (x − e2 )3 ;
e 2e4 3ξ 3
11 
(−1)j−1 (x − e2 )j

−1
(x − e2 )12 .
X
T11 (x) = 2 + and R11 (x) =
j=1
je2j 12ξ 12

After you have convinced yourself these formulas are correct, suppose that we are interested in approximating ln(x)
with an absolute error of no more than 0.1. Since |ξ −3 | and |ξ −12 | are decreasing functions of ξ, they attain their
maximum values
on a closed
interval at the lower endpoint of that interval. Hence, for x ≥ e2 , we have |R2 (x)| ≤
maxξ∈[e2 ,x] 3ξ3 (x − e2 )3 = 3e16 (x − e2 )3 . But for 0 < x < e2 , we have |R2 (x)| ≤ maxξ∈[x,e2 ] 3ξ13 (x − e2 )3 =
1
1 2 3
3x3 (e −x) . To determine where these remainders are less than 0.1, we need to solve the equations 3e16 (x−e2 )3 = 0.1
 q  √
3 √3
and 3x1 3 (e2 − x)3 = 0.1. The values we seek are x = 1 + 3 10 3
e2 ≈ 12.33 and x = 8100+10 √ 90−30 e2 ≈ 4.427.
13 3 90
So
Taylor’s theorem guarantees that T2 (x) will approximate ln(x) to within 0.1 over the entire interval [4.427, 12.33].
Since e2 ≈ 7.389, T2 (x) approximates ln(x) to within 0.1 from about 3 below e2 to about 5 above e2 . In other
words, as long as x is close enough to x0 = e2 , the approximation is good. A similar calculation for R11 (x) reveals
that T11 (x) is guaranteed to approximate ln(x) to within 0.1 over the interval [3.667, 14.89]. In other words, for a
larger value of n, x doesn’t need to be as close to x0 to achieve the same accuracy.
But remember, these are only theoretical bounds on the errors. The actual errors are often much smaller than
1 2 3
the bounds. For example, our analysis gives the upper
bound |R2 (3)| ≤ 3·3 3 (e − 3) ≈ 1.05 where the actual
3−e 2 (3−e )
2 2
error, |T2 (3) − ln(3)| = 2 + e2 − 2e4 − ln(3) ≈ .131. The bound is about 8 times the actual error. If we

take this point a bit further, the graphs of T2 (x) and T11 (x) versus ln(x) (and a bit of calculation we will discuss
later) reveal that T2 (x) actually approximates ln(x) to within 0.1 over the interval [3.296, 13.13] and T11 (x) actually
approximates ln(x) to within 0.1 over the interval [0.9030, 15.33]. These intervals are a bit larger than the theoretical
guaranteed intervals. See Figure 1.2.2. This figure reveals something else too. T2 (18) does a much better job of
approximating ln(18) than does T11 (18). It’s not always the case that more terms means a better approximation.
We now turn our attention to perhaps the most often analyzed Taylor polynomials—those for the sine and cosine
functions. They provide examples with beautiful visualization and simple analysis. The nth Taylor polynomial for
f (x) = cos(x) expanded about 0 is
 j 
dxj (cos(x))
n d
X
Tn (x) = cos(0) +  x=0
(x − 0)j 
j=1
j!

cos(0) 2 sin(0) 3 cos(0) 4


= cos(0) − sin(0) · x − x + x + x − ···
2 6 24
1 1
= 1 − x2 + x4 − · · ·
2 24
1.2. TAYLOR POLYNOMIALS 13

and its remainder term is



dn+1
dxn+1 (cos(x))

x=ξ
Rn (x) = (x − 0)n+1
(n + 1)!
− sin(ξ) when n mod 4≡0



n+1 
− cos(ξ) when n mod 4≡1

x
= .
(n + 1)! 
 sin(ξ) when n mod 4≡2

cos(ξ) when n mod 4≡3

Since the sine and cosine functions are bounded between −1 and 1 we know that

|x|n+1 |x|n+1
− ≤ Rn (x) ≤ .
(n + 1)! (n + 1)!

There are two ways this remainder term will be small. First, if x is close to 0, then |x| is small, making Rn (x)
1
small. Second, if n is large, then (n+1)! is small, making Rn (x) small. In other words, for small values of n, the
remainder term is small for small values of x. Tn (x) is a good approximation of cos(x) for such combinations of
x and n. On the other hand, for large values of n, the remainder term is small even for large values of x. For
62 √
example, |R61 (x)| ≤ |x|
62! , so |R61 (x)| will remain less than 1 for all x with magnitude less than
62
62! ≈ 23.933.
Figures 1.2.1 and 1.2.3 illustrate these points.

Figure 1.2.3: For large n, Tn (x) is a good approximation even for large x.
3
T60
cos(x)
2

-1

-2

-3
-30 -20 -10 0 10 20 30

Key Concepts
Rolle’s theorem: Suppose that f (x) is continuous on [a, b] and differentiable on (a, b). If f (a) = f (b), then there
exists ξ ∈ (a, b) such that f 0 (ξ) = 0.

Taylor’s theorem: Suppose that f (x) has n + 1 derivatives on (a, b), and x0 ∈ (a, b). Then for each x ∈ (a, b),
there exists ξ, depending on x, lying strictly between x and x0 such that
n  (j)
f (x0 ) f (n+1) (ξ)
X 
f (x) = f (x0 ) + (x − x0 ) j
+ (x − x0 )n+1 .
j=1
j! (n + 1)!


f (j) (x0 )
Pn 
nth Taylor polynomial: Tn (x) = f (x0 ) + j=1 j! (x − x0 )j .

Maclaurin polynomial: A Taylor polynomial expanded about x0 = 0 is also called a Maclaurin polynomial.
f (n+1) (ξ)
Remainder term: Rn (x) = (n+1)! (x − x0 )n+1 is precisely − (Tn (x) − f (x)).
14 CHAPTER 1. PRELIMINARIES

Error term: Another name for the remainder term.

Crumpet 4: The original theorem of Brook Taylor

The original theorem of Brook Taylor was published in his opus magnum Methodus Incrementorum Directa &
Inversa of 1715. In Methodus, it appears as the second corollary to Proposition VII Theorem III, bearing faint
resemblance to any modern statement of the theorem.

There is no mention of a remainder term. There is no use of the familiar f (x)-type function notation. It’s written
in Latin. And there is no laundry list of hypotheses.
Here is the original statement of Taylor’s theorem in English as translated by Ian Bruce. Proposition VII.
Theorem III: There are two variable quantities, z & x, of which z is regularly increased by the given increment
\ \ \\
z , and nz = v, v − z = v, v − z = v , and thus henceforth. Moreover, I say that in the time z increases to z + v, x
˙ ˙ ˙ ˙
\ \\\
increases likewise to become x + x 1z
v
+ x 1·2z
vv
2 + x 1·2·3z 3 + &c. Corollary II: If for the evanescent increments,
...
vv v
˙ ¨
˙ ˙ ˙
\\ \
the fluxions of the proportionals themselves are written, now with all the v , v, v, v , v , &c. equal to the time
/ //
v2 ... v3
z uniformly flows to become z + v, x becomes x + ẋ 1vż + ẍ 1·2 ż 2 + x 1·2·3ż 3 + &c . . .
1.2. TAYLOR POLYNOMIALS 15

Crumpet 5: Interpretation of the original theorem of Brook Taylor

Unfortunately, the English translation of Taylor’s theorem is only moderately helpful to anyone who is not well
acquainted with early 18th century mathematics. In 1715, function notation was still 20 years in the making.
Today, we would interpret the declaration of the two variables as declaring that x is a function of z. The claim
\ \\\
in Theorem III is that we can rewrite x(z + v) as x + x 1z
v
+ x 1·2z
vv
2 + x 1·2·3z 3 + &c.
...
vv v
Just as x should be
˙ ¨
˙ ˙ ˙
interpreted as a function of z so should x, x, and ...
x . More precisely, x means x(z + z ) − x(z), the amount x is
˙ ¨ ˙ ˙
incremented as z is incremented by z . Likewise, x is the amount x is incremented as z is incremented by z , so
h ˙ i h ¨ i ˙ ˙

x = x(z + z ) − x(z) = x(z + 2z ) − x(z + z ) − x(z + z ) − x(z) = x(z + 2z ) − 2x(z + z ) + x(z). Similarly, ...
x is
¨ ˙ ˙ ˙ ˙ ˙ ˙ ˙ ˙
the amount x is incremented as z is incremented by z . Now would be a good time to break from reading to verify
¨ ˙
that ...
x = x(z + 3z ) − 3x(z + 2z ) + 3x(z + z ) − x(z), that ....
x = x(z + 4z ) − 4x(z + 3z ) + 6x(z + 2z ) − 4x(z + z ) + x(z),
˙ ˙ ˙ ˙ ˙ ˙ ˙
\ \\
0 1 2
and so on. With this understanding and the conventions x for x, x for x, x for x, v for v, v for v, v for v , and
0 1 ˙ 2 ¨
so on, it is then an algebraic exercise to see that
n 
n(n − 1) n(n − 1)(n − 2) n(n − 1) · · · 1

X n n
x(z + nz ) = x=x+x +x +x + ··· + x
˙ j j 0 1 1 2 1·2 3 1·2·3 n 1 · 2 · 3···n
j=0

nz nz (n − 1)z nz (n − 1)z (n − 2)z nz (n − 1)z · · · 1z


˙ ˙ ˙ ˙ ˙ ˙ ˙ ˙ ˙
= x+x +x +x + ··· + x
0 1 1z 2 1 · 2z 2 3 1 · 2 · 3z 3 n 1 · 2 · 3 · · · nz n
˙ ˙ ˙ ˙
1 23 23 n
v vv vvv vvv · · · v
= x+x +x +x + ··· + x .
0 1 1z 2 1 · 2z 2 3 1 · 2 · 3z 3 n 1 · 2 · 3 · · · nz n
˙ ˙ ˙ ˙

This calculation is essentially Taylor’s proof of Theorem III.


Corollary II (which we would consider the theorem) is not proved by Taylor beyond the “obvious” application
of Newton’s theory of fluxions. In today’s language, corollary II follows by applying the limit as n → ∞ to the
x
expression from Theorem III. It makes for another nice exercise to verify that limn→∞ k
zk
= x(k) (z), the kth
˙
k
derivative of x. And one final exercise to see that limn→∞ v = v. As Taylor took these results for granted, so
2 3
shall we. Applying them to Theorem III, we see that x(z + v) = x(z) + x0 (z) 1! v
+ x00 (z) v2! + x000 (z) v3! + · · · . In
the notation of Taylor, ż is the first derivative of x, ż2 is the second derivative of x, and so on. So we in fact
ẋ ẍ

v2 ... v3
have x + ẋ 1vż + ẍ 1·2 ż 2
+ x 1·2·3 ż 3
+ &c as claimed.
It is interesting that Theorem III is true for any function x defined on the interval [x, x + v]. No matter if
x is differentiable, or even continuous. It is a statement about finite differences. It is the corollary that requires
many more assumptions because that is where we pass to the limit.

Octave
Two things that will come in handy time and again when using Octave are inline functions and .m files. Creating
an inline function is a simple way to make a “custom” function in Octave. Creating a .m file is an organized way
to execute a number of commands and save your work for later.
In the last section we saw many built-in functions like sin(x), log(x), and abs(x). These have predefined
meaning in Octave. But what if you want to define f (x) = 3x2 ? There is no built-in “3 x squared” function. That’s
where an inline function is useful. The syntax for an inline function is
name = inline(’function definition’)
where name is the name of the function and function definition is its formula. In the case of f (x) = 3x2 ,
the Octave code looks like f=inline(’3*x^2’). Then you can use f the same way you would use sin or log or
abs. Write the name of the function, left parenthesis, argument, right parenthesis. So, after defining f with the
f=inline(’3*x^2’) statement, f(7) will result in 147:
16 CHAPTER 1. PRELIMINARIES

octave:1> f=inline(’3*x^2’);
octave:2> f(7)
ans = 147
Now we may complete Experiment 1 of section 1.1 a fourth way. Instead of doing the computations on the
command line, we can create a text file with the commands in it. Saved as a .m file, Octave will recognize it as a list
of instructions. If you are familiar with programming, this way of working with Octave will come very naturally.
Writing a .m file is the equivalent of writing a program. After it is written, it needs to be processed. On the Octave
command line, a .m file is run by typing the name of the file, without the .m. That’s it, so it isn’t exactly like
writing a program. There is no compiling. It’s a little bit more like scripting that way.
To begin, use any text editor you like to create the list of commands. Note well, Microsoft Word, LibreOffice,
and other word processors are not text editors. They are word processors. They have font formatting features,
page set up features, and so on. Now imagine your last report or letter to Mom and remove all the formatting,
save separation of paragraphs. That’s a text file. No bold, no centering, no images, no special fonts, no margins,
no pages. Just the typed words. There is no need for all the decorations a word processor allows. All Octave needs
is a list of commands. The only formatting you will need is the line feed (new line) and tabs. If you don’t already
have a favorite text editor (and maybe even if you do), you should use the one that comes with Octave. If you use
this program, you will have no problems. So, first create the text document experiment1.m exactly as shown here:
format(’long’)
p1 = 10*pi-31
p2 = 100*p1-41
p3 = 100*p2-59
p4 = 100*p3-26
p5 = 100*p4-53
p6 = 100*p5-58
p7 = 100*p6-97
Then, on the Octave command line, type experiment1 to get the results:

octave:1> experiment1
p1 = 0.415926535897931
p2 = 0.592653589793116
p3 = 0.265358979311600
p4 = 0.535897931159980
p5 = 0.589793115997963
p6 = 0.979311599796347
p7 = 0.931159979634685
This way of writing Octave commands has two distinct advantages. First, if you make errors, it’s a simple matter
to correct them. Just edit the text file and save the changes. Second, you have a record of your work. You can
share it, print it, or just save it for later. There is only one real disadvantage. It’s more involved than just executing
a few commands on the command line. So, for simple computations, it is more headache than necessary.
Note well that the .m file has to be saved in the same directory from which Octave was started. This type of
detail will be taken care of for you if you use an IDE, but if you are using a command line and text editor, you
need to be sure .m files are saved to the proper location.

Exercises 2. Let f (x) = 4x3 − 2x2 + 8x − 9.


1. Find T3 (x) and R3 (x) for the function expanded about (a) Find T3 (x) and R3 (x) expanded about x0 = 0.
x0 . (b) Find T3 (x) and R3 (x) expanded about x0 = 2.
(a) f (x) = sin(x); x0 = 0. [S] (c) Make a conjecture based on your answers to parts
(a) and (b). Can you prove it?
(b) f (x) = sin(x); x0 = π/2.
[S] 3. Find the 36th Maclaurin Polynomial for f (x) = ex .
(c) f (x) = sin(x); x0 = π.
4. Suppose f (x) is a function whose fourth derivative ex-
(d) f (x) = ex ; x0 = 0.
ists on the whole real line, (−∞, ∞), and that f (2) = 3,
(e) f (x) = ex ; x0 = ln 2. f 0 (2) = −1, f 00 (2) = 2, and f 000 (2) = −1.
[A]
(f) f (x) = x sin(x); x0 = 0. (a) Write down the third Taylor polynomial for f (x)
(g) f (x) = cos2 (x); x0 = 0. expanded about x0 = 2.
1.2. TAYLOR POLYNOMIALS 17

(b) Use the Taylor polynomial to approximate f (4). (a) Find the second Taylor polynomial, P2 (x), about
(c) Find a bound on the absolute error of the approx- x0 = 0.
imation using the fact that (b) Find the remainder term, R2 (0.5), and the actual
(4)
error in using P2 (0.5) to approximate f (0.5).
−3 ≤ f (ξ) ≤ 5
(c) Repeat part (a) using x0 = 1.
for all ξ ∈ [2, 4]. (d) Repeat part (b) using the polynomial from part
5 (c).
5. Compute the 3 Taylor Polynomial for f (x) = x −
rd

2x4 + x3 − 9x2 + x − 1 expanded about x0 = 1. 14. Find the second Taylor polynomial, P2 (x), for f (x) =
6. Find the second Taylor Polynomial for f (x) = csc x ex- ex cos x about x0 = 0.
π
panded about x0 = . Here are some facts you may (a) Use P2 (0.5) to approximate f (0.5). Find an up-
4
find useful: per bound on the error |f (0.5) − P2 (0.5)| using
1 the remainder term and compare it to the actual
f 0 (x) = − csc(x) cot(x) csc(x) = error.
sin(x)
cos(x) (b) Find a bound on the error |f (x) − P2 (x)| good on
f 00 (x) = csc(x)(1 + 2 cot2 (x)) cot(x) = the interval [0, 1].
sin(x) ´1
(c) Approximate f (x) dx by calculating
´1 0
7. The hyperbolic sine, sinh(x), and hyperbolic cosine, P 2 (x) dx instead.
0
cosh(x), are derivatives of one another. That is,
(d) Find
´ 1 an upper bound for the error in (c) using
d 0
|R2 (x)| dx and compare the bound to the ac-
(sinh(x)) = cosh(x)
dx tual error.
and
15. Let f (x) = ex .
d
(cosh(x)) = sinh(x). (a) Find the nth Maclaurin polynomial Pn (x) for
dx
f (x).
Find the remainder term, R43 , associated with the 43rd
(b) Find a bound on the error in using P4 (2) to ap-
Maclaurin polynomial for f (x) = cosh(x).
proximate f (2).
8. Use an inline function to evaluate the Taylor poly- (c) How many terms of the Maclaurin polynomial
nomial T4 (x) = 1 − 21 x2 + 24
1 4
x at the given value of x. would you need to use in order to approximate
[S]
f (2) to within 10−10 ? In other words, for what n
does Pn (2) have an error bound less than or equal
(a) 0 to 10−10 ?
1
(b) 2
16. Find the fourth Taylor Polynomial for ln x expanded
(c) 1 about x0 = 1.
(d) π 17. What is the 50th term of T100 (ex ) expanded about
x0 = 6?
9. Use an inline function to evaluate the Taylor poly- 18. The Maclaurin series for the arctangent function con-
nomial T3 (x) = 1 + x + 21 x2 + 61 x3 at the given value of verges for −1 < x ≤ 1 and is given by
x.

X x2i−1
(a) 0 arctan x = lim Pn (x) = lim (−1)i+1 .
n→∞ n→∞ 2i − 1
3
(b) 2
i=n+1

(c) 2 Use the fact that tan(π/4) = 1 to determine the num-


(d) e [A] ber of terms, n, of the series that need to be summed
to ensure that |4Pn (1) − π| < 10−3 .
10. Write and run a .m file that finds all the answers 19. Exercise 18 details a rather inefficient means of ob-
for exercise 8. [S] taining an approximation to π. The method can
be improved substantially by observing that π/4 =
11. Write and run a .m file that finds all the answers arctan 12 + arctan 13 and evaluating the series for the
for exercise 9. arctangent at 12 and at 13 . Determine the number of
12. Find ξ(x) as guaranteed by Taylor’s theorem in the fol- terms that must be summed to ensure an approxima-
lowing situation. tion to π within 10−3 .
[A] 20. For f (x) = tan−1 (x),
(a) f (x) = cos(x), x0 = 0, n = 3, x = π.
(b) f (x) = ex , x0 = 0, n = 3, x = ln 4. 0 if n is even

f (n) (0) =
(c) f (x) = ln(x), x0 = 1, n = 4, x = 2. (−1)(n−1)/2 (n − 1)! if n is odd.

13. Let f (x) = x3 . Find the nth Maclaurin polynomial Pn (x) for f .
18 CHAPTER 1. PRELIMINARIES

21. How many terms of the Maclaurin Series of sin x are (g) Sketch graphs of f (x) and T2 (x) on the same set
needed to guarantee an approximation with error no of axes for x ∈ [1, 26].
more than 10−2 for any value of x between 0 and 2π?
28. Suppose f (x) is such that −3 ≤ f (10) (x) ≤ 7 for all
22. Suppose you are approximating f (x) = ex using the x ∈ [0, 10]. Find lower and upper bounds on the ab-
tenth Maclaurin polynomial. Find the largest interval solute error in using T9 (x) expanded about x0 = 3 to
over which the approximation is guaranteed to be ac- approximate
curate to within 10−3 .
23. Find a bound on the error in approximating e10 by (a) f (0).
using the twenty-fifth Taylor polynomial of g(x) = ex (b) f (10).
expanded about x0 = 0.
29. Suppose you wish to approximate the value of −e4 sin 4
24. Find a bound on the error of the approximation using separate Maclaurin polynomials (Taylor polyno-
1 2 1 3 1 1 mials expanded about x0 = 0) for the sine and exponen-
e2 ≈ 1 + 2 + (2) + (2) + (2)4 + (2)5 tial functions instead of a single Maclaurin polynomial
2 6 24 120
for the function f (x) = −ex sin x. How many terms of
according to Taylor’s Theorem. Compare this bound each would you need in order to get accuracy within
to the actual error. 10−20 ? Ignore round-off error.
25. Suppose f (8) (x) = ex cos x for some function f . Find 30. Find a theoretical upper bound, as a function of x, for
a bound on the error in approximating f (x) over the the absolute error in using T4 (x) to approximate f (x).
interval [0, π/2] using T7 (x) expanded about x0 = 0.
26. Let f (x) = 1
, and x0 = 5. [S] (a) ex sin x; x0 = 0.
x
2
[S]
(b) e−x ; x0 = 0.
(a) Find T2 (x).
10
(c) + sin(10x); x0 = π.
(b) Find R2 (x). x

(c) Use T2 (x) to approximate f (1) and f (9). 31. The Maclaurin Series for f (x) = e−x is

(d) Find a theoretical upper bound on the absolute ∞


X (−1)i 1 2 1 3
error of each of the approximations in part (c). xi = 1 − x + x − x + ...
i! 2 6
i=0
(e) Find a theoretical lower bound on the absolute
error of each of the approximations in part (c). Find a bound on the error in approximating 1/e by
(f) Find the actual absolute error for each of the ap- 1 − 1 + 1/2 − 1/6 + 1/24.
proximations in part (c). Verify that they are 32. The Taylor series for f (x) = ex is
indeed between the theoretical bounds. 1 2 1 3 1 4 1 5
T (x) = 1 + x + x + x + x + x + · · ·.
(g) Sketch graphs of f (x) and T2 (x) on the same set 2! 3! 4! 5!
This series converges to f (x) for all values of x. In
of axes for x ∈ [1, 9]. particular, for x = 1, this means that
27. Let f (x) = ln(1 + x) and x0 = 0. 1 1 1
f (1) = T (1) = 1 + 1 + (1)2 + (1)3 + (1)4 + · · ·
2! 3! 4!
(a) Find T3 (x).
Simplifying this equation, we see that
(b) Find R3 (x).
(c) Use T3 (x) to approximate f (1) and f (26). 1 1 1 1
e=1+1+ + + + + ···
2 6 24 120
(d) Find a theoretical upper bound on the absolute
error of each of the approximations in part (c). Use Taylor Series to find infinite sums that sum to

(e) Find a theoretical lower bound on the absolute (a) ln(2)


error of each of the approximations in part (c).
(b) 2/3
(f) Find the actual absolute error for each of the ap-
(c) π/4
proximations in part (c). Verify that they are √
indeed between the theoretical bounds. (d) 2
1.3. SPEED 19

1.3 Speed
Besides accuracy, there is nothing more important about a numerical method than speed. There is almost always a
trade-off between one and the other, however. Fast computations are often not particularly accurate, and accurate
calculations are often not particularly fast. There are certain algorithms that produce accurate results quickly,
however. Deriving them, or identifying them once derived is what numerical analysis is all about.
The first type of numerical method we will encounter produces a sequence of approximations that, when ev-
erything is working, approach some desired value, say p. With these methods, we will get a sequence hpn i with
limn→∞ pn = p. You should be familiar with the concept of the limit of a sequence from Calculus, but the purpose
there was much different from ours here. Generally, you were concerned with whether a given sequence converged
at all. And when it did converge, and you were very lucky, you were able to determine the limit. In numerical
analysis, we know certain sequences converge, and are only interested in how quickly they do so.
Simple observation (and a little common sense) can tell you which cars on a highway are traveling faster than
which. Simple observation (and a little common sense) will also often tell you which sequences converge faster
than which. Consider the sequences in Table 1.4 which all converge to e ≈ 2.71828182845904. htn i is accurate

Table 1.4: Some sequences that converge to e.


n qn rn sn tn
0 3 3 3 3
1 2.9436563656918 2.86799618929986 2.82129001274358 2.78177393100014
2 2.89858145824525 2.78315514435127 2.73850656616954 2.72150682612711
3 2.86252153228801 2.73974041668143 2.71973377603211 2.71829014894701
4 2.83367359152222 2.72324781752852 2.71830229432561 2.71828182851442
5 2.81059523890958 2.71899828870116 2.71828184916891 2.71828182845904
6 2.79213255681947 2.71833715075158 2.71828182845934 2.71828182845904
7 2.77736241114739 2.71828369688657 2.71828182845904 2.71828182845904
8 2.76554629460972 2.71828184959225 2.71828182845904 2.71828182845904
9 2.75609340137958 2.71828182851528 2.71828182845904 2.71828182845904
10 2.74853108679547 2.71828182845907 2.71828182845904 2.71828182845904
.. .. .. .. ..
. . . . .

to 15 significant digits by the sixth term; hsn i is accurate to 15 significant digits by the eighth term; hrn i is still
not accurate to 15 significant digits by the eleventh term, but seems likely to gain 15 significant digits of accuracy
on the twelfth term; and hqn i is only accurate to 2 significant digits by the eleventh term, so seems likely to take
considerably more than twelve terms to gain 15 significant digits of accuracy. Since they all started at 3, it seems
reasonable to say that, ordered from fastest to slowest, they are htn i, hsn i, hrn i, hqn i. And that is correct as we will
see soon. But just like knowing which cars are faster than which is different from knowing how fast each is going,
knowing which sequences converge faster than which is different from knowing how quickly each one converges. To
measure the speed of a given car, you need access to its speedometer or a radar gun. To measure the order of
convergence (speed) of a sequence, you need a definition and a little algebra.

Order of convergence of a sequence


Suppose the sequence hpn i converges to p. Then we say hpn i converges with order α ≥ 1 if

|pn+1 − p|
lim α =λ
n→∞ |pn − p|

for some real number λ > 0.


Let’s see how to use this definition to calculate the orders of convergence of the sequences in Table 1.4. According
to the definition, α, should it exist, gives the speed (or order) of convergence of a sequence. Now assuming that α
does exist, we have that limn→∞ |p|pn+1 −p|
n −p|
α = λ, so for large enough n,

|pn+1 − p| |pn+2 − p|
α ≈ α ≈ λ.
|pn − p| |pn+1 − p|
20 CHAPTER 1. PRELIMINARIES


−p
ln ppn+2

n+1 −p

In particular, we can solve for α to find α ≈ .
−p
ln pn+1
pn −p

Crumpet 6: Order of Convergence Less than or equal to 1?

|pn+1 −p|
There is no such thing as an order of convergence less than one because if limn→∞ |pn −p|α
= λ for some
0 < α < 1, then

|pn+1 − p| |pn+1 − p|
lim = lim · |pn − p|α−1 ,
n→∞ |pn − p| n→∞ |pn − p|α
|pn+1 −p|
a contradiction. On the one hand, the ratio test implies that limn→∞ |pn −p|
exists and is less than or equal to 1.
|p −p|
On the other hand, α < 1 =⇒ α − 1 < 0 so for |pn − p| small, |pn − p| is large. Hence, limn→∞ |pn+1
α−1
n −p|
α · |pn −

p| α−1
does not exist. To be rigorous, let M be any real number. Then there exists an N1 such that n > N1 implies
 1
0.9λ 1−α

|pn+1 −p|
|pn −p|α
> 0.9λ. There also exists N 2 such that n > N 2 implies |p n − p| < , so |pn − p|α−1 > 0.9λM
.
M
|p −p|
Letting N = max{N1 , N2 } we have that n > N implies both |pn+1 n −p|
α > 0.9λ and |pn − p|
α−1 M
> 0.9λ . Hence, for
n > N , we have
|pn+1 − p| |pn+1 − p| M
= · |pn − p|α−1 > 0.9λ · = M.
|pn − p| |pn − p|α 0.9λ
|p −p|
Therefore, limn→∞ |pn+1n −p|
does not exist. When α = 1, it must be that λ ≤ 1 because otherwise the ratio test
implies that h|pn − p|i diverges, and, therefore, hpn i diverges.


ln 2.8985−e
2.7485−e
ln qq21 −e
−e
ln qq10 ln

−e 2.9436−e −e 2.7560−e
For example, ≈ ≈ 1 and 9 = ≈ 1. And if we try other sets of three
2.9436−e 2.7560−e
ln qq10 −e ln ln q9 −e
ln

−e 3−e q8 −e 2.7655−e
consecutive terms of hqn i, we get the same results. The order of convergence of hqn i is about 1. Of course, we would
need a formula for |qn − e| to determine whether the limit were truly 1, but we have some evidence. Repeating
the calculations for hrn i, hsn i, and htn i, we get approximate orders of convergence 1.322, 1.618, and 2, respectively.
Again we see that, ordered from fastest to slowest, they are htn i, hsn i, hrn i, hqn i.
If you attempted to calculate the orders of convergence yourself, you may have noticed that more information is
needed to use sn with n > 6 or tn with n > 4. All of these terms in the table are equal, so the formula for α fails to
produce a real number! A more useful table for calculating orders of convergence is one listing absolute errors: In

Table 1.5: Absolute errors.


n |qn − e| |rn − e| |sn − e| |tn − e|
0 2.817(10)−1 2.817(10)−1 2.817(10)−1 2.817(10)−1
1 2.253(10)−1 1.497(10)−1 1.03(10)−1 6.349(10)−2
2 1.802(10)−1 6.487(10)−2 2.022(10)−2 3.224(10)−3
3 1.442(10)−1 2.145(10)−2 1.451(10)−3 8.32(10)−6
4 1.153(10)−1 4.965(10)−3 2.046(10)−5 5.538(10)−11
5 9.231(10)−2 7.164(10)−4 2.07(10)−8 2.453(10)−21
6 7.385(10)−2 5.532(10)−5 2.953(10)−13 4.817(10)−42
7 5.908(10)−2 1.868(10)−6 4.263(10)−21 1.856(10)−83
8 4.726(10)−2 2.113(10)−8 8.777(10)−34 2.757(10)−166
9 3.781(10)−2 5.623(10)−11 2.608(10)−54 6.084(10)−332
10 3.024(10)−2 2.22(10)−14 1.595(10)−87 2.961(10)−663

addition to making it easier to calculate α, this chart makes it painfully obvious that our common sense conclusion
1.3. SPEED 21

about which sequences converge faster than which was quite right. Just compare the accuracy (absolute errors) of
the eleventh terms.
So now we can calculate orders of convergence, but what does it all mean? What does the order of convergence
tell us about successive terms in the sequence? Solving the approximation |p|pn+1 −p|
n −p|
α ≈ λ gives us that |pn+1 − p| ≈

λ|pn − p|α . So, roughly speaking, convergence of order α means that, for large enough n, the approximation pn+1
is about λ|pn − p|α−1 times closer to the limit p than is pn . To rephrase in terms of significant digits of accuracy, a
little bit of algebra:

|pn+1 − p| ≈ λ|pn − p|α


α
pn+1 − p
≈ λ pn − p · |p|α−1


p p
pn − p α

pn+1 − p
− log ≈ − log − log λ|p|α−1


p p
d(pn+1 ) ≈ αd(pn ) − log λ|p|α−1 .


Based on this calculation, we conclude these rules of thumb:

1. for linear convergence (α = 1), d(pn+1 ) ≈ d(pn ) − log λ, so each term has a fixed number more significant
digits of accuracy (approximately equal to − log λ) than the previous;

2. for quadratic convergence (α = 2), d(pn+1 ) ≈ 2d(pn ) − log (λ|p|), so each term has double the number of
significant digits of accuracy of the previous, give or take some;

3. for cubic convergence (α = 3), d(pn+1 ) ≈ 3d(pn )−log λ|p|2 , so each term has triple the number of significant


digits of accuracy of the previous, give or take some;

and so on. Summarizing, for large n, you can expect that each term will have − log λ|p|α−1 more than α times


as many significant digits of accuracy as the previous term. We can see this claim in action by calculating λ for
the sequences htn i, hsn i, hrn i, and hqn i. Using the fact that λ ≈ |p|pn+1 −p|
n −p|
α , we find that λ = 0.8 for each sequence.

Therefore, hqn i should show each term having − log 0.8 ≈ .1 more significant digits of accuracy than the previous.
More sensibly, this means the sequence will show about one more significant digit of accuracy every ten terms.
This is borne out by observing that q0 has error about 3(10)−1 while q10 has error about 3(10)−2 . For hrn i, we
should expect each term to have about − log(0.8 · e.322 ) ≈ −0.04 more than
1.322 times as many significant digits
of accuracy as the previous. For example, r3 has about log 2.145(10) e
≈ 2.1 significant digits of accuracy while

−2

r4 has about 1.322(2.1) − .04 ≈ 2.73 significant digits of accuracy, r5 has 1.322(2.73) − .04 ≈ 3.57 significant digits
of accuracy,
and so on until r 8 has about 8.1 significant digits of accuracy. Again this is borne out by the table
as log r8 −e = log 2.113(10) −8 ≈ 8.1. Though we can do a similar calculation for htn i, it’s easier just to eyeball it
e e
since all we need to see is that the exponent in the scientific notation doubles, give or take a little, from one term
to the next. Indeed it does as it goes from 1 to 2 to 3 to 6 to 11, and so on.
Note that in all this analysis, we have ignored the requirement that n be “large”. That was acceptable in this
case since these sequences were contrived so that even n = 0 was large enough! In practical applications this will
not be the case.
To appreciate just how much faster one order of convergence is over another, consider the relation

d(pn+1 ) ≈ αd(pn ) − log λ|p|α−1




again. Now suppose we know that d(pn0 ) = dn0 for some particular n0 large enough that the approximation is
reasonable. Then it can be shown that, for α > 1,

d(pn0 +k ) ≈ (dn0 − C)αk + C

log λ|p|α−1

where C = .
α−1
22 CHAPTER 1. PRELIMINARIES

Crumpet 7: Solving a Recurrence Relation

The relation d(pn+1 ) ≈ αd(pn ) − log λ|p|α−1 is an example of a recurrence relation. In particular, a first order

linear nonhomogeneous recurrence relation with constant coefficients since it has the form

an+1 = k1 an + k2

where k1 and k2 are constants. Linear nonhomogeneous recurrence relations can be solved by summing a homo-
geneous solution and a particular solution. For the particular solution, we seek a solution of the form an = A
(for all n) by substituting this assumed solution into the recurrence relation. Doing so gives A = k1 A + k2 , so
A = 1−kk2
1
is such a solution. For the homogeneous solution, we seek a sequence of the form an = rn that satisfies
an+1 = k1 an + 0. Substituting our assumed solution into the modified (homogeneous) recurrence relation gives
rn+1 = k1 rn . Rearranging, rn (r − k1 ) = 0 so r = 0 or r = k1 . Notice that Bk1n is also a solution for any constant
B. This includes the solution an = 0 which would arise from setting r = 0. Finally, putting the particular and
homogeneous solutions together, the solution of an+1 = k1 an + k2 is an = Bk1n + 1−k k2
1
for any constant B. In
log(λ|p|α−1 )
the case of d(pn+1 ) ≈ αd(pn ) − log λ|p|α−1 , k1 = α and k2 = − log λ|p|α−1 so d(pn ) = Bαn + .
 
α−1
The value of B is determined by substituting any known element of the sequence into this formula and solving
log(λ|p|α−1 ) log(λ|p|α−1 )
 
for B. Supposing d(pn0 ) = dn0 yields d(pn ) = dn0 − α−1
αn + α−1
.

The important thing to see here is that d(pn0 +k ) is an exponential function when α > 1. The number of significant
digits of accuracy grows exponentially with base α. As we saw before, for α = 1, the number of significant digits
grows linearly. In calculus you learned that any exponential function grows much faster than any polynomial
function, so it is reasonable and correct to conclude that sequences converging with orders greater than 1 are
markedly faster converging than are sequences converging with linear (α = 1) order.
But be careful. Based on this same memory of calculus, you would also conclude that the sequence h2−n i
converges to 0 much faster than does hn−2 i. By some measures, that’s true, but not by all measures. Consider the
orders of convergence of these two sequences. We seek values α1 and α2 such that
|2−(n+1) − 0| |(n + 1)−2 − 0|
lim = λ1 and lim = λ2
n→∞ |2−n − 0|α1 n→∞ |n−2 − 0|α2
for some real numbers λ1 and λ2 . A little bit of algebra will lead to solutions:
|2−(n+1) − 0| 2−n−1
= = 2(α1 −1)n−1
|2−n − 0|α1 2−α1 n
while
|(n + 1)−2 − 0| n2α2
= .
|n−2 − 0|α2 n2 + 2n + 1
2α2
The only way limn→∞ 2(α1 −1)n−1 will be a nonzero constant is if α1 = 1. The only way limn→∞ n2n+2n+1 will be a
nonzero constant is if the leading coefficients of the numerator and denominator are equal. That means α2 must be
1 as well. So h2−n i and hn−2 i both converge to zero with linear order. They are equally extremely slow to converge
by this measure! Still, something should not feel quite right about claiming that h2−n i and hn−2 i converge at the
same speed.

Rate of Convergence of a Sequence


For sequences that converge with linear order, we need a finer measure than order to determine which is faster than
which. Recall from calculus,
2−n n2
lim = lim
n→∞ n−2 n→∞ 2n
2n
= lim n
n→∞ 2 ln 2
2
= lim n = 0,
n→∞ 2 (ln 2)2
1.3. SPEED 23

indicating that h2−n i approaches 0 much faster than does hn−2 i. You may also recall comparisons between power
functions:
n−p
lim −q = 0
n→∞ n
whenever p > q > 0; and between exponential functions:
a−n
lim =0
n→∞ b−n

whenever a > b ≥ 1; and between the two:


a−n
lim =0
n→∞ n−q

1
whenever

1 a > 1. In other words, sequences



1 of the form an converge to
1zero
faster than sequences of the form

n1p whenever a > 1. The sequence


1 an converges to zero faster than bn whenever a > b ≥ 1. The sequence
n p converges to zero faster than nq whenever p > q > 0. Not all functions are as simple as these, but we can
use these as our yard sticks. Suppose hpn i converges to p, hbn i converges to 0 and |pn − p| ≤ λ|bn | for some constant
λ and all sufficiently large n. Then we say that hpn i converges

to p with rate of convergence
O(b n ), read “big-oh of
bn ”. Since we are familiar with sequences of the forms a1n for some constant a > 1
and n1p for some constant
p > 0, and they are simple enough, typically hbn i will be one of them. For example, 2n+1 4n converges to 12 , and
2n + 1 1
= 1 ≤ 1 · 1,


4n −
2 4n 4 n
so 4n converges with rate O( n1 ). We may also say that 2n+1

2n+1 1 1
4n = 2 + O( n ) to convey exactly the same message.
Normally, when we find a rate of convergence, we try to find the fastest converging sequence from our stock of
simple examples that satisfies the definition. In this case, there is none faster.
Basically all the sequences studied in any depth
in calculus converge with linear order. So what does it take to
n
converge with a higher order? Let’s have a look at e−2 .
n+1 n
|e−2 − 0| e−2·2
lim = lim −α2n = 1
n→∞ |e −2 n
− 0|α n→∞ e

when α = 2. So e21n is quadratically convergent. Essentially, it takes an exponentially growing exponent to



converge with an order greater than 1.

Crumpet 8: Approximating π

The sequence
1103 · 23/2 1130173253125 1029347477390786609545
, , ,...
9801 313826716467 · 27/2 1116521080257783321 · 223/2
1
converges to π . Its terms are given by the formula
* √ n
+
8 X (4j)!(1103 + 26390j)
9801 (j!)4 · 3964j
j=0 n=0,1,2,3,...

of Srinivasa Ramanujan. For all practical purposes, it converges very quickly. The first term already has about
8 significant digits of accuracy:
1103 · 23/2
≈ 0.31830987844047012321768445317
9801
1
≈ 0.31830988618379067153776752674,
π
and the second has about 16:
1130173253125 1

− ≈ 6.48(10)−17 ,

313826716467 · 27/2 π

double the accuracy of the first term. The third term is already more than double-precision accurate.
It’s tempting to believe, or hope, the sequence is quadratically convergent, but it is not. The third term has
an accuracy of about 24 significant digits. Each term in the sequence is approximately 8 significant digits more
accurate than the previous—the hallmark of a linearly convergent sequence.
24 CHAPTER 1. PRELIMINARIES

Key Concepts
Order of convergence: The sequence hpn i converges to p with order of convergence α ≥ 1 if

|pn+1 − p|
lim α =λ
n→∞ |pn − p|

for some real number λ > 0.

Absolute error: For a sequence hpn i that converges to p with order α, the absolute errors of consecutive terms
are related by the approximation
|pn+1 − p| ≈ λ|pn − p|α
for large enough n.

Significant digits of accuracy: For a sequence hpn i that converges to p with order α, the numbers of significant
digits of accuracy of consecutive terms are related by the approximation

d(pn+1 ) ≈ αd(pn ) − log λ|p|α−1




for large enough n. In closed form (for α 6= 1)

d(pn+k ) = (dn − C)αk + C

log λ|p|α−1

where C = .
α−1
Rate of convergence: The sequence hpn i converges to p with rate of convergence O(bn ) if hbn i converges to 0 and

|pn − p| ≤ λ|bn |

for some constant λ and all sufficiently large n.

Octave
An invaluable tool in any kind of programming is looping. When you need to perform some procedure multiple
times for varying input, a loop is probably the right solution. While there are several types of loops available in
Octave, we will discuss only for loops right now. The idea is to have a variable, sometimes called a counter, that
counts how many times the procedure has been performed. When the procedure has been performed the desired
number of times, the looping ends, and the program continues from there. You almost certainly encountered this
idea before you ever wrote a computer program. If you ever went to the fair and paid a dollar to toss a dozen rings
in hopes of landing one on the neck of a soda bottle, you have experienced looping. You may have even counted
the rings as you tossed them. You were the counter! You had to perform the procedure of throwing a ring into
the field of bottles 12 times. So, perhaps you threw one and counted to yourself “1”. Then you threw another and
counted “2”. And another and counted “3”. And so on through “12”. When the last ring was tossed, you continued
about your day at the fair.
The for loop is an abstract analogy of this situation. Suppose you want to calculate 1!, 2!, 3!, and so on through
12!. In Octave, you could create the following .m file and run it.

factorial(1)
factorial(2)
factorial(3)
factorial(4)
factorial(5)
factorial(6)
factorial(7)
factorial(8)
factorial(9)
factorial(10)
factorial(11)
factorial(12)
1.3. SPEED 25

But this can be tedious and not particularly reader-friendly, especially if we are interested in doing some computation
many more than 12 times. The purpose of the loop is to reduce the repetitiveness of this approach. We want to
perform the procedure of calculating the factorial of 12 different integers, so a loop is appropriate. The syntax for
the loop is to set up the counter, write the code to perform the procedure, and mark the end of the loop. It looks
something like this.

for j=first:last
do something.
end%for

This will cause Octave to perform the procedure once for each integer from first to last, including both first
and last. The value of the counter, j in this case, may be used in the procedure. So to calculate 1! through 12!,
we might write

for j=1:12
factorial(j)
end%for

This will produce exactly the same output as the program with one line for each factorial. And if later you want
to calculate 1! through 20! instead, all you have to do is change the 12 to a 20. The for loop is your friend!
Now suppose we want to calculate α for each set of three consecutive values of |sn − e| from Table 1.5. Since
there are 9 such sets, we need to create a loop that will run through 9 times. And inside the loop, we will need to
−e
ln ssn+2

n+1 −e

perform the calculation α = . But before we can start, we need to tell Octave about the 11 values from
−e
ln sn+1
sn −e
the table. The most convenient way to do so is in an array. An array is like a vector. It has components. In this
case, each component will hold one value from the table. And the syntax for creating the array is a lot like vector
notation. We will use square brackets to delimit the components of the array, and we will separate the components
by commas. So, the first line of our Octave code will look like this.

errs = [2.817*10^(-1), 1.03*10^(-1), 2.022*10^(-2), 1.451*10^(-3), ...


2.046*10^(-5), 2.07*10^(-8), 2.953*10^(-13), 4.263*10^(-21), ...
8.777*10^(-34), 2.608*10^(-54), 1.595*10^(-87)]

The ellipses (three consecutive dots) at the ends of the first two lines are needed to tell Octave that the command
continues onto the next line. Without them, separating a single command over multiple lines will cause a syntax
error. Starting a new line in Octave is the signal to start a new command as well.
Now Octave knows the values of |sn − e|. Using this vector is a lot like using subscripts. The first value,
2.817(10)−1 , is called errs(1). The second is called errs(2). The third is called errs(3), and so on. The length
of the array errs can be retrieved using the length() function of Octave. The command length(errs)-2 will be
used instead of hard-coding the 9. So we can finish the Octave code like so.

errs = [2.817*10^(-1), 1.03*10^(-1), 2.022*10^(-2), 1.451*10^(-3), ...


2.046*10^(-5), 2.07*10^(-8), 2.953*10^(-13), 4.263*10^(-21), ...
8.777*10^(-34), 2.608*10^(-54), 1.595*10^(-87)];
for j=1:length(errs)-2
alpha = log(errs(j+2)/errs(j+1))/log(errs(j+1)/errs(j))
end%for

This code produces these results:

alpha = 1.6182
alpha = 1.6181
alpha = 1.6176
alpha = 1.6182
alpha = 1.6180
alpha = 1.6180
alpha = 1.6180
alpha = 1.6180
alpha = 1.6180
26 CHAPTER 1. PRELIMINARIES

Not bad, but we can do better. Let’s calculate α, λ, and d(sn ) by two different methods—directly and using the
formula d(pn+1 ) ≈ αd(pn ) − log λ|p|α−1 . Then let’s display the results in a nicely formatted table.


We will need the disp() command and a two-index array. The disp() command is used to display some text
or some quantity. When used for text, the text needs to be delimited by single quotation marks. When used for
quantities, not. So, we might have an Octave program output the word “hello” with the command disp(’hello’)
or have it output the value of ln(2) with the command disp(log(2)). The disp() command can also handle
variables, so if p1 and p2 have been assigned values, then we can display their difference using disp(p2-p1). A
two-index array can be thought of as a table, or a matrix. It holds values in what can be imagined as rows and
columns. So, instead of having errs(j) as we did before, we may have errs(j,k) where j indicates the row and
k indicates the column. The program
A(2,4) = 7;
disp(A);
produces
0 0 0 0
0 0 0 7
OK, back to the task at hand. We will combine everything we have learned about Octave into one program.
errs = [2.817*10^(-1), 1.03*10^(-1), 2.022*10^(-2), 1.451*10^(-3), ...
2.046*10^(-5), 2.07*10^(-8), 2.953*10^(-13), 4.263*10^(-21), ...
8.777*10^(-34), 2.608*10^(-54), 1.595*10^(-87)];

d = inline(’-log(x/exp(1))/log(10)’);
for j=1:9
% alpha:
T(j,1) = log(errs(j+2)/errs(j+1))/log(errs(j+1)/errs(j));
% lambda:
T(j,2) = errs(j+2)/errs(j+1)^T(j,1);
% d (explicit):
T(j,3) = d(errs(j+2));
end

alpha = 1.61804;
lambda = 0.8;
constant = log(lambda*exp(alpha-1))/log(10);
T(1,4) = T(1,3);
for j=2:9
% d (recursive)
T(j,4) = alpha * T(j-1,4) - constant;
end%for

disp(’ alpha lambda d (expl) d (rec)’);


disp(’ --------------------------------------------’);
disp(T)
produces
alpha lambda d (expl) d (rec)
--------------------------------------------
1.61816 0.80015 2.12851 2.12851
1.61814 0.80010 3.27263 3.27252
1.61764 0.79855 5.12339 5.12356
1.61822 0.80158 8.11832 8.11863
1.61797 0.79941 12.96403 12.96477
1.61804 0.80045 20.80458 20.80601
1.61805 0.80059 33.49095 33.49346
1.61804 0.80031 54.01799 54.02225
1.61804 0.80031 87.23153 87.23866
1.3. SPEED 27

It is worth taking some time to make sure you understand all the lines of this program. It uses assignment, built-in
functions, inline functions, simple output, arrays, and for loops. The % on a line tells Octave to ignore it and
everything on the line that follows. These tidbits are called comments. They are strictly for the human user to
document what the program does. Lengthy programs should always be documented so any user of the program will
be better able to understand what it does. Here the comments are simple, but they may be much more elaborate.
π
D E
Exercises (j) →0
en − π n
1. Some convergent sequences and their limits are given.  2

n [S]
Find the order of convergence for each. (k) →0
2n
n!
D E
(a) →0 7 + cos(5n)
 
nn (l) →0
D
1
E n3 + 1
[S]
(b) →0
3en 8n2
 
n
n (m) + →3
22 − 2 3n2 + 12 3n + 10
 
[S]
(c) →1
2 2n + 3
2n2 + 3n
 
[A]
(n) → −2
n2
 
[A] 1 − n2
(d) →1
1 + n2
3n5 − 5n
 
D
en E (o) → −3
(e) →0 1 − n5
een
D
n+1
E 7. Find the rates of convergence of the following sequences
2. Show that the sequence converges to 1 lin- as n → ∞.
n−1
early. 1
n (a) lim sin =0
3. Show that the sequence pn = 21−2 is quadratically n→∞ n
convergent. 1
(b) lim sin 2 = 0
4. Give an example of a sequence which converges to 0 n→∞ n
with order α = 10. 1 2
 
(c) lim sin =0
5. Approximate the order of convergence of the sequence n→∞ n
pn and explain your answer. (d) lim [ln(n + 1) − ln(n)] = 0
n→∞
|pn+1 −p| |pn+1 −p| |pn+1 −p|
n |pn −p|1.2 |pn −p|1.3 |pn −p|1.4 For questions on this page- on the next page, use the
25 9.07(10) −6
.0110 13.39 following definition for rate of convergence for a func-
26 1.88(10)−7 .00303 48.65 tion. For a function f (h), we say limh→a f (h) = L with
27 1.01(10)−9 .000530 277.8 rate of convergence g(h) if |f (h)−L| ≤ λ|g(h)| for some
28 λ > 0 and all sufficiently small |h − a|.
6. Some linearly convergent sequences and their limits are 8. Use a Taylor polynomial to find the rate of convergence
given. Find the (fastest) rate of convergence of the form of
O n1p or O a1n for each. If this is not possible, sug- lim (2 − eh ) = 1.

h→0
gest a reasonable rate of convergence.
9. Use a Taylor polynomial to find the rate of convergence
6 6 6 6 of
(a) 6, , , , ,... → 0
7 49 343 2401 sin(h) − eh + 1
D
11n − 2
E lim = 0.
(b) → 11 h→0 h
n+3
10. Find rates of convergence for the following functions as
sin n
 
(c) √ →0 [S] h → 0.
n
sin h
D
4
E
[S]
(a) lim =1
(d) →0 h→0h
10n + 35n + 9
1 − cos h
4 (b) lim =0
D E
[S]
(e) →0 h→0 h
10n − 35n − 9
sin h − h cos h
2n (c) lim =0
 
(f) √ → 2 [A] h→0 h
n2 + 3n
1 − eh
5 −2
D n E (d) lim = −1
(g) →1 h→0 h
5 +3
n

√ √ [A] 11. Find the rate of convergence of
(h) n + 47 − n → 0
n2 1 h2 + cos h − eh
 
(i) → lim = −1.
3n2 + 1 3 h→0 h
28 CHAPTER 1. PRELIMINARIES

12. Show that sum=1;


for k=1:30
(sin h)(1 − cos h) = 0 + O(h3 ). sum=sum+1.0/k*k;
end
13. Write an Octave program (.m file) that uses a loop disp(sum)
and the disp() command to produce the following out-
put (powers of 7). [S] 18. Some sequences do not have an order of convergence.
n
1 Let pn = 2n! .
7
(a) Show that limn→∞ pn = 0.
49
|pn+1 |
343 (b) Show that limn→∞ |pn |
= 0.
2401 D E
|pn+1 |
16807 (c) Show that |pn |α
diverges for any α > 1.
117649
823543 19. Use the rules of thumb for order of convergence to
5764801 approximate the number of iterations it will take to
40353607 achieve 12 significant digits of accuracy of π for each
order of convergence. Assume each sequence starts with
14. Write an Octave program (.m file) that uses a loop
one significant digit of accuracy.
and the disp() command to output the first 10 powers
of 5 starting with 50 . (a) α = 1, λ = 0.8
15. Write an Octave program (.m file) that uses a loop, (b) α = 1, λ = 0.5 [S]

an array, and the disp() command to find the values of


n (c) α = 1, λ = 0.1
22 − 2
f (n) = 2n for n = 0, 1, 2, 4, 6, 10. [S] (d) α = 1.5
2 +3
[A]
16. Write an Octave program (.m file) that uses a loop, (e) α = 2
an array, and the disp() command to find the values of (f) α = 3
2n
f (n) = √ for n = 0, 2, 5, 10, 100, 1000, 20000.
n2 + 3n 20. Prove that the order of convergence of a sequence is
unique.
17. The following Octave code is intended to calculate
the sum 21. Write a for loop that outputs the sequence of num-
30
X 1 bers.
k2
k=1 (a) 7, 8, 9, 10, 11, 12, 13, 14, 15
but it does not. Find as many mistakes in the code as
(b) 20, 19, 18, 17, 16, 15, 14, 13
you can. Classify each mistake as either a compilation
error (an error that will prevent the program from run- (c) 12, 12.333, 12.667, 13, 13.333, 13.667, 14
ning at all) or a bug (an error that will not prevent the (d) 1, 9, 25, 49, 81, 121, 169, 225, 289, 361, 441
program from running, but will cause improper calcu-
lation of the sum). (e) 1, .5, .25, .125, .0625, .03125, .015625
1.4. RECURSIVE PROCEDURES 29

1.4 Recursive Procedures


The Mathemagician
Mathemagician: I have here an ordinary bed sheet. Nothing up my sleeves. No secret pockets. Maybe just a
touch of magic dust. But other than that, an ordinary bed sheet. When lain flat it is, of course one
layer thick. As I take these corners in my hands and place them over the opposite corners, folding the
bed sheet in half, how many layers thick does it become?

Audience: Two!

Mathemagician: Very good. Allow me to fold it in half again. Now how many layers thick has it become?

Audience: Four!

Mathemagician: Excellent. Watch very closely as I fold it for a third time. Think hard and tell me how many
layers thick is the folded sheet now.

Audience: Six! (from a few) Eight! (from more)

Mathemagician: That’s right. Eight. So much for the warm up. I shall now have my lovely assistant bring
out another perfectly ordinary bed sheet. This time already folded. Crystal! The bed sheet please ...
(Crystal brings out the bed sheet, already folded). Again, an ordinary bed sheet. This time folded.
I shall now fold it in half as I have done before and ask again, how many layers thick has the sheet
become?

Audience: (Mostly silent–just some murmurings)

Mathemagician: I see. Well, I don’t know either...

Audience: (Laughing)

Mathemagician: ...but I can tell you it is twice as many layers thick as it was before!

Audience: (Mostly silent–just a few groans)

Mathemagician: I know. I know. A cheap parlor trick. But wait! Watch as I slowly unfold the sheet, one fold at
a time. One! ... Two! (he peers toward the sky as if in thought) ... Three! ... (again seemingly deep in
thought) ... Four! ... Four times folded in half and now, as you can plainly see, the sheet is three layers
thick. The first fold was in thirds. (he peers off into space, waves his wand, stares deep into the eyes of
the audience) Forty-eight!!!

Audience: (Silent but clearly wanting of an explanation)

Mathemagician: The sheet started 3 layers thick, and was doubled in thickness four times ... 3 ... 6 ... 12 ... 24
... 48.

Though it was meant to seem like a wise crack, the observation that folding a sheet in half doubles the number of
layers was the key to counting the layers in the folded sheet. Recursive procedures are magical in the same way.
They seem to hold nothing of value when, in fact, they hold the key. They are based on the principle that no matter
what the current state of affairs (no matter how many layers thick the sheet is), following the procedure (folding it
in half) will produce a predictable result (double the thickness).
Perhaps the simplest numerical example of this idea comes from thinking of a bag of marbles—an opaque bag
with an unknown number of marbles inside. One marble is added, and you are asked how many are inside. Of
course the best you can say is something like “one more than there were before.” Even though you do not know
how many marbles are in the bag to begin with, when one is added to the bag, you know the new total is one more
than the previous total. This is recursive thinking.
30 CHAPTER 1. PRELIMINARIES

Figure 1.4.1: 2 × 3 and 6 × 9 grids can be tiled with trominos.

Figure 1.4.2: A 2n × 2n grid can be (almost) tiled recursively.

Trominos
Connect three squares edge-to-edge in the shape of an L, and you have a tromino. Trominos aren’t used in games
like dominoes are, but are often used in interesting mathematical questions involving tiling. Tiling with trominos
means covering without overlapping trominos and without having any parts of trominos lying outside the shape
being tiled. For example, a 2 × 3 grid can be tiled with trominos as can a 6 × 9 grid. See Figure 1.4.1. If n is a
positive integer, then a 2n × 2n grid can almost be tiled with trominos. All but one square can be covered. Try it,
first with a 2 × 2 grid. That one’s not too hard. Then try it with a 4 × 4 grid or an 8 × 8 grid.
How about a 1024 × 1024 grid? I can’t recommend that you actually get yourself a 1024 × 1024 grid of squares
and start filling in with trominos. It would take 349, 525 trominos. You may not finish in your lifetime! Instead,
it is time to start thinking recursively. Use the previous result in your answer. The same way you can just say
the marble bag “has one more than before”, we can phrase the solution to tiling the 1024 × 1024 grid in terms of
the tiling of the 512 × 512 grid. Here’s how it goes. Take a 1024 × 1024 grid and section it off into four 512 × 512
subgrids by dividing it down the middle both horizontally and vertically. In the upper left 512 × 512 grid, tile all
but the bottom right corner. In the lower left 512 × 512 grid, tile all but the upper right corner. In the lower right
grid, tile all but the upper left corner. Finally, in the upper right 512 × 512 grid, tile all but the upper right corner
(Figure 1.4.2). This leaves room for a single L-shaped tromino in the middle, and one square left over. That’s it!
It should feel a little bit like cheating since we didn’t specify how to deal with the 512 × 512 grid, but the same
argument applies to the 512 × 512 grid. You can section it off into four subgrids, tile those and be done.
The same tiling argument can be made for any 2n × 2n grid based on the 2n−1 × 2n−1 tiling, except when n = 1.
1.4. RECURSIVE PROCEDURES 31

Figure 1.4.3: The 32 × 32 grid recursively tiled.

You just have to tile the 2 × 2 grid yourself! But once that’s done, you have a complete solution for any 2n × 2n
grid. A similar exception applies to every recursive procedure. The recursion is only good most of the time. At
some point, you have to get your hands dirty and supply a solution or answer. Such an answer is often called an
initial condition.

Crumpet 9: Proof by induction

Proof by induction also uses a sort of recursive thinking. In the method, one must prove that a claim is true for
some value of the variable. This part is analogous to having an initial condition. Then one must prove that the
truth of the claim for the value n implies the truth of the claim for n + 1. This is analogous to the recursive
relationship between states. In fact, the construction of a tiling for the 2n × 2n grid based on the 2n−1 × 2n−1
grid plus the tiling of the 2 × 2 grid just presented essentially form a proof by induction that the 2n × 2n grid,
save one corner, can be tiled by trominos for any n ≥ 1. In this way, all proofs by induction boil down to the
ability to see the recursive relationship between states.
In 1954, Solomon Golomb pubished a proof by induction that the 2n × 2n grid minus any single square (not
necessarily a corner), called a deficient square, can be tiled by trominos. Can you construct a (recursive) tiling
of a 2n × 2n deficient square? You may use the tiling of a 2k × 2k grid minus one corner in your construction.

Reference [12]

Octave
Custom functions
As any modern useful programming language does, Octave allows custom functions beyond those that can be written
as a single inline formula. Let’s say you are interested in the maximum value a function takes over an evenly
spaced set of values. That function has a very special purpose and is not commonly used. Consequently, it is not
built into any programming language, so if you really want a function that does that, it is your job to write it.
Similarly, if you want a function that calculates the symmedian point of a triangle, you need to write it. In fact,
most anything computational beyond evaluating basic functions will not be built into Octave.
Custom functions are written around three basic pieces of information: a name for the function, a list of inputs,
and a description of the output. These three things should be well defined before the work of writing the function
32 CHAPTER 1. PRELIMINARIES

begins. Actually writing the function involves simply telling Octave the desired name, inputs, and how to determine
the output. The basic format for a function is this:

function ans = myName(input1, input2, ... )


...
ans = final answer;
end%function

The first line holds the name of the function and a list of inputs. The rest of the function is dedicated to computing
the output, ans.
The function that determines the maximum value of a function over an evenly spaced set of values might be
written following these steps. First, we decide to name it “maxOverMesh”. Notice there are no spaces and no special
characters in the name. There’s a very limited supply of non-alphabetic characters that can go into the name of a
function. It’s usually safe to assume an underscore and numbers are acceptable, but you can’t count on anything
else! It’s best to keep it at that. Second, we need to think about what inputs are necessary for this function. Of
course, the function to maximize is required, and somehow the mesh of points where it should be checked needs to
be specified. There are multiple ways to do this, but perhaps the one that is easiest for the user is to require the
lower end point, upper end point, and number of intervals in the mesh. Finally, we need to write some code that
will take those inputs and determine the maximum value of the function over the mesh. One way to do it is this:

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% maxOverMesh() written by Leon Q. Brin 21 January 2013 %
% INPUT: Interval [a,b]; function f; and number of %
% subintervals n. %
% OUTPUT: maximum value of the function over the end %
% points of the subintervals. %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function ans = maxOverMesh(f,a,b,n)
ans = f(a);
for i=1:n
x = (i*b + (n-i)*a)/n;
F = f(x);
if (F>ans) ans = F;
end%for
end%function

It is good practice to preface each function you write with a comment containing a three-point description of the
function—the name, inputs, and output. If you or anyone else looks at it later, you will have a quick summary of
how to use the function and for what.
Whatever the last value assigned to ans when the function is complete will be the output of the function. The
function starts by assigning the value of the function at the left end point to ans. Then it loops through the rest of
the subinterval end points, calculating the value of the function at each one. Each time it finds a value higher than
ans, it (re-)assigns ans to that value. At the end of the loop, the greatest value of the function has been assigned
to ans.
To use a custom function, save it in a .m file with the same name as that of the function. For example, the
maxOverMesh() function would be saved in a file named maxOverMesh.m. Then your custom function can be called
just as any built-in Octave function as long as the .m file is saved in the same directory in which the program using
it is saved. Or, if using it from the command line, the working directory of Octave (the one from which Octave was
started, unless explicitly changed during your session) must be the directory in which the .m file is saved:

octave:1> maxOverMesh(inline(’(x^2-6*x+8)*exp(x)’), 0, 4, 99)


ans = 8.6728
octave:2> f = inline(’(x^2+3*x-5)/(x^2-3*x+5)’)
f = f(x) = (x^2+3*x-5)/(x^2-3*x+5)
octave:3> maxOverMesh(f, -5, 5, 225)
ans = 2.6362

maxOverMesh.m may be downloaded at the companion website.


1.4. RECURSIVE PROCEDURES 33

Recursive functions
Thinking recursively, what would you say if I asked you what 10! was? Think about it for a moment before reading
on. That’s right! 10 factorial is just 10 times 9!:

10! = 10 · 9 · 8 · 7 · 6 · 5 · 4 · 3 · 2 · 1
= 10 · (9 · 8 · 7 · 6 · 5 · 4 · 3 · 2 · 1)
= 10 · (9!).

No need to come up with a number. Just a recursive idea, because of course the idea works just as well for 9!, and
so on . . . up to (or should I say down to?) a point. At what point is it no longer true that n! = n · (n − 1)!? When
n = 0. We need to specify that 0! = 1 and not rely on recursive thinking in this case. But only this case!
Let’s see how this recursive calculation works for 5!. According to the recursion, 5! = 5 · 4!. But 4! = 4 · 3!
so we have 5! = 5 · (4 · 3!). But 3! = 3 · 2! so we now have 5! = 5(4(3 · 2!)). Continuing, 2! = 2 · 1! = 2 · 1 · 0!
so we now have 5! = 5(4(3(2(1 · 0!)))). And now the recursion stops and we simply plug in 1 for 0! to find out
that 5! = 5(4(3(2(1(1))))). Maybe you were expecting 5 · 4 · 3 · 2 · 1 for a final result instead. Of course you get
120 either way, so from the standpoint of getting things right, either way is fine. Pragmatically, the point is moot.
Computing factorials recursively is dreadfully inefficient and impossible beyond the maximum depth of recursion
for the programming language in use, so should never be used in practice anyway. Its only value is as an exercise
in recursive thinking and programming.
Generally, a recursive function will look like this:
function ans = recFunction(input1, input2, ... )
if (recursion does not apply)
return appropriate ans
else
return recFunction(i1, i2, ... )
end%if
end%function
Determining whether the recursion applies is the first item of business. If not, an appropriate output must be
supplied. Otherwise, the recursive function simply calls itself with modified inputs. Since the recursive (wise-guy)
definition of n! is n · (n − 1)! and applies whenever n > 0, and 0! = 1, the recursive factorial function might look
like this:
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% recFactorial() written by Leon Q. Brin 21 January 2013 %
% is a recursively defined factorial function. %
% INPUT: nonnegative integer n. %
% OUTPUT: n! %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function ans = recFactorial(n)
if (n==0)
ans = 1;
else
ans = n*recFactorial(n-1);
end%if
end%function
Note the == when checking if n equals 1. This is not a typographical error. This is very important. All programming
languages must distinguish between assignments and conditions. On paper, it may seem natural to write x = 3
when you want to set x equal to 3. It may also seem natural to write “if x = 3, everything is good.” We use
the “equation” x = 3 exactly the same way on paper to mean two very different things. When we set x = 3 we
are making a statement, or assignment of the value 3 to the variable x. But when we write “if x = 3 . . .” we are
making a hypothetical statement, or a conditional statement. The value of x is unknown. In Octave the distinction
is made by using a single equals sign, =, to mean assignment and two equals signs, ==, to mean conditional equals.
recFactorial.m may be downloaded at the companion website.
34 CHAPTER 1. PRELIMINARIES

Exercises (e) Which code is simpler (recursive or nonrecursive)?

1. Write a .m file with a function that takes one input, (f) Which code is faster?
squares it, and returns the result. Your file should (g) Which code is more accurate?

(a) contain a comment block at the beginning con- NOTE: F30 = 1346269.
taining your name, the date, and an explanation 6. Let the sequence han i be defined by
of what the program does and how to use it.
1 2
(b) have a function of the form foo(x) in it that re- an+1 = 5an − 30an + 25 , n≥1
4
turns the square of its input (argument) x. √
17 + 2 7
a0 = .
Make sure to test your function from the Octave com- 5
mand prompt. (a) Calculate a1 , a2 and a3 exactly.
2. The Octave function foo(x) is shown below. (b) Find a20 and a51 exactly.
(c) Write a recursive function that calculates the nth
function res = foo(x)
term of the sequence. Your function should have
if (x<1)
one argument, n. Write a program that calls this
res = 0;
function to calculate a1 , a2 , a3 , a20 , and a51 .
else
half = x/2; (d) Write a function that uses a for loop to calcu-
floorhalf = floor(half); late the nth term of the sequence. Your function
if (half == floorhalf) should have one argument, n. Write a program
res = 0 + foo(floorhalf); that calls this function to calculate a1 , a2 , a3 , a20 ,
else and a51 .
res = 1 + foo(floorhalf); (e) Which code is simpler (recursive or nonrecursive)?
endif
(f) Which function is faster?
endif
endfunction (g) Which code is more accurate, and why?
(h) Which function is better, and why?
(a) Find foo(2).
(i) Do you trust either function to calculate a600 ac-
(b) Find foo(23). curately? If not, why not?
[A]
3. Write a recursive Octave function that will calculate 7. Trominos, part 1.
n
X 1 (a) Recursively speaking, how many trominos are
i needed to tile a 2n × 2n grid, save one corner?
i=1
(b) What is the greatest (integer) value of n for which
the recursive definition does not apply?
4. Write a recursive Octave function that calculates
an for any n ≥ 0 given (c) For the value of n of part 7b, how many trominos
are needed?
a0 = 100, 000
[S]
8. Trominos, part 2.
an = 1.05an−1 − 1200, n > 0.
(a) Write a recursive Octave function for calculating
5. The Fibonacci sequence, hFn i, is recursively defined by the number of trominos needed to tile a 2n × 2n
grid, save one corner.
Fn+1 = Fn + Fn−1 , n≥1
(b) Use your function to verify that 349, 525 tromi-
F0 = 1
nos are needed to tile a 1024 × 1024 grid, save one
F1 = 1 corner.
so the first few terms are 1, 1, 2, 3, 5, 8. 9. The Tower of Hanoi, part 1. The Tower of Hanoi is
a game played with a number of different sized disks
(a) Write a recursive function that calculates the n th
stacked on a pole in decreasing size, the largest on the
Fibonacci number. Your function should have one
bottom and the smallest on top. There are two other
argument, n.
poles, initially with no disks on them. The goal is to
(b) Write a function that uses a for loop to calculate move the entire stack of disks to one of the initially
the nth Fibonacci number. Your function should empty poles following two rules. You are allowed to
have one argument, n. move only one disk at a time from one pole to another.
(c) Write a program that calls the function from 5a You may never place a disk upon a smaller one. [S]
to calculate F30 . (a) Starting with a stack of three disks, what is the
(d) Write a program that calls the function from 5b minimum number of moves it takes to complete
to calculate F30 . the game? Answer this question with a number.
1.4. RECURSIVE PROCEDURES 35

(b) Starting with a stack of four disks, what is the (a) How many partitions of A into k nonempty sub-
minimum number of moves it takes to complete sets include the subset {n}? Give an answer in
the game? terms of Stirling numbers of the second kind.
i. Answer this question recursively. (b) How many partitions of A into k nonempty sub-
ii. Answer this question with a number based sets do not include the subset {n}? Give an
on your recursive answer. answer in terms of Stirling numbers of the sec-
[S]
ond kind. Hint, consider partitions of B =
10. The Tower of Hanoi, part 2. {1, 2, 3, . . . , n − 1} into k nonempty subsets.
(a) Starting with a “stack” of one disk, what is the 15. Stirling numbers of the second kind, part 4.
minimum number of moves it takes to complete
the game? (a) Use your answers to questions 13 and 14 to de-
rive a recursive formula with initial conditions for
(b) Use your answer to (a) plus a generalization the number of ways a set of n elements can be
of your answer to question 9(b)i to write a recur- partitioned into k subsets.
sive Octave function for calculating the minimum
number of moves it takes to complete the game (b) Write a recursive Octave function that calcu-
with a stack of n disks. lates Stirling numbers of the second kind.

(c) Use your Octave function to verify that it (c) Use your Octave function to verify that
takes a minimum of 1023 moves to complete the S(10, 4) = 34105.
game with a stack of 10 disks. 16. A set of blocks contains some that are 1 inch high and
11. The Tower of Hanoi, part 3. The Tower of Hanoi with some that are 2 inches high. How many ways are there
adjacency requirement. Suppose the rules of The Tower to make a stack of blocks 15 inches high? [S]
of Hanoi are modified so that each disk may only be 17. A male bee (drone) has only one parent since drones
moved to an adjacent pole, and the goal is to move the are the unfertilized offspring of a queen bee. A female
entire stack from the left-most pole to the right-most bee (queen) has two parents. Therefore, 0 generations
pole. back, a male bee has one ancestor (the bee himself). 1
generation back, the bee also has 1 ancestor (the bee’s
(a) What is the minimum number of moves it takes mother). 2 generations back, the bee has 2 ancestors
to complete the game with a “stack” of one disk? (the mother’s two parents). How many direct ancestors
(b) Find a recursive formula for the minimum num- does a male bee have n generations back?
ber of moves it takes to complete the game with 18. Argue that any polygon can be triangulated (covered
a stack of n disks, n > 1. with non-overlapping triangles). An example of a tri-
(c) Write a recursive Octave function for the mini- angulation of a dodecagon follows.
mum number of moves to complete the game with
a stack of n disks.
(d) Use your Octave function to compute the min-
imum number of moves it takes to complete the
game with a stack of 5 disks. 10 disks.
12. Stirling numbers of the second kind, part 1. Let S(n, k) 19. In questions 5 and 6, you should have noticed that
be the number of ways to partition a set of n elements the recursive functions were slower than their for loop
into k nonempty subsets. A partition of a set A is a counterparts. How many times slower? Why is the Fi-
collection of subsets of A such that each element of bonacci recursion so many more times slower than its
the set A must be an element of exactly one of the for loop counterpart?
subsets. The order of the subsets is irrelevant as the
20. Let the sequences hbn i and hcn i be defined as follows.
partition is a collection (a set of sets). For example, the
partition {{1}, {2, 3}, {4}} is a partition of {1, 2, 3, 4}. 1
b0 = ; bn+1 = 4bn − 1, n ≥ 0
{{4}, {1}, {2, 3}} is the same partition of {1, 2, 3, 4}. 3
1
(a) Find S(10, 1). [S] c0 = ; cn+1 = 4cn (1 − cn ), n ≥ 0
10
(b) Find S(3, 2). (a) Write a function that uses a for loop to calculate
(c) Find S(4, 3). the nth term of hbn i. Your function should have
(d) Find S(4, 2). [S] one argument, n.
(b) Write a function that uses a for loop to calculate
(e) Find S(8, 8).
the nth term of hcn i. Your function should have
[S]
13. Stirling numbers of the second kind, part 2. one argument, n.
(a) Find S(n, 1). (c) Write a program that calls these functions to cal-
culate b30 and c30 . How accurate are these calcu-
(b) Find S(n, n).
lations? HINT b30 = 31 and c30 = .32034 accurate
14. Stirling numbers of the second kind, part 3. Let to 5 decimal places.
A = {1, 2, 3, . . . , n}. [A] (d) Can you think of a way to make these calculations
more dependable (more accurate)?
36 CHAPTER 1. PRELIMINARIES
Chapter 2
Root Finding

2.1 Bisection
In Section 1.2 (page 12), we claimed that “T2 (x) actually approximates ln(x) to within 0.1 over the interval
[3.296, 13.13]”, with a promise that we would discuss the calculation later. It is now later. First, we rephrase
the claim as “the distance between T2 (x) and ln(x) is less than or equal to 0.1 for all x ∈ [3.296, 13.13].” In other
words,
1
|T2 (x) − ln(x)| < for all x ∈ [3.296, 13.13].
10
1
One way to begin solving this inequality is to consider the pair of equations T2 (x) − ln(x) = ± 10 . With a focus on
solving
1
T2 (x) − ln(x) = , (2.1.1)
10
x−e2 (x−e2 )2
recall that T2 (x) = 2 + e2 − 2e4 . We are thus looking to solve the equation

x − e2 (x − e2 )2 1
2+ − − ln(x) = .
e2 2e4 10
Finally, having written the equation in full detail, it should come as no surprise that we will not be solving for
x exactly. There is no analytic method for solving such an equation. Generally, equations with both polynomial
terms and transcendental terms will not be solvable. However, from the graph in Figure 1.2.2, we can get a first
approximation of the solution. We are looking for the place where T2 (x) exceeds ln(x) by 0.1. Since the two
graphs essentially overlap at x = 6, we might aver that T2 (6) exceeds ln(x) by less than 0.1 there. Since there is a
reasonably large gap between the graphs at x = 2, we might also aver that T2 (2) exceeds ln(x) by more than 0.1
1 1
there. In other words, T2 (2) − ln(2) > 10 while T2 (6) − ln(6) < 10 . Since T2 (x) − ln(x) is continuous on the interval
1
[2, 6], the Intermediate Value theorem guarantees there is a value c ∈ (2, 6) such that T2 (c) − ln(c) = 10 . It is this
value of c we are after. And we know it is between 2 and 6. It’s a start, but we can do better!
What about 4? Well, T2 (4) − ln(4) ≈ .04986 < 0.1, so now we know T2 (4) exceeds ln(4) by less than 0.1. Now
the Intermediate Value theorem tells us that c is between 2 and 4 (T2 (2) exceeds ln(x) by more than 0.1). Shall we
check on x = 3? Yes. T2 (3) − ln(3) ≈ .131 > 0.1, so now we know T2 (3) exceeds ln(3) by more than 0.1. Recapping,
T2 (4) − ln(4) < 0.1 while T2 (3) ln(3) > 0.1. By the Intermediate Value theorem again, we know c is between 3 and
4. And we may continue the process, limited only by our patience. This is the process we call the bisection method:

1. Identify an interval [a, b] such that either a or b overshoots the mark while the other undershoots it.

2. Calculate the midpoint, m, of the identified interval.

3. If a and m both overshoot or both undershoot the mark, the desired value lies in [m, b].

4. If b and m both overshoot or both undershoot the mark, the desired value lies in [a, m].

5. Return to step 2 using the newly identified interval.

37
38 CHAPTER 2. ROOT FINDING

1 1
Figure 2.1.1: + indicates T2 (x) − ln(x) > 10 and − indicates T2 (x) − ln(x) < 10 .

|
| | | | | |
2 3 3.25 3.5 4 6
m2 m4 m3 m1

1
Using a + sign for values of x for which T2 (x) − ln(x) overshoots the desired value 10 and a − sign for values of x
1
for which T2 (x) − ln(x) undershoots the desired value 10 , we may diagram this procedure, including the next two
iterations, as in Figure 2.1.1. We might also reproduce the calculations in a table:

a m b T2 (a) − ln(a) T2 (m) − ln(m) T2 (b) − ln(b)


2 4 6 .3116 .04986 .002582
2 3 4 .3116 0.131 .04986
3 3.5 4 0.131 0.0824 .04986
3 3.25 3.5

No matter how the procedure is understood, the sequence of approximations

4, 3, 3.5, 3.25, . . .

is produced. What is the next value? Answer on page 45.


Not only do we have a sequence of numbers approaching the solution, we know for certain that 4 is accurate to
within 2 units of the exact value. 3 is accurate to within 1 unit. 3.5 is accurate to within 0.5 units. And 3.25 is
accurate to within 0.25 units. In general, each approximation is accurate to within half the length of the interval
from which it was computed as midpoint. After all, the exact value is guaranteed to lie within the interval. The
farthest the midpoint can possibly be from the exact value is half the length of the interval.
Though the method works perfectly well as described, normally the equation to be solved is simplified so that
one side is zero. In that way, the other side can be thought of as a function whose roots are desired. Plus, it
simplifies the implementation of the method slightly. For example, we would consider solving the equation
1
T2 (x) − ln(x) − =0
10
1
instead of 2.1.1. Then the procedure boils down to finding a root of f (x) = T2 (x) − ln(x) − 10 . This is why this
method is called a root-finding method. It is used to find zeros, or roots, of functions. In this light, we might
summarize the first 8 iterations of this procedure as follows:

a m b f (a) f (m) f (b)


2 4 6 >0 <0 <0
2 3 4 >0 >0 <0
3 3.5 4 >0 <0 <0
3 3.25 3.5 >0 >0 <0
3.25 3.375 3.5 >0 <0 <0
3.25 3.3125 3.375 >0 <0 <0
3.25 3.28125 3.3125 >0 >0 <0
3.28125 3.296875 3.3125

Notice two things. The actual values of f (a), f (m), and f (b) are not needed. Only their sign is important because
all we need to do is maintain one endpoint where the function is greater than 0 (overshoots) and one where the
function is less than 0 (undershoots). Furthermore, the f (a) and f (b) columns are not strictly necessary either. If
the procedure is carried out faithfully, they will never change sign. In fact, that’s what it means to carry out the
procedure faithfully! In steps 3 and 4, you choose which subinterval to keep by maintaining opposite signs of the
function on opposite endpoints.
As the last line indicates, the desired value is approximately 3.296 as promised. The other value, 13.13, is
1
determined by finding a root of the function g(x) = T2 (x) − ln(x) + 10 . Give it a shot! Start with a = 10 and
b = 14, for example. Solution on page 45.
Though it works, the only real point of carrying out the procedure using a table is to make sure you understand
exactly how it works. If we were actually to use the method in practice, we would write a short computer program
2.1. BISECTION 39

instead. Computers are very good at repetitious calculations, something at which humans are not particularly
adept. In this procedure, we need to calculate a midpoint, decide whether this midpoint should then become the
left or right endpoint, make it so, and repeat.
That leaves only one question—how many repetitions, or iterations, should we compute? And that depends on
the user. Perhaps an answer to within 10−2 of the exact value will suffice, and maybe only 10−6 accuracy will do.
The program we write should be flexible enough to calculate the answer to whatever accuracy is desired, within
reason. With that in mind, here is some pseudo-code for the bisection method.

The Bisection Method (pseudo-code)


Though technically not necessary for coding, when we can, we will preface each method’s pseudo-code with math-
ematical assumptions that guarantee success. The implication is that if the method is run in a situation where the
assumptions are not met, then the method should not be expected to provide dependable results. It may or may
not give useful information. The old adage “garbage in...garbage out” applies!

Assumptions: f is continuous on [a, b]. f (a) and f (b) have opposite signs.
Input: Interval [a, b]; function f ; desired accuracy tol; maximum number of iterations N .
Step 1: Set err = |b − a|; L = f (a);
Step 2: For j = 1 . . . N do Steps 3-5:

2 ; M = f (m); err = err/2;


Step 3: Set m = a+b
Step 4: If M = 0 or err ≤ tol then return m;
Step 5: If LM < 0 then set b = m; else set a = m and L = M ;
Step 6: Print “Method failed. Maximum iterations exceeded.”
Output: Approximation m within tol of exact root, or message of failure.

As noted earlier, this method should calculate a midpoint (Step 3), decide whether this midpoint should then
become the left or right endpoint (Step 5), make it so (Step 5), and repeat some number of times (Steps 1, 2, and 4).
Much of the code is dedicated to determining when to stop. This is typical of numerical methods. The calculations
are half the battle. Controlling the calculations is the other half. If we didn’t have to worry about stopping, the
pseudo-code might look something like this:

Step 1: Set L = f (a);


Step 2: Set m = 2 ;
a+b
M = f (m);
Step 3: If LM < 0 then set b = m; else set a = m and L = M ;
Step 4: Go to Step 2.

There would be no need for j, err, tol, or N , making the algorithm quite a bit simpler. Of course, programmed
this way, the program would never stop, so j, err, tol, and N , are indeed necessary. Nonetheless, this pseudo-code
without the ability to stop is important. It can be thought of as the guts of the program. This is the code that
executes the method. Sometimes it is easiest to start with the guts and then add the controls afterward.
As for determining whether the midpoint should become the left or right endpoint, Step 5 (Step 3 of the
guts) uses a somewhat slick method. By slick, I mean short, efficient, and not immediately obvious. The sign of
LM = f (a) · f (m) is checked. If it is negative (LM < 0) then m should become the right endpoint (should replace
b) because this means f (a) and f (m) have opposite signs. That’s the only way LM can be negative. On the other
hand, if LM > 0 then we know f (a) and f (m) have the same sign, so m should become the left endpoint (should
replace a). In Step 3 the midpoint is calculated without any fanfare.
The rest of the code is there to make sure the program doesn’t do more than necessary and doesn’t end up
spinning its wheels indefinitely. It is important to be able to separate, at least in your mind, the guts of the program
from the stopping logic. As for the stopping logic, in Step 4, we stop if err ≤ tol as we should. But we also check
the unlikely event that M = 0 in which case we happened to hit the root exactly so should quit. Though it could
be argued overkill to set a maximum number of iterations, N , in this program, it’s a good habit to get into. Some
numerical methods provide no guarantee the required tolerance will ever be reached. For these methods, a fallback
exit criterion is needed. Also, if tol were accidentally set to a negative value, it would certainly never be reached.
The algorithm would have no way to stop without N .
40 CHAPTER 2. ROOT FINDING

Analysis of the bisection method


There are two good reasons to study the bisection method. First, its assumptions for guaranteed success are much
simpler to verify than those of other methods. Even so, be somewhat cautious. Faithful execution of any numerical
method is subject to proper programming, accurate computation, and proper input. Programmers and users are
not infallible. Nor are computers. Remember the lessons of Section 1.1. At the same time you should be wary of
the results, you should temper your skepticism with a good dose of confidence in the method. It is only in rare
circumstances that the computer will be the source of any problems.
Second, error analysis is straightforward. Let m1 = a+b2 , the midpoint of [a, b]. Let succeeding midpoints be
m2 , m3 , m4 , and so on. Then the Intermediate Value theorem guarantees |mj − p| ≤ b−a 2j for some root p of
f (x). As we learned in section 1.3, this means the sequence hm n i converges to p with linear order, and rate of
convergence O 21n . This method should be considered slow to converge because it does so with linear order. But


among those methods with linear order, it should be considered fast. The error decays exponentially—faster than
any polynomial decay.

Key Concepts
The Intermediate Value Theorem: Suppose f is a continuous function on [a, b] and y is between f (a) and f (b).
Then there is a number c between a and b such that f (c) = y.1

Iteration: (1) Repeating a computation or other process, using the output of one computation as the input of the
next.

Iteration: (2) Any of the intermediate results of an iteration. Also called an iterate.

The bisection method: Produces a sequence of approximations hmj i that converges to some root in [a, b].

Error bound for the bisection method: The error of approximation mj is no more than 2j .
b−a
That is, |mj −
2j for some root p of f (x).
p| ≤ b−a

Convergence for the bisection method: The bisection method converges with linear order and has rate of
convergence O 21n .


Octave
Roughly half the work in writing pseudo-code for the bisection method was dedicated to the logic of the method—
the determination of when to stop. In programming, this type of logic is handled by if then [else] statements,
and variations thereof. It is common practice in programming to use square brackets to denote something that is
optional. So the template if then [else] should be read to mean that logic is handled by if then statements or
if then else statements. The exact syntax looks like this:

if (condition)
execute code here
[else
execute code here]
end%if

Again, the square brackets indicate optional code.


The if then statement works as you might imagine. In the if then form of the statement, all code between
then and end is executed whenever the condition is true. It is skipped whenever the condition is false. The if
then else form of the statement is similar. All code between then and else is executed whenever the condition
is true. The code between else and end is skipped in this case. Exactly the reverse happens when the condition is
false. The code between then and else is skipped while the code between else and end is executed. The simplest
use of an if then else statement might look like this.

if (n>10)
disp(’n is big’)
else
1 The word “between” in this theorem can be interpreted as inclusive or exclusive of the endpoint values as long as the same

interpretation is made for each instance of the word.


2.1. BISECTION 41

disp(’n is small’)
end%if

In Octave, if then [else] statements are written almost exactly as they are in pseudo-code. In fact, much of
the pseudo-code in this text will translate nearly verbatim into Octave. One notable exception is the symbol used
in the condition. Octave requires a boolean operator in the condition. That is, an operator that will evaluate to
either true or false. The = operator assigns a value to a variable. It is not a boolean operator so should not be
used in an if condition. Instead, == (two equals signs) should be used. This table summarizes the six most common
boolean operators in Octave.

Comparison Operator
greater than, less than >, <
greater than or equal, less than or equal >=, <=
equal ==
not equal !=

If you needed to check if x ≥ 0, you would use if (x>=0) in Octave. If you needed to check if t equaled 1, you
would use if (t==1) in Octave. And so on. Logical operators are often needed as well.

Logical Operator Octave Code


and &&
or ||

For example, if you need to check whether x is between a and b, as in a ≤ x < b, a logical operator is needed. In
this case, we need logical “and” since a ≤ x < b means a ≤ x and x < b. The Octave code would be if (a<=x &&
x<b) or something logically equivalent.
Technically, an if then statement is concluded with an end statement. However, to emphasize the type of
statement being ended, we will make a habit of ending an if then statement with end%if and ending a for loop
with end%for. The %if and %for are just comments since they start with %. Consequently, they are not strictly
necessary, but they may aid in the readability of your code, especially when you have nested constructs. When you
have an if statement inside a for loop or vice versa, using end to end both of them is not as informative as using
end%if and end%for.
2 (x−e2 )2 1
An Octave program to find a root of f (x) = 2 + x−e e2 − 2e4 − ln(x) − 10 between 2 and 6 to within 10−4
using the bisection method with a maximum of 100 iterations might look like this.

f = inline(’2+(x-exp(2))/exp(2)-(x-exp(2))^2/(2*exp(4))-log(x)-1/10’);
a=2;
b=6;
err=b-a;
L=f(a);
for i=1:100
m=(a+b)/2;
M=f(m);
err=err/2;
if (M==0 || err<=10^-4)
disp(m);
return;
end%if
if (L*M<0)
b=m;
else
a=m;
L=M;
end%if
end%for
disp(’Method failed. Maximum iterations exceeded.’);

This code would produce the correct result, 3.2952. Compare this code to the pseudo-code. You will see the main
difference is syntax. However, there is one major disadvantage to writing the code this way. In order to change the
42 CHAPTER 2. ROOT FINDING

function, the endpoints, the tolerance, or the maximum number of iterations, the code needs to be modified in just
the right place. That is no real disadvantage if you never need to run the bisection method again. But, generally,
we should imagine that we will be running the methods we write many times over with different inputs. Or that we
will be handing our code over to someone else to run many times over with different inputs. Imagine me handing
you this code and asking you to find a root of f (x) = cos x − x between 0 and 3 to within 10−6 . It is not good
practice to hard code the inputs to a method. Instead, they should be given as inputs to a programmed function. In
Octave, this is done in a .m file. That doesn’t mean that we will simply take the code as written and save it in a .m
file. The .m file will assume that the inputs—interval [a, b]; function f ; desired accuracy tol; maximum number of
iterations N —will be supplied from another source—the user. The code inside the .m file should execute properly
regardless of the (yet unknown) inputs. The syntax for an Octave function is:
function result=name(input1,input2,...)
execute these lines
end%function
function is a keyword that tells Octave a function is to be defined. result is the name of the variable that holds
the answer, or result, of the function. name is the name of the function. It must also be the name of the .m file. A
completed bisection.m file might look like this:
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Bisection method written by Leon Q. Brin 09 July 2012 %
% Purpose: Implementation of the bisection method %
% INPUT: Interval [a,b]; function f; tolerance tol; and %
% maximum number of iterations maxits. %
% OUTPUT: root res to within tol of exact or message of %
% failure. %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function res=bisection(a,b,f,tol,maxits)
err=b-a;
L=f(a);
for i=1:maxits
m=(a+b)/2;
M=f(m);
err=err/2;
if (M==0 || err<=tol)
res=m;
return;
end%if
if (L*M<0)
b=m;
else
a=m;
L=M;
end%if
end%for
res=’Method failed. Maximum iterations exceeded.’;
end%function
Writing this way has not only the advantage of being easily reusable. It is also simpler! No need to worry about what
function the root of which is desired; or over what interval; and so on. And it more closely resembles the pseudo-code.
Once written and properly functioning, it can be saved as a .m file and never be looked at again (except for study).
It just works! If you hand it off to someone to use, they should be able to use it without modification. bisection.m
2 (x−e2 )2 1
may be downloaded at the companion website. Now finding a root of f (x) = 2 + x−e e2 − 2e4 − ln(x) − 10 between
2 and 6 to within 10 using the bisection method with a maximum of 100 iterations might look like this.
−4

octave:9>
f = inline(’2+(x-exp(2))/exp(2)-(x-exp(2))^2/(2*exp(4))-log(x)-1/10’);
octave:10> bisection(2,6,f,10^-4,100)
ans = 3.2952
2.1. BISECTION 43

After bisection.m is written, the bisection() function becomes part of the Octave language. It can be called just
like any built-in function. As a second example, we can find a root of f (x) = cos x − x between 0 and 3 like so:
octave:12> bisection(0,3,inline(’cos(x)-x’),10^-5,100)
ans = 0.73909

Exercises 12. Find a bound on the number of iterations needed to


achieve an approximation with accuracy 10−4 to the
1. Write an Octave function implementing the bisec- solution of x3 − x − 1 = 0 on the interval [1, 2] using
tion method as shown on the facing page. Save it as a the bisection method. Do not actually compute the
.m file for future use. approximation. Just find the bound.
2. Use the Intermediate Value Theorem to show that the 13. The graph of f (x) over the interval [0.75, 2] is shown
function has a root in the indicated interval. below. Notice f (x) has three roots on this interval:
approximately .795, 1.06, and 1.59. To which of the
(a) f (x) = 3 − x − sin x; [2, 3]
three roots does the bisection method converge if we
(b) g(x) = 3x4 − 2x3 − 3x + 2; [0, 1] let a = .75 and b = 2? How do you know?
(c) g(x) = 3x4 − 2x3 − 3x + 2; [0, 0.9] [S]

(d) h(x) = 10 − cosh(x); [−3, −2] 100


(e) f (t) = 4 + 5 sin t − 2.5; [−6, −5] 50

2 [S]
3t tan t
(f) g(t) = 1−t2
; [21.5, 22.5] 0

3t
(g) h(t) = ln(3 sin t) − 5
; [1, 2] -50

(h) f (r) = esin r − r; [−20, 20] -100

(i) g(r) = sin(e ) + r; [−3, 3]


r 0.8 1 1.2 1.4 1.6 1.8 2

(j) h(r) = 2sin r − 3cos r ; [1, 3] 14. Suppose you are trying to find the root of f (x) =
x − e−x using the bisection method. Find an integer a
3. Create a table showing three iterations of the bisection
such that the interval [a, a + 2] is an appropriate one in
method with the function and starting interval indi-
which to start the search.
cated in question 2. [S]
15. Find a lower bound on the number of iterations it would
4. Use your bisection.m code to find a root of the func-
take to guarantee accuracy of 10−20 in question 6.
tion in the interval of question 2 to within 10−8 . [A]
16. How many steps (iterations) of the bisection method
5. Use the bisection method to find m3 for the given func- are necessary to guarantee a solution with 10−10 accu-
tion on the given interval. Do this without a computer racy if a root is known to be within [4.5, 5.3]? [A]
program. Just use a pencil, paper, and a calculator.
You may check your answers with a computer program 17. Suppose you are using the bisection method on an in-
if you wish. [A] terval of length 3. How many iterations are necessary
to guarantee accuracy of the approximation to within

(a) f (x) = x − cos x on [0, 1] 10−6 ?
(b) f (x) = 3(x + 1)(x − 12 )(x − 1) on [−1.25, 2.5] 18. Suppose a function g satisfies the assumptions of the
bisection method on the given interval. Starting with
6. Use the Bisection Method to find m4 for g(x) = that interval, how many iterations are needed to ap-
x sin x + 1 on [9, 10]. proximate the root to within the given tolerance?
7. Use the bisection method to find m3 for the equation
(a) [−7, 10]; 10−6
x cos x − ln x = 0 on the interval [7, 8].
(b) [5, 9]; 10−3
8. Use the bisection method to find a root of g(x) =
sin x − x2 between 0 and 1 with absolute error no more (c) [9, 15]; 10−10
than 1/4. (d) [−6, −1]; 10−105 (assume the computer calculates
9. Approximate the root of g(x) = 2 + x − ex between 1 with 300 significant digits so round-off error is not
and 2 to within 0.05 of the exact value using the bisec- a problem)
tion method.
19. 1 is a root of f (x) = ln(x4 −x3 −7x2 + 13x−5) that
10. There are 21 roots of the function f (x) = cos(x) on the can not be found by the bisection method.
interval [0, 65]. To which root will the bisection method
converge, starting with a = 0 and b = 65? [A] (a) Use a graph of the function near 1 to explain why.
You may use the Octave code below to produce
11. Find a bound on the number of iterations needed to
an appropriate graph.
achieve an approximation with accuracy 10−3 to the
solution of x3 + x − 4 = 0 on the interval [1, 4] using (b) Run the bisection method on f over the interval
the bisection method. Do not actually compute the [0.8, 1.2] anyway. What happens instead of find-
approximation. Just find the bound. [S] ing the root?
44 CHAPTER 2. ROOT FINDING

x=0.8:.01:1.2; 26. The function shown has roots at approximately 2.41,


f=inline("log(x.^4-x.^3-7*x.^2+13*x-5)"); 4.11, 5.62, 7.01, 8.32, 9.57, 10.78, and 11.94. To which
plot(x,f(x)) root will the bisection method converge with the given
starting interval?
20. 4 is a root of g(x) = | sin(πx)| that can not be found
by the bisection method.

(a) Use a graph of the function near 4 to explain why.


You may use the Octave code below to produce
an appropriate graph.
(b) Run the bisection method on f over the interval
[3.5, 4.5] anyway. What happens instead of find-
ing the root?

x=3.5:.05:4.5;
(a) [2, 3]
f=inline("abs(sin(pi*x))");
plot(x,f(x)) (b) [6, 8]
(c) [2, 6]
21. Let f (x) = sin(x2 ). f is continuous on [4, 5], but
f (4) < 0 and f (5) < 0, so the assumptions of the bi- (d) [5, 9]
section method are not met. Nonetheless, using the (e) [10, 12] Note: the assumptions of the bisection are
bisection method as described in the pseudo-code on f not met on this interval. Nonetheless, the method
over the interval [4, 5] does produce a root. Explain. [S] as outlined in the pseudo-code will converge to a
22. The functions in questions 2e, 2f, and 2g all fail to meet root!
the assumptions of the bisection method on the interval
27. Find an interval of length 1 over which the bisection
[−4, −0.5]. For each one, explain how so.
method may be applied in order to find a root of
23. Write an Octave function called collatz that takes f (x) = x4 − 7.6746x3 − 40.7477022x2 + 200.9894434x +
one integer input, n, and returns 3n + 1 if n is odd and 319.0914281.
n/2 if n is even. Save it as a collatz.m file. Use an if 28. The following algorithm is one possible incarnation of
then else statement in your function. HINT: Use the the bisection method.
Octave ceiling function. If ceil(n/2) equals n/2, then
n must be even (no remainder when divided by 2). Use Assumptions: f is continuous on [a, b]. f (a) and f (b)
your collatz function to calculate [A] have opposite signs.
Input: Interval [a, b]; function f
(a) collatz(17) Step 1: For j = 1 . . . 15 do Steps 2 and 3:
(b) collatz(10) Step 2: Set m = a+b ;
2
(c) collatz(109) Step 3: If f (a)f (m) < 0 then set b = m; else set
(d) collatz(344) a = m;
Step 4: Print m.
24. Write your own absolute value function called Output: Approximation m.
absval (abs is already defined by Octave, so it is best
to use a different name) that takes a real number input (a) Apply this algorithm to the function f (x) =
and returns the absolute value of the input. Use an (x)(x − 2)(x + 2) over the interval [−3, 3]. Which
if then else statement in your function. Save it as root will this algorithm approximate?
absval.m and test it on the following computations. (b) How accurate is the approximation guaranteed to
be according to the formula
(a) | − 3|
(b) |123.2| b−a
|pn − p| ≤ ?
22
2n
(c) π − 7

(c) How accurate is the approximation in reality?
(d) |10 − π 2 |
Compare this to the bound in (b).
25. f (x) = sin(x2 ) has five roots on the interval [7, 8]. (d) Modify the algorithm so it will approximate a dif-
f (7) < 0, f (8) > 0, and f is continuous on [7, 8], so ferent root using the same starting interval.
the assumptions of the bisection method are met. The
(e) Modify the algorithm so it does not use multipli-
method will converge to a root.
cation.
(a) Use your bisection.m file (Exercise 1) to deter-
29. Use the following pseudo-code to write a slightly differ-
mine which one. [A]
ent implementation of the bisection method. Refer to
(b) Find 4 different intervals for which the bisection Table 1.1 if you are unsure how to program the quan-
method will converge to the other four roots in tity d(ln(b − a) − ln(T OL))/ ln 2e. The while loop is
[7, 8]. discussed on page 61.
2.1. BISECTION 45

Input function f , endpoints a and b; tolerance T OL. 30. Use the code you wrote for question 29 to find solutions
Return approximate solution p and f (p) and the accurate to within 10−5 for the following problems.
number of iterations done N0 .
(a) x − 2x = 0 on [0, 1]
Step 1 Set i = 1; F A = f (a); N0 = d(ln |b − a| −
(b) ex − x2 + 3x − 2 = 0 on [0, 1]
ln(T OL))/ ln 2e;
Step 2 While i ≤ N0 do Steps 3-6. (c) 2x cos(2x)−(x+1)2 = 0 on [−3, −2] and on [−1, 0]

Step 3 Set p = (a + b)/2; F P = f (p); 31. Find an approximation of 3 correct to within 10−4
Step 4 If F P = 0 then using the bisection method. Write an essay on how
Return(p, f (p), N0 ); STOP. you solved this problem. Include your bisection code,
Step 5 Set i = i + 1; what function and what interval you used and why.
Step 6 If F A · F P > 0 then 32. A trough of length L has a cross section in the shape
Set a = p; F A = F P ; of a semicircle with radius r. When filled with water to
else within a distance h of the top, the volume V of water
Set b = p; is
Step 7 Return(p, f (p), N0 ); h
h
  i
V = L 0.5πr2 − r2 arcsin
p
STOP. −h r 2 − h2
r
(a) Discuss the advantages/disadvantages of this al- Suppose L = 10 ft, r = 1 ft, and V = 12.4 ft3 . Find
gorithm compared to the one on page 42. the depth of the water in the trough to within 0.01 ft.
(b) Where does the calculation N0 = d(ln(b − a) − Note: In Octave, use asin(x) for arcsin(x) and pi for
ln(T OL))/ ln 2e come from? π.

Answers
What is the next value?: T2 (3.25) − ln(3.25) ≈ .10429, which overshoots the mark. So 3.25 becomes the new
left endpoint, and the next value is 3.25+3.5
2 = 3.375, the midpoint of 3.25 and 3.5.
The right endpoint is 13.13: Starting with a = 10 and b = 14, note that g(a) ≈ .088 > 0 and g(b) ≈ −.044 < 0,
so g of the left endpoint should always be positive and g of the right endpoint should always be negative:

a m b g(m)
10 12 14 .044 ⇒ m becomes left endpoint
12 13 14 .006 ⇒ m becomes left endpoint
13 13.5 14 −.017 ⇒ m becomes right endpoint
13 13.25 13.5 −.005 ⇒ m becomes right endpoint
13 13.125 13.25 .0004 ⇒ m becomes left endpoint
13.125 13.1875 13.25 −.002 ⇒ m becomes right endpoint
13.125 13.15625 13.1875 −.0009 ⇒ m becomes right endpoint
13.125 13.140625 13.15625 −.0002 ⇒ m becomes right endpoint
13.125 13.1328125 13.140625
46 CHAPTER 2. ROOT FINDING

2.2 Fixed Point Iteration


Grab your calculator. Anything with a cosine button will do nicely. Presuming you have a simple scientific
calculator, press the all-clear button, usually marked AC or just C. The screen should now display 0. Press the
cosine button, which should be marked cos. The screen should display 1. Press the cosine button again. The
screen should display 0.540302 . . .. Repeat. Repeat again. In fact, continue pressing the cosine button until you
notice a pattern.
If you have a fancier calculator with a previous-answer button, usually marked Ans, press 0 and then Enter or =.
Then press the cosine button and then the previous-answer button. Then press Enter or = to do the computation.
The first time around, the screen should display 1 (just as with a scientific calculator). To repeat, however, just
press Enter or = again. This will repeat the last computation. In this case, the cosine of the previous answer. The
screen should display 0.540302 . . .. Now repeat until you notice a pattern.
After about 30 repetitions, or, as we will call them, iterations, your calculator should display a number like
0.739083847 . . .. And no matter how many times you repeat, or iterate, the calculation, it won’t change much. In
fact, once it reaches 0.7390851332 . . ., it won’t change at all (unless your calculator shows more decimal places—after
about 90 iterations, a calculator showing 15 decimal places will display 0.739085133215161 and it won’t change from
there). What that means is cos(0.7390851332 . . .) = 0.7390851332 . . .. And we call 0.7390851332 . . . a fixed point of
the cosine function. The value is fixed (does not change) when the cosine function is applied. Put another way, at
0.7390851332 . . ., the input and output of the cosine function are equal. See a simulation of this iteration online at
the companion website.
Perhaps a whole series of questions now comes to mind. Why does this work? What if we start with a number
other than 0? Does this work with any function? Can we predict when it will or won’t work? Can we find roots
this way? Is convergence fast? In this section and the next, we will give at least partial answers to all of these
questions. We start with “Why does this work?”.
Consider solving the system (
y = cos(x)
.
y=x
One way to do so is by the method of substitution. If we substitute y = x into y = cos x we get x = cos x or
cos x = x. The solutions of the system coincide exactly with the fixed points of the cosine function, for any solution
of cos x = x is a value x that is fixed by the cosine. Since systems of two equations in two unknowns can be solved,
at least approximately, by graphing, this suggests that we might take a look at the graph of the system in order to
learn more about what is happening during iteration.

Figure 2.2.1: Finding the fixed point of cos(x).

(a) (b)

Figure 2.2.1(a) shows the graphs of y = cos(x) and y = x over the interval [0, 1]. We can see the intersection at
around (0.75, 0.75) so we should think that the fixed point is around 0.75 (which of course we know is true from our
calculator experiment). Figure 2.2.1(b) illustrates the exercise of computing cos(0), cos(1), cos(0.540302 . . .), . . ..
Following the vertical line segment from (0, 0) to (0, 1) represents calculating cos(0). Following the horizontal
continuation from (0, 1) to (1, 1) and subsequently the vertical line segment from (1, 1) to (1, 0.540302 . . .) rep-
resents calculating cos(1). Following the horizontal line from (1, 0.540302 . . .) to (0.540302 . . . , 0.540302 . . .) and
2.2. FIXED POINT ITERATION 47

subsequently the vertical line from (0.540302 . . . , 0.540302 . . .) to (0.540302 . . . , 0.857553 . . .) represents calculating
cos(0.540302 . . .), and so on. With each pair of line segments, one going horizontally from the graph of y = cos(x)
to the graph of y = x followed by one going vertically from the line y = x to the graph of y = cos(x), another
iteration is shown. Figure 2.2.1(b) is sometimes called a web diagram [2], and is commonly used to illustrate the
concept of iteration. That the path of the web diagram tends toward (0.739085 . . . , 0.739085 . . .) is an unavoidable
consequence of the geometry of the graph of cos(x).
What if we start with a number other than 0? Using figure 2.2.1, you should be able to convince yourself that
convergence to the point (0.7390851332 . . . , 0.7390851332 . . .) is assured for any initial value between 0 and 1. Try
it. Start anywhere on the line y = x. Proceed vertically to the graph of y = cos(x). Then horizontally to the line
y = x. And repeat. You should find that the path of the web diagram always tends toward the intersection of the
graphs. Now consider starting with any real number, r. The cosine of any real number is a number in the interval
[−1, 1] so cos(r) ∈ [−1, 1]. And the cosine of any number in the interval [−1, 1] is a number in the interval [0, 1] so
cos(cos(r)) ∈ [0, 1]. That is, the second iteration is in the interval from 0 to 1. So after only two iterations, any
initial value will become a value between 0 and 1. And our web diagram implies that further iteration will lead to
the fixed point. So, regardless of the initial value, iteration leads to the fixed point. And the preceding argument
forms the seed for a proof of this fact.
Not all functions are so well behaved, however. For example, 12 = 1. In other words, 1 is a fixed point of
the function y = x2 . However, iteration starting with any number other than 1 or −1 does not lead to this fixed
point. If we start with any number greater than 1 and square it, it becomes greater. And if we square the result, it
becomes greater still. And squaring again only increases the value, without bound. Hence, iteration starting with
any value greater than 1 (or less than −1) does not lead to convergence to the fixed point 1. Nor does iteration
starting with any number of magnitude less than 1. Figure 2.2.2 illustrates iteration of y = x2 with initial value 0.9.

Figure 2.2.2: Visualizing the iteration of f (x) = x2 .

Follow the web diagram from the point (0.9, 0.9) vertically to the graph of y = x2 and then horizontally back to
the line y = x, and so on, to check for yourself. This is a nice illustration of the fact that the square of any number
between 0 and 1, exclusive, is smaller than the number itself. With starting values between −1 and 1 exclusive
of ±1, iteration gives a sequence converging to 0, not 1. To summarize, excepting −1 and 1, no initial value will
produce a sequence converging to 1 under iteration of the function y = x2 .
There is a fundamental difference between the fixed point 0.7390851332 . . . of f (x) = cos(x) and the fixed point
1 of g(x) = x2 . Fixed point iteration converges to 0.7390851332 . . . under f (x) = cos(x) for any initial value. Fixed
point iteration fails to converge to 1 under g(x) = x2 for all initial values but ±1.2 Examining the graphs of f (x)
and g(x) each superimposed against the line y = x in the neighborhood of their respective fixed points can give a
clue [Figure 2.2.3] as to the difference. True, f (x) has a negative slope at its fixed point while g(x) has a positive
slope at its fixed point. You can see this from the graphs or you can “do the calculus”. The important difference,
though, is more subtle. It’s not the sign of the slope at the fixed point that matters. It’s the magnitude of the
slope at the fixed point that matters. For smooth functions, neighborhoods of points with slopes of magnitude
2 For a third type of behavior, fixed point iteration converges to 0 under g(x) for initial values near 0, but not for others!
48 CHAPTER 2. ROOT FINDING

Figure 2.2.3: Left: f (x) = cos(x) and y = x. Right: g(x) = x2 and y = x.

greater than 1 tend to be expansive. That is, points move away from one another under application of the function.
However, neighborhoods of points with slopes of magnitude less than 1 tend to be contractive. That is, points move
toward one another under application of the function.

Proposition 2. If h(x) is differentiable on (a, b) with |h0 (x)| < 1 for all x ∈ (a, b), then whenever x1 , x2 ∈ (a, b),
|h(x2 ) − h(x1 )| < |x2 − x1 |.

Proof. Let x1 , x2 ∈ (a, b) and, without loss of generality, let x2 > x1 so that we may properly refer to the interval
from x1 to x2 . Since h(x) is continuous on [x1 , x2 ] and differentiable on (x1 , x2 ), the mean value
theorem gives us
h(x2 )−h(x1 ) h(x2 )−h(x1 )
c ∈ (x1 , x2 ) ⊆ (a, b) such that h (c) = x2 −x1 . But h (c) < 1 by assumption, so h (c) = x2 −x1 < 1, from
0 0 0

which we immediately conclude that |h(x2 ) − h(x1 )| < |x2 − x1 |.

Moreover, a function whose derivative has magnitude less than 1 can only cross the line y = x one time. Once it
has crossed, it can never “catch up” because that would require a slope greater than 1, the slope of the line y = x.

Proposition 3. Suppose h(x) is continuous on [a, b], differentiable on (a, b) with |h0 (x)| < 1 for all x ∈ (a, b), and
h([a, b]) ⊆ [a, b]. Then h has a unique fixed point in [a, b].

Proof. If h(a) = a or h(b) = b, we have proved existence, so suppose h(a) 6= a and h(b) 6= b. Since h([a, b]) ⊆ [a, b] it
must be the case that h(a) > a and h(b) < b. It immediately follows that h(a) − a > 0 and h(b) − b < 0. Since the
auxiliary function f (x) = h(x) − x is continuous on [a, b], the Intermediate Value Theorem guarantees the existence
of c ∈ (a, b) such that f (c) = 0. By substitution, h(c) − c = 0, implying h(c) = c, so c is a fixed point of h. The
existence of a fixed point is established. Now suppose c1 ∈ [a, b] and c2 ∈ [a, b] are distinct fixed points of h. Then

h(c1 ) − h(c2 ) c1 − c2
= = 1.
c1 − c2 c1 − c2

By the mean value theorem, there exists c3 between c1 and c2 such that h0 (c3 ) = 1, contradicting the fact that
|h0 (x)| < 1 for all x ∈ (a, b). Hence, it is impossible that c1 and c2 are distinct.

Hence, we can reasonably expect that when the derivative at a fixed point has magnitude less than 1, iteration is
a viable method for approximating (finding) the fixed point, but when the derivative at a fixed point has magnitude
greater than 1, iteration is not a viable method of approximating the fixed point. We must be careful, though,
not to take this rule of thumb as absolute. It only applies to so-called well-behaved functions. In this case, that
the function has a continuous first derivative in the neighborhood of the fixed point is well-behaved enough. The
following theorem establishes that fixed point iteration will converge in a neighborhood of a fixed point if the
magnitude of the function’s derivative is less than 1 there.

Theorem 4. (Fixed Point Convergence Theorem) Given a function f (x) with continuous first derivative and fixed
point x̂, if |f 0 (x̂)| < 1 then there exists a neighborhood of x̂ in which fixed point iteration converges to the fixed point
for any initial value in the neighborhood.
2.2. FIXED POINT ITERATION 49

Proof. By continuity, there exists ε > 0 such that |f 0 (x)| < 1 for all x ∈ (x̂ − ε, x̂ + ε). Let 0 < δ < ε and set
M= max |f 0 (x)|. Now suppose x0 is a particular but arbitrary value in (x̂ − δ, x̂ + δ). As in proposition 2,
x∈[x̂−δ,x̂+δ]
the Mean Value Theorem is applied. This time, we are guaranteed c ∈ (x̂ − δ, x̂ + δ) such that f 0 (c) = f (x̂)−f (x0 )
x̂−x0 .
But |f 0 (c)| ≤ M so |f (x̂) − f (x0 )| ≤ M |x̂ − x0 |. Furthermore x̂ is a fixed point, so f (x̂) = x̂, from which it
follows that |x̂ − f (x0 )| ≤ M |x̂ − x0 |. Now we define xk = f (xk−1 ) for all k ≥ 1 and prove by induction that
|x̂ − xk | ≤ M k |x̂ − x0 | for all k ≥ 1. Since x1 = f (x0 ), we have already shown |x̂ − x1 | ≤ M |x̂ − x0 |, so the
claim is true when k = 1. Now suppose |x̂ − xk | ≤ M k |x̂ − x0 | for some particular but arbitrary value k ≥ 1.
Note that |x̂ − xk | ≤ M k |x̂ − x0 | implies xk ∈ (x̂ − δ, x̂ + δ) so we apply the Mean Value Theorem as before and
conclude that |x̂ − f (xk )| ≤ M |x̂ − xk |. Substituting xk+1 for f (xk ) and using the inductive hypothesis, we have
|x̂ − xk+1 | ≤ M · M k |x̂ − x0 | = M k+1 |x̂ − x0 |. Hence, we have 0 ≤ |x̂ − xk | ≤ M k |x̂ − x0 |. Of course lim 0 = 0 and
k→∞
lim M k |x̂ − x0 | = 0, so by the squeeze theorem, lim |x̂ − xk | = 0.
k→∞ k→∞

As suggested earlier, we should not expect fixed point iteration to converge when the derivative at a fixed
point has magnitude greater than one. In fact, more or less the opposite happens. There is a neighborhood of
the fixed point in which fixed point iteration is guaranteed to escape the neighborhood for any initial value in the
neighborhood not equal to the fixed point itself. Given that fact, it is tempting to think that perhaps the Fixed
Point Convergence Theorem could be strengthened to a bi-directional implication, an if-and-only-if claim. And it
“almost” can. What can be said here has direct parallels to the ratio test for series. Recall, for any sequence of real

ak+1 X
numbers a0 , a1 , a2 , . . ., the limit L = lim helps determine the convergence of ak in the following way:
k→∞ ak
k=0


X
• If L < 1, then ak converges (absolutely).
k=0


X
• If L > 1, then ak diverges.
k=0


X
• If L = 1, then ak may converge (absolutely or conditionally) or may diverge.
k=0

Analogously, for any function f (x) with continuous first derivative and fixed point x̂, the derivative f 0 (x̂) helps
determine the convergence of the fixed point iteration method in the following way:

• If |f 0 (x̂)| < 1, then fixed point iteration converges to x̂ for any initial value in some neighborhood of x̂.

• If |f 0 (x̂)| > 1, then fixed point iteration escapes some neighborhood of x̂ for any initial value in the neighbor-
hood other than x̂.

• If |f 0 (x̂)| = 1, then fixed point iteration may converge to x̂ for any initial value in some neighborhood of x̂;
or may escape some neighborhood for any initial value in the neighborhood other than x̂; or may have no
neighborhood in which all initial values lead to convergence and no neighborhood in which all values other
than x̂ escape.

The graphs in Figure 2.2.4 of functions with derivative equal to one at their fixed point help illustrate this last case.

For one of these functions, fixed point iteration converges for all values in a neighborhood of the fixed point. For
another of these functions, fixed point iteration escapes some neighborhood of the fixed point for all initial values in
the neighborhood except the fixed point itself. And for the third of these functions, fixed point iteration converges
to the fixed point for some initial values and escapes a neighborhood of the fixed point for others (and every
neighborhood of the fixed point will have both types of initial values). Can you tell which is which? Figure it out
by creating web diagrams for each. Answer on page 55.
The proof of the Fixed Point Convergence Theorem can easily be extended to include initial values in any
neighborhood of the fixed point in which the magnitude of the derivative remains less than 1. The size and
symmetry of the interval are not important. For example, f (x) = 18 x3 − x2 + 2x + 1 has a fixed point at x̂ = 2. The
proof of the Fixed Point Convergence Theorem establishes convergence to 2 in a symmetric interval about 2 such
50 CHAPTER 2. ROOT FINDING

Figure 2.2.4: Convergence behavior when the derivative at the fixed point is 1.

as [1.9, 2.1]. But this interval is far from the largest neighborhood of initial values for which fixed point iteration
converges to 2. We can find bounds on the largest such interval by solving the equation |f 0 (x)| = 1. To that end:

3 2
x − 2x + 2 = ±1
8
2
3x − 16x + 16 = ±8
2
3x − 16x + 24 = 0 or 3x2 − 16x + 8 = 0
√ √
8 ± i2 2 8 ± 2 10
x= or x=
√ 3 √ 3
8 − 2 10 8 + 2 10
≈ 0.558 and ≈ 4.775,
3 3
so we should expect fixed point iteration to converge to 2 on any closed interval contained in
√ √ 
8 − 2 10 8 + 2 10

, .
3 3

Now, if we have the computer execute fixed point iteration for a large number of evenly spaced initial values, say
100, on the interval [−2, 8] and record the results on a number line where we color an initial value black if it does
not converge to 2 and green if it does converge to 2 (we will call such diagram a convergence diagram), we get

which shows that fixed point iteration converges to 2 on approximately [−0.5,


 6.5]. Indeed, the experiment confirms
√ √
that fixed point iteration converges on any closed interval contained in 8−23 10 , 8+23 10 as predicted. But the
diagram shows convergence on an even larger set. We can conclude that the Fixed Point Convergence Theorem
gives sufficient but not necessary conditions for convergence in a neighborhood of a fixed point.
A graph√of the function f (x) superimposed on the line y = x (Figure 2.2.5) gives some insight as to why the
bounds 8±23 10 do not tell a complete story. By imagining the web diagram for any initial value between the two
fixed points other than 2, that is −0.61 and 6.61, you should be able to convince yourself that fixed point iteration
converges to 2 for any initial value in the interval (−0.61, 6.61). Can you prove it? Graphs like those in Figures
2.2.3, 2.2.4, and 2.2.5 are indispensable and should always be consulted when trying to understand fixed point
iteration, but they should not be relied upon as proof. For that, we need to rely on theorems like the Fixed Point
Convergence Theorem.

Crumpet 10: One interesting quadratic


2.2. FIXED POINT ITERATION 51

Figure 2.2.5: f (x) = 81 x3 − x2 + 2x + 1 and the line y = x

Trying to find roots of the logistic equation

g(x) = (α − 1)x − αx2

by applying fixed point iteration to the corresponding function f (x) = x + g(x) = αx(1 − x) is a famous exercise
in dynamical systems which has a nasty habit of not working! Complete the following investigation to see what
happens.

1. Show that f (x) = αx(1 − x) as claimed.


2. For each of the values α = 2.5, α = 3.2, α = 3.833, and α = 4, do the following.
(a) Find the positive fixed point of f (root of g) analytically (using a pencil, paper, and some algebra).
(b) Set x0 = 0.1 and use a computer program to calculate x975 through x1000 .
(c) Examine the 26 iterations of part (b) and describe what you see.
3. Draw a connection between your results from part 2 and the following diagram.

1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
2.4 2.6 2.8 3 3.2 3.4 3.6 3.8 4

4. Use the diagram to predict a value of α for which you would expect fixed point iteration to lead to x975
through x1000 cycling through 4 different values. Check your prediction.
52 CHAPTER 2. ROOT FINDING

Figure 2.2.6: Convergence diagrams for 6 functions with the same fixed points.

f1 :

f2 :

f3 :

f4 :

f5 :

f6 : √ √
black: does not converge; green: converges to 3; red: converges to 1 + 3; blue: converges to 1 − 3

Root Finding
When successful, fixed point iteration finds solutions of an equation of the form f (x) = x. A root finding problem
requires the solution of an equation of the form g(x) = 0. However, the equation f (x) = x has exactly the same
solutions as the equation f (x)−x = 0, so finding fixed points of f (x) is equivalent to finding roots of g(x) = f (x)−x.
Indeed, we can rephrase the example of finding fixed points of f (x) = 18 x3 − x2 + 2x + 1 as the problem of finding
roots of g(x) = f (x) − x = 81 x3 − x2 + x + 1. But it is the opposite problem that is much more common. We have
the question of finding the roots of a function and need to rephrase it in terms of a fixed point problem.
Suppose we want the roots of g(x) = −x3 + 5x2 − 4x − 6. We can rephrase the question of solving g(x) = 0 as
the problem of finding the fixed points of many different functions! But you will have to ignore some sage advice of
your algebra teacher to derive them! The key is to use algebra to rewrite the equation −x3 + 5x2 − 4x − 6 = 0 as
an equation of the form x = f (x). The simplest way is to add x to both sides of the equation. This manipulation
and several others are shown in the following list.
• −x3 + 5x2 − 4x − 6 = 0 ⇒ −x3 + 5x2 − 3x − 6 = x
−x3 +5x2 −6
• −x3 + 5x2 − 4x − 6 = 0 ⇒ −x3 + 5x2 − 6 = 4x ⇒ 4 =x
q
x3 +4x+6
• −x3 + 5x2 − 4x − 6 = 0 ⇒ −x3 − 4x − 6 = −5x2 ⇒ = x2 ⇒ ± x +4x+6 =x
3
5 5

• −x3 + 5x2 − 4x − 6 = 0 ⇒ 5x2 − 4x − 6 = x3 ⇒ 3
5x2 − 4x − 6 = x
Can you see what has been done for each one? Thus, q we have five candidates for fixed point iteration, f1 (x) =
3 2 −x3 +5x2 −6 x3 +4x+6
q
x3 +4x+6

−x + 5x − 3x − 6, f2 (x) = 4 , f3 (x) = 5 , f4 (x) = − 5 , and f5 (x) = 3 5x2 − 4x − 6,
all of which will potentially give roots of g(x). There is a sixth function we will discuss in much more detail later:
2x3 −5x2 −6 3
√ √
f6 (x) = 3x 2 −10x+4 . The roots of g(x) are 1 − 3 ≈ −0.73, 1 + 3 ≈ 2.73, and 3, so we will consider convergence
diagrams over the interval [−2, 5]. Fixed point iteration converges to different fixed points for the different functions
despite the fact that all 6 functions have exactly the same three fixed points. The convergence diagrams of Figure
2.2.6 are color-coded to reflect √this fact. Black
√ indicates lack of convergence just as before. Green, red, and blue
indicate convergence to 3, 1 + 3, and 1 − 3, respectively. Notice that only f6 provides convergence for, as far
as we can tell, every initial value in [−2, 5], and is also the only one for which fixed point iteration converges to
different fixed points for different initial values. See if you can understand why each function has the convergence √
behavior it does by looking at the graphs of f1 , f2 , . . . , f6 . Pay special attention to the graphs around 1 + 3 and
3 By
√ √
calculating f6 (1 − 3), f6 (1 + 3), and f6 (3), you can verify that f6 has these three values as fixed points as well.
2.2. FIXED POINT ITERATION 53

3. Looks can be deceiving in that area because the two fixed points are so close together. Also, see if you can find
two initial values in [−2, 5] for which fixed point iteration on f6 does not converge. What happens instead? For an
extra challenge, see if you can find a third point in [−2, 5] for which fixed point iteration on f6 does not converge.
Hint: you may need to use a computer algebra system to find such a point exactly or use fixed point iteration to
approximate it! Answers on page 55.

The Fixed Point Iteration Method (pseudo-code)


Though we spent a lot of time talking about how to determine whether we should expect the fixed point iteration
method to converge or not, none of that information is strictly relevant to coding the method. Any implementation
of the method should allow the user to try fixed point iteration for any function with any initial value. It is the user’s
responsibility to understand that when the assumptions are not met, the results are unpredictable. Remember,
“garbage in...garbage out.”
The fixed point iteration method presents a problem that the bisection method did not. In the bisection method,
there was a simple and convenient formula for an upper bound on the error. To provide something similar in the
fixed point iteration method, one would have to sacrifice simplicity or convenience or both, but the benefits do
not outweigh the sacrifice. Instead, a more general stopping criterion is used. When two consecutive iterations are
closer to one another than a given tolerance, the method stops. At this point, the difference between iterations,
say xk and xk+1 , is smaller than the tolerance. For a sequence derived from fixed point iteration, xk+1 = f (xk ), so
|xk+1 − xk | = |f (xk ) − xk |. When |xk+1 − xk | is small, |f (xk ) − xk | is small, so f (xk ) ≈ xk . xk is “almost” a fixed
point.

Assumptions: f is differentiable. f has a fixed point x̂. x0 is in a neighborhood (x̂ − δ, x̂ + δ) where the
magnitude of f 0 is less than one.
Input: Initial value x0 ; function f ; desired accuracy tol; maximum number of iterations N .
Step 1: For j = 1 . . . N do Steps 2-4:
Step 2: Set x = f (x0 );
Step 3: If |x − x0 | ≤ tol then return x;
Step 4: Set x0 = x;
Step 5: Print “Method failed. Maximum iterations exceeded.”
Output: Approximation x near exact fixed point, or message of failure.

Key Concepts
Fixed point: x0 is a fixed point of the function f (x) if f (x0 ) = x0 .
Fixed point iteration: Calculating the sequence x0 , x1 = f (x0 ), x2 = f (x1 ), x3 = f (x2 ), . . . given the function f
and initial value x0 .
Attractive fixed point: A fixed point is called attractive (or attracting) if there is a neighborhood of the fixed
point in which fixed point iteration converges for all initial values in the neighborhood.
Repulsive fixed point: A fixed point is called repulsive (or repelling) if fixed point iteration escapes some neigh-
borhood of the fixed point for any initial value in the neighborhood other than the fixed point itself.
Mean Value Theorem: If f is continuous on [a, b] and has a derivative on (a, b), then there exists c ∈ (a, b) such
that f 0 (c) = f (b)−f
b−a
(a)
.

Fixed Point Convergence Theorem: Given a function f (x) with continuous first derivative and fixed point x̂,
if |f 0 (x̂)| < 1 then there exists a neighborhood of x̂ in which fixed point iteration converges to the fixed point
for any initial value in the neighborhood.

Exercises Value Theorem are met for the function over the inter-
val. (ii) If the hypotheses are met, find a value c as
1. Write an Octave implementation of the fixed point it- guaranteed by the theorem.
eration method. Save it as a .m file for future use.

2. (i) Decide whether or not the hypotheses of the Mean (a) f (x) = 3 − x − sin x; [2, 3]
54 CHAPTER 2. ROOT FINDING

(b) g(x) = 3x4 − 2x3 − 3x + 2; [0, 1] 9. Use proposition 3 to show that g(x) = 2x(1 − x) has a
4 3
(c) g(x) = 3x − 2x − 3x + 2; [0, 0.9] [S] unique fixed point on [0.3, 0.7].
3x2 −1 [S]
(d) h(x) = 10 − cosh(x); [−3, −2] [A] 10. Let f (x) = 6x+4
.

(e) f (t) = 4 + 5 sin t − 2.5; [−6, −5] (a) Show that f has a unique fixed point on
3t2 tan t [S] [−4, −0.9].
(f) g(t) = 1−t2
; [20, 23]
3t [A]
(b) Use fixed point iteration to find an approximation
(g) h(t) = ln(3 sin t) − 5
; [2, 4] to the fixed point that is accurate to within 10−2 .
(h) f (r) = esin r − r; [−20, 20] [A]
11. Let g(x) = π + 0.5 sin(x/2).
(i) g(r) = sin(e ) + r; [−3, 3]
r
(a) Show that g has a unique fixed point on [0, 2π].
(j) h(r) = 2sin r − 3cos r ; [1, 3]
(b) Use fixed point iteration to find an approximation
3. Find the fixed point(s) of the function exactly. Use to the fixed point that is accurate to within 10−2 .
algebra.
√ 12. Show that√the fixed point iteration method applied to
(a) f (x) = 3 2x3 − x2 − x f (x) = 3 8 − 4x will converge to a root of g(x) =
ln(2 x3 + 4x − 8 for any initial value x0 ∈ [1.2, 1.5]. [S]
(b) f (x) = 2
13. Show that fixed point iteration is guaranteed to con-
(c) f (x) = log(x2 − 3x) − 1 + x [A]
verge to the fixed point of
(d) g(x) = 3x2 + 5x + 1 [A] √
5000
f (x) = ( 2)x
(e) g(x) = x + 1+2e−3t
− 2500 √
for any x0 ∈ [1, 3]. HINT: f 0 (x) = 21 ln(2) · ( 2)x .
(f) g(x) = eln(x+1)−3
√ 14. Let g(x) = x2 − 3x − 2.
(g) h(x) = 4x2 + 4x + 1
[S] (a) Find a function f on which fixed point iteration
(h) h(x) = x − 10 + 3x + 25 · 3−x
will converge to a root of g.
(i) h(x) = x + 6 − 3 log5 (2x)
(b) Use your function to find a root of g to within
4. Find at least two candidate functions, f1 (x) and f2 (x), 10−3 of the exact value.
for finding roots of g(x) via fixed point iteration. In (c) State the initial value you used and how many
other words, convert the problem of finding a root of g iterations it took to get the approximation.
into a problem of finding a fixed point of f1 or f2 .
15. Use fixed point iteration with p0 = −1 to approximate
(a) g(x) = 7x2 + 5x − 9 a root of g(x) = x3 − 3x + 3 accurate to the nearest
(b) g(x) = x + cos x 10−4 .
(c) g(x) = 6x5 + 12x2 − 8 [A] 16. Use a fixed√point iteration method to find an approx-
imation of 3 that is accurate to within 10−4 . What
(d) g(x) = x2 − e3x+4 [S]
function and initial value did you use?
(e) g(x) = 7x − 3 cos(πx − 2) + ln |2x2 + 4x − 8| 17. The function f (x) = x4 + 2x2 − x − 3 has two roots.
One of them is in [−1, 0] and the other is in [1, 2].
2 2
−5x−1 [A]
(f) g(x) = 3x −5x+1
− 2−x

5. Compute the first 5 iterations of the fixed point itera- (a) In preparation for finding a root of f (x) using
tion method using the given function and initial value. fixed point iteration, one way to manipulate the
Based on these 5 iterations, do you expect the method equation x4 + 2x2 − x − 3 = 0 is to add x to both
to converge? sides. This gives

(a) f (x) = 3 − sin x; x0 = 2 x = x4 + 2x2 − 3


[S]
(b) g(x) = 10 + x − cosh(x); x0 = −3 Draw appropriate graphs to determine whether it-
(c) h(t) = ln(3 sin t) + 2t ; t0 = 1 [A] eration of the function g(x) = x4 +2x2 −3 will find
5
sin r cos r
the root in [−1, 0]. How about the root in [1, 2]?
(d) w(r) = 2 −3 + r; r0 = 1 Explain how you came to your conclusions.
6. Use your Octave function from question 1 with the (b) Manipulate the equation x4 + 2x2 − x − 3 = 0 in
function and initial value in question 5. Set the tol- such a way that fixed point iteration does work
erance to 10−10 and the maximum iterations to 100. to find the root in [−1, 0]. Draw the graphs that
Does the method converge within 100 iterations? If so, demonstrate that your method will work.
to what value? Report at least 10 significant digits. (c) Does the same manipulation allow you to find the
[S][A]
root in [1, 2]? If not, find another manipulation
7. Construct a web diagram for each function/initial value that will. Again, show the graphs that demon-
pair in question 5. [S][A] strate that your method will work.
8. Compare the results from question 6 with the results (d) Use your method(s) from parts 17b and 17c to
of question 7. Are they consistent with one another? find the two roots accurate to 3 decimal places.
2.2. FIXED POINT ITERATION 55


24. Let g(x) = 12 + 1x
x
18. Fixed point iteration on f (x) = 3 2x3 − x2 − x will 5
− 10−5 .
not converge to a fixed point. However,
√ fixed point
iteration on the function g(x) = 3 x2 + x will con- (a) Show that if g(x) has a zero at p, then the func-
verge to approximately 1.618033988749895 for any x0 tion f (x) = x + cg(x) has a fixed point at p.
in [0.5, 3.5]. [A] (b) Find a value of c for which fixed point iteration
of f (x) will successfully converge for any start-
(a) How many iterations does it take to achieve 10 −4
ing value, p0 , in the interval [16, 17]. Sketch the
accuracy using g(x) with x0 = 2.5?
graphs that demonstrate that your choice of c is
(b) Explain why f (x) and g(x) have the same fixed appropriate.
points.
(c) Use the function from part 24b with the value of
19. Find a zero (any zero) of g(x) = x2 + 10 cos x accurate c you have determined to find a root of g(x) ac-
to within 10−4 using fixed point iteration. State curate to within 10−4 . State the value you used
for p0 . Show the last 3 iterations. How many
(a) the function f to which you fixed point iteration iterations did it take?
(b) the initial value, x0 , you used
25. Prove that for f (x) = cos x, fixed point iteration con-
(c) how many iterations it took
verges for any initial value.
20. Let c be a nonzero real number. Argue that any fixed 26. The Fixed Point Convergence Theorem can be
point of f (x) = xec·g(x) is a root of g. strengthened. The requirement that the first deriva-

21. Approximate 3 using the method suggested by ques- tive be continuous can be replaced. Modify the proof
tion 20. in the text to show the following claim.
22. Suppose g(x̂) = 0 and g has a continuous first deriva- Given a differentiable function f (x) with fixed point x̂,
tive. Argue that there exists a value c for which fixed if |f 0 (x)| ≤ M < 1 for all x in some neighborhood of
point iteration on f (x) = x + cg(x) will converge to x̂ x̂, then fixed point iteration converges to the fixed point
on some neighborhood of x̂. for any initial value in the neighborhood.
23. Find a value of c for which fixed point iteration is guar- 27. Create three graphs similar to those in Figure 2.2.4 to
anteed to converge for the function f (x) = x + c(x − analyze the situation when the derivative at the fixed
5 cos x) with any initial value x0 ∈ [0, π/2]. Explain. point equals −1. Does the situation differ from that
[A]
when the derivative at the fixed point equals 1?

Answers
Figure 2.2.4: From left to right: every neighborhood of the fixed point will have both types of initial values;
point iteration converges for all values in a neighborhood of the fixed point; fixed point iteration escapes some
neighborhood of the fixed point for all initial values in the neighborhood except the fixed point itself
Figure 2.2.6: When its denominator is zero, f6 (x) will be undefined (there is a vertical asymptote in the graph),
so we solve 3x2 −10x + 4 = 0 to find two initial values

for which fixed point iteration will fail (since the
first iteration will be undefined). They are x = 5±3 13 ≈ .4648 and 2.868. To find a third point for which

fixed point iteration will fail, we solve the equation f6 (x) = 5+5 13 (we could just as easily have solved
√ √
f6 (x) = 5−5 13 instead). Then the second iteration will be undefined since the first iteration will be 5+5 13 .
The
q only real solution is approximately 1.055909763230534, which can be found by fixed point iteration on
√ √ √
3 13x2 +10x2 − 10 13x
− 50x
3 + 3 + 3
4 13 38

2
3
. Prove it. Note, though, the claim that fixed point iteration will fail is
based on the assumption of exact arithmetic. The fact that any reasonable implementation of the fixed point
iteration method will involve floating point arithmetic might provide just enough error for the method to
converge even for these initial values.
56 CHAPTER 2. ROOT FINDING

2.3 Order of Convergence for Fixed Point Iteration


Suppose f is a function with fixed point x̂ and f 0 (x̂) exists. Let x0 , x1 , x2 , . . . be a sequence derived from fixed
point iteration (xk+1 = f (xk ) for all k ≥ 1) such that lim xk = x̂ and xk 6= x̂ for all k = 0, 1, 2, . . .. Then
k→∞

f (xn ) − f (x̂)

|xn+1 − x̂|
1 =

|xn − x̂| xn − x̂

and
f (xn ) − f (x̂)

lim = |f 0 (x̂)|. (2.3.1)
n→∞ xn − x̂
Therefore, fixed point iteration is linearly convergent as long as f 0 (x̂) 6= 0. The following proposition could be
presented as a corollary to the Fixed Point Convergence Theorem since much of the argument simply repeats what
was noted there, but we choose to present it as a separate claim based on equation 2.3.1. To be more precise, we
have the following result.
Proposition 5. (Fixed Point Error Bound) Let f be a differentiable function with fixed point x̂ and let [a, b] be an
interval containing x̂. If |f 0 (x)| ≤ M < 1 for all x ∈ [a, b] and f ([a, b]) ⊆ [a, b], then for any initial value x0 ∈ [a, b],
fixed point iteration, with xk+1 = f (xk ) for all k ≥ 0, gives an approximation of x̂ with absolute error no more than
M k |x0 − x̂|.
Proof. An elementary induction proof (requested in the exercises) will establish that xk ∈ [a, b] for all k ≥ 0. We
proceed to prove the error bound. The absolute error in approximating x̂ by x0 is |x0 − x̂| = M 0 |x0 − x̂| so the
claim is true for k = 0. Now suppose the claim is true for some particular but arbitrary k ≥ 0. By the Mean Value
Theorem, there is a c in the interval from x̂ to xk such that f 0 (c) = f (xxkk)−f −x̂
(x̂)
. Since x̂ and xk are both in [a, b], so
is c. It follows that |f (c)| ≤ M , so |f (xk ) − f (x̂)| ≤ M |xk − x̂|. But x̂ is a fixed point of f , so f (x̂) = x̂, from which
0

it follows that |f (xk ) − x̂| ≤ M |xk − x̂|, and, therefore, that |xk+1 − x̂| ≤ M |xk − x̂|. By the inductive hypothesis,
|xk − x̂| ≤ M k |x0 − x̂|, so |xk+1 − x̂| ≤ M · M k |x0 − x̂| = M k+1 |x0 − x̂|.
When f 0 (x̂) = 0, equation 2.3.1 shows that fixed point iteration does not converge linearly. For any sequence
hpn i converging to p, if limn→∞ |p|pn+1 −p|
n −p|
= 0 we say the sequence is superlinearly convergent or that convergence is
faster than linear.
Consider the functions f (x) = 81 x3 −x2 +2x+1 and f1 (x) = −x3 +5x2 −3x−6 from section 2.2. Recall 2 is a fixed
point of f and 3 is a fixed point of f1 and observe that f 0 (2) = 38 ·22 −2·2+2 = − 12 and f10 (3) = −3·32 +10·3−3 = 0
Consequently, we should expect fixed point iteration of f1 to converge to 3 faster than that of f converges to 2. With
s0 , s1 , s2 , . . . = 1.75, f (1.75), f (f (1.75)), . . . and t0 , t1 , t2 , . . . = 2.75, f1 (2.75), f1 (f1 (2.75)), . . ., table 2.1 shows the

Table 2.1: Comparing order of convergence for fixed point iteration when the derivative at the fixed point is not
zero (sn ) to that when the derivative at the fixed point is zero (tn ).
n |2 − sn | |3 − tn |
0 2.5(10)−1 2.5(10)−1
1 1.074(10)−1 2.343(10)−1
2 5.644(10)−2 2.068(10)−1
3 2.740(10)−2 1.623(10)−1
4 1.388(10)−2 1.010(10)−1
5 6.894(10)−3 3.984(10)−2
6 3.459(10)−3 6.286(10)−3
7 1.726(10)−3 1.578(10)−4
8 8.640(10)−4 9.966(10)−8
9 4.318(10)−4 3.973(10)−14
10 2.159(10)−4 6.317(10)−27

relative speeds of convergence. hsn i is converging linearly as expected, and htn i seems to be converging quadratically.
The last four exponents in the |3 − tn | column are −4, −8, −14, −27, indicating that the number of significant digits
of accuracy is approximately doubling with each iteration. In other words, the error of one term is roughly the
square of the previous error (meaning α = 2 in the definition of order of convergence).
2.3. ORDER OF CONVERGENCE FOR FIXED POINT ITERATION 57

Table 2.2: Accelerating the convergence of a linearly converging sequence.



an −c |an+1 −c|
n cn an |cn − c| |an − c| cn −c |an −c|2
0 1 .728010 2.609(10)−1 1.107(10)−2 .0934 .0110
1 .5403 .733665 1.987(10)−1 5.419(10)−3 .0639 44.19
2 .8575 .736906 1.184(10)−1 2.178(10)−3 .0400 74.17
3 .6542 .738050 8.479(10)−2 1.034(10)−3 .0274 217.9
4 .7934 .738636 5.439(10)−2 4.490(10)−4 .0180 419.4
5 .7103 .738876 3.771(10)−2 2.085(10)−4 .0122 1034
6 .7639 .738992 2.487(10)−2 9.289(10)−5 .0081
7 .7221
8 .7504

Taylor’s theorem will provide the proof we need that this convergence really is quadratic. Suppose f has a
third derivative in a neighborhood of x̂. Define en = x̂ − xn . Then according to Taylor’s theorem, x̂ = f (x̂) =
f (xn + en ) = f (xn ) + en f 0 (xn ) + 12 e2n f 00 (xn ) + O(e3n ). But f (xn ) = xn+1 so we get

1
x̂ − xn+1 = en+1 = en f 0 (xn ) + e2n f 00 (xn ) + O(e3n ). (2.3.2)
2
Also from Taylor’s theorem, f 0 (x̂) = f 0 (xn + en ) = f 0 (xn ) + en f 00 (xn ) + O(e2n ). But f 0 (x̂) = 0 so

f 0 (xn ) = −en f 00 (xn ) − O(e2n ). (2.3.3)

Substituting 2.3.3 into 2.3.2,


1
en+1 = en (−en f 00 (xn ) − O(e2n )) + e2n f 00 (xn ) + O(e3n )
2
1 2 00 3
= − en f (xn ) + O(en ).
2

Hence, x̂−xn+1
(x̂−xn )2 = en+1
en2 = − 12 f 00 (xn ) + O(en ) and

1 00
f (xn ) + O(en ) = 1 f 00 (x̂) ,

|x̂ − xn+1 |
lim = lim

n→∞ |x̂ − xn |2 n→∞ 2 2

showing that convergence is at least quadratic. If f 00 (x̂) happens to be 0, then the convergence is superquadratic.
To summarize, on the off-chance that, at a fixed point x̂, f 0 (x̂) = 0, fixed point iteration is successful and fast
for initial values near x̂. But when f 0 (x̂) 6= 0, fixed point iteration may fail to converge to x̂, and when it does
converge, the convergence is slow. There is a quick fix (quick to implement, not quick to explain) for some of this
deficiency when f 0 (x̂) 6= 0, however. We will first concentrate on the speed of convergence.
Let the sequence hcn i be defined by

c0 = 1
ck = cos(ck−1 ), k > 0.

You should be able to verify that the first few terms of this sequence are (approximately)

1, .5403, .8575, .6542, .7934, . . .

This is exactly the sequence you created in the calculator experiment on page 46 of section 2.2. Define a new
sequence han i by
(cn+1 − cn )2
an = cn − .
cn+2 − 2cn+1 + cn
Table 2.2 shows the first few terms of each sequence along with some error analysis. As promised, the sequence
han i is converging more quickly than hcn i, evidenced by the fact that acnn−c is tending to zero. The last column of
−c

the table indicates that the convergence of han i to c is not quadratic, however.
58 CHAPTER 2. ROOT FINDING

|p−pn+1 |
More generally, suppose hpn i is any sequence that converges linearly to p. Then we have lim = λ 6= 0,
n→∞ |p−pn |
so we should expect |p−pn+2 |

|p−pn+1 | ≈ λ for large enough n, from which we get |(p−pn+2 )(p−pn )| ≈ |p−pn+1 |2 .
|p−pn+1 |
|p−pn |
Assuming p − pn+2 and p − pn have the same sign for large n4 , we can remove the absolute values to find

(p − pn+2 )(p − pn ) ≈ (p − pn+1 )2


p2 − (pn+2 + pn )p + pn+2 pn ≈ p2 − 2pn+1 p + p2n+1
(−pn+2 + 2pn+1 − pn )p ≈ −pn+2 pn + p2n+1
pn+2 pn − p2n+1
p ≈ .
pn+2 − 2pn+1 + pn

Therefore, we may take any three consecutive terms of hpn i and predict p from this formula. For large enough n,
this prediction will be a much better estimate of p than is pn . But just as we were able to claim |(p−pn+2 )(p−pn )| ≈
|p − pn+1 |2 , it must also be the case that pn+2 pn ≈ p2n+1 , so the numerator of our approximation is nearly zero. Of
course, that means the denominator must be nearly zero as well, since the quotient is p, a value that may not be
zero. To avoid some of the error inherent in this calculation, it is advisable to compute the algebraically equivalent
approximation
(pn+1 − pn )2
p ≈ pn − (2.3.4)
pn+2 − 2pn+1 + pn
instead. Let’s go back and revisit the sequence hsn i and apply this approximation.
(sn+1 −sn )2
Define an = sn − sn+2 −2sn+1 +sn and consider table 2.3 comparing the two sequences hsn i and han i. han i

Table 2.3: Comparing fixed point iteration when the derivative at the fixed point is not zero, sn , to the Aitken’s
delta-squared sequence, an .
n sn |2 − sn | an |2 − an |
0 1.75 2.5(10)−1 1.99506842493985 4.931(10)−3
1 2.107421875 1.074(10)−1 1.999022858310434 9.771(10)−4
2 1.943559146486223 5.644(10)−2 1.999737171760319 2.628(10)−4
3 2.027401559734717 2.740(10)−2 1.999937151202653 6.284(10)−5
4 1.986114080555812 1.388(10)−2 1.999983969455146 1.603(10)−5
5 2.006894420349172 6.894(10)−3
6 1.996540947531514 3.459(10)−3

converges significantly faster than the linearly convergent sequence from which is was derived, just as before! The
fact that |2 − an | ≈ |2 − sn+2 |2 is evidence of this claim, but the convergence of han i is still linear. Make sure you
can calculate the an in this table yourself before reading on.
On a practical note, there is no sense in calculating all the terms a0 , a1 , . . . , an−2 as done in the table. The
terms of han i are dependent only on those of hsn i so an−2 can be calculated just as well without having calculated
a0 , a1 , . . . , an−3 . The table shows all of them only for illustrative purposes and so you can get some practice with
formula 2.3.4. The important thing to notice is that an has approximately twice as many significant digits of
accuracy as does sn+2 . Consequently, a0 is a much better approximation than is s2 .

Crumpet 11: Aitken’s delta-squared method is designed for any linearly convergent sequence, not
just sequences derived from fixed point iteration.

The derivation of 2.3.4, referred to as Aitken’s delta-squared formula, makes no reference to fixed point iteration.
In fact it makes no assumptions about the origin of the sequence. It makes no difference. It may be a sequence of
partial sums, a sequence of partial products, a sequence derived from any recurrence relation, a sequence derived
from number theory, or anything else. The only important characteristics are that the sequence converges and it
does so linearly.

4 This will happen in the common events that the x̂ − x all have the same sign or the x̂ − x have alternating signs, so this is not
n n
an unrealistic assumption.
2.3. ORDER OF CONVERGENCE FOR FIXED POINT ITERATION 59

Table 2.4: Steffensen’s method applied to f (x) = cos x.


|an+1 −c|
n an g(an ) g(g(an )) |an − c| |an −c|2
0 1 .5403023058681398 .8575532158463934 2.609(10)−1 .162
1 .7280103614676171 .7464997560452203 .7340702837365296 1.107(10)−2 .148
2 .7390669669086738 .7390973701357808 .7390768902228948 1.816(10)−5 .148
3 .7390851331660755 .739085133248225 .739085133192888 4.908(10)−11 .148
4 .7390851332151607 3.063(10)−17

1
The sum 1
− 1 + 15 − 17 + 19 − · · · converges to π
linearly so Aitken’s delta-squared method should be helpful.
Pn3 (−1) k+1
4
13 76
If we let pn = k=1 2k−1
be the n th
partial sum, then p2 = 15
, p3 = 105
, p4 = 263315
, and p5 = 2578
3465
. Aitken’s
76 − 13 )2
( 105 ( 263 − 76 )2 |π −p 2
13 1321 76 989 4|
extrapolation gives a2 = 15
− 15
263 −2 76 + 13 = 1680
and a3 = 105
− 315 105
2578 −2 263 + 76 = 1260 . 4
|π −a2 |
≈ 2.6 and
315 105 15 3465 315 105 4
|π −p5 |2
4
|π −a3 |
≈ 3.5 so extrapolation gives an error less than the square of the error in the original sequence.
4

Perhaps this fact gives you an idea. Once s2 is calculated, we can use equation 2.3.4, also known as Aitken’s
delta-squared method, to calculate a better approximation than we already have. And once we have this good
approximation, it seems a bit silly to cast it aside and continue computing s3 = f (s2 ), s4 = f (s3 ), and so on. What
if we use a0 in place of s3 in our iteration? In other words, we would have s1 = f (s0 ), s2 = f (s1 ), s3 = a0 , s4 = f (s3 ),
and so on. That should improve s3 , s4 , and s5 . And once we have s5 we again have three consecutive fixed point
iterations, so we can apply Aitken’s delta squared method again. Instead of calculating s6 = f (s5 ), we can get what
should be a better approximation by using equation 2.3.4 on s3 , s4 , and s5 . In other words, s6 = a3 , s7 = f (s6 ),
s8 = f (s7 ). Again, we have three consecutive fixed point iterations, so s9 = a6 , and so on. This gives the sequence

1.75, 2.107421875, 1.943559146486222,


1.995068424939850, 2.002459692429676, 1.998768643123618,
1.999997974970982, 2.000001012513483, 1.999999493743001,
1.999999999999658, 2.000000000000170, 1.999999999999914,
1.999999999999999, ...

which converges to 2 very quickly compared to hsn i. If we consider the calculations of s1 , s2 , s4 , s5 , s7 , s8 , . . . to be


intermediary and focus on the subsequence s0 , s3 , s6 , s9 , . . . = s0 , a0 , a3 , a6 , . . . as a sequence itself we have

1.75, 1.995068424939850, 1.999997974970982, 1.999999999999658, 1.999999999999999, . . .

which converges very rapidly! The construction of this subsequence as a sequence in and of itself is called Steffensen’s
method and the convergence is quadratic as long as hsn i is convergent. The following is a heuristic argument that
Steffensen’s method gives quadratic convergence. As seen, the error in s2 is not significantly different from the error
in s0 . But a0 has an error approximately equal to the square of the error in s2 , so the error in a0 is approximately
the square of the error in s0 . Similarly, the error in s5 is not significantly different from that in a0 = s3 . But the
error in a1 is approximately the square of the error in s5 , so the error in a1 is approximately the square of the error
in a0 . Similarly, the error in an+1 is approximately the square of the error in an .
Applying Steffensen’s method to the function f (x) = cos x with x0 = 1, we can accelerate the convergence of the
sequence hcn i dramatically. Table 2.4 shows the first few terms of han i with some error analysis. The last column
of the table indicates that
|an+1 − c|
lim ≈ .148
n→∞ |a − c|2
n

and, consequently, that the sequence han i converges quadratically.


Finally, we have two ways to get quick convergence from fixed point iteration. One, we simply iterate when the
function has derivative zero at the fixed point. Two, we use Steffensen’s method.
60 CHAPTER 2. ROOT FINDING

Figure 2.3.1: Convergence diagrams for 5 functions with the same fixed points—Steffensen’s method.

f1 :

f2 :

f3 :

f4 :

f5 : √ √
black: does not converge; green: converges to 3; red: converges to 1 + 3; blue: converges to 1 − 3

Convergence Diagrams
Speeding up fixed point iteration only takes care of one deficiency of the method. There is still the problem of diver-
gence from fixed points where the derivative of the function has magnitude equal to or greater than 1. Steffensen’s
method helps. Compare Figure 2.3.1 with Figure 2.2.6. The convergence diagrams for Steffensen’s method show
convergence over larger intervals of initial values. Moreover, where f1 and f2 are concerned, Steffensen’s method
finds all three fixed points, just as fixed point iteration on f6 did.

Steffensen’s Method (pseudo-code)


Since Steffensen’s method is particularly prone to floating-point error, we do a preliminary check for convergence
before the Aitken’s delta-squared step. This helps prevent large errors or division by zero in Step 4.

Assumptions: Fixed point iteration converges to a fixed point of f with initial value x0 .
Input: Initial value x0 ; function f ; desired accuracy tol; maximum number of iterations N .
Step 1: For j = 1 . . . N do Steps 2-6:
Step 2: Set x1 = f (x0 ); x2 = f (x1 )
Step 3: If |x2 − x1 | ≤ tol then return x2
Step 4: Set x = x0 − x(x 1 −x0 )
2

2 −2x1 +x0

Step 5: If |x − x0 | ≤ tol then return x;


Step 6: Set x0 = x;
Step 7: Print “Method failed. Maximum iterations exceeded.”
Output: Approximation x near exact fixed point, or message of failure.

Key Concepts
Aitken’s delta-squared method: If hpn i converges to p linearly, the sequence han i defined by an = pn −
(pn+1 −pn )2
pn+2 −2pn+1 +pn converges to p superlinearly.

Fixed Point Error Bound: Let f be a differentiable function with fixed point x̂ and let [a, b] be an interval
containing x̂. If |f 0 (x)| ≤ M < 1 for all x ∈ [a, b] and f ([a, b]) ⊆ [a, b], then for any initial value x0 ∈ [a, b],
fixed point iteration, with xk+1 = f (xk ) for all k ≥ 0, gives an approximation of x̂ with absolute error no
more than M k |x0 − x̂|.
2.3. ORDER OF CONVERGENCE FOR FIXED POINT ITERATION 61

Fixed Point Iteration Order of Convergence: Suppose f is a function with fixed point x̂ and f 0 (x̂) exists.
Let x0 , x1 , x2 , . . . be a sequence derived from fixed point iteration (xk+1 = f (xk ) for all k ≥ 1) such that
lim xk = x̂ and xk 6= x̂ for all k = 0, 1, 2, . . .. Then the sequence hxn i converges linearly to x̂ if f 0 (x̂) 6= 0 and
k→∞
at least quadratically if f 0 (x̂) = 0.
Steffensen’s method: A modification of fixed point iteration where every third term is calculated using Aitken’s
delta-squared method.
|pk+1 − p|
Superlinear convergence: If the sequence p0 , p1 , p2 , . . . converges to p and lim = 0, then the sequence
k→∞ |pk − p|
is said to converge superlinearly.
|pk+1 − p|
Superquadratic convergence: If the sequence p0 , p1 , p2 , . . . converges to p and lim = 0, then the
k→∞ |pk − p|2
sequence is said to converge superquadratically.

Octave
In section 1.3, we learned about for loops. With a for loop, you have to know how many times you want the loop
to run or at least you need a maximum. You can quit a for loop before it is done by exiting (returning) from
the function. There are times, however, when you don’t know how many times you need a loop to run and you
don’t even have a convenient maximum at hand. In this case, a while loop is more appropriate. A while loop will
continue to loop as long as a certain condition is met, and you set the condition. The syntax for a while loop is
while (condition)
do something.
end%while
but must be used with caution. for loops always have an end, but while loops do not if programmed carelessly. If
the condition of a while loop is never met, the loop runs indefinitely! Here is a simple example of a while loop
that never ends. Do not run it!
i=0;
while (i<12)
disp("Help! I’m stuck in a never-ending loop!!")
end%while
The problem is i is set less than 12 and never changes so always remains less than 12. Thus the condition of this
while loop is always met. This loop can easily be modified to terminate. If we increment i inside the loop, it will
end. This modification of the never-ending loop does end and displays a messge 12 times:
i=0;
while (i<12)
disp("That’s better. I can handle a dozen iterations.")
i=i+1;
end%while
Incidentally, any for loop can be replaced by a while loop like this one.
We are human. Inevitably, we will program a while loop that never ends. What to do once it starts running?
Of course, you can power down the machine, but that is a little like bringing your coffee mug to the kitchen using a
bull dozer. There is an easier way. You can simply stop the application in which you are running Octave. If you are
using a command line (terminal) window or the Octave GUI, you can simply close it. But, if you remember, you
can also press Ctrl-c. That is, tap the c key while holding down the Ctrl key. This will interrupt the never-ending
loop.
For a more practical example, the bisection method can easily be re-programmed using a while loop. First, the
pseudo-code:

Assumptions: f is continuous on [a, b]. f (a) and f (b) have opposite signs.
Input: Interval [a, b]; function f ; desired accuracy tol.
Step 1: Set m = 2 ;
a+b
err = |b − a|/2; L = f (a);
62 CHAPTER 2. ROOT FINDING

Step 2: While err > tol do Steps 3-5:


Step 3: Set m = a+b
2 ; M = f (m); err = err/2;
Step 4: If M = 0 then return m;
Step 5: If LM < 0 then set b = m; else set a = m and L = M ;
Step 6: Return m.
Output: Approximation m within tol of exact root.

Now the Octave code. If you decide to use this code, it should be saved in a file named bisectionWhile.m.

function p = bisectionWhile(f,a,b,tol)
p = a + (b-a)/2;
err = abs(b-a);
FA = f(a);
while (err>tol)
p = a + (b-a)/2;
FP = f(p);
err=err/2;
if (FP == 0)
return
end%if
if (FA*FP > 0)
a = p;
FA = FP;
else
b = p;
end%if
end%while
end%function

Use this code with caution! It can run as a never-ending loop! If the function is called with a negative value for tol,
as in bisectionWhile(g,1,2,-10), it will run until forcibly stopped (using Ctrl-c or shutting down the Octave
app) as err will always be greater than −10.

Error checking
The most useful software includes error checking. In the case of the bisectionWhile function, we want to avoid
the endless loop in every instance we can imagine. Adding a couple lines at the beginning of the function provides
some security:

function p = bisectionWhile(f,a,b,tol)
if (tol<=0)
p = "ERROR:tol must be positive.";
return
end%if
p = a + (b-a)/2;
err = abs(b-a);
FA = f(a);
while (err>tol)
p = a + (b-a)/2;
FP = f(p);
err=err/2;
if (FP == 0)
return
end%if
if (FA*FP > 0)
a = p;
FA = FP;
2.3. ORDER OF CONVERGENCE FOR FIXED POINT ITERATION 63

else
b = p;
end%if
end%while
end%function

In general, having your program check for input errors like this is called error checking or validation . Most
of the time, we will write code assuming the input is valid and will not do any error checking. This makes the
programming simpler, but also allows for problems like never-ending loops! bisectionWhile.m may be downloaded
at the companion website.

Exercises (c) 12, 12.333, 12.667, 13, 13.333, 13.667, 14


1. Supply the proof that xk ∈ [a, b] for all k ≥ 0 in propo- (d) 1, 9, 25, 49, 81, 121, 169, 225, 289, 361, 441
sition 5. (e) 1, .5, .25, .125, .0625, .03125, .015625
2. Show that √
pn+2 pn − p2n+1 9. The function g(x) = 3 5 − 3x satisfies the hypotheses
pn+2 − 2pn+1 + pn of proposition 5 over the interval [1, 1.3]. Find a bound
on the number of iterations required to find the fixed
and
(pn+1 − pn )2 point to within 10−5 accuracy starting with initial value
pn − x0 of your choice.
pn+2 − 2pn+1 + pn √
are algebraically equivalent. 10. Fixed point iteration on the function g(x) = 3 x2 + x
will converge to approximately 1.618033988749895 for
3. Write an Octave function that implements Steffensen’s
any x0 in [0.5, 3.5]. [A]
method.
4. Write an Octave program (.m file) that uses a while (a) Find a bound on the number of iterations it will
loop and the disp() command to output the first 10 take to achieve 10−4 accuracy with x0 = 2.5.
powers of 5 starting with 50 . (b) How many iterations does it actually take to
achieve 10−4 accuracy with x0 = 2.5?
5. Write an Octave program (.m file) that uses a while
2
loop, an array, and the disp() command to find the 11. Let f (x) = 3x −1
6x+4
. In exercise 10 of section 2.2, you
n
22 − 2 were asked to show that f has a unique fixed point on
values of f (n) = 2n for n = 0, 1, 2, 4, 6, 10. [S]
2 +3 [−4, −0.9]. [S]

6. Write an Octave program (.m file) that uses (a) Find a bound on the number of iterations required
a while loop, an array, and the disp() command to approximate the fixed point to with 10−11 ac-
2n curacy using fixed point iteration with any initial
to find the values of f (n) = √ for n =
n2 + 3n value in [−4, −0.9].
0, 2, 5, 10, 100, 1000, 20000.
(b) Use fixed point iteration with x0 = −4 to find an
7. The following Octave code is intended to calculate approximation to the fixed point that is accurate
the sum to within 10−11 . The fixed point is x = −1.
30
X 1 (c) Compare the bound to the actual number of iter-
k2 ations needed.
k=1
but it does not. Find as many mistakes in the code as 12. Let g(x) = π + 0.5 sin(x/2). In exercise 11 of section
you can. Classify each mistake as either a compilation 2.2, you were asked to show that g has a unique fixed
error (an error that will prevent the program from run- point on [0, 2π].
ning at all) or a bug (an error that will not prevent the
program from running, but will cause improper calcu- (a) Find a bound on the number of iterations required
lation of the sum). to achieve 10−2 accuracy using fixed point itera-
tion with any initial value in [0, 2π].
sum=1;
(b) Use fixed-point iteration with x0 = 0 to find an
k=1;
approximation to the fixed point that is accurate
while k<30
to within 10−2 . The fixed point is x =???.
sum=sum+1.0/k*k;
end (c) Compare the bound to the actual number of iter-
diss(sum) ations needed.

8. Write a while loop that outputs the sequence of 13. Calculate


√ two iterations of Steffensen’s method for
numbers. g(x) = 3 x2 + x with x0 = 2.5. [A]
14. Use Steffensen’s method to find the root of g(x) =
(a) 7, 8, 9, 10, 11, 12, 13, 14, 15 x4 − 2x3 − 4x2 + 4x + 4 in [2, 3] accurate to five siginif-
(b) 20, 19, 18, 17, 16, 15, 14, 13 icant digits. [A]
64 CHAPTER 2. ROOT FINDING

15. Compute a0 , a1 , and a2 of Aitken’s delta-squared hp1 , p2 , p3 , . . .i ≈ h.84147, .95885, .98158, . . .i converges
method for the sequence in problem 2 on page 27. to 1, albeit very slowly. Generate the first three terms
Since the sequence has an undefined term at n = 1, of the sequence han i using Aitken’s delta-squared cal-
start the sequence h n−1 n+1
i with n = 2. In other words, culation. Does it seem to be approaching 1 faster than
consider the sequence in problem 2 on page 27 to be does hpn i?
3, 2, 53 , 32 , 75 . . . so p0 = 3, p1 = 2, p2 = 35 , and so on. 20. Fixed point iteration applied to f (x) = sin(x) with
16. The following sequences are linearly convergent. Gen- x0 = 1 takes 29, 992 iterations to reach a number be-
erate the first five terms of the sequence han i using low 0.01 on its way to the fixed point 0. Incidentally,
Aitken’s delta-squared calculation. x29992 ≈ 0.099999. How many iterations does it take
Steffensen’s method with x0 = 1 to reach a number
(a) p0 = 0.5, pn = (2 − epn−1 + p2n−1 )/3 for n ≥ 1 [S]
below 0.01? Comment. [S]
p
(b) p0 = 0.75, pn = epn−1 /3 for n ≥ 1 21. Let f (x) = 1 + (sin x)2 and p0 = 1. Find a1 and a2 of
17. Use Aitken’s delta squared method to find p = lim pn Steffensen’s method with a calculator. [A]
n→∞
accurate to 3 decimal places. 22. Compute the first three √iterations of Steffensen’s
method applied to g(x) = ( 2)x using p0 = 3.
pn = {−2, −1.85271, −1.74274, −1.66045, 23. Steffensen’s method is applied to a function f (x) using
− 1.59884, −1.55266, −1.51804, p0 = 1. If f (f (p0 )) = 3 and a1 = 0.75, what is f (p0 )?
[A]
− 1.49208, −1.47261, . . .}
24. Find the fixed point of f (x) = x−0.002(ex cos(x)−100)
18. The sequence han i of question 15 converges faster than in [5, 6] using Steffensen’s method. [A]
does the sequence in problem 2 on page 27. If you
were to apply Aitken’s delta-squared method to the se- 25. In question 24 you found a fixed point x̂. For what
quence han i, would you expect the convergence to be function g(x) is x̂ a root?
even faster? Explain. [A]
26. Write a while loop that outputs the numbers
19. Recall from calculus that limn→∞ n sin n1 = 1.

1, .5, .25, .125, .0625, .03125, .015625, . . . until it reaches
Therefore, if we let pn = n sin n1 , then the sequence a number below 10−4 .

2.4. NEWTON’S METHOD 65

2.4 Newton’s Method


In section 2.3 we addressed some of the deficiency in fixed point iteration, but delayed deep discussion of the
mysterious function f6 of the root finding investigation on page 52. The time has come to discuss f6 in some detail.
2x3 −5x2 −6
We start with some number crunching. Recall that f6 (x) = 3x 2 −10x+4 and let x0 = 4. Proceeding with fixed point

iteration,
x1 = f6 (x0 ) = 3.5
x2 = f6 (x1 ) ≈ 3.217391304347826
x3 = f6 (x2 ) ≈ 3.072749058541597
x4 = f6 (x3 ) ≈ 3.013730618589344
x5 = f6 (x4 ) ≈ 3.000683798275568
x6 = f6 (x5 ) ≈ 3.000001860777997
x7 = f6 (x6 ) ≈ 3.000000000013848.
You can see two things. The sequence x0 , x1 , x2 , . . .
1. is converging to (the fixed point) 3; and
2. it looks like the convergence is quadratic since, starting with x4 to x5 , the number of significant digits is
roughly doubling with each iteration.
In the analysis in section 2.3 on page 56, we found that fixed point iteration converges quadraticly (or better) only
when the derivative at the fixed point is zero. These observations should lead you to believe f60 (3) = 0. Let’s check.
First, the derivative f60 (x) = 6x −40x +74x2 −4x−60
4 3

(3x2 −10x+4)2 (you should verify this). Evaluating the numerator at the fixed
4 3 2
point, x = 3, we get 6(3) − 40(3) + 74(3) − 4(3) − 60 = 486 − 1080 + 666 − 12 − 60 = 0. So we have convergence
to a fixed point where the derivative of the function is zero, and we indeed √ have that convergence is quadratic.
Starting with x0 = 2, fixed√ point iteration on f6 converges to 1 + 3, and starting with x0 = −1, fixed point
iteration converges to 1 − 3. You should be able to verify this from the convergence diagram in Figure 2.2.6 or
from calculating the first several iterations for each yourself. What you do not get from the convergence diagram
is the speed of convergence. For that, you need to look at the iterates. You should do so. Does convergence look
quadratic in these cases too? Answer on page 72.
From the convergence diagram, we see that fixed point iteration will converge for virtually any initial value,
and all three fixed points can be estimated by fixed point iteration. Moreover, from our calculations, it looks like
convergence is quadratic for all three. It’s hard to ask for more from a function. Fast convergence to any fixed
point! So whence did f6 come?
Suppose g(x) is differentiable and g(x̂) = 0 so g has a root at x̂. Consider f (x) = x − gg(x) 0 (x) . x̂ is a fixed point

of f as long as g 0 (x̂) 6= 0:
g(x̂) 0
f (x̂) = x̂ − 0 = x̂ − 0 = x̂.
g (x̂) g (x̂)
Moreover, as long as g has a second derivative near x̂,
g 0 (x̂) · g 0 (x̂) − g(x̂)g 00 (x̂)
f 0 (x̂) = 1−
g 0 (x̂) · g 0 (x̂)
0 · g 00 (x̂)
= 1−1+ 0
g (x̂) · g 0 (x̂)
= 0.
From these calculations, we conclude if g(x) is twice differentiable, g(x̂) = 0 and g 0 (x̂) 6= 0, then fixed point iteration
of f (x) with initial value in a neighborhood of x̂ will converge quadratically to x̂. What a great way to turn a root
finding problem into a fixed point problem!
Now is a good time to recall that f6 was just one of 6 candidate functions designed to find the roots of
g(x) = −x3 + 5x2 − 4x − 6 by fixed point iteration. Indeed, g 0 (x) = −3x2 + 10x − 4 and
g(x) −x3 + 5x2 − 4x − 6
x− = x−
g 0 (x) −3x2 + 10x − 4
2x3 −5x2 −6
=
3x2 −10x + 4
= f6 (x).
66 CHAPTER 2. ROOT FINDING

g(x)
Using fixed point iteration on f6 (x) = x − g 0 (x) to find roots of g(x), as done here, is called Newton’s method.

A Geometric Derivation of Newton’s Method


The following figure shows how to compute the first two iterations of Newton’s method on g(x) = −x3 + 5x2 −4x−6
with initial value x0 = −2.5 geometrically.

To compute x1 , the tangent line to g at (x0 , g(x0 )) is drawn and its intersection with the x-axis is x1 . Similarly,
the tangent line to g at (x1 , g(x1 )) is drawn and its intersection with the x-axis is x2 . And so on. For example,
(x0 , g(x0 )) = (−2.5, 50.875) and g 0 (x0 ) = g 0 (−2.5) = −47.75. Hence, the “rise” (0−50.875) over the “run” (x1 +2.5)
between (−2.5, 50.875) and (x1 , 0) must equal −47.75. We thus have −50.875 x1 +2.5 = −47.75 so

−50.875
x1 = − 2.5 ≈ −1.43455497382199.
−47.75
In symbols, the “rise” (−g(x0 )) over the “run” (x1 − x0 ) must equal g 0 (x0 ). In other words,
−g(x0 )
= g 0 (x0 ) ⇒
x1 − x0
−g(x0 )
= x1 − x0 ⇒
g 0 (x0 )
g(x0 )
x1 = x0 − .
g 0 (x0 )
1) g(xn )
Similar calculation shows x2 = x1 − gg(x
0 (x ) , and more generally xn+1 = xn − g 0 (x ) . This recurrence relation describes
1 n
g(x)
Newton’s method—iterating the function f (x) = x − g 0 (x) .

Newton’s Method (pseudo-code)


Unlike Steffensen’s method, the denominator appearing in Newton’s method is not expected to approach zero as
the iterates converge, so generally there is much less trouble with stability of the calculation and no intermediate
checks are done before computing one iteration from the previous.

Assumptions: g is twice differentiable. g has a root at x̂. x0 is in a neighborhood (x̂ − δ, x̂ + δ) where the
0 0 00
magnitude of f 0 (x) = 1 − g (x)·gg0(x)−g(x)g
(x)·g 0 (x)
(x)
is less than one.
Input: Initial value x0 ; function g and its derivative g 0 ; desired accuracy tol; maximum number of iterations
N.
Step 1: For j = 1 . . . N do Steps 2-4:
g(x0 )
Step 2: Set x = x0 − g 0 (x0 ) ;
Step 3: If |x − x0 | ≤ tol then return x;
Step 4: Set x0 = x;
Step 5: Print “Method failed. Maximum iterations exceeded.”
Output: Approximation x near exact fixed point, or message of failure.
2.4. NEWTON’S METHOD 67

Table 2.5: The secant method applied to g(x) = −x3 + 5x2 − 4x − 6 with x0 = 5 and x1 = x0 + g(x0 ) = −21.
n xn |3 − xn |
0 5 2(10)0
1 −21 2.4(10)1
2 4.9415730337078 1.941(10)0
3 4.8869924815972 1.886(10)0
4 4.0502898397912 1.050(10)0
5 3.7088949488497 7.088(10)−1
6 3.412824115541 4.128(10)−1
7 3.232292913133 2.322(10)−1
8 3.1141957095727 1.141(10)−1
9 3.0465011115969 4.650(10)−2
10 3.0132833760752 1.328(10)−2
11 3.0020189248976 2.018(10)−3
12 3.0001014520965 1.014(10)−4
13 3.0000008128334 8.128(10)−7
14 3.0000000003297 3.297(10)−10

Secant Method
The greatest weakness of Newton’s method is the requirement that g 0 be known and used in the calculation.
The derivative is not always accessible or manageable or even known, though. In such a case, it is better to use
Steffensen’s method or the secant method. The secant method is derived by replacing the g 0 of Newton’s method
with a difference quotient. In order for this to make any sense, though, we will need to restate Newton’s method in
g(xn )
terms of xn . In Newton’s method we are iterating f (x) = x − gg(x)
0 (x) so xn+1 = xn − g 0 (x ) .
n
Now suppose you have a function g and some iterate xn−1 . That is enough to locate one point on the graph
of g, namely (xn−1 , g(xn−1 )). But we need another point in order to form a difference quotient (the slope of
the line through two points). So suppose we have a second value, xn , near xn−1 . Then g(xxnn)−g(x n−1 )
−xn−1 ≈ g 0 (xn )
so we can substitute g(xxnn)−g(x
−xn−1
n−1 )
for g 0 (xn ) in Newton’s method. This yields the secant method, xn+1 = xn −
g(xn )/ g(xxnn)−g(x n−1 )
 
−xn−1 , which simplifies to

xn − xn−1
xn+1 = xn − g(xn ) . (2.4.1)
g(xn ) − g(xn−1 )

Notice this is not quite a fixed point iteration scheme. Each iteration depends on the previous two values, not one.
The analysis we’ve done so far does not apply, but there’s hope that convergence will be fast since this method is a
reasonable approximation of Newton’s method near a root, assuming g is differentiable near there. Table 2.5 provides
evidence that the secant method indeed converges quickly. In the particular case of g(x) = −x3 + 5x2 − 4x − 6 with
x0 = 5 and x1 = x0 + g(x0 ) = −21, it takes a while to settle in, but after the first 8 iterations or so, convergence is
very fast. Not quite quadratic, but superlinear for sure.


1+ 5
Crumpet 12: The secant method converges with order 2 .

Suppose g is a function with root x̂, g 0 (x̂) 6= 0, g 00 (x̂) 6= 0, and g 000 (x) exists in a neighborhood of x̂. Let
x −xn−1
x0 , x1 , x2 , . . . be a sequence derived from the secant method (xn+1 = xn − g(xn ) g(xnn)−g(x n−1 )
for all k ≥ 2) such
that lim xk = x̂. Define en = xn − x̂ so xn = x̂ + en . Making this substitution into 2.4.1 we have
k→∞

en − en−1
en+1 = en − g(x̂ + en ) . (2.4.2)
g(x̂ + en ) − g(x̂ + en−1 )

Taylor’s theorem allows g(x̂ + ek ) = g(x̂) + ek g 0 (x̂) + 21 e2k g 00 (x̂) + O(e3k ). Noting that g(x̂) = 0 and substituting
68 CHAPTER 2. ROOT FINDING

into 2.4.2,

en g 0 (x̂) + 21 e2n g 00 (x̂) + O(e3n )


en+1 = en − (en − en−1 )
(en − en−1 )g 0 (x̂) + 21 (e2n − e2n−1 )g 00 (x̂) + O(e3n−1 )
00
e2
n g (x̂)
en + 2g 0 (x̂)
+ O(e3n )
= en − O(e3 )
(en +en−1 )g 00 (x̂)
1+ 2g 0 (x̂)
+ n−1
(en −en−1 )
O(e3
n−1 )
   
(en +en−1 )g 00 (x̂) e2 00
n g (x̂)
en 1 + 2g 0 (x̂)
+ (en −en−1 )
− en + 2g 0 (x̂)
+ O(e3n )
= O(e3 )
(en +en−1 )g 00 (x̂)
1+ 2g 0 (x̂)
+ n−1
(en −en−1 )
00
g (x̂)
en en−1 2g 0 (x̂) +
en
en −en−1
O(e3n−1 ) + O(e3n )
= O(e3 )
. (2.4.3)
(en +en−1 )g 00 (x̂)
1+ 2g 0 (x̂)
+ n−1
(en −en−1 )

|x̂−xn+1 |
Using equality 2.4.3 to find a value α for which limn→∞ |x̂−xn |α
= λ 6= 0, we have

|x̂ − xn+1 | |en+1 |


lim = lim
n→∞ |x̂ − xn |α n→∞ |en |α

1−α g 00 (x̂) e1−α 3 3−α
0 (x̂) + e −e ) + )
n
en en−1 2g O(e n−1 O(en
= lim
n n−1
n→∞ (en +en−1 )g 00 (x̂) O(e3 )
1+ + n−1

2g 0 (x̂) (e −e )

n n−1

= λ 6= 0.

But limn→∞ en = limn→∞ en−1 = 0. Hence, limn→∞ e1−α n en−1 must not be 0 or divergent, for if it were,
|x̂−x |
limn→∞ |x̂−xn+1
n|
α would be 0 or divergent, respectively. Consequently, there is a positive constant C such that
1/(1−α)
limn→∞ |e1−α
n en−1 | = limn→∞ |e1−α
n+1 en | = C ⇒ limn→∞ |en+1 en | = C 1/(1−α) . Now we have

|en+1 | |en+1 |
lim = λ 6= 0 and lim = C 1/(1−α) 6= 0.
n→∞ |en |α n→∞ |en |1/(α−1)

Since the order of convergence of a sequence is unique (Exercise 20 of section 1.3) it must be that α = 1/(α − 1)
or α2 − α − 1 = 0. The quadratic formula supplies the desired result.

So far we have only applied Newton’s method and the secant method to the cubic polynomial g(x) = −x3 +
2
5x − 4x − 6, a task not strictly necessary. The rational roots theorem, a basic tool from pre-calculus, would give
you the roots exactly. The method would have you check ±1, ±2, ±3, and ±6 as possible roots of g. Assuming you
did your checks by synthetic division, your work might look something like this:
3 −1 5 −4 −6
−3 6 6
−1 2 2 0
meaning g(x) = (x − 3)(−x2 + 2x +√ 2). The other two roots would then come from the quadratic formula applied

to −x2 + 2x + 2 and would be −2±−24+8 = 1 ± 3.

Crumpet 13: Solving the cubic

The solutions of the quadratic equation ax2 +bx+c = 0 are given by the well-known quadratic equation. Less well-
known, and significantly more involved, is any formula for the solutions of the cubic equation ax3 +bx2 +cx+d = 0.
One method of solution follows. First, we let
3ac − b2
p = and
3a2
2b3 − 9abc + 27a2 d
q = .
27a3
2.4. NEWTON’S METHOD 69

Then we set r
q
3 q2 p3
w =− − + .
2 4 27
Third, we set w1 , w2 , and w3 to the three possible (complex) values of w. Finally, the three solutions of ax3 +
bx2 + cx + d = 0 are
p b
xi = wi − − , i = 1, 2, 3.
3wi 3a
This is essentially the method of Cardano, published in the 16th century!
For example, to solve the equation −x3 + 5x2 − 4x − 6 = 0, we start with

3(−1)(−4) − 52 13
p = =− and
3(−1)2 3
2 · 53 − 9(−1)(5)(−4) + 27(−1)2 (−6) 92
q = = .
27(−1)3 27
Then
r
3 92 922 133
w = − − − 2
2 · 27 4 · 272 27

46 922 − 4 · 133
= − −
27 54

46 −324
= − −
27 54
46 i
= − − .
27 3
√ −1 √ −1
In polar form, w3 = 132713 ei(tan (9/46)−π) so we may set w1 = 313 ei(tan (9/46)−π)/3 , one of the cube roots of
w3 . Unfortunately, finding the angle (tan−1 (9/46) − π)/3 exactly amounts to solving a cubic equation! However,
with a calculator in hand, one can get the approximation

−0.982793723247329, which in the end will be good
enough. So, the real part of w1 is approximately 313 cos(−0.982793723247329) ≈ .6666666666666667 and the

imaginary part is approximately 313 sin(−0.982793723247329) ≈ −1. w1 is suspiciously close to 32 − i. And we
3 2 3 2 2
can check, 2
= +3 (−i) + 3 · 32 (−i)2 + (−i)3 = 27
8
− 12 i−2+i 46
= − 27 − 13 i. Therefore, w1 = 2
 
3
−i 3 3 9 3
−i
 √  √ √  √  √ √
2 1 3 3 3−2
and we let w2 = 3
−i −2 + 2 i = 6
+ 6 i and w3 = 32 − i
3+2 3
− 12 + 2
3
i = −3
6
3−2
+ 3−2 3
6
i.
Finally,
13 5 13w1 5 5
x1 = w1 + + = w1 + + = w1 + w1 + =3
9w1 3 9|w1 |2 3 3
13 5 13w2 5 5 √
x2 = w2 + + = w2 + + = w2 + w2 + = 3+1
9w2 3 9|w2 |2 3 3
13 5 13w3 5 5 √
x3 = w3 + + = w3 + + = w3 + w3 + =− 3+1
9w3 3 9|w3 |2 3 3

For an equation you most likely did not see in pre-calculus, or calculus for that matter, consider
p
x − ex cos e2x − x2 = 0.
You might try to solve this equation exactly, with a pencil and paper, but you would soon run into a dead end. This
equation can not be solved explicitly. The best you can hope for is to approximate √ the solutions with a numerical
method. To get some idea what we are in for, look at the graph of x − ex cos e2x − x2 in Figure 2.4.1. The
function oscillates wildly, and only oscillates more wildly as x increases. The graph crosses the x-axis 29 times on
the interval from 0 to 4.5 so has 29 roots there! They are
.3181315052047641, 1.668024051576096, 2.062277729598284,
2.439940377216816, 2.653191974038697, . . .
and can be found by Newton’s method with initial values 0, 1.5, 2, 2.4, 2.6, . . .. Can you find the next root? Answer
on page 72.
70 CHAPTER 2. ROOT FINDING


Figure 2.4.1: The graph of x − ex cos e2x − x2 crosses the x-axis infinitely many times.

Secant Method (pseudo-code)


A straightforward implementation of the secant method can easily be inefficient due to the number of times g
appears in formula on page 67. The pseudo-code below takes great care not to compute each value of g more than
once. If it seems more complicated than necessary, this is likely the source of the complication.

Assumptions: g has a root at x̂. g is differentiable in a neighborhood of x̂. x0 and x1 are sufficiently close
to x̂.
Input: Initial values x0 and x1 ; function g; desired accuracy tol; maximum number of iterations N .
Step 1: Set y0 = g(x0 ); y1 = g(x1 )
Step 2: For j = 1 . . . N do Steps 3-5:

−y0 ;
Step 3: Set x = x1 − y1 xy11 −x 0

Step 4: If |x − x1 | ≤ tol then return x;


Step 5: Set x0 = x1 ; y0 = y1 ; x1 = x; y1 = g(x1 )
Step 6: Print “Method failed. Maximum iterations exceeded.”
Output: Approximation x near exact fixed point, or message of failure.

Seeded Secant Method (pseudo-code)


The greatest drawback to the secant method is the necessity of two initial values. They should be near one another,
but how near, and how do you determine? These are tough questions, and the answers are complicated at best.
One reasonable approach is to let x1 = x0 + g(x0 ). Assuming x0 is near a root, g(x0 ) will be small, so x1 will be
near x0 . Taking this approach relieves the user from the burden of selecting a second initial value. There are times
when such automated selection is not desirable, so both methods have their place. This method only works well
when the initial approximation is good.

Assumptions: g has a root at x̂. g is differentiable in a neighborhood of x̂. x0 is sufficiently close to x̂.
Input: Initial value x0 ; function g; desired accuracy tol; maximum number of iterations N .
Step 1: Set y0 = g(x0 ); x1 = x0 + y0 ; y1 = g(x1 )
Step 2: For j = 1 . . . N do Steps 3-5:
Step 3: Set x = x1 − y1 xy11 −x
−y0 ;
0

Step 4: If |x − x1 | ≤ tol then return x;


Step 5: Set x0 = x1 ; y0 = y1 ; x1 = x; y1 = g(x1 )
Step 6: Print “Method failed. Maximum iterations exceeded.”
Output: Approximation x near exact fixed point, or message of failure.
2.4. NEWTON’S METHOD 71

Key Concepts
Rational Roots Theorem: If the polynomial p(x) = a0 + a1 x + · · · + ak xk has rational coefficients, then any
rational roots of p are in the set d : n is a factor of a0 and d is a factor of ak .
n

Synthetic division: A method for calculating the quotient of a polynomial by a monomial. Example on page 68.

Newton’s method: A root finding method that generally converges to a root of g(x) quadratically, but requires
n)
the use of the derivative. In this method, x0 is chosen and xn+1 = xn − gg(x
0 (x ) is computed for each n > 0.
n

Secant method: A root finding method that generally converges to a root of g(x) with order approximately
1.618, but does not require the use of the derivative. In this method, x0 and x1 are chosen and xn+1 =
−xn−1
xn − g(xn ) g(xxnn)−g(x n−1 )
is computed for each n > 0.

Seeded secant method: A modification of the secant method where x0 is chosen and x1 = x0 + g(x0 ).

Exercises 11. Compare the secant method and Newton’s method


based on questions 4 and 5. Which finds roots in fewer
1. Write Octave code that implements Newton’s iterations? Which one fails least often? Which is bet-
method as a function. ter?
2. Write Octave code that implements the secant 12. Compute the first three√iterations of Newton’s method
method as a function. applied to g(x) = x − ( 2)x with x0 = 3.

3. Write Octave code that implements the seeded se- 13. Find a value of x0 for which Newton’s method will fail
cant method as a function. to converge to a root of g(x) = 2 + x − ex .
14. Explain why Newton’s method fails to converge for the
4. Use your secant method function from question 2
the function g(x) = x2 + x + 1 with x0 = 1.
with a tolerance of 10−5 to find a solution of
2 ln(1 + x2 ) − x
15. Let g(x) = . Using Newton’s method
(a) e + 2
x −x
+ 2 cos x − 6 = 0 using 1 ≤ x0 ≤ 2. 1 + x2
to find a root of g(x) with x0 = 5 yields x14 =
(b) ln(x − 1) + cos(x − 1) = 0 using 1.3 ≤ x0 ≤ 2. 8.6624821192 and with x̃0 = 1.2 yields x̃14 = 0. Com-
(c) 2x cos x − (x − 2)2 = 0 using 2 ≤ x0 ≤ 3. [A] pare the values of x14 and x̃14 with the fourteenth iter-
ations from question 9 and explain any similarities or
(d) 2x cos x − (x − 2)2 = 0 using 3 ≤ x0 ≤ 4. [A]
differences. [A]
(e) (x − 2)2 − ln x = 0 using 1 ≤ x0 ≤ 2. 16. Let g(x) = e3x − 27x6 + 27x4 ex − 9x2 e2x and let
(f) (x − 2)2 − ln x = 0 using e ≤ x0 ≤ 4. p0 = 4. Find p10 using Newton’s method. HINT:
g 0 (x) = 3e3x − 18(x + x2 )e2x + 27(x4 + 4x3 )ex − 162x5 .
[A]
5. Repeat exercise 4 using your Newton’s method code
from question 1. [A] 17. Newton’s method does not introduce spurious solu-
tions. Suppose f (x) = x − gg(x) 0 (x) and g (x̂) 6= 0. Prove
0

6. Repeat exercise 4 using your seeded secant method that x̂ is a root of g if and only if x̂ is a fixed point of f .
code from question 3. [A] Hint: one direction is proven in the text of this section.
7. Repeat exercise 4 using a tolerance of 10−10 . Taking 18. The polynomial g(x) = x4 + 2x3 − x − 3 has a root
this new value as the exact value, did using a tolerance x̂ ≈ 1.097740792. Find the largest neighborhood (a, b)
of 10−5 give a result accurate to within 10−5 of the of x̂ such that Newton’s method converges to x̂ for any
exact value? [A] initial value x0 ∈ (a, b). [S]

8. Let g(x) = 100 sin 10 and x0 = 1.25. Find x1 and x2



x2 x 19. Use Newton’s method to find a negative solution of
of Newton’s method. [S]
0 = 12x4 − 13x3 + 7x2 + x − 130
9. Let g(x) = 2 ln(1 + x2 ) − x. Find x14 using Newton’s
method with accurate to the nearest 10−4 . What initial value did
you use? How many iterations did it take?
(a) x0 = 5
20. Consider the function g(x) = e6x + 3(ln 2)2 e2x −
[A]
(b) x0 = 1.2 (ln 8)e4x − (ln 2)3 . Compute enough iterations of New-
ton’s method with x0 = 0 to approximate a zero of
10. Let g(x) = 2 ln(1 + x2 ) − x. Find x2 and x3 using the g with tolerance 0.0002. Construct the Aitken’s delta
secant method with squared sequence han i. Is the order of convergence im-
[S]
proved? [A]
(a) x0 = 5 and x1 = 6
21. As with Newton’s method, the secant method can eas-
(b) x0 = 1 and x1 = 2 ily be described geometrically: Draw the line through
72 CHAPTER 2. ROOT FINDING

the two points (x0 , f (x0 )) and (x1 , f (x1 )). Find the in- 25. Use your code from question 2 to find a root of
tersection of this line with the x-axis. The x-coordinate the function in the interval of question 2 on page 43
of the intersection is x2 . Find x3 by intersecting the to within 10−8 . Compare your answer to that from
line through (x1 , f (x1 )) and (x2 , f (x2 )) with the x- question 4 on page 43. [A]
axis. And so on. Graph the polynomial p(x) =
26. The sum of two numbers is 20. If each number is added
x3 − 3x + 3, and demonstrate the first iteration of the
to its square root, the product of the two sums is 172.2.
secant method graphically for x0 = −1 and x1 = −2.
[S] Determine the two numbers to within 10−4 of their ex-
act values. [S]
22. Suppose you are using the secant method with x0 = 1
and x1 = 1.1 to find a root of f (x). 27. Find an example of a situation in which Newton’s
method will fail on the second iteration (i.e., x1 may
(a) Find x2 given that f (1) = 0.3 and f (1.1) = 0.23. be calculated but x2 may not). [S]
(b) Create a sketch (graph) that illustrates the calcu- 28. Let h(x) = 2.2x3 − 6.6x2 + 4.4x and let g(x) = h◦3 (x).
lation. HINT: x2 will be located where the line That is, g(x) = h(h(h(x))). Approximate a root of
through (x0 , f (x0 )) and (x1 , f (x1 )) crosses the x- g 0 (x).
axis.
29. For what values of x0 , approximately, will Newton’s
23. Use the graph of g to answer the following questions. method converge to −2.5?
g has roots at −2π, −π, π, and 2π. [A]

(a) To which root will Newton’s method converge if 30. For the function shown in question 29, find x2 and x3
x0 = 2.5? for the secant method with x0 = −10 and x1 = 6.
(b) What will happen if x0 = 0?
31. Let ˆ x
(c) Find a positive integer value of x0 for which New- et
f (x) = 10 − dt.
ton’s method will converge to 2π. 0 1+t
(d) Find a negative value of x0 for which Newton’s Approximate the positive root of f . [A]

method will converge to 2π.


32. Of the root finding methods we have surveyed so far
24. Graph the polynomial p(x) = x3 − 3x + 3, and demon- (Bisection, Fixed Point, Newton’s, Secant, and Stef-
strate Newton’s method graphically for x0 = −1. fensen’s), which one do you feel is the best? Why?

Answers
Quadratic convergence?

n xn xn
0 2 −1
1 2.5 −.7647058823529411
2 2.666666666666667 −.7326286052763475
3 2.722222222222227 −.7320509933083684
4 2.731741086881274 −.7320508075688965
5 2.732050478023325
6 2.732050807568503
.. .. ..
. . .
2.732050807568877 −.7320508075688772

The convergence looks quadratic since the number of significant digits of accuracy roughly doubles with the
last couple of iterations.
Next root? The next root is approximately 2.872257717171606. This can be found using Newton’s method with
x0 = 2.81, for example. Note this computation is very sensitive to initial conditions because there are so many
roots near one another. Starting with x0 = 2.8, for example, leads to the root at 9.662623060421268!
2.5. MORE CONVERGENCE DIAGRAMS 73

2.5 More Convergence Diagrams


The cubic function g(x) = 1 − x3 has one real root, 1. But it also has two complex roots. If you have studied
complex analysis, you probably know what the other two are. And even if you have not studied complex analysis,
you can figure them out by basic techniques of pre-calculus. Since 1 is a root, you can use synthetic division to
deflate the polynomial:

1 −1 0 0 1
−1 −1 −1
−1 −1 −1 0

This division shows that g(x) = (x − 1)(−x2 − x − 1), so the other two roots are √
the solutions√of the equation
1± 1−4
2
−x − x − 1 = 0, thus deflating the problem to a quadratic. The solutions are −2 = − 21 ± i 23 . By the way,
3
you may also recognize 1 − x as one of the special forms of polynomials, the difference of cubes.
Of course this is all fascinating, but what does this have to do with numerical analysis? What may surprise
you is that fixed point iteration (and, therefore, Newton’s method), the secant method, and Steffensen’s method
can all be used to find complex roots just as well as real ones! In fact, the algorithms need no modification! The
programming language used to implement the methods, of course, does need to be able to handle complex number
arithmetic. Octave does so without ado.
First, finding a root of g(x) = 1 − x3 and finding a fixed point of f (x) = 1/x2 are equivalent. Why? Answer
on page 80. Setting x0 = −1 + i and applying Newton’s method and the secant method to g(x) = 1 − x3 , and
Steffensen’s method to f (x) = 1/x2 we get the following:

xi
i Steffensen’s Secant Newton’s
0 −1 + i −1 + i −1 + i
1 −0.85 + 0.8i −0.66666666 + 0.83333333i −0.66666666 + 0.83333333i
2 −0.60313824 + 0.67770639i −0.55034016 + 0.82376444i −0.50869191 + 0.84109987i
3 −0.39846066 + 0.84671567i −0.49763752 + 0.85554014i −0.49932999 + 0.86626917i
4 −0.51660491 + 0.84998590i −0.49932718 + 0.86627140i −0.49999991 + 0.86602490i
5 −0.49910537 + 0.86543351i −0.50000774 + 0.86602504i −0.50000000 + 0.86602540i
6 −0.50000228 + 0.86602568i −0.49999999 + 0.86602540i
7 −0.50000000 + 0.86602540i −0.50000000 + 0.86602540i
.. .. ..
8 . . .

Each sequence quickly converges to the complex root − 12 + 23 i. And this is not a fluke or a contrived example.
Generally, these methods work just as well in the complex plane as they do on the real line. One can find real roots
starting with complex numbers too. If we change the initial value x0 to 1 + i, Newton’s method converges to 1, for
example.
Having expanded our view of the methods to include complex numbers, there is a new type of convergence
diagram to consider. We can now look at convergence patterns for the three methods over a host of initial values
in the complex plane, not just the real line. Figure 2.5.1 shows convergence diagrams for Newton’s method with
g(x) = 1 − x3 , the seeded secant method with g(x) = 1 − x3 , and Steffensen’s method with f (x) = 1/x2 . Each
diagram covers the part of the complex plane with real parts in [−5, 5] and imaginary parts in [−3.75, 3.75]. The top
left corner of each diagram represents initial value −5 + 3.75i and the bottom right corner represents initial value
5 − 3.75i. The center

of each diagram represents

the initial value 0. The colors correspond to the three roots, red to
1 3 1 3
1, green to − 2 + 2 i, and blue to − 2 − 2 i. Black corresponds to failure to converge. The different intensities of
red, green, and blue correspond to the number of iterations the method took to converge. The greater the intensity,
the fewer iterations.

We can see that for x0 = 5 − 3.75i, Newton’s method and the seeded secant method both
converge to − 12 + 23 i, because the upper right hand corner of each diagram is colored green. Steffensen’s method,
on the other hand, fails to converge to any root if begun with x0 = 5 − 3.75i, evidenced by the blackness in the
upper right hand corner of the convergence diagram.
The dwell represents the maximum number of iterations allowed, so actually the black dots represent initial
values for which convergence was not achieved within a number of iterations equal to or less than the dwell. That’s
different from claiming the method does not converge at all for these initial values. There’s a chance that some of
the blackened initial values would still lead to convergence if allowed more iterations.
74 CHAPTER 2. ROOT FINDING

Figure 2.5.1: Convergence diagrams over the complex plane.

From top to bottom:


Newton’s method with
g(x) = 1 − x3
and dwell 20;
seeded secant method with
g(x) = 1 − x3
and dwell 40;
Steffensen’s method with
1
f (x) = 2
x
and dwell 40.
Each diagram covers the part of
the complex plane with real
parts in [−5, 5] and imaginary
parts in [−3.75, 3.75].
2.5. MORE CONVERGENCE DIAGRAMS 75

Figure 2.5.2: A vertical line and its image under the exponential function.

Two things are very striking about these convergence diagrams. First, the seeded secant method and Newton’s
method converge for a much larger set of initial values than does Steffensen’s method. This is, at least in part,
due to the function chosen. For other functions, there may be a fixed point scheme for which Steffensen’s method
converges on large sets of initial values too. Second, the patterns of colors are extremely intricate, even fractal
in nature. Predicting to which root a method will converge for a given initial value, and indeed whether it will
converge at all, are very difficult questions! And this analysis is done on a rather benign (simple) function.
Consider now a much more complicated problem—finding the roots of g(z) = ez − z or, equivalently, finding the
fixed points of f (z) = ez . A graph of f (z) (over the real numbers) will quickly convince you that there are no real
number solutions. It will take some thought to determine the nature of any complex solutions.
To that end, fix a real number a0 and consider the vertical line in the complex plane, La0 = {a0 + ib : b ∈
R}. The image of La0 under the exponential function is a circle with radius ea0 centered at the origin. Indeed,
ea0 +ib = ea0 eib = ea0 (cos b + i sin b). Thus b parameterizes the circle about the origin with radius ea0 . Now,
suppose La0 contains a fixed point, ẑ = a0 + ib̂, of the exponential function, f (z) = ez . Then ẑ = f (ẑ), or
a0 + ib̂ = ea0 (cos b̂ + i sin b̂). We conclude that the line and the circle intersect at the fixed point. Every fixed point
of f is necessarily an intersection of the line La0 with the circle Ca0 for some a0 . Figure 2.5.2 shows a representative
example. In fact, the diagram shows an interesting case: x = a0 ≈ 2.439940377216816. The coordinates of the two
intersections are
(2.439940377216816, ±11.2098911414971).
The interesting thing is
e2.439940377216816+11.2098911414971i ≈ 2.439940377216816 − 11.2098911414971i
and
e2.439940377216816−11.2098911414971i ≈ 2.439940377216816 + 11.2098911414971i.
The two points are images of one another under the exponential function! What we have found here are called peri-
odic points. If we let z1 = 2.439940377216816−11.2098911414971i and z2 = 2.439940377216816+11.2098911414971i,
then ez1 = z2 and ez2 = z1. Hence, if we iterate z2 = f (z1 ), z3 = f (z2 ), z4 = f (z3 ), z5 = f (z4 ), and so on, the
sequence z1 , z2 , z3 , z4 , . . . actually looks like
z1 , z2 , z1 , z2 , z1 , z2 , . . . .
The sequence just flops back and forth between z1 and z2 in a periodic fashion. We call such values period 2 points.
They are not fixed points of f (z) but they are fixed points of f (f (z))!

Crumpet 14: Periodic points.

If a sequence hpn i has the form


p1 , p 2 , . . . , p k , p 1 , p 2 , . . . , p k , p 1 , . . . , k>1
then we say p1 is a period k point (and p2 , p3 , . . . , pk are too!).
76 CHAPTER 2. ROOT FINDING

Figure 2.5.3: More convergence diagrams over the complex plane.

From left to right: Newton’s method with g(z) = z − ez and dwell 20; secant method with g(z) = z − ez and
dwell 40; Steffensen’s method with f (z) = ez and dwell 40. Each diagram covers the part of the complex plane
with real parts in [−10, 30] and imaginary parts in [0, 73].

On the other hand, ẑ = 2.062277729598284 + 7.588631178472513i is (approximately) a fixed point of f (z) since

e2.062277729598284+7.588631178472513i = 2.062277729598284 + 7.588631178472513i.

Moreover, the conjugate of ẑ, ẑ = 2.0622377729598284 − 7.588631178472513i is also a fixed point. Verify it with a
calculator or wih Octave!
Generally, if ẑ is a fixed point of ez then so is ẑ:

ẑ = eẑ =⇒ ẑ = eẑ = eẑ .

So if we find one fixed point, we actually have found two, the fixed point and its conjugate.
We’re ready to get back to considering intersections of La0 and Ca0 . Assume a0 + ib is a fixed point of ez . Then
a0 + ib = ea0 +ib = ea0 (cos b + i sin b), so

a0 = ea0 cos b
b = ea0 sin b (2.5.1)
2 2 2a0
Now, because ap0 + ib is a point of intersection, it is on Ca0 , so a0 + b = e ⇒ b = ± e2a0 − a20 . Finally,
p

substituting b = e2a0 − a20 into 2.5.1, we find an intersection point will be a fixed point if and only if
q
a0 = ea0 cos e2a0 − a20

q and q
e2a0 − a20 = ea0 sin e2a0 − a20 . (2.5.2)

You should pause long enough to consider why it is not necessary to substitute b = − e2a0 − a20 into 2.5.1. Hint:
p

make the substitution and simplify. You should find out that the two equations you get are equivalent to those in
2.5.1.
For example, 2.439940377216816 − 11.2098911414971i and 2.062277729598284 + 7.588631178472513i both sat-
isfy the first equation of 2.5.2, but 2.439940377216816 − 11.2098911414971i does not satisfy the second while
2.062277729598284 + 7.588631178472513i does. So, as observed earlier, 2.439940377216816 − 11.2098911414971i is
not a fixed point but 2.062277729598284 + 7.588631178472513i is.
2.5. MORE CONVERGENCE DIAGRAMS 77

Do you recognize the first equation of 2.5.2? We first saw it on page 69 in section 2.4. As noted there, the
smallest five solutions are

.3181315052047641, 1.668024051576096, 2.062277729598284,


2.439940377216816, 2.653191974038697, . . .

The values 2.062277729598284 and 2.439940377216816 provided the examples for this discussion. What about the
other three values in this list? Do they give fixed points of the exponential function? Period two points? Something
else? Take a moment to investigate. Answers are on page 80. Using Octave to investigate 2.062277729598284,
which we know is a fixed point:

octave:1> format(’long’)
octave:2> a0=2.062277729598284
a0 = 2.06227772959828
octave:3> b=sqrt(exp(2*a0)-a0^2)
b = 7.58863117847251
octave:4> exp(a0+I*b)
ans = 2.06227772959828 + 7.58863117847251i

verifies that ea0 +ib = a0 + ib for a0 = 2.062277729598284, at least to machine precision. The exact value of the
fixed point is not known, but that is the nature of numerical analysis.
Figure 2.5.3 shows convergence to 12 of the fixed points of ez , one for each of the 12 different colors. The
coordinates of each fixed point can be approximated by locating the spot of greatest intensity within each colored
band.
As was done in Figure 2.5.3, convergence diagrams for the secant method can be created by setting x1 = x0 + δ
for some small number δ. It does not matter whether δ is real or complex. Selecting x1 automatically this way
allows the diagram to show convergence or divergence based on x0 alone, just as is done for the other convergence
diagrams. You will notice that the convergence diagram for the secant method and the convergence diagram for
Newton’s method are quite similar. For sufficiently small δ, this will be the case in general. The secant method
convergence diagram and the Newton’s method convergence diagram for the same function over the same region will
look very much the same. The only significant difference will be the number of iterations needed for convergence.
The secant method will need more iterations to converge.

Exercises

1. Match the function with its Newton’s method convergence diagram. The real axis passes through the center of each
diagram, and the imaginary axis is represented, but is not necessarily centered. [S]

f (x) = 56 − 152x + 140x2 − 17x3 − 48x4 + 9x5


g(x) = (x2 )(ln x) + (x − 3)ex
h(x) = 1 + 2x + 3x2 + 4x3 + 5x4 + 6x5
l(x) = (ln x)(x3 + 1)
78 CHAPTER 2. ROOT FINDING

(a) (b)

(c) (d)

2. Match the function with its Newton’s method convergence diagram. The real axis passes through the center of each
diagram, and the imaginary axis is represented, but is not necessarily centered. [A]

f (x) = sin x
g(x) = sin x − e−x
h(x) = ex + 2−x + 2 cos x − 6
l(x) = x4 + 2x2 + 4

(a) (b)

(c) (d)

3. Find a polynomial that has the following roots and no others.

(a) −7, 2, 1 ± 5i
(b) −7, 2, 1 + 5i
2.5. MORE CONVERGENCE DIAGRAMS 79

[S]
(c) −4, −1, 2, ±2i
[S]
(d) −4, −1, 2, 2i
(e) 0, −1 ± i, 1 ± i
(f) −3 + i, −2 − i, −3i, 1 − 2i

4. Create Newton’s method convergence diagrams for the polynomials of question 3. Make sure you capture a region that
shows at least a small area converging to each root. Octave code may be downloaded at the companion website.
1
5. The functions f (x) = ex and g(x) = x2 +1
have no roots, real or complex. Find at least two others that also have no
roots.
x2 −7x+10
6. Let f (x) = 2
+ sin(3x).

(a) Find all the real roots of f . This is not a polynomial, so deflation will not work. Instead, graph the function and
use Newton’s method to find the real roots accurate to 10−8 . There are four of them.
(b) Create a Newton’s method convergence diagram for f to see if there are any complex roots. If so, use Newton’s
method to approximate them. Use the convergence diagram to help you choose initial values.
(c) Can you find all the roots of f ?

7. Match the function with its seeded secant method convergence diagram. The real axis passes through the center of
each diagram, and the imaginary axis is represented, but is not necessarily centered. [S]

f (x) = sin x
g(x) = sin x − e−x
h(x) = ex + 2−x + 2 cos x − 6
l(x) = 56 − 152x + 140x2 − 17x3 − 48x4 + 9x5

(a) (b)

(c) (d)

8. Match the function with its seeded secant method convergence diagram. The real axis passes through the center of
each diagram, and the imaginary axis is represented, but is not necessarily centered. [A]

f (x) = x4 + 2x2 + 4
g(x) = (x2 )(ln x) + (x − 3)ex
h(x) = 1 + 2x + 3x2 + 4x3 + 5x4 + 6x5
l(x) = (ln x)(x3 + 1)
80 CHAPTER 2. ROOT FINDING

(a) (b)

(c) (d)

9. Create seeded secant method convergence diagrams for the polynomials of question 3. Make sure you capture a region
that shows at least a small area converging to each root. Octave code may be downloaded at the companion website.
10. The Newton’s method convergence diagram for one polynomial is much like the Newton’s method convergence diagram
for another. Interesting changes in the Newton’s method convergence diagrams and seeded secant method convergence
diagrams can be achieved by multiplying a polynomial by a non-polynomial function with no roots. Create Newton’s
method and seeded secant method convergence diagrams for products of functions in question 3 with functions in
question 5.
11. Discuss the relative strengths and weaknesses of Newton’s method, the secant method, and the seeded secant method.

Answers
Why equivalent? The equations g(x) = 0 and f (x) = x have exactly the same solutions. g(x) = 0 ⇔ 1 − x3 =
0 ⇔ 1 = x3 ⇔ x12 = x ⇔ f (x) = x.

Nature of roots? .3181315052047641 is a fixed point of the exponential function:

octave:1> format(’long’)
octave:2> a0=.3181315052047641;
octave:3> b=sqrt(exp(2*a0)-a0^2)
b = 1.33723570143069
octave:4> exp(a0+I*b)
ans = 0.318131505204764 + 1.337235701430689i

1.668024051576096 is a period two point of the exponential function:

octave:1> format(’long’)
octave:2> a0=1.668024051576096;
octave:3> b=sqrt(exp(2*a0)-a0^2)
b = 5.03244706448616
octave:4> exp(a0+I*b)
ans = 1.66802405157609 - 5.03244706448616i

2.653191974038697 is a fixed point of the exponential function:


2.5. MORE CONVERGENCE DIAGRAMS 81

octave:5> a0=2.653191974038697;
octave:6> b=sqrt(exp(2*a0)-a0^2)
b = 13.9492083345332
octave:7> exp(a0+I*b)
ans = 2.65319197403878 + 13.94920833453319i
82 CHAPTER 2. ROOT FINDING

2.6 Roots of Polynomials


Synthetic division revisited
You may recall using the rational roots theorem and synthetic division to find roots of polynomials of degree 3 or
more in algebra. The process was something like this. You made a list of possible roots based on the rational roots
theorem. You checked each one using synthetic division until you either found a root or ran out of candidates. It
is possible that was as far as your class took the process, but there is more to say.
Suppose we have a polynomial p(x) and a number t. Synthetic division gives coefficients of q(x) such that
p(x) = q(x) · (x − t) + p(t). For example, the synthetic division

t p(x)
z}|{ z }| {
−3 −4 2 3 −6
12 −42 117
−4 14 −39 111
| {z } | {z }
q(x) p(t)

tells us that p(x) = −4x3 + 2x2 + 3x − 6 = (−4x2 + 14x − 39)(x + 3) + 111. While it is a small burden to evaluate
the expression −4x3 + 2x2 + 3x − 6 when x = −3, it is no burden at all to evaluate (−4x2 + 14x − 39)(x + 3) + 111
when x = −3. The (x + 3) factor is zero, so it doesn’t matter to what (−4x2 + 14x − 39) evaluates. The product is
zero and (−4x2 + 14x − 39)(x + 3) + 111 evaluates to 111. Therefore, p(−3) = 111. Synthetic division gives a quick
way to evaluate a polynomial. The number at the end of the division is the value of the polynomial at the value of
the divisor.
More generally, here is a dissection of the division of p(x) = a0 + a1 x + · · · + an xn by x − t using synthetic
division:

t an an−1 an−2 ··· a0


an t an (an t + an−1 ) ··· an (· · · an (an (an t + an−1 ) + an−2 ) + · · · + a1 )
an an t + an−1 an (an t + an−1 ) + an−2 ··· p(t)

Beginning with t in the upper left corner, we end up with p(t) in the lower right corner. It is not only when the
number in the lower right corner is zero do we find something of interest. Every synthetic division gives something
of interest! The number in the bottom right corner is p(t) whether it turns out to be zero or not. And there is
more.
The numbers an , an t + an−1 , an (an t + an−1 ) + an−2 , and so on, appearing in the bottom row of the synthetic
division give the coefficients of the quotient, q(x). Every synthetic division gives a decomposition of the polynomial
into quotient and remainder. Thus, with every synthetic division, we get an equivalent expression of the form
q(x) · (x − t) + p(t). There is still more.
Differentiating the equation p(x) = q(x) · (x − t) + p(t) with respect to x gives

p0 (x) = q 0 (x) · (x − t) + q(x).

Hence, p0 (t) = q 0 (t) · (t − t) + q(t) = q(t). So, not only do the numbers in the bottom row give the coefficients of
the quotient, they double as coefficients appropriate for evaluating p0 (t). Returning to the previous example, if we
desire to calculate p0 (−3), we simply continue the synthetic division as in

−3 −4 2 3 −6
12 −42 117
−3 −4 14 −39 111
12 −78
−4 26 −117

and find out p0 (−3) = −117. The procedure of calculating p(t) and p0 (t) by simultaneous synthetic divisions
is known as Horner’s method and is especially convenient for use in Newton’s method. If we were trying to
find a root of p(x) = −4x3 + 2x2 + 3x − 6 with initial approximation x0 = −3 we would have, at this point,
0) 111
x1 = x0 − pp(x
0 (x ) = −3 − −117 ≈ −2.05128. Yet there is more.
0
2.6. ROOTS OF POLYNOMIALS 83

Finding all the roots of polynomials


When we happen upon a root of the polynomial p(x), the result of the synthetic division, p(x) = q(x)(x − t) + p(t),
reduces to p(x) = q(x)(x − t) since t is a root, meaning p(t) = 0. In this case, we have a factorization of p(x). The
rest of the roots of p are exactly the roots of q, so having found one root, we have reduced the problem of finding
roots of p to (a) noting the root we have found plus (b) finding the roots of the polynomial q, a polynomial of
one degree less than that of p. In this way, we have deflated the problem of finding the n roots of the nth degree
polynomial p to finding the n − 1 roots of the (n − 1)-degree polynomial q. Taking it a step further, when we have
found a root of q, we can use synthetic division to reduce the problem again. We (a) note the root of q and (b)
continue searching for roots of the quotient, an (n − 2)-degree polynomial. We continue this way, deflating the
problem by one degree each time we find a root until we have reduced the problem to a 2nd degree polynomial. At
this point, we have a quadratic polynomial and can use the quadratic equation to find the last two roots.
For example, −1.18985 is (approximately) a root of p(x) = −4x3 + 2x2 + 3x − 6. Synthetic division of p(x) by
(x + 1.18985) gives

−1.18985 −4 2 3 −6
4.7594 −8.04267 6.00002
−4 6.7594 −5.04267 0.00002

The (near) zero in the box at the bottom-right indicates that −1.18985 is approximately a root. There is no appre-
ciable remainder upon division of −4x3 + 2x2 + 3x − 6 by x + 1.18985. Moreover, the numbers −4, 6.7594, −5.04267
in the bottom row give the coefficients of q(x). Thus, we find from this division that −4x3 + 2x2 + 3x − 6 =≈
(−4x2 + 6.7594x − 5.04267)(x + 118985). We can now find the other two roots by locating the roots of q(x) =
−4x2 + 6.7594x − 5.04267. Using the quadratic formula, they are

6.75942 − 4(−4)(−5.04267)
p
−6.7594 ±
≈ .84493 ± .73944i.
−8

Our process will lead us to finding n roots of any nth degree polynomial. It is important to note that some of
these roots may be complex and some of them may be repeated.

Crumpet 15: The Fundamental Theorem of Algebra

The process of finding one root of a given polynomial, deflating, and finding another mirrors quite closely the
mathematical theorems of algebra. The Fundamental Theorem of Algebra states that every polynomial with
complex coefficients and degree at least one has a complex root. Thus our search for a root is not in vain! We can
then write our polynomial in factored form and continue. The Fundamental Theorem says that there is again a
root of the deflated polynomial. And if we keep track of all the roots as we find them, we end up writing our
polynomial in the form
p(x) = a(x − r1 )e1 (x − r2 )e2 · · · (x − rk )ek , (2.6.1)
where a is a nonzero constant, r1 , r2 , . . . , rk are the k distinct complex roots, and e1 , e2 , . . . , ek are the so-called
(positive integer) multiplicities of the roots. From this form, we see that the degree of the polynomial equals the
sum of the multiplicities, e1 + e2 + · · · + ek . This is what we mean when we say the number of roots, counting
multiplicity, is equal to the degree of the polynomial. Thus when searching for the roots of a polynomial of degree
n, we know we are looking for n roots, but not necessarily n distinct roots. Some of them may be repeated and
the repetitions are accounted for in the multiplicities. To formalize the claim in equation 2.6.1, we have the
follwing theorem.

Theorem 6. (Fundamental Factorization Theorem) If n ≥ 1 and p is a degree n polynomial, then

p(x) = a(x − r1 )e1 (x − r2 )e2 · · · (x − rk )ek

for some constant a 6= 0, roots r1 , r2 , . . . , rk , and positive integer exponents e1 , e2 , . . . , ek where


k
X
ej = n.
j=1
84 CHAPTER 2. ROOT FINDING

Proof. Suppose n = 1 so p(x) takes the form ax + b with a 6= 0. Then p(x) = a(x − (− ab ))1 and thus takes
the required form. Now suppose all polynomials of some degree n ≥ 1 take the required form and let p be a
polynomial of degree n + 1. By the Fundamental Theorem of Algebra, p has a root. Call it ρ. Then x − ρ is
a factor of p so p can be written as p(x) = (x − ρ) · q(x) for some polynomial q of degree n. By the inductive
hypothesis, we have that q takes the required form, so

p(x) = (x − ρ) · a(x − r1 )e1 (x − r2 )e2 · · · (x − rk )ek

where e1 + e2 + · · · + ek = n. If ρ is distinct from r1 , r2 , . . . , rk , then p takes the form

p(x) = a(x − r1 )e1 (x − r2 )e2 · · · (x − rk )ek (x − ρ)1 .

If ρ equals one of r1 , r2 , . . . , rk , say rj , then p takes the form

p(x) = a(x − r1 )e1 (x − r2 )e2 · · · (x − rj )ej +1 · · · (x − rk )ek .

In either case, p takes the required form and the proof is complete.

Pseudo-pseudo-code for this procedure might look something like this:

Assumptions: p is a polynomial of degree n > 2.


Input: Polynomial p(x); tolerance tol; maximum number of iterations N .
Step 1: For i = 1 to n − 2 do Steps 2-5:
Step 2: Find a root x0 of p(x) [using tol, N , and some root-finding method];
Step 3: If error trying to find x0 then
return “Method failed. Root of degree n − i + 1 not found.”;
Step 4: Factor p(x) as q(x) · (x − x0 );
Step 5: Set xi = x0 ; p(x) = q(x);
Output: Approximate roots.

To refine the pseudo-pseudo-code into pseudo-code, we will use Newton’s method, assisted by Horner’s method,
in Step 2. The usual drawback of Newton’s method, the requirement that the derivative be known and calculated, is
but a small inconvenience when Horner’s method is employed. But how do we represent polynomials in a computer
program so that we can accomplish Steps 4 and 5? The same way we implement code to execute Horner’s method.
Pseudo-code for Horner’s method, with an array:

Assumptions: p is a polynomial of degree n ≥ 1.


Input: array [c] of coefficients of p(x) = c1 + c2 x + c3 x2 + · · · + cn+1 xn ; x0 .
Step 1: Set y = cn+1 ; z = cn+1 ;
Step 2: For j = n, n − 1, . . . , 2 do Step 3
Step 3: Set y = x0 y + cj ; z = x0 z + y;
Step 4: Set y = x0 y + c1 ;
Output: y = p(x0 ) and z = p0 (x0 ).

As in synthetic division, there is no need to retain the variable to various exponents. Only the coefficients are
needed to define a polynomial. So, in the program, a polynomial is represented by an array of numbers. Putting
together our pseudo-pseudo code, Newton’s method and Horner’s method into a single program, we have a method
for finding all the roots of a polynomial:

Assumptions: p is a polynomial of degree n > 2 and c1 , the constant coefficient of p, is nonzero.


Input: array [c] of coefficients of p(x) = c1 + c2 x + c3 x2 + · · · + cn+1 xn ; tolerance tol; maximum number of
iterations N ; initial value x0 .
Step 1: Set m = n;
2.6. ROOTS OF POLYNOMIALS 85

Step 2: For i = 1 to n − 2 do Steps 3-13:


Step 3: Set k = 0; Set x = x0 ;
Step 4: While |x − x0 | > tol or k = 0 do Steps 5-12:
Step 5: If k = N then return “Method failed. Not all roots found.”
Step 6: Set x0 = x;
Step 7: Set dm = cm+1 ; z = cm+1 ;
Step 8: For j = m, m − 1, . . . , 2 do Step 9
Step 9: Set dj−1 = x0 dj + cj ; z = x0 z + dj−1 ;
Step 10: Set y = x0 d1 + c1 ;
Step 11: Set x = x0 − yz ;
Step 12: Set k = k + 1;
Step 13: Set ri = x; [c] = [d]; m = m − 1;
Step 14: Set D = c22 − 4c1 c3 ; s1 = −c2 + D; s2 = −c2 − D;
p

2c1
Step 15: If the real part of c2 is negative, then set rn−1 = s1
2c3 and rn = s1 ; else set rn−1 = s2
2c3 and
rn = 2c
s2 ;
1

Output: Array [r1 , r2 , . . . , rn ] of approximate roots.

Steps 4 through 12 implement Newton’s method to find a single root, using Horner’s method in Steps 7 through 10
to calculte the value of the polynomial and its derivative at x0 . Care is taken to calculate and store the coefficients
[d] of the quotient for easy referral in Step 13. It is assumed that the square root calculated in Step 14 is the
principle branch of the complex square root. Steps 14 and 15 utilize an alternate form of the quadratic formula
that avoids the subtraction of nearly equal quantities so much as possible.

Crumpet 16: Alternate Quadratic Formula


−b± b2 −4ac
When the roots of p(x) = ax2 + bx + c are small, the numerator of the quadratic formula, x = 2a
, is

necessarily small. In this case, it is best to match the signs of −b and ± b2 − 4ac in order to avoid subtracting
quantities of nearly equal value. Choosing the sign of the square root term this way gives one of the roots as
accurately as possible, but leaves the other root undetermined. Multiplying both numerator and denominator
by the conjugate of the numerator gives an alternate expression of the quadratic formula:
√ √
−b ± b2 − 4ac −b ∓ b2 − 4ac b2 − (b2 − 4ac)
· √ = √
2a −b ∓ b2 − 4ac 2a(−b ∓ b2 − 4ac)
4ac
= √
2a(−b ∓ b2 − 4ac)
2c
= √ .
−b ∓ b2 − 4ac
Expanding, we have √
−b + b2 − 4ac 2c
= √
2a −b − b2 − 4ac
and √
−b − b2 − 4ac 2c
= √ .
2a −b + b2 − 4ac

However, there is little that can be done at this point if zero happens to be a double root. In this instance, both c1
and c2 will be zero or nearly zero, making both s1 and s2 very small. This is why the set of assumptions includes
the stipulation c1 6= 0. This ensures that zero is not a root of p.
86 CHAPTER 2. ROOT FINDING

Newton’s method and polynomials


There is one more issue to address regarding the use of Newton’s method for finding roots of polynomials. For a
polynomial with real coefficients, if x0 is real, so will be x1 , and x2 , and every successive iteration! There will be
no hope of finding complex roots. This is not a problem if the polynomial has at most two complex roots. The
real roots will be found and the resulting quadratic will hold the two complex roots. The complex roots will be
uncovered by the quadratic formula. In general, though, we can not count on a polynomial having at most two
complex roots. Our method should work for polynomials with arbitrarily many complex roots, including the case
when all roots are complex.
The fix is not difficult, with one proviso. Mathematically, Newton’s method and Horner’s method work just as
well with complex numbers as they do with real numbers. As long as the programming language you are using can
handle complex numbers, just begin with a complex (not purely real) initial approximation x0 , and complex roots
will be found! Even so, it is possible that all the real roots are found first and what remains will be a polynomial
with more than two complex roots and no real roots. This is where the inaccuracy of floating point arithmetic is
actually helpful! Neither the coefficients nor the value of x0 will be purely real due to round-off error. The complex
roots will generally be found.

Müller’s Method
Another very fast method for finding roots of equations is Müller’s method . In principle, it is very much like the
secant method. With the secant method, two initial approximations p0 and p1 are made. The secant line through
the points (p0 , f (p0 )) and (p1 , f (p1 )) is drawn and its intersection with the x-axis gives p2 . With Müller’s method,
three initial approximations p0 , p1 , and, p2 are needed. The parabola through the points (p0 , f (p0 )), (p1 , f (p1 )),
and (p2 , f (p2 )) is drawn and its intersection with the x-axis gives p3 . There are a couple of issues to deal with,
however. First, if the parabola so drawn crosses the x-axis at all, it crosses it twice. We need to choose one of the
zeros for p3 . Second, it is possible the parabola will not cross the x-axis at all.
Solving the problem of which root to choose is simple. We assume the approximation p2 is better than the
others, so we choose the root that is closest to p2 . Actually, that solves the second “problem” too. Even when the
parabola does not cross the x-axis, it has zeros. They are complex. And we do not worry about that. We simply
take the complex root that is closest to p2 . This has the nice advantage that even when the coefficients of p(x) are
all real and p0 , p1 , and, p2 are all real, and all the roots of p(x) are complex, it will find a complex root.
As to the business of finding the parabola passing through (p0 , f (p0 )), (p1 , f (p1 )), and (p2 , f (p2 )), we will seek
a parabola P (x) of the form
P (x) = a(x − p2 )2 + b(x − p2 ) + c.

Making the substitutions x = pi and P (x) = f (pi ) leads to the three equations

f (p0 ) = a(p0 − p2 )2 + b(p0 − p2 ) + c


f (p1 ) = a(p1 − p2 )2 + b(p1 − p2 ) + c
f (p2 ) = c

So we find out immediately that c = f (p2 ) and we must solve the simultaneous equations

f (p0 ) − f (p2 ) = a(p0 − p2 )2 + b(p0 − p2 )


f (p1 ) − f (p2 ) = a(p1 − p2 )2 + b(p1 − p2 )

for a and b. The solution is

(p0 − p2 )2 (f (p1 ) − f (p2 )) − (p1 − p2 )2 (f (p0 ) − f (p2 ))


b =
(p0 − p2 )(p1 − p2 )(p0 − p1 )
(p1 − p2 )(f (p0 ) − f (p2 )) − (p0 − p2 )(f (p1 ) − f (p2 ))
a = .
(p0 − p2 )(p1 − p2 )(p0 − p1 )

Now plugging a, b, and c into the quadratic formula gives us roots x = p2 − b±√b22 −4ac . To choose the one closest to
√ √
p2 , we compare |b + b2 − 4ac| with |b − b2 − 4ac| and use the larger. This gives us the smallest value for |x − p2 |,
the distance of the root from p2 .
2.6. ROOTS OF POLYNOMIALS 87

For example, we will use Müller’s method with p0 = 1, p1 = 2, and p2 = 3 to find a root of f (x) = x3 + 1. We
calculate

δ0 = f (p0 ) − f (p2 ) = 2 − 28 = −26


δ1 = f (p1 ) − f (p2 ) = 9 − 28 = −19
h0 = p0 − p2 = −2
h1 = p1 − p2 = −1
h2 = p0 − p1 = −1

h2 δ −h2 δ
so we get c = 28, b = 0h01h1 h12 0 = 4(−19)−1(−26)
−2 = 25, and a = h1hδ00h−h 0 δ1
1 h2
= −1(−26)−(−2)(−19)
−2 = 6. A close look at
2
the graphs of f (x) and P (x) = 6x + 25x + 28 shows that they do meet three times (at the required points), and
that P (x) does not have real roots:

45

40

35

30

25

20

15

10

0
0 0.5 1 1.5 2 2.5 3 3.5
.
√ √ √ √ √
b ± b2 − 4ac = 25 ± 625 − 672 = 25 ± i 47. Since |25 + i 47| = |25 − √
i 47|, it does not matter which root
we take. Selecting p3 = p2 − b−√b2c2 −4ac , we get p3 = 3 − 25−i
56√
47
= 11 47
12 − 12 i. Continuing this process gives the

1 3
iterates 0.75238 − 0.75810i, 0.57069 − 0.84288i, . . . , 0.50000 − 0.86603i, converging to 2 − 2 i.

Crumpet 17: Orders of convergence

The order of convergence of Müller’s method to a simple root (one that is not repeated) is

√  13
11 19 4 1
√ + +  1 + 3 ≈ 1.839286755214161
3 3 27 √
11 19 3
9 3

3
+ 27

and to a double root,


√  13
139 8 7 1
√ + +  31 + 6 ≈ 1.233751928528259.
24 3 27 √
139 8
36 √
24 3
+ 27

The method of Laguerre converges to a simple root with order 3.

References [23, 26]

The following chart summarizes the relative strengths and weaknesses of Newton’s method, the secant method,
and Müller’s method.
88 CHAPTER 2. ROOT FINDING

Newton’s Secant Müller’s


Initial values needed 1 2 3
Derivative needed? Yes No No
Order of Convergence5 2 ≈ 1.618 ≈ 1.839
Automatic discovery of complex roots? No No Yes
Simplified in the case of polynomials? Yes No No

Key Concepts
Synthetic division: A method for dividing a polynomial p(x) by a monomial (x − x0 ) using only addition, multi-
plication, and the coefficients of p. The process is identical to evaluating a polynomial by nesting. Synthetic
division simply provides an organizational tool so that nesting can be accomplished simply with pencil and
paper.
Horner’s method: A method where the value of a polynomial and its derivative at a single point are calculated
simultaneously via synthetic division.
Müller’s method: A root-finding method similar to the secant method where instead of using a secant line a
parabola is used.
Deflation: The method of replacing a polynomial p(x) by the product of a monomial (x − x0 ) and a polynomial
q(x) of degree one less than that of the original polynomial.

Exercises where c is an array containing the coefficients of the


polynomial, x0 is the initial value, tol is the tolerance,
1. Write an Octave function that calculates the roots and N is the maximum number of iterations before giv-
of a quadratic function using the alternate quadratic ing up. The code should be similar to code you wrote to
formula when appropriate. The first line of your func- implement Newton’s method before, but this code will
tion should be only work for polynomials. Inside your newtonhorner
function [r1,r2] = quadraticRoots(a,b,c) function, DO NOT write Horner’s method code. Just
call the horner function you wrote in question 2. Test
where r1 and r2 are the roots of p(x) = ax2 + bx + c. your code well by comparing outputs of your function
This way, the values r1 and r2 are returned by the to outputs from the code you wrote in question 1 on
function in an array. The function is called like this: page 71.

[s,t]=quadraticRoots(1,2,3), 4. Complete the code for the deflate function begun


here.
setting s to the value of one of the roots and t to the
other. Test your code well by comparing outputs of % This function will deflate a polynomial
your function to hand/calculator computations. % given a root.
% INPUT: coefficients c of the polynomial;
2. Write an Octave function that implements Horner’s % a root r of the polynomial.
method. The first line of your function should be % OUTPUT: coefficients d of the deflated
% polynomial.
function [p,pprime] = horner(x0,c)
function d = deflate(c,r)
where c is an array containing the coefficients of the
polynomial, x0 is the number at which to evaluate it, p end%function
is the value of the polynomial at x0, and pprime is the
5. Write an Octave function implementing Müller’s
value of the derivative of the polynomial at x0. This
method.
way, the values p and pprime are returned by the func-
tion in an array. The function is called like this: 6. Use Horner’s method/synthetic division to find g(2)
and g 0 (2). Do not use a computer.
[y,yy]=horner(-2,[5,4,3,2,1]),
(a) g(x) = 3x3 + 12x2 − 13x − 8 [S]

setting y to the value of the polynomial and yy to the (b) g(x) = −7 + 8x − 3x + 5x − 2x4
2 3 [A]
value of its derivative. Test your code well by compar-
ing outputs of your function to hand/calculator com- 7. Use Horner’s method to calculate g(−2) and g 0 (−2)
putations. where g(x) = 4x4 − 5x3 + 6x − 7. Do not use a com-
puter.
3. Write an Octave function that implements New-
ton’s method with Horner’s method. The first line of 8. Use your work from question 6 to help execute two it-
your function should be erations of Newton’s method using a pencil, paper, cal-
culator, and Horner’s method/synthetic division. Use
function x = newtonhorner(c,x0,tol,N) initial value x0 = 2. [S][A]
2.6. ROOTS OF POLYNOMIALS 89

9. Use your work from question 7 to help execute two it- (b) Find all the roots of f (x).
erations of Newton’s method using a pencil, paper, cal-
culator, and Horner’s method/synthetic division. Use 18. About 800 years ago John of Palermo challenged math-
initial value x0 = −2. ematicians to find a solution of the equation x3 + 2x2 +
10x = 20. In 1224, Fibonacci answered the call in
10. Compute x2 of Newton’s method by hand (using the presence of Emperor Frederick II. He approximated
Horner’s method/synthetic division) for f (x) = x3 + the only real root using a geometric technique of Omar
4x − 8 starting with x0 = 0. Khayyam (1048-1131), arriving at the estimate
11. Find x2 of Newton’s method by hand (using Horner’s 2 3
method/synthetic division) for f (x) = x4 − 2x3 − 4x2 + 1 1 1
   
1 + 22 +7 + 42 +
4x + 4 using x0 = 2. 60 60 60
4 5 6
1 1 1
  
12. Using Horner’s method as an aid, and not using your 33 +4 + 40 .
60 60 60
calculator, find the first iteration of Newton’s method
for the function f (x) = 2x3 − 10x + 1 using x0 = 2. How accurate was his approximation?
13. Demonstrate two iterations of Newton’s method (using Reference [5, pg. 96 ex. 10]
Horner’s method/synthetic division) applied to f (x) =
5x3 − 2x2 + 7x − 3 with p0 = 1 by hand.
19. Calculate the value of the polynomial at the given
14. Find all the roots of the polynomial as follows. Use value of x in two different ways. (i) Use your horner
Newton’s method with tolerance 10−5 to approxi- function from question 2; and (ii) use an inline() func-
mate a root of the polynomial. You may use your tion. Then (iii) compare the two results using Octave’s
newtonhorner function from question 3. Then use syn- == operator.
thetic division to deflate the polynomial one degree. Do √
not use a computer for deflation. Then use Newton’s (a) p(x) = x4 − 2x3 − 12x2 + 16x − 40 at x = 3 [S]
method with tolerance 10−5 to approximate a root of (b) q(x) = 56 − 152x + 140x2 − 17x3 − 48x4 + 9x5 at
the deflated polynomial. Then use synthetic division to x = π/2 [A]
deflate the deflated polynomial one degree. Repeat un-
(c) r(x) = x6 + 11x4 − 34x3 − 130x2 − 275x + 819 at
til the deflated polynomial is quadratic. Once this hap- √
1− 5 [A]
pens, use the quadratic formula (or alternate quadratic 2
formula) to find the last two roots. (d) s(x) = 5x10 + 3x8 − 46x6 − 102x4 + 365x2 + 1287
at 1e
(a) g(x) = x4 + 6x3 −59x2 + 144x−144 [S]

(b) g(x) = −280 + 909x − 154x2 − 178x3 + 54x4 + 9x5 20. Write an Octave function that uses your functions from
[A] questions 1, 3, and 4 to find all the roots of a polyno-
mial. Test your function well on polynomials of various
15. Find all the roots of the polynomial as follows. Use degrees for which you know the roots. You may base
Newton’s method with tolerance 10−5 to approxi- your function on the pseudo-code on page 84, but your
mate a root of the polynomial. You may use your code should be significantly simpler since you are call-
newtonhorner function from question 3. Then use syn- ing functions instead of writing their code. [A]
thetic division to deflate the polynomial one degree. 21. Use your code from question 20 to find all the solutions
You may use your deflate function from question 4 of the equation. [A]
for deflation. Then use Newton’s method with toler-
ance 10−5 to approximate a root of the deflated poly- (a) x5 + 11x4 − 34x3 − 130x2 − 275x + 819 = 0
nomial. Then use synthetic division to deflate the de- (b) 5x5 + 3x4 − 46x3 − 102x2 + 365x + 1287 = 0
flated polynomial one degree. Repeat until the deflated
polynomial is quadratic. Once this happens, use the 22. Find all the roots of g(x) = 25x3 − 105x2 + 148x − 174.
quadratic formula to find the last two roots. You may 23. Recall that there are some similarities between the se-
use your quadraticRoots function from question 1 for cant method and Müller’s method. They each require
solving the quadratic. multiple initial approximations. They each involve cal-
(a) g(x) = x4 − 2x3 − 12x2 + 16x − 40 [S] culating the zero of some function passing through
these initial points. They both give superlinear con-
(b) g(x) = 56 − 152x + 140x − 17x − 48x4 + 9x5
2 3 [A]
vergence to simple roots. And, of course, they are both
root finding methods. Let’s tweak the idea in the fol-
16. For each root you found in question 14 except the first
lowing way. To find roots of g, start as with the secant
one, use it as an initial approximation in Newton’s
method, using two approximations, x0 and x1 . Then,
method with tolerance 10−5 to see if you can refine
instead of using the zero of a line through (x0 , g(x0 ))
your roots. Do they change? [S][A]
and (x1 , g(x1 )), find the function of the form
17. f (x) = x3 − 1.255x2 − .9838x + 1.2712 has a root at
x = 1.12. h(x) = ax3 + b

(a) Use Newton’s method with an initial approxima- passing through (x0 , g(x0 )) and (x1 , g(x1 )). Let x2 be
tion x0 = 1.13 to attempt to find this root. Ex- the zero of h. Then repeat with x1 and x2 to get x3 ,
plain what happens. and so on.
90 CHAPTER 2. ROOT FINDING

(a) Let g(x) = 2 ln(1 + x2 ) − x, x0 = 5 and x1 = 6. 26. The graph of f (x) is shown. Find distinct sets of values
Find x2 using this method. p0 , p1 , and p2 for which Müller’s method
(b) Find a formula for x2 given any function g(x) and
any initial conditions x0 and x1 . Your formula (a) will lead to a complex value for p3 .
should be in terms of x0 , x1 , g(x0 ), and g(x1 ).
(c) Find a general formula for xn in terms of xn−2 , (b) will lead to the root at x ≈ 4.4.
xn−1 , g(xn−2 ), and g(xn−1 )).
(c) will lead to the root at x ≈ 2.8.
(d) Write an Octave function that implements this
method and prints out each iteration.
4

(e) Use your Octave function to decide whether 3


the order of convergence for this method is linear
or superlinear. 2

24. Pick a function whose root(s) you know exactly. Use 1

Müller’s method to find one of the roots. Use three 0

consecutive iterations to estimate the order of conver-


-1
gence.
25. The errors in three consecutive iterations of Müller’s -2

method are shown in the table. Use this information -3


1 2 3 4 5 6
to estimate the order of convergence.
n |xn − x|
12 1.53627(10)−349 27. The function shown in question 26 is f (x) =
x2 −7x+10
13 1.67365(10)−642 2
+ sin(3x). Use this information to test your
14 1.83922(10)−1181 conjectures in question 26.
2.7. BRACKETING 91

2.7 Bracketing
Bisection is called a bracketed root-finding method. A root is known to lie within a certain interval. Each iteration
reduces the size of the interval and maintains the guarantee the root is within. At each step of the algorithm, the
root is known to be between the latest estimate and one of the previous. These bounds form a bracket around the
root. As the algorithm proceeds, the bracket decreases in size until it is smaller than some tolerance, at which point
the root is known to be close and the algorithm stops.
The problem with bisection is its linear order of convergence. Compared to superlinear methods like the secant
method and Newton’s method, the bisection method just creeps along. But the bisection method has something
the secant method and Newton’s method do not—certainty of convergence. Yes, the secant method and Newton’s
method are fast when they converge, but there is no guarantee they will converge at all.
Methods combining the virtues of the bisection method (guaranteed convergence) and some higher order method
(speed) are called safeguarded methods. They are guaranteed to converge and can do so quickly when the root is
near. Any superlinear method may be bracketed, producing a safeguarded method.

Bracketing
Bracketing means maintaining an interval in which a root is known to lie. Bracketing is used in the bisection method.
With each iteration, the root is known to lie between the two latest approximations. Bracketing is not used in the
secant method nor Newton’s method. There is no guarantee a root remains near the latest approximations.
It is not difficult, however, to combine the bisection method with the secant method or Newton’s method, or
any other high order method for that matter, to form a hybrid method where the root remains bracketed and there
is a chance for fast convergence. In such a method, a candidate for the next iteration is computed according to the
high order method. If this candidate lies within the bracket, it becomes the next iteration. If the candidate lies
outside the bracket, the bisection method is used to compute the next iteration instead.
Bracketed secant method, better known as the method of false position or regula falsi, provides an elementary
example. In fact, the high order method (the secant method) always produces a value inside the bracket, so checking
that point is not necessary. Where false position and the secant method differ is choosing which of the previous
two iterations to keep. In the secant method, it is always the latest iteration which is kept for the next. In false
position, the latest iteration which maintains a bracket about the root is kept for the next whether that iteration
is the latest or not. Bracketed Newton’s method provides a slightly more advanced example because it is entirely
possible an iteration of Newton’s method will land outside the bracket.
Take the function g(x) = 3 − x − sin(x) over the interval [2, 3]. f is continuous on [2, 3], and g(2) ≈ 0.09 and
g(3) ≈ −0.14 have opposite signs. Thus [2, 3] brackets a root of g, so let x0 = 2 and x1 = 3. The table shows the
computation of the next iteration for bracketed secant method and bracketed Newton’s method.

x0 x1 candidate x2 x2
bracketed secant 2 3 x1 − g(x1 ) g(xx11)−g(x
−x0
0)
≈ 2.3912 2.3912
g(x1 )
bracketed Newton’s 2 3 x1 − g 0 (x1 ) ≈ −11.101 2.5

In bracketed secant, the candidate x2 is accepted, but in bracketed Newton’s method, the candidate x2 is outside
the bracket so it is discarded and x2 according to the bisection method (2.5) is taken instead.
To set up the next iteration, g(x2 ) is calculated. Since g(x2 ) is negative in both methods, the old x1 , which was
3, is discarded and x0 = 2 is “upgraded” to x1 in order to maintain the bracket. This way, g has opposite signs at
x1 and x2 . The following table demonstrates this decision process plus the computation of the next iteration.

g(x2 ) x1 x2 candidate x3 x3
bracketed secant −0.073141 2 2.3912 x2 − g(x2 ) g(xx22)−g(x
−x1
1)
≈ 2.2165 2.2165
g(x2 )
bracketed Newton’s −0.098472 2 2.5 x2 − g 0 (x2 ) ≈ 2.0048 2.0048

Can you fill in x4 based on the values in the following table? Notice the old x1 must be “upgraded” in bracketed
secant but not in bracketed Newton’s. Why? Answers on page 98.

g(x3 ) x2 x3 candidate x4 x4
bracketed secant −0.015215 2 2.2165 x3 − g(x3 ) g(xx33)−g(x
−x2
2)
≈ 2.1854 ?
g(x3 )
bracketed Newton’s 0.087906 2.5 2.0048 x3 − g 0 (x3 ) ≈ 2.1565 ?
92 CHAPTER 2. ROOT FINDING

The next 5 iterations of each method are given here in case you would like to try your hand at computing a few.
And now is a good time to do so. These values were computed using the subsequent Octave code.

bracketed
secant Newton’s
x5 2.18062942638407 2.17925592233708
x6 2.17988957044102 2.17975682599184
x7 2.17977718322867 2.17975706647997
x8 2.17976012038625 2.17975706648003
x9 2.17975753008587 2.17975706648003

False position (bracketed secant method) Octave code

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Written by Dr. Len Brin 20 May 2014 %
% Purpose: Implementation of the Method of %
% False Position. %
% INPUT: function g; initial values a and b; %
% tolerance TOL; maximum iterations N %
% OUTPUT: approximation x and number of %
% iterations i; or message of failure %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function [x,i] = falsePosition(g,a,b,TOL,N)
i=1;
A=g(a);
B=g(b);
while (i<N)
b
x=b-B*(b-a)/(B-A);
if (abs(x-b)<TOL)
return
end%if
X=g(x);
if ((B<0 && X>0) || (B>0 && X<0))
a=b; A=B;
end%if
b=x; B=X;
i=i+1;
end%while
x="Method failed---maximum number of iterations reached";
end%function

Bracketed Newton’s method Octave code

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Written by Dr. Len Brin 20 May 2014 %
% Purpose: Implementation of bracketed Newton’s %
% method. %
% INPUT: function g; its derivative gp; initial %
% values a and b; tolerance TOL; maximum %
% iterations N %
% OUTPUT: approximation x and number of %
% iterations i; or message of failure %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function [x,i] = bracketedNewton(g,gp,a,b,TOL,N)
i=1;
A=g(a);
2.7. BRACKETING 93

B=g(b);
while (i<N)
b
x=b-B/gp(b);
if (x<min([a,b]) || x>max([a,b]))
x=b+(a-b)/2;
end%if
if (abs(x-b)<TOL)
return
end%if
X=g(x);
if ((B<0 && X>0) || (B>0 && X<0))
a=b; A=B;
end%if
b=x; B=X;
i=i+1;
end%while
x="Method failed---maximum number of iterations reached";
end%function

falsePosition.m and bracketedNewton.m may be downloaded at the companion website.


The code for bracketed secant method and bracketed Newton’s method are very similar. In fact, they are nearly
identical. There are only two differences besides the commentary at the beginning. Where bracketed secant has the
line x=b-B*(b-a)/(B-A);, bracketed Newton’s has the line x=b-B/gp(b);. This is the essential difference between
the two as this is where the high order method is executed. The only other difference is that bracketed Newton’s
includes three lines where it checks whether x lands within the bracket and executes one step of the bisection method
if not:

if (x<min([a,b]) || x>max([a,b]))
x=b+(a-b)/2;
end%if

Actually, we could add these three lines to the bracketed secant method and it would run just the same. It is
impossible for the secant method to produce a value of x outside the bracket, so the bisection step would never be
executed. The only essential difference between the two functions is the execution of the high order method.
We can use this observation to create a sort of blueprint for bracketing any high order method. Steffensen’s,
Müller’s (as long as the approximation stays real), or Sidi’s (section 3.2), for example, can be bracketed this way.
The following pseudo-pseudo-code represents such a blueprint, giving guidance on how to safeguard a high order
method by combining it with bisection.

Assumptions: g is continuous on [a, b]. g(a) and g(b) have opposite signs.
Input: Interval [a, b]; function g; desired accuracy tol; maximum number of iterations N ; any other variables,
like g 0 in the case of Newton’s method, needed to iterate the superlinear method.
Step 1: Set A = g(a); B = g(b); i = 2;
Step 2: Initialize any other variables needed for superlinear();
Step 3: While i < N do Steps 4-10:
Step 4: Set x = superlinear(a, b, g, . . .);
Step 2 ;
5: If (x − a)(x − b) > 0 then x = b + a−b
Step 6: If |x − b| < tol then return x
Step 7: Set X = g(x);
Step 8: If BX < 0 then set a = b; A = B;
Step 9: Set b = x; B = X; i = i + 1;
Step 10: Update any other variables needed for superlinear();
Step 11: Print “Method failed. Maximum iterations exceeded.”
Output: Approximation m within tol of exact root, or message of failure.
94 CHAPTER 2. ROOT FINDING

Figure 2.7.1: A troublesome function for the bracketed secant method.

As motivation for the need to develop bracketed versions of other high order methods, consider the particularly
1+10x 1
problematic function g(x) = 1−10x . It has a root at − 10 , but the bracketed secant method can be very slow to
converge to this root. Figure 2.7.1 illustrates this slow convergence beginning with the bracket [a, b] = [−4, .05].
With this unfortunate choice of bracket, the method takes 45 iterations to achieve 10−5 accuracy. A smarter
algorithm would not only check that each iterate lands within the brackets, but would also check to see that the
high order method is making quick progress toward the root. If it detected that convergence was slow, say slower
than bisection would be, it would take a bisection step instead. Note that bracketed Newton’s method does not
have a significant problem with this function. Given the same initial bracket, it converges to within 10−5 of the root
in only 10 iterations (the first 4 of which are bisection steps). Alas, Newton’s method requires use of the derivative.
A fast bracketed root-finding method that does not require knowledge of the derivative would be quite useful.
In the early 1970s, Richard Brent built upon the work of van Wijngaarden and Dekker to produce a bracketed
method that combines bisection, the secant method, and inverse quadratic interpolation, all the while checking
to make sure the high order method is making sufficiently quick progress toward a root. The result is what is
now known as Brent’s method [3]. It does not require knowledge of the derivative. It is fast. It is guaranteed to
converge. Consequently, it is a popular all-purpose method for finding a root within a bracket when the derivative
is not accessible. The full details of Brent’s method will not be presented here, but a significant step toward that
method will. The method presented here is similar to the MATLAB function fzero [22].

Inverse Quadratic Interpolation

You may recall, in Müller’s method, three initial approximations, say a, b, and, c are needed. The parabola through
the points (a, g(a)), (b, g(b)), and (c, g(c)) is drawn and its intersection with the x-axis gives the next iteration. The
key elements of this method, the process of fitting a quadratic function to the three points, is called interpolation.
Thus Müller’s method could just as well be called the “quadratic interpolation method”.
As you may have guessed, the method of inverse quadratic interpolation is similar. Instead of fitting a quadratic
function to the points (a, g(a)), (b, g(b)), and (c, g(c)), the roles of x and y are reversed. A quadratic function is
fitted to the points (g(a), a), (g(b), b), and (g(c), c) instead. Since x is a function of y in this case, the quadratic
will cross the x-axis exactly once, when y = 0. Evaluating the quadratic at 0 gives the next iteration. Figure 2.7.2
shows quadratic interpolation and inverse quadratic interpolation on the same set of three points. In quadratic
interpolation, y is treated as a function of x. In inverse quadratic interpolation, x is treated as a function of
y. Inverse quadratic interpolation avoids the main complication of quadratic interpolation—calculating its x-axis
crossings. In quadratic interpolation, the quadratic may cross the x-axis twice or not at all! Either way, some choice
needs to be made at every step, and the roots of the quadratic involve the quadratic formula. In inverse quadratic
interpolation, the quadratic is guaranteed to cross the x-axis exactly once, and finding the crossing is just a matter
of evaluating the quadratic at 0. That is, y = 0. Remember, the quadratic gives x as a function of y.
Referring back to the derivation of Müller’s method on page 86, forcing the parabola to pass through the points
(a, A), (b, B), and (c, C), and swapping the roles of x and y, a formula for the inverse parabola, q, just falls out:

q(y) = q0 (y − B)2 + q1 (y − B) + q2
2.7. BRACKETING 95

where

q2 = b
(A − B)2 (c − b) − (C − B)2 (a − b)
q1 =
(A − B)(C − B)(A − C)
(C − B)(a − b) − (A − B)(c − b)
q0 = .
(A − B)(C − B)(A − C)

Crumpet 18: Quadratic interpolation order of convergence

The method of inverse quadratic interpolation has order of convergence about 1.84 under reasonable assumptions.
If the function whose root is being determined has three continuous derivatives in a neighborhood of the root,
the latest three approximations are sufficiently close, and the root is simple, then the order of convergence is the
real solution of
α3 − α2 − α − 1 = 0.
We can use inverse quadratic interpolation to approximate it!

>> format(’long’)
>> f=inline(’x^3-x^2-x-1’)
f = f(x) = x^3-x^2-x-1
>> [res,i]=inverseQuadratic(f,1,2,10^-12,100)
res = 1.83928675521416
i = 8

The exact solution is √  13


11 19 4 1
α= √ + + 1 + 3.
3 3 27 √
11 19 3
9 3

3
+ 27

You may recognize this as the order of convergence for Müller’s method. Indeed, any quadratic interpolation
method converges to a simple root with this order.

Reference [29]

The x-axis crossing is, therefore,

x = q(0)
= B 2 q0 − Bq1 + q2
(C − B)(a − b) − (A − B)(c − b) (A − B)2 (c − b) − (C − B)2 (a − b)
= B2 −B +b
(A − B)(C − B)(A − C) (A − B)(C − B)(A − C)
 2
B (C − B) + B(C − B)2 (a − b) − B 2 (A − B) + B(A − B)2 (c − b)
  
= +b
(A − B)(C − B)(A − C)
−B 2 C + BC 2 (a − b) − −B 2 A + BA2 (c − b)
   
= +b
(A − B)(C − B)(A − C)
BC(C − B)(a − b) − BA(A − B)(c − b)
= +b
(A − B)(C − B)(A − C)
A(B − 1)(a − b) − C (1 − B
A )(c − b)
B C A
= b+
(1 − A )( B − 1)( C − 1)
B C A

C (1 A )(c − b) − A ( B − 1)(a − b)
A
−B B C
= b+ .
( A − 1)( C − 1)( C
B B A
− 1)
96 CHAPTER 2. ROOT FINDING

Figure 2.7.2: Quadratic and inverse quadratic interpolation.

To make the compuation of x a little more programmer-friendly, some new variables are introduced. Let
B C A
r= − 1, s= − 1, t= −1
A B C
so
r(t + 1)(c − b) + s(r + 1)(a − b)
x=b− . (2.7.1)
rst

Inverse quadratic interpolation can be bracketed just like any other high order method. But it does present an
interesting question that not all high order methods do. Three points are necessary for a quadratic interpolation,
so when they are used to produce the next iteration, a fourth point is generated. Of the four points, the computer
needs to decide which two will become the next bracket, and which point should be the third needed for the next
interpolation. But we are getting ahead of ourselves.
Each iteration begins with three points, (a, g(a)), (b, g(b)), and (c, g(c)) where a and b bracket a root and c is a
third point. For the first iteration, only the bracket is given. c is set equal to a. For every iteration, the signs of
g(a) and g(b) are checked to ensure that a and b bracket a root. If they are opposite, the method proceeds. If they
are the same, that means g(b) and g(c) must have opposite signs, so a is set equal to c. Next, the absolute values of
g(a) and g(b) are checked. If |g(a)| < |g(b)|, the labels of a and b are switched and c is set equal to the new value of
a. After these initial checks, the computation of the next iteration begins with assurance that a root lies between
a and b; b is likely the best estimate of the root to date; and c is likely the worst estimate of the root to date.
If c = a after the initial checks and possible relabeling, then quadratic interpolation is impossible. The next
iteration is generated by the secant method (linear interpolation) instead. If c 6= a after the initial checks and
possible relabeling, a candidate for the next iteration, x, is calculated according to inverse quadratic interpolation.
If the candidate lies within the bracket, it is accepted as the next iteration. If it lies outside the bracket, a step
of the bisection method is used instead. In either case, c is set equal to b and b is set equal to x. For bracketed
inverse quadratic interpolation, this completes one iteration. The method is then repeated until a sufficiently good
approximation is found.
In the best-case scenario, inverse quadratic interpolation is used at every step and convergence is superlinear
with order about 1.84. In the worst-case scenario, one of the high order methods is used at every step, but the
function is pathological and convergence is slow, possibly even slower than bisection. Slow convergence is rare,
though, and the actual order of convergence can not be pinned down in general. The method switches between
methods of different orders. The best we can say is it is usually fast.

Bracketed inverse quadratic interpolation Octave code


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Written by Dr. Len Brin 21 May 2014 %
% Purpose: Implementation of bracketed inverse %
% quadratic interpolation method. %
% INPUT: function g; initial values a and b; %
% tolerance TOL; maximum iterations N0 %
% OUTPUT: approximation x and number of %
% iterations i; or message of failure %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function [x,i] = bracketedInverseQuadratic(g,a,b,TOL,N0)
i=1;
2.7. BRACKETING 97

A=g(a);
B=g(b);
c=a; C=A;
while (i<N0)
b
if (B*A>0)
a=c; A=C;
end%if
if (abs(A) < abs(B))
c=b; C=B;
b=a; B=A;
a=c; A=C;
end%if
if (a==c)
x=(b*A-a*B)/(A-B);
else
r=B/A-1; s=C/B-1; t=A/C-1;
p=(t+1)*r*(c-b)+(r+1)*s*(a-b);
q=t*s*r;
x=b-p/q;
end%if
if (x<min([a,b]) || x>max([a,b]))
x=b+(a-b)/2;
end%if
if (abs(x-b)<TOL)
disp(" ");
return
end%if
c=b; C=B;
b=x; B=g(b);
i=i+1;
end%while
x="Method failed---maximum number of iterations reached";
end%function

1+10x
Applying the bracketed inverse quadratic interpolation method to the problematic function g(x) = 1−10x over the
interval [−4, .05] yields the result within 10 accuracy in only 11 iterations. The method took only 1 iteration
−5

more than bracketed Newton’s without requiring knowledge of the derivative of g! bracketedInverseQuadratic.m
may be downloaded at the companion website.

Stopping
In all of our root-finding methods, the algorithm stops when the difference between consecutive iterations is less
than some tolerance. This criterion is based on the assumption that the error will be no more than this difference.
And that is a safe assumption for any method that is converging superlinearly when it quits. Indeed, it is even
safe for the linearly converging bisection method where the difference between consecutive iterations is exactly the
theoretical bound on the error.
The criterion is not safe when a superlinear method is used far enough from a root that superlinear convergence
is not observed. This is exactly what happens in figure on page 94. The difference between consecutive iterations
is actually larger than the absolute error at every step. This is an unusual situation, but it can happen.
The criterion is also not safe when a method is linearly convergent with a limiting convergence constant λ > 12 .
However, linearly convergent methods should never be used on their own as there is always a faster alternative.
There is one more important consideration regarding stopping. Stopping when the difference between consecutive
iterations is less than some tolerance is dependent on the absolute error. When roots could be very small or very
large, it is perhaps better to use a criterion based on relative error. Instead of stopping when |xn+1 − xn | < tol, for
example, we would instead stop when |xn+1 − xn | < tol · |xn+1 |.
98 CHAPTER 2. ROOT FINDING

Key Concepts
Bracketing: Iteratively refining an interval, also known as the bracket, in which a root is known to lie until it is
small beyond some tolerance.
Inverse quadratic interpolation: A quadratic in y is fit to three consecutive approximations of a root. The
intersection of the quadratic with the x-axis becomes the next iteration.
Bracketed secant method: A combination of the secant method and bisection method employing bracketing. At
each iteration, if the secant method produces a value inside the current bracket, it becomes the next iteration.
Otherwise bisection is used to produce the next iteration.
False position: Another name for the bracketed secant method.
Regula falsi: Another name for the bracketed secant method.
Bracketed Newton’s method: A combination of Newton’s method and the bisection method employing brack-
eting. At each iteration, if Newton’s method produces a value inside the current bracket, it becomes the next
iteration. Otherwise bisection is used to produce the next iteration.
Bracketed inverse quadratic interpolation: A combination of inverse quadratic interpolation, the secant method,
and bisection employing bracketing. At each iteration, if inverse quadratic interpolation produces a value in-
side the current bracket, it becomes the next iteration. Otherwise either the secant method or bisection is
used to produce the next iteration.

Exercises 7. Repeat question 1 using Octave, bracketed inverse


1. Use the bracketed secant method (false position) to find quadratic interpolation, and a tolerance of 10−6 . [S][A]
a root in the indicated interval, accurate to within 10−2 . 8. Compare the results of questions 5, 6, and 7. [A]

[A]
(a) f (x) = 3 − x − sin x; [2, 3]
4 3 9. Write a bracketed Steffensen’s method Octave func-
(b) g(x) = 3x − 2x − 3x + 2; [0, 1]
tion. REMARK: Steffensen’s method is a fixed point
(c) g(x) = 3x4 − 2x3 − 3x + 2; [0, 0.9] [S]
finding method. It solves the equation f (x) = x, not
(d) h(x) = 10 − cosh(x); [−3, −2] f (x) = 0. So a proper bracket [a, b] is one for which
√ [A] (f (a) > a and f (b) < b) or (f (a) < a and f (b) >
(e) f (t) = 4 + 5 sin t − 2.5; [−600, −500]
2
b). Geometrically, this means the points (a, f (a)) and
3t tan t
(f) g(t) = 1−t2
; [3490, 3491] (b, f (b)) are on opposite sides of the line f (x) = x, anal-
(g) h(t) = ln(3 sin t) − 3t
; [1, 2] ogous to a root-finding bracket where the two points are
5
sin r [S]
on opposite sides of the line f (x) = 0.
(h) f (r) = e − r; [−20, 20]
(i) g(r) = sin(e ) + r; [−3, 3]
r
10. Use your code from question 9 to repeat question
(j) h(r) = 2 sin r
−3 cos r
; [1, 3] [A] 1 using Octave, bracketed Steffensen’s method, and a
tolerance of 10−6 . Given that you are looking for a root
2. Repeat question 1 using bracketed Newton’s method. of g(x), use f (x) = g(x) + x in your call to Steffensen’s
[S][A]
method. [S][A]
3. Repeat question 1 using the secant method. Compare [A]
your answer with that of false position. [S][A] 11. Compare the results of questions 7 and 10.

4. Repeat question 1 using Newton’s method. Compare 12. Rewrite the inverseQuadraticInterpolation Oc-
your answer with that of bracketed Newton’s method. tave function so that it stops when the (approximated)
[S][A]
relative error is less than the tolerance.
5. Repeat question 1 using Octave and a tolerance of
10−6 . [S][A] 13. Use your code from question 12 to repeat question
1 with a tolerance of 10−6 . [S][A]
6. Repeat question 2 using Octave and a tolerance of
10−6 . [S][A] 14. Compare the results of questions 7 and 13. [A]

Answers
x4 : In both methods, the candidate x4 is accepted since in each case, x4 is within the bracket formed by x2 and
x3 . So, for bracketed secant, x4 = 2.1854, and for bracketed Newton’s, x4 = 2.1565. x1 is upgraded to x2 in
bracketed secant because g(x3 ) is negative. g(x2 ) and g(x3 ) must have opposite signs in order to maintain
the bracket. x1 is not upgraded in bracketed Newton’s because g(x3 ) is positive.
Chapter 3
Interpolation

3.1 A root-finding challenge


We open this chapter by combining its content with that of the previous chapter. In the present chapter, we will
discuss interpolating functions (functions whose graphs must contain a prescribed set of points) and interpolation
(the exercise of finding such a function). In the previous chapter, we discussed approximating roots of functions by
numerical computation. Putting these ideas together in the present section, we present an interpolating function,
which we will call f , and challenge the reader to find all 6 roots of f , f 0 , and a particular antiderivative of f as
accurately and efficiently as possible. Graphs of the three functions and the definition of f follow. Should you
accept the challenge, be prepared to use all of what you know about root-finding and Octave. This problem is not
easily solved!
If you would like to get right to it, you can skip most of the content of this section. Use the three graphs and
the Octave code as a starting point to find the roots of F , f , and, f 0 . The rest of the material is here to help you
understand the definition and construction of the functions, but is not prerequisite to taking the challenge.

The function f and its antiderivative


The function

0.08

0.06

0.04

0.02

-0.02

0 0.2 0.4 0.6 0.8 1 ,

which we will call F , could easily be mistaken for a cubic or higher degree polynomial, but it is far from so nice.
First, its domain is the interval [0, 1], so the graph shown is the entire graph. Second, it has but two derivatives.
Third, its definition is a touch unusual. More on that soon.
What we have here is the antiderivative of a fractal interpolating function. An interpolating function is a function
that contains a set of prescribed points. This one happens to be fractal in nature, thus a fractal interpolating
function. The fractal interpolating function, f , passes through

(0, .123), (.33, −.123), and (1, .5) (3.1.1)

99
100 CHAPTER 3. INTERPOLATION

in such a way that the graph shown is that of its antiderivative. The unusual nature of the definition of F is derived
from the unusual nature of the definition of f :

f1 + c1 αx + d1 f 0≤x≤α
( x

α ,
f (x) = 
f2 + c2 x−α
1−α + d2 f x−α
1−α , α≤x≤1

where
8979 34779 27
f1 = , c1 = − , d1 =
100000 100000 100
75891 317391 67
f2 = − , c2 = , d2 =
550000 550000 550
33
α= .
100

Crumpet 19: Fractal Interpolating Functions

Fractal interpolating functions are not restricted to passing through three points. Actually, three is the minimum.
In general, for n ≥ 3, suppose x1 < x2 < · · · < xn . The linear fractal interpolating function (there are other
types of fractal interpolating functions) passing through each of the points

(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )

and having domain [x1 , xn ] is defined by the linear transformations


      
x ai 0 x ei
Li = + , i = 1, 2, . . . , n − 1.
y ci di y fi

The ai , ci , ei , and fi are calculated based on the requirement that the function interpolate the given points. In
particular, we require        
x1 xi xn xi+1
Li = and Li = .
y1 yi yn yi+1
The di are free parameters with the restriction |di | < 1. It is a straightforward algebraic exercise to show
xi+1 − xi
ai =
xn − x1
yi+1 − yi − di (yn − y1 )
ci =
xn − x1
ei = xi − ai x1
fi = yi − ci x1 − di y1 .

In concert,
  the Li define the
 function f , each Li responsible for the subset [xi , xi+1 ] of the domain.
x ai x + ei
Li = , so as Li takes x to ai x + ei , it simultaneously takes y to ci x + di y + fi .
y ci x + di y + fi
Noting that Li takes this action on the function f , we must have that f (ai x + ei ) = ci x + di f (x) + fi on [x1 , xn ],
or equivalently,
x − ei x − ei
   
f (x) = fi + ci + di f on [xi , xi+1 ].
ai ai
Putting all the pieces together, f is defined by

h1 (x),

 x1 ≤ x ≤ x2

h2 (x),

x2 ≤ x ≤ x3
f (x) = ..


 .
hn−1 (x),

xn−1 ≤ x ≤ xn

where
x − ei x − ei
   
hi (x) = fi + ci + di f .
ai ai
3.1. A ROOT-FINDING CHALLENGE 101

´x
Consequently, F (x) = x1
f (t) dt is defined by
´ x
 x1 h1 (t)dt,

´
x 1 ≤ x ≤ x2
F (x2 ) + xx h2 (t)dt,


2
x 2 ≤ x ≤ x3
F (x) = ..
 .
F (xn−1 ) + ´ x


hn−1 (t)dt,

xn−1
xn−1 ≤ x ≤ xn

without qualification, and f 0 (x) is defined by

h1 (x),
 0
 x1 ≤ x ≤ x2

h02 (x),

x2 < x ≤ x3
f 0 (x) = ..


 .
h0n−1 (x),

xn−1 < x ≤ xn

as long as f 0 exists! If adii < 1 for all i, then the derivative will exist almost everywhere, but will generally be
discontinuous. If we also have h0i (xi+1 ) = h0i+1 (xi+1 ) for all i = 1, 2, . . . , n − 2, then the derivative will exist and
will be continuous.

Reference [2, Chapter 6]

The definition of f is self-referential. Its values are defined by, among other terms, values of itself! This makes
evaluating the function a bit different from evaluating a typical function. For example, by virtue of the fact that f
passes through the points 3.1.1, we must have f (0) = .123, f (.33) = −.123, and f (1) = .5, facts we can check easily
enough. According to the definition,
f (0) = f1 + d1 f (0) = .08979 + .27f (0)
so f (0) is defined in part by itself. We need to solve the equation f (0) = .08979 + .27f (0) to find f (0). Thus we
have f (0) = .08979
.73 = .123, as promised. Again according to the definition,
75891 317391 67
f (1) = f2 + c2 + d2 f (1) = − + + f (1).
550000 550000 550
− 75891
+ 317391
Solving for f (1), we have f (1) = 550000 550000
1− 550
67 = 12 , as promised. Since α = .33, the definition actually gives two
ways to calculate f (.33). According to the first part of f ,
f (.33) = f (α) = f1 + c1 + d1 f (1)
8979 34779 27 1
= − + ·
100000 100000 100 2
= −.123.
Now is a good time to verify that f (α) = −.123 according to the second part of f as well. Try it! Calculating other
values of f can be a bit more challenging, but there are still a few that are not so bad. α2 < α and α + (1 − α)α > α,
so
f (α2 ) = f1 + c1 α + d1 f (α)
8979 34779 33 27 123
 
= − · + · −
100000 100000 100 100 1000
= −.0581907
f (α + (1 − α)α) = f2 + c2 α + d2 f (α)
75891 317391 33 67 123
 
= − + · + · −
550000 550000 100 550 1000
2060703
=
55000000
= .037467327
102 CHAPTER 3. INTERPOLATION

With a similar level of difficulty, you can now calculate

f (α3 ), f (α(α + (1 − α)α)), f (α + (1 − α)α2 ),


and f (α + (1 − α)(α + (1 − α)α)).

Answers on page 105. More generally, once you have calculated f (x) for some value x, you can then calculate f (αx)
and f (α + (1 − α)x) from it. ´x
Now that we have a handle on f , we define F by F (x) = 0 f (t) dt for all x ∈ [0, 1]. Integrating f (x) we have
c 1 x2
f1 x + + αd1 F 0≤x≤α
( x

2α α ,
F (x) = (x−α)2
 
F (α) + f2 (x − α) + c22(1−α) + (1 − α)d2 F x−α
1−α , α≤x≤1

where again both formulas are applicable when x = α. Just like f , F is self-referential. We must go through the same
process in finding values of F as we did finding values of f . To get started, F (0) = αd1 F (0) ⇒ (1 − αd1 ) · F (0) = 0,
but α and d1 are both less than 1, so 1 − αd1 6= 0. Therefore,
0
F (0) = = 0.
1 − αd1
´0
We could have computed this value by integration just as well: F (0) = 0 f (t) dt = 0. Now, according to the
formula,  c2 
F (1) = F (α) + (1 − α) f2 + + d2 F (1)
2
and
 c1 
F (α) = α f1 + + d1 F (1) ,
2
a system of two equations in the two unknowns, F (α) and F (1). Its solution is
121012947
F (α) = − ≈ −.01989886325517151
6081400000
5361861
F (1) = ≈ 0.0881682014009932.
60814000
Now that we have the few values, F (0), F (α), and F (1), we can calculate others as before. The values F (αx) and
F (α + (1 − α)x) will both depend on the value of F (x). So we can compute F (α2 ) and F (α + (1 − α)α):

c1 α3
F (α2 ) = f1 α2 + + αd1 F (α)
2
10678194456039
=
6081400000000000
≈ .001755877668964219
c2 (1 − α) α2
F (α + (1 − α)α) = F (α) + f2 (1 − α) α + + (1 − α)d2 F (α)
2
94196657189979
= −
3040700000000000
≈ −.03097860926430723.

Now you can calculate F (α3 ), F (α(α + (1 − α)α)), F (α + (1 − α)α2 ), and F (α + (1 − α)(α + (1 − α)α)) yourself.
Answers on page 105. You shouldn’t worry about calculating these values exactly. That would require a computer
algebra system with arbitrary precision and is not really the point. The point is to make sure you understand how
to do the calculations. Use a calculator or Octave and the approximate values already calculated.

The derivative of f and more graphs


The function f has a continuous derivative. In fact, the parameters defining f were specifically chosen so the
derivative would exist and be continuous. Differentiating f gives us

α + αf  0≤x≤α
(c d1 0 x

α ,
1

f (x) =
0

1−α + 1−α f 1−α , α ≤ x ≤ 1


c2 d2 0 x−α
3.1. A ROOT-FINDING CHALLENGE 103

Figure 3.1.1: Graph of f .


0.5

0.4

0.3

0.2

0.1

-0.1

0 0.2 0.4 0.6 0.8 1

Figure 3.1.2: Graph of f 0 .


1

-1

-2

-3

-4

-5

-6
0 0.2 0.4 0.6 0.8 1

and we can check as before that the definition is consistent when x = α:


c1 d1 c1 11593
f 0 (0) = + f 0 (0) ⇒ f 0 (0) = =− = −5.7965
α α α − d1 2000
c2 d2 0 c2 105797
f 0 (1) = + f (1) ⇒ f 0 (1) = = ≈ 1.052706467661692
1−α 1−α 1 − α − d2 100500
c1 d1 141949
f 0 (α) = + f 0 (1) = − ≈ −.1926037991858887
α α 737000
c2 d2 0 141949
f 0 (α) = + f (0) = − ≈ −.1926037991858887.
1−α 1−α 737000
Other values of f 0 can be computed as done for f and F . The graphs of f and f 0 are shown in Figures 3.1.1 and
3.1.2.
That’s it. Now see if you can find the roots of the three functions. The following Octave code will help you
evaluate the functions at any points, a real time saver!

Octave
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Written by Dr. Len Brin 19 February 2014 %
% Purpose: Calculate values of the fractal interpolating %
% function, f, passing through %
% (0,f_0), (alpha,f_alpha), and (1,f_1), %
% its derivative and its integral. %
% INPUT: value at which to evaluate, x; array of values, %
104 CHAPTER 3. INTERPOLATION

% f = [f_0,f_alpha,f_1]; alpha; scaling factors %


% d1 and d2. %
% OUTPUT: y=f’(x); yy=f(x); yyy=F(x). %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function [y,yy,yyy] = fractalInterpolator(x,f,alpha,d1,d2)
f1=f(1)*(1-d1);
c1=f(2)-d1*f(3)-f1;
f2=f(2)-d2*f(1);
c2=(1-d2)*f(3)-f2;
F1=(alpha*(f1+c1/2)+(1-alpha)*(f2+c2/2))/(1-(1-alpha)*d2-alpha*d1);
FA=alpha*(f1+c1/2+d1*F1);
l=0;
r=1;
a=[];
if (alpha>1/2)
its=floor(log(10^-16)/log(alpha));
else
its=floor(log(10^-16)/log(1-alpha));
end%if
for i=1:its
if (alpha>1/2)
h = (r-l)*alpha;
m = l+h;
else
h = (r-l)*(1-alpha);
m = r-h;
end%if
if (x<m)
a(i)=0;
r=m;
else
a(i)=1;
l=m;
end%if
end%for
x=0;
y=c1/(alpha-d1);
yy=f(1);
yyy=0;
for i=its:-1:1
if (a(i)==0)
y=(c1+d1*y)/alpha;
yy=c1*x+d1*yy+f1;
yyy=alpha*(f1*x+c1/2*x*x+d1*yyy);
x=alpha*x;
else
y=(c2+d2*y)/(1-alpha);
yy=c2*x+d2*yy+f2;
yyy=FA+(1-alpha)*(f2*x+c2/2*x*x+d2*yyy);
x=alpha+(1-alpha)*x;
end%if
end%for
end%function

fractalInterpolator.m may be downloaded at the companion website.


3.1. A ROOT-FINDING CHALLENGE 105

Answers
Evaluating f : The following are a few values of f :

f (α3 ) ≈ .03620418000000000
f (α(α + (1 − α)α)) ≈ −.09176089063636364
f (α + (1 − α)α2 ) ≈ −.08222890363636364
f (α + (1 − α)(α + (1 − α)α)) ≈ .1846063473223140.

Evaluating F : The following are a few values of F :

F (α3 ) ≈ .002702687013731212
F (α(α + (1 − α)α)) ≈ −.003859289400223274
2
F (α + (1 − α)α ) ≈ −.02753062961856850
F (α + (1 − α)(α + (1 − α)α)) ≈ −.01466250212441314.
106 CHAPTER 3. INTERPOLATION

3.2 Lagrange Polynomials


A function that is required to have a graph passing through some set of prescribed points is called an interpolating
function, and we say that such a function interpolates the prescribed points. Further, the exercise of finding such
a function is called interpolation.
In exercise 3a of section 2.5, you are asked to find a polynomial with roots at −7, 2, and 1 ± 5i (and no others).
The function, therefore, must be a polynomial and have a graph passing through the points
(−7, 0), (2, 0), (1 + 5i, 0), and (1 − 5i, 0). (3.2.1)
In retrospect, then, the question could have been phrased as: find a polynomial passing through the points 3.2.1
(and not having any roots besides −7, 2, 1 + 5i, and 1 − 5i), a question of interpolation. We now expand upon this
idea by considering polynomials with graphs passing through points with arbitrary ordinates (not just 0).
We start on familiar ground. The polynomial p(x) = (x + 7)(x − 2) has roots −7 and 2 so has a graph passing
through (−7, 0) and (2, 0). Suppose we want to modify p so it also passes through (−1, 1). That is, we want
p(−7) = 0, p(−1) = 1, and p(2) = 0. Beginning with p(x) = (x + 7)(x − 2), we already have p(−7) = 0 and p(2) = 0,
so really we only need to concentrate on p(−1) = 1. As is, p(−1) = (−1 + 7)(−1 − 2) = 6(−3) = −18, a far cry
from 1. But p(x) = (x + 7)(x − 2) is not the only polynomial passing through (−7, 0) and (2, 0). Let a be any
real number and note that q(x) = a(x + 7)(x − 2) also passes through (−7, 0) and (2, 0). If we choose a such that
q(−1) = 1, we have the desired function:
1
q(−1) = a(−1 + 7)(−1 − 2) = −18a = 1 ⇒ a = − .
18
1
q(x) = − 18 (x + 7)(x − 2) passes through all three of the points, (−7, 0), (2, 0), and (−1, 1). But let us not lose
1 1 p(x)
sight of whence this came. − 18 = p(−1) , so, actually, the desired function can be written as q(x) = p(−1) . Indeed,
p(−7) p(2) p(−1)
q(−7) = p(−1) = 0, q(2) = p(−1) = 0, and q(−1) = p(−1) = 1.

Now suppose we want a polynomial passing through (−7, 0), (2, 0), and (−1, 2). As before, we know p(x) =
p(x)
(x + 7)(x − 2) has the desired roots and q(x) = p(−1) has the nice feature that q(−1) = 1. We use these two facts
to come up with an answer. In fact, without doing any calculation, we know the polynomial
p(x) √
l(x) = 2
p(−1)

is the desired function. Take a moment to check that l(−7) = 0, l(2) = 0, and l(−1) = 2, and understand its
construction. This idea is the seed for what is called the Lagrange form of interpolating polynomials.
We are now ready to let the ordinates fly! Suppose we would like a polynomial passing through (−7, y1 ),
(2, y2 ), and (−1, y3 ). We know the polynomial p3 (x) = (x + 7)(x − 2) has zeros at −7 and 2, so the polynomial
(x)
l3 (x) = pp33(−1) y3 has zeros at −7 and 2 and, conveniently, l3 (−1) = y3 . This is a good first step. It has the correct
ordinate at −1 and zeros at −7 and 2. Similarly, we can construct the polynomial p2 (x) = (x + 7)(x + 1) with
zeros at −7 and −1, from which we can construct the polynomial l2 (x) = pp22(x) (2) y2 with zeros at −7 and −1 and,
conveniently, l2 (2) = y2 . This is a good second step. It has the correct ordinate at 2 and zeros at −7 and −1. Now
consider the sum (l3 + l2 ). l3 (−1) = y3 and l2 (−1) = 0, so (l3 + l2 )(−1) = y3 . Similarly, l3 (2) = 0 and l2 (2) = y2 , so
(l3 + l2 )(2) = y2 . Moreover, (l3 + l2 )(−7) = 0. We now have a polynomial passing through two of the three required
points and having a zero at the abscissa of the third point. If we had a polynomial with the correct ordinate at −7
and zeros at 2 and −1, we could add it to the sum and be done. But this is exactly the type of polynomial we have
(x)
been constructing! We let p1 (x) = (x + 1)(x − 2) and l1 (x) = pp11(−7) y1 , and note that l1 has the correct ordinate at
−7 and zeros at 2 and −1, just as we needed. Finally, the desired polynomial is (l1 + l2 + l3 ). Table 3.1 summarizes
the construction.
And now we are ready for complete generalization. Suppose n ≥ 1 and x0 , x1 , . . . , xn are n distinct real numbers.
We use the notation Pn (x) for the polynomial of least degree interpolating the points
(x0 , y0 ), (x1 , y1 ), . . . , (xn , yn ).
n
Y
Setting pi (x) = (x − xj ) = (x − x0 ) · · · (x − xi−1 )(x − xi+1 ) · · · (x − xn ), one formula for Pn is
j=0
j6=i
n
X pi (x)
Ln (x) = yi . (3.2.2)
p (x )
i=0 i i
3.2. LAGRANGE POLYNOMIALS 107

Table 3.1: A polynomial passing through (−7, y1 ), (2, y2 ), and (−1, y3 ).


(x)
x l1 (x) = pp11(−7) y1 l2 (x) = pp22(x)
(2) y2 l3 (x) = p3 (x)
p3 (−1) y3 (l1 + l2 + l3 )(x)
−7 y1 0 0 y1
2 0 y2 0 y2
−1 0 0 y3 y3

As written, Ln is called the Lagrange form of Pn . For sake of brevity, it is often called the Lagrange interpolating
polynomial, or even Lagrange polynomial. However, the interpolating polynomial of least degree by any other name
would be but Pn . We will adhere to the practice of calling it the interpolating polynomial of least degree, or use
the notation Pn , when the form is unimportant and will add the phrase Lagrange form, or use the notation Ln ,
when it is.
The main use for interpolating polynomials in numerical analysis is to approximate non-polynomial functions in
the following way. Suppose we know the value of f at a selection of points. That is, we know f (x0 ) = y0 , f (x1 ) =
x1 , . . . , f (xn ) = yn and perhaps not much more. The interpolating polynomial of least degree passing through the
n + 1 points
(x0 , y0 ), (x1 , y1 ), . . . , (xn , yn )
will, by construction, agree with f at x0 , x1 , . . . , xn and we can say with some precision how closely this interpolating
polynomial agrees with f at other points as well. The values of the interpolating polynomial at these “other points”
are what we refer to as approximations of the non-polynomial function.
Setting a = min(x0 , . . . , xn , x) and b = max(x0 , . . . , xn , x), we have the following result. If f has n+1 derivatives
on (a, b) and f, f 0 , f 00 , . . . , f (n) are all continuous on [a, b], then there is a value ξx ∈ (a, b) such that

f (n+1) (ξx )
f (x) − Pn (x) = (x − x0 )(x − x1 ) · · · (x − xn ). (3.2.3)
(n + 1)!

Ironically, this result is proven by considering the Lagrange form of an interpolating polynomial in t that is equal
to the error at x and equal to zero at each xi . That polynomial is

(t − x0 )(t − x1 ) · · · (t − xn )
Λ(t) = [Pn (x) − f (x)] .
(x − x0 )(x − x1 ) · · · (x − xn )

Crumpet 20: Λ

Λ is the (capital) eleventh letter of the Greek alphabet and is pronounced lam-duh . The lower case version, λ,
appears much more commonly in mathematics and often represents an eigenvalue.

Subtracting this polynomial from the error, e(t) = Pn (t) − f (t), we have a function,

g(t) = e(t) − Λ(t),

that is zero for all t = x0 , x1 , . . . , xn , x. Since g, g 0 , . . . , g (n) are all continuous on [a, b] and g (n+1) exists on (a, b),
by Generalized Rolle’s Theorem, there is a value ξx ∈ (a, b) such that g (n+1) (ξx ) = 0. On the other hand,

g (n+1) (ξx ) = e(n+1) (ξx ) − Λ(n+1) (ξx )


= Pn(n+1) (ξx ) − f (n+1) (ξx ) − Λ(n+1) (ξx ),
(n+1)
and Pn is a polynomial of degree at most n. Hence, Pn (t) = 0 for all t and we have g (n+1) (ξx ) = −f (n+1) (ξx ) −
(n+1)
Λ (ξx ) = 0. It follows that
f (n+1) (ξx ) = −Λ(n+1) (ξx ).
108 CHAPTER 3. INTERPOLATION

But, Λ is a polynomial of degree n + 1 in t, so its (n + 1)st derivative with respect to t is constant with respect to
t. We write Λ as
Pn (x) − f (x)
Λ(t) = + bn tn + · · · + b0 t0
 n+1 
t
(x − x0 )(x − x1 ) · · · (x − xn )
for some constants bn , bn−1 , . . . , b0 , and consequently,

Pn (x) − f (x)
Λ(n+1) (t) = · (n + 1)!,
(x − x0 )(x − x1 ) · · · (x − xn )

and we have, by substitution,

f (x) − Pn (x)
f (n+1) (ξx ) = · (n + 1)!
(x − x0 )(x − x1 ) · · · (x − xn )

or, equivalently,
f (n+1) (ξ)
(x − x0 )(x − x1 ) · · · (x − xn ) = f (x) − Pn (x)
(n + 1)!
as desired.
Figure 3.2.1 shows interpolating polynomials for three different functions. The x-coordinates of the prescribed
points are the same for each interpolating polynomial. The x-coordinates are

0, .1951846177977887, .3554400571592862, .4823905248516196, .9138095996128959, and 1.

The four numbers between 0 and 1 were selected by a random number generator. The interpolating polynomial
closely resembles the function only in the first case. The sixth derivative of f helps explain why.
Our error term,
f (6) (ξ)
(x − x0 )(x − x1 ) · · · (x − x5 )
6!
implies that the sixth derivative of f and the polynomial h(x) = (x−x0 )(x−x 6!
1 )···(x−x5 )
determine how much f and
L6 will differ. By bounding both f (6) and |h| over the interval [0, 1], we can get a bound on the difference between
f and L6 . The graphs of f (6) are shown in Figure 3.2.1. The graph of h is

2.5e-06

2e-06

1.5e-06

1e-06

5e-07

-5e-07

-1e-06

0 0.2 0.4 0.6 0.8 1

so maxx∈[0,1] |h(x)| occurs around 0.75. We can use a root-finding method applied to h0 to find that the maximum
of |h| is approximately h(.7409254943919) ≈ 2.506891519629(10)−6 , a relatively small number. On the other hand,
for f (x) = esin((x+1) ) , we find maxx∈[0,1] f (6) (x) ≈ f (6) (.6777170541644) ≈ 44013.74605321, a relatively large
2

number. Their product,


max |h(x)| · max f (6) (x) ≈ .11,

x∈[0,1] x∈[0,1]

gives a bound on the error. The absolute furthest L6 can be from f over the interval [0, 1] is 0.11, a relatively small
number. The actual error is considerably smaller, so can barely be noticed in the top left graph of Figure 3.2.1.
3.2. LAGRANGE POLYNOMIALS 109

Figure 3.2.1: Three interpolating functions. From top to bottom, esin((x+1) ) , sin e(x+1) , and a fractal function
2
 2


as defined in section 3.1. f is shown in black and the interpolant, L6 , in red.


f (x) and L6 (x) f (6) (x)

40000
2.5
30000
20000
2
10000
0
1.5
-10000
-20000
1
-30000
-40000
0.5

0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1

8e+13
1 6e+13
4e+13
0.5
2e+13
0
0
-2e+13

-0.5 -4e+13
-6e+13
-1 -8e+13

0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1

0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
0 0.2 0.4 0.6 0.8 1
110 CHAPTER 3. INTERPOLATION

 
For f (x) = sin e(x+1) , we find maxx∈[0,1] f (6) (x) ≈ f (6) (1) ≈ 8.552147927657737(10)13 , a relatively large
2

number. This time the product,



max |h(x)| · max f (6) (x) ≈ 2.1439307114460004(10)8 ,

x∈[0,1] x∈[0,1]

is a huge number relative to the values of f . So the theoretical error bound does not predict good results for
this interpolation. In fact, it suggests that the interpolation could have been much, much worse! L6 might have
differed from f by over 2 million, a fact that should be worrisome considering f takes values between −1 and 1. An
approximation that is off by even 1 is completely useless for this particular f . As it is, we should not be surprised
that L6 is not a good approximation of f since the error term can be quite large. Nonetheless, the method is sound.
Failure to approximate f well should not be seen as a flaw in the method, but rather a flaw in its application. If
we really wanted to approximate f well, we would need to find a different set of points over which to interpolate.
For the fractal function in the bottom left of Figure 3.2.1, our error estimate is entirely irrelevant. The sixth
derivative of f does not exist. In fact, even the first derivative of f does not exist. We have no way to estimate
the error except to look at the graphs. And as we see, L6 again does a very poor job of approximating f . Failure,
again, should not be seen as a flaw in the method, but rather in its application. Approximating a function with an
interpolating polynomial presumes that the function has sufficient derivatives.

Crumpet 21: Bernstein polynomials

Suppose f is a continuous function on the interval [0, 1], and define the polynomial
n    
X n ν
Bn (x) = f xν (1 − x)n−ν , n = 1, 2, 3, . . .
ν n
ν=0

Then
lim Bn (x) = f (x)
n→∞

uniformly. That is, limn→∞ max{|Bn (x) − f (x)| : x ∈ [0, 1]} = 0. The Bn are Bernstein polynomials. Shown
below are B4 , B20 , B100 , and B500 for the fractal function in figure 3.2.1.

0.4
0.35
0.3
0.25

0.2
0.15
0.1
0.05
0

0 0.2 0.4 0.6 0.8 1

An application of interpolating polynomials


Again we find ourselves connecting the content of the previous chapter with that of the current. The secant method
is actually an application of interpolating polynomials to root-finding. The secant line whose slope is used to
3.2. LAGRANGE POLYNOMIALS 111

calculate any given iteration can be viewed as an interpolating line! It passes through two points lying on g. Hence,
it is an approximation of g.
Having taken this point of view, we can now imagine generalizing the method by using the derivative of a higher
degree interpolating polynomial to approximate g 0 at each step. Such a generalized method, which we will call
Sidi’s k th degree method [30], is summarized by the formula
g(xn )
xn+1 = xn −
p0n,k (xn )
where pn,k is the interpolating polynomial passing through the points
(xn , g(xn )), (xn−1 , g(xn−1 )), . . . , (xn−k , g(xn−k )).
When k = 1, this is exactly the secant method. When k = 2, this method uses the same parabola as does Müller’s
method, but in a different way. In Müller’s method, the next iteration is found by locating a root of the interpolating
polynomial. In this method, the next iteration is found by locating a root of a tangent line to the interpolating
polynomial.
As k increases, more initial values are needed, but the order of convergence

increases as a benefit. Letting αk
be the order of convergence of Sidi’s k th degree method, we have α1 = 1+2 5 ≈ 1.618, the order of convergence of
the secant method, and
α2 ≈ 1.839, α3 ≈ 1.928, α4 ≈ 1.966.
For any k, Sidi’s method has an order of convergence less than 2 (the order of convergence of Newton’s method)
but it approaches 2 as k increases.
At this point, you might wonder just how practical such a method might be. After all, calculating a new
Lagrange interpolating polynomial and evaluating its derivative at each step can be a cumbersome process. We will
take up this issue in the next section.

Neville’s Method
The Lagrange form of an interpolating polynomial is as convenient as it gets for a human. With a little care and
patience, it is possible to write down such a polynomial without even the aid of a calculator. However, adding
points to the interpolation and evaluating the polynomial for non-interpolated points can be cumbersome tasks.
Consider a simple example: the polynomial interpolating f (x) = ex at x = 0, 1, 2:
(x − 1)(x − 2) 0 (x − 0)(x − 2) 1 (x − 0)(x − 1) 2
L2 (x) = e + e + e
(0 − 1)(0 − 2) (1 − 0)(1 − 2) (2 − 0)(2 − 1)
(x − 1)(x − 2) x(x − 2) x(x − 1) 2
= + e+ e .
2 −1 2
Evaluating L2 (1.5), for example, requires either
1. computing the values of the three separate terms, each a quadratic polynomial, and adding:
(1.5 − 1)(1.5 − 2) 1.5(1.5 − 2) 1.5(1.5 − 1) 2
L2 (1.5) = + e+ e
2 −1 2
= −.125 + .75e + .375e2
≈ 4.684607408443278
or
2. the unpleasant business of simplifying L2 into a simpler form and then evaluating:
(x − 1)(x − 2) x(x − 2) x(x − 1) 2
L2 (x) = + e+ e
2 −1 2
1 2 e2
= (x − 3x + 2) − e(x2 − 2x) + (x2 − x)
2 2 
1 e2 3 e2
  
= −e+ x2 + − + 2e − x+1
2 2 2 2
≈ 1.47624622100628x2 + 0.242035607452765x + 1
so L2 (1.5) ≈ 1.47624622100628(1.5)2 + 0.242035607452765(1.5) + 1 = 4.684607408443277.
112 CHAPTER 3. INTERPOLATION

Method 2 is better if you have more points at which to evaluate, and method 1 is better if you plan to add points
of interpolation. However, neither method is particularly convenient. Even less convenient than evaluating the
polynomial is the task of requiring another point of interpolation. Previous work is of limited use. And we haven’t
even begun to discuss the trouble of writing a computer program to automate the calculations. Neville’s method
can be used to overcome these limitations when the value of the polynomial at a specific point is required.
Neville’s method is based on the observation that interpolating polynomials can be constructed recursively.
Suppose Pk,l is the polynomial of degree at most l interpolating the data

(xk , f (xk )), (xk+1 , f (xk+1 )), . . . , (xk+l , f (xk+l )).

Then, by definition, P0,n is the polynomial of degree at most n interpolating the data

(x0 , f (x0 )), (x1 , f (x1 )), . . . , (xn , f (xn )).

Moreover, P0,n can be computed using the recursive formula

(x − xi+m+1 )Pi,m (x) − (x − xi )Pi+1,m (x)


Pi,m+1 (x) =
xi − xi+m+1
Pi,0 (x) = f (xi ), i = 0, . . . , n. (3.2.4)

This claim can be checked by noting five things:

1. Pi,0 is the degree 0 polynomial interpolating the one datum (xi , f (xi )).

2. Pi,m and Pi+1,m are polynomials of degree at most m, so Pi,m+1 is a polynomial of degree at most m + 1.

(xi − xi+m+1 )Pi,m (xi )


3. Pi,m+1 (xi ) = = Pi,m (xi ) = f (xi ).
xi − xi+m+1

4. For any j = i + 1, . . . , i + m,

(xj − xi+m+1 )Pi,m (xj ) − (xj − xi )Pi+1,m (xj )


Pi,m+1 (xj ) =
xi − xi+m+1
(xj − xi+m+1 )f (xj ) − (xj − xi )f (xj )
=
xi − xi+m+1
f (xj ) [(xj − xi+m+1 ) − (xj − xi )]
=
xi − xi+m+1
= f (xj ).

−(xi+m+1 − xi )Pi+1,m (xi+m+1 )


5. Pi,m+1 (xi+m+1 ) = = Pi+1,m (xi+m+1 ) = f (xi+m+1 ).
xi − xi+m+1

A rigorous proof by induction on m, requested in the exercises, should follow closely these notes. Points 1
and 2 establish that Pk,l has degree at most l. Points 3 through 5 establish that Pk,l interpolates the points
(xk , f (xk )), (xk+1 , f (xk+1 )), . . . , (xk+l , f (xk+l )). Formula 3.2.4 succinctly summarizes Neville’s method.
While Neville’s method (formula 3.2.4) can be used to find formulas for interpolating polynomials as in

(x − x1 )P0,0 (x) − (x − x0 )P1,0 (x)


P0,1 (x) =
x0 − x1
x − x1 x − x0
= f (x0 ) + f (x1 ),
x0 − x1 x1 − x0

it is normally used to find the value of an interpolating polynomial at a specific point. We earlier determined that
L2 (1.5) = 4.684607408443277 for the polynomial, L2 (x), interpolating f (x) = ex at x = 0, 1, 2. We now find this
value using Neville’s method. P0,0 (1.5) = f (0) = 1, P1,0 (1.5) = f (1) ≈ 2.718281828459045, and P2,0 (1.5) = f (2) ≈
3.2. LAGRANGE POLYNOMIALS 113

Table 3.2: Neville’s method example, calculating P0,2 (1.5).


xi Pi,0 = f (xi ) Pi,1 Pi,2
0 1 3.577422742688568 4.684607408443278
1 2.718281828459045 5.053668963694848
2 7.38905609893065

7.38905609893065. So
(1.5 − x1 )P0,0 (1.5) − (1.5 − x0 )P1,0 (1.5)
P0,1 (1.5) =
x0 − x1
(1.5 − 1)(1) − (1.5 − 0)(2.718281828459045)
=
0−1
≈ 3.577422742688568
(1.5 − x2 )P1,0 (1.5) − (1.5 − x1 )P2,0 (1.5)
P1,1 (1.5) =
x1 − x2
(1.5 − 2)(2.718281828459045) − (1.5 − 1)(7.38905609893065)
=
1−2
≈ 5.053668963694848
(1.5 − x2 )P0,1 (1.5) − (1.5 − x0 )P1,1 (1.5)
P0,2 (1.5) =
x0 − x2
(1.5 − 2)(3.577422742688568) − (1.5 − 0)(5.053668963694848)
=
0−2
≈ 4.684607408443278.

A tabulation of the computation may make it easier to internalize the recursion and imagine how this process might
be automated. Table 3.2 shows such a tabulation. The use of this recursive formula may be more difficult than
direct computation for a human being, but for a computer, using the recursion is much quicker and simpler as
evidenced by a look at the pseudo-code.

Assumptions: Pn (x) is the degree at most n polynomial interpolating the data

(x0 , f (x0 )), (x1 , f (x1 )), . . . , (xn , f (xn ))

and the value Pn (x̂) is desired.


Input: Value x̂; abscissas x0 , x1 , . . . , xn ; ordinates f (x0 ), f (x1 ), . . . , f (xn ).
Step 1: For i = 0 . . . n do Step 2:
Step 2: Set Pi,0 = f (xi );
Step 3: For j = 1 . . . n do Steps 4-5:
Step 4: For i = 0 . . . n − j do Step 5:
(x̂−xi+j )Pi,j−1 −(x̂−xi )Pi+1,j−1
Step 5: Set Pi,,j = xi −xi+j

Output: Table of values, P . P0,n holds the desired value, Ln (x̂).

Uniqueness
There are some subtleties we have thus far glossed over. When we introduced the Lagrange form, we casually stated
“Ln is called the Lagrange form of Pn ”, implying that the Lagrange form gives the interpolating polynomial of least
degree (since Pn is defined as such)! This fact is far from obvious. Nonetheless, we went on as if it were obvious that
Ln and Pn were one and the same polynomial. Worse yet, when we came around to discussing Neville’s method,
we calculated P0,2 (1.5) and compared it to L2 (1.5) from earlier with the implication that they should be the same,
again as if it were simply given that P0,2 and L2 should be the same polynomial. The following result shows that
our blind faith that Pn , Ln , and P0,n amount to different names for the same object was not misplaced (by virtue
of the fact that they all interpolate the same data and have degree at most n).
114 CHAPTER 3. INTERPOLATION

Theorem 7. The polynomial, Pn , of least degree interpolating the data (x0 , y0 ), (x1 , y1 ), . . . , (xn , yn ) exists and is
unique. Moreover, any interpolating polynomial of degree at most n is equal to Pn .
Proof. By construction, Ln interpolates the data. Moreover, the degree of Ln is at most n since it is the sum of
polynomials pi each with degree exactly n. Thus Pn exists and has degree at most n [at this point, we must
admit that the degree of Pn may be less than that of Ln ]. Now suppose q is any polynomial interpolating
(x0 , y0 ), (x1 , y1 ), . . . , (xn , yn ) with degree n or less. Then the polynomial f = Pn − q also has degree n or less.
Moreover, f (xi ) = Pn (xi ) − q(xi ) = yi − yi = 0 for all i − 0, . . . , n. Thus f has n + 1 roots. Alas, the only way f
can have n + 1 roots and have degree n or less is if f is identically 0. Hence, f (x) = Pn (x) − q(x) = 0, implying
Pn (x) = q(x) for all x.

Octave
The indices presented in the pseudo-code are predicated on indexing starting with 0, as in the mathematical
description. In Octave, however, indices can not be 0. They are always positive integers. A slight modification of
the indices is required to accommodate this discrepancy.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Written by Leon Brin 22 March 2014 %
% Purpose: This function implements Neville’s method for %
% computing the value P(xhat) of the interpolating %
% polynomial P passing through the data (x(1),y(1)), %
% (x(2),y(2)),...,(x(n),y(n)). %
% INPUT: value xhat; array x of abscissas; array y of %
% ordinates. %
% OUTPUT: table of values Q; Q(1,n)=P(xhat). %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function Q = nevilles(xhat,x,y)
n=length(x);
for i=1:n
Q(i,1)=y(i);
end%for
for j=2:n
for i=1:n+1-j
Q(i,j)=((xhat-x(i+j-1))*Q(i,j-1)-(xhat-x(i))*Q(i+1,j-1))/(x(i)-x(i+j-1));
end%for
end%for
end%function
nevilles.m may be downloaded at the companion website.

Key Concepts
Interpolating function: A function whose graph is required to pass through a set of prescribed points.
Interpolating polynomial: A polynomial whose graph is required to pass through a set of prescribed points.
Interpolating polynomial of least degree: The polynomial of least degree interpolating a given set of n + 1
data points is unique. We denote this polynomial by Pn .
Interpolating polynomial of degree at most n: The polynomial interpolating n + 1 distinct points has degree
at most n and is equal to the polynomial of least degree interpolating the points.
Generalized Rolle’s theorem: Suppose that f has n derivatives on (a, b) and f, f 0 , f 00 , . . . , f (n−1) are all contin-
uous on [x0 , xn ]. If f (x0 ) = f (x1 ) = · · · = f (xn ) for some x0 < x1 < · · · < xn , then there exists ξ ∈ (a, b)
such that f (n) (ξ) = 0.
Lagrange form of an interpolating polynomial: The Lagrange form, Ln , of the polynomial of degree at most
n interpolating the points (x0 , y0 ), (x1 , y1 ), . . . , (xn , yn ) is given by the formula
n
X pi (x)
Ln (x) = yi ,
p (x )
i=0 i i
3.2. LAGRANGE POLYNOMIALS 115

n
Y
where pi (x) = (x − xj ) = (x − x0 ) · · · (x − xi−1 )(x − xi+1 ) · · · (x − xn ).
j=0
j6=i

Interpolation error: For Pn , the interpolating polynomial of least degree passing through the n + 1 points
(x0 , y0 ), (x1 , y1 ), . . . , (xn , yn ), there is a value ξx ∈ (a, b) such that

f (n+1) (ξx )
f (x) − Pn (x) = (x − x0 )(x − x1 ) · · · (x − xn ),
(n + 1)!

assuming f has n + 1 derivatives on (a, b) and f, f 0 , f 00 , . . . , f (n) are all continuous on [a, b], and where
a = min(x0 , . . . , xn , x) and b = max(x0 , . . . , xn , x).
Sidi’s method: A root-finding method summarized by the formula

f (xn )
xn+1 = xn −
p0n,k (xn )

where pn,k is the interpolating polynomial passing through the points

(xn , f (xn )), (xn−1 , f (xn−1 )), . . . , (xn−k , f (xn−k )).

Neville’s method: A method for computing the interpolating polynomial of least degree or values of it based on
the recursive relation
(x − xi+m+1 )Pi,m (x) − (x − xi )Pi+1,m (x)
Pi,m+1 (x) =
xi − xi+m+1
Pi,0 (x) = f (xi )

where Pk,l is the polynomial of least degree interpolating the data

(xk , f (xk )), (xk+1 , f (xk+1 )), . . . , (xk+l , f (xk+l )).

Exercises 4. Suppose the data from question 3 were taken from an


appropriately differentiable function f . Use the inter-
1. Write down the Lagrange interpolating polynomial
polating polynomial you found in question 3 to estimate
passing through (1, 2), (1.5, −0.83), and (2.11, −1).
f (1.3). [S]
2. Find a polynomial that passes through the four points 5. Find the estimate in question 4 using Neville’s method.
[S]
(0, 0), (1, 2), (4, −3), and (10, −1).
6. Given the following data for f (x), approximate f (0.3)
3. Construct the (at most) quadratic Lagrange Polyno- using an interpolating polynomial of degree at most
mial interpolating the data. (a) 1
(a) (1, 1), (2, 1), and (3, 2) (b) 2
(c) 3
(b) (0, 10), (30, 58), (1029, −32)
(c) (−10, 10), (20, 58), (1019, −32) [S] x 0 1 2 3
f (x) 0.8 0.7 0.75 0.5
(d)
7. Given the following data for f (x), approximate f (3)
x f (x) using an interpolating polynomial of degree at most [S]
5 15
200 2 (a) 1
10 15 (b) 2
(c) 3
(e)
x 2 3.5 4 5
x f (x) f (x) 0.8 0.7 0.75 0.5
−5 15
−2 2 8. Use interpolating polynomials of degrees one, two,
3 15 and three to approximate each of the following:
116 CHAPTER 3. INTERPOLATION

(a) f (0.43) if f (0) = 1, f (0.25) = 1.64872, f (0.5) = 0 1 2.6 P0,2 3.016


2.71828, f (0.75) = 4.48169. 0.25 2 P1,1 2.96
0.5 P2,0 2.4
(b) f (0.18) if f (0.1) = −0.29004986, f (0.2) =
0.75 8
−0.56079734, f (0.3) = −0.81401972, f (0.4) =
−1.0526302. [S] Determine f (0.5). [A]

(c) f (2.26) if f (1) = 1.654, f (1.5) = −2.569, f (2) =


16. L3 (x) = −7x + 57x2 − 134x + 78 is the degree (at
3
−1.329, f (2.5) = 1.776. [S]
most) 3 interpolating polynomial for the data in the
(d) f (11.26) if f (10) = −0.7865, f (11) = −1.2352, table. Find ω. [A]
f (12) = −0.8765, f (13) = 0.0021.
x 0.5 0.8 ω 1.4
9. Let x0 = 1, x1 = 1.25, and x2 = 1.6. Using data y 24.375 3.696 0 −17.088
at these xi , construct interpolating polynomials of de-
grees at most one and at most two and use them to 17. Let P3 (x) be the interpolating polynomial for the data
approximate f (1.4). Find the absolute errors. (0, 0), (0.5, y), (1, 3), (2, 2). Find y if the coefficient of
x3 in P3 (x) is 6.
(a) f (x) = sin πx [S] √
√ 18. Let f (x) = x − x2 and P2 (x) be the interpolating
(b) f (x) = 3 x − 1 polynomial on x0 = 0, x1 , and x2 = 1. Find the largest
(c) f (x) = e2x−4 value of x1 in (0, 1) for which f (0.5) − P2 (0.5) = −0.25.

(d) f (x) = ln(10x) 19. The interpolating polynomial on n + 1 points does not
always have degree n. It has degree at most n. Plot the
10. Use formula 3.2.3 to find theoretical error bounds for data (1, 1), (2, 3), (3, 5), and (4, 7), and make a conjec-
the approximations in question 9. Compare the bound ture as to the degree of the polynomial interpolating
to the actual error. [S] these four points. What led you to your conjecture?
11. A Lagrange interpolating 20. Use Neville’s method to find the polynomial described
√ polynomial is constructed for
the function f (x) = ( 2)x using x0 = 0, x1 = 1, in question 19. Does it have the degree you expected?
x2 = 2, x3 = 3. It is used to approximate f (1.5). 21. Let
Find a bound on the error in this approximation.
1
12. Find the polynomial referred to in question 11. Then xj = 1− for j = 0, 1, 2, . . .
j+1
(a) use the polynomial to approximate f (1.5); and f (x) = 5 + 3x2018
(b) calculate the actual error of this approximation, Pn (x) = the interpolating polynomial
and compare it to the bound you calculated in passing through
question 11. (x0 , f (x0 )), . . . , (xn , f (xn )).

13. Use Neville’s method to find the approximation in Find


question 11. lim Pn (1).
n→∞
14. The height of a model rocket is given at several times [A]
in the following table. Approximate the height of the
rocket at time t = 0.6 sec using at least two different 22. Let f (x) = e−x . Two different numbers are chosen at
sets of points. Comment on which approximation is random from the interval [0, 1], say x0 and x1 . Then
likely most accurate. the points (x0 , f (x0 )) and (x1 , f (x1 )) are used to get a
linear Lagrange interpolation approximation to f over
Time (sec) Height (ft) the interval [0, 1]. Find a bound (good for the entire in-
0.53238 30.0534 terval and every pair of points x0 and x1 ) for the error
0.56040 32.7929 in using this approximation.
0.58842 35.4956
0.61644 38.1575 23. Supply the inductive proof that P0,n is the poly-
nomial of degree at most n interpolating the data
15. The following table results from using Neville’s method (x0 , f (x0 )), (x1 , f (x1 )), . . . , (xn , f (xn )). See notes on
to approximate f (0.4). page 112.
3.3. NEWTON POLYNOMIALS 117

3.3 Newton Polynomials


In this section, we are interested in an efficient automated process for calculating interpolating polynomials. The
Lagrange form of an interpolating polynomial is best suited for pencil and paper calculations, not computer au-
tomation. Neville’s method is well suited for computing the value of an interpolating polynomial at a particular
point, not calculation of the polynomial itself. True, Neville’s method can be used to calculate the interpolating
polynomials themselves, but it lends itself to this task no better than the Lagrange form. Presently, we will discover
how the same recursive formula used in Neville’s method is used to derive a very efficient, computer-friendly method
for calculating interpolating polynomials themselves. The result of the computation is a set of coefficients for the
Newton form of a polynomial.
Suppose we have already computed the polynomial Nn (x) interpolating the data

(x0 , f (x0 )), (x1 , f (x1 )), . . . , (xn , f (xn )).

We now wish to compute the polynomial Nn+1 (x) interpolating the data

(x0 , f (x0 )), (x1 , f (x1 )), . . . , (xn+1 , f (xn+1 )),

and we would like to recycle the work we have already done (much the same way we could add a point of interpolation
in Neville’s method and reuse all previous work)! One way to attack the problem is to find a polynomial q(x) such
that
Nn+1 (x) = Nn (x) + q(x).
If the attack is to be successful, we must have q(x) = Nn+1 (x) − Nn (x) for all x, and, in particular, q(xj ) =
Nn+1 (xj ) − Nn (xj ) for j = 0, 1, . . . , n + 1. But Nn+1 (xj ) − Nn (xj ) = f (xj ) − f (xj ) = 0 for j = 0, 1, . . . , n, and
Nn+1 (xn+1 ) − Nn (xn+1 ) = f (xn+1 ) − Nn (xn+1 ). In other words, we seek the polynomial q interpolating the points

(x0 , 0), (x1 , 0), . . . , (xn , 0), (xn+1 , (f − Nn )(xn+1 )).

Ironically, this is a job for the Lagrange form:

(x − x0 ) · · · (x − xn )
q(x) = (f − Nn )(xn+1 )
(xn+1 − x0 ) · · · (xn+1 − xn )
(f − Nn )(xn+1 )
= (x − x0 ) · · · (x − xn ). (3.3.1)
(xn+1 − x0 ) · · · (xn+1 − xn )
(f −Nn )(xn+1 )
But (xn+1 −x0 )···(xn+1 −xn ) is just a constant, so we replace it by an+1 so that we have q(x) = an+1 (x−x0 ) · · · (x−xn ).
Of course we can calculate an+1 using the formula (xn+1(f−x −Nn )(xn+1 )
0 )···(xn+1 −xn )
, but there is a better way, which we will
see shortly. We can also learn from the upcoming computation the most convenient form for Nn .
When n = 0, q has the form a1 (x − x0 ); when n = 1, q has the form a2 (x − x0 )(x − x1 ); when n = 2, q has
the form a3 (x − x0 )(x − x1 )(x − x2 ); and so on. Of course N0 (x) = a0 is constant since it is the interpolating
polynomial of least degree passing through a single point. So N1 (x) = N0 (x) + a1 (x − x0 ) immediately takes the
form a0 + a1 (x − x0 ); N2 (x) immediately takes the form a0 + a1 (x − x0 ) + a2 (x − x0 )(x − x1 ); N3 (x) immediately
takes the form a0 + a1 (x − x0 ) + a2 (x − x0 )(x − x1 ) + a3 (x − x0 )(x − x1 )(x − x2 ); and so on. This would suggest
that the most convenient form for Nn+1 , the one that requires no simplification, is

Nn+1 (x) = a0 + a1 (x − x0 ) + a2 (x − x0 )(x − x1 ) + · · · + an+1 (x − x0 ) · · · (x − xn ). (3.3.2)

Given in this form, the unknown quantity, an+1 , appears as the coefficient of the xn+1 term. Consequently, an+1 is
potentially the leading coefficient of Nn+1 . If an+1 were zero, then we would not call it the leading coefficient. We
will facilitate the rest of this discussion by introducing the following term. For an interpolating polynomial on k + 1
points, the coefficient of its xk term is called its potential leading coefficient (even if it happens to be zero).
Since this potential leading coefficient is the crux of our problem, we focus attention on determining the potential
leading coefficient of any interpolating polynomial.
Here is where the recursive formula
(x − xi+m+1 )Pi,m (x) − (x − xi )Pi+1,m (x)
Pi,m+1 (x) =
xi − xi+m+1
Pi,0 (x) = f (xi )
118 CHAPTER 3. INTERPOLATION

used in devising Neville’s method comes in handy. In as much as Pi,m and Pi+1,m both have degree at most m,
their potential leading coefficients are the coefficients of their xm terms. It follows that the coefficient of the xm+1
term of (x − xi+m+1 )Pi,m (x) equals the potential leading coefficient of Pi,m (x), and, similarly, the coefficient of
the xm+1 term of (x − xi )Pi+1,m equals the potential leading coefficient of Pi+1,m . Therefore, the coefficient of the
xm+1 term of (x − xi+m+1 )Pi,m (x) − (x − xi )Pi+1,m (x) is the difference of the potential leading coefficients of Pi,m
and Pi+1,m . To simplify the discussion, we use the notation fi,j for the potential leading coefficient of Pi,j . Now the
coefficient of the xm+1 term of (x − xi+m+1 )Pi,m (x) − (x − xi )Pi+1,m (x) is just fi,m − fi+1,m . Hence, the potential
leading coefficient fi,m+1 of Pi,m+1 (the coefficient of the xm+1 term of Pi,m+1 ) is given by

fi,m − fi+1,m
fi,m+1 = (3.3.3)
xi − xi+m+1
fi,0 = f (xi ).

Crumpet 22: DividedDifferences

While we choose to use the notation fi,j for the potential leading coefficient of Pi,j , it is much more customary
to use the expanded notation f [xi , xi+1 , . . . , xi+j ] for this quantity, and to call it a j th divided difference.

Finally, we have a formula for the potential leading coefficient that recycles previous calculations. Since Nn+1
and P0,n+1 interpolate the same set of points and both have degree at most n + 1, they are equal by theorem
7. Therefore, their potential leading coefficients, an+1 and f0,n+1 are equal. By recursion 3.3.3, we then have
f −f1,n
an+1 = f0,n+1 = x0,n0 −xn+1
.
It can not be stressed enough that we have not discovered a new polynomial. We have only discovered a new
way to calculate the same old interpolating polynomials. Nn , Ln , and P0,n all interpolate the same data and all
have degree at most n. They are, therefore, equal by theorem 7. Just the forms in which they are written possibly
differ. The polynomial form in equation 3.3.2 is called the Newton form.

Crumpet 23: Newton Polynomials

Typically, the Newton form and divided differences are presented completely independent of Neville’s recursive
formula, an approach that takes considerably more work to develop. There are reasons to do so, however. Refrain-
ing from the use of Neville’s formula follows more closely the historical development of the subject since Newton
(1643–1727) preceded Neville (1889-1961) by over 200 years! Moreover, following the historical development more
naturally leads to further study of divided differences.

As an example, take the polynomial interpolating f (x) = ex at x = 0, 1, 2, as we did in the discussion of Neville’s
method on page 111. f0,0 = f (0) = 1, f1,0 = f (1) ≈ 2.718281828459045, and f2,0 = f (2) ≈ 7.38905609893065. So

f0,0 − f1,0 1 − 2.718281828459045


f0,1 = =
x0 − x1 0−1
≈ 1.718281828459045
f1,0 − f2,0 2.718281828459045 − 7.38905609893065
f1,1 = =
x1 − x2 1−2
≈ 4.670774270471606
f0,1 − f1,1 1.718281828459045 − 4.670774270471606
f0,2 = =
x0 − x2 0−2
≈ 1.47624622100628.
3.3. NEWTON POLYNOMIALS 119

Table 3.3: Newton form example, calculating N2 (x).


xi fi,0 = f (xi ) fi,1 fi,2
0 1 1.718281828459045 1.47624622100628
1 2.718281828459045 4.670774270471606
2 7.38905609893065

Therefore, N2 (x) = 1 + 1.718281828459045(x) + 1.47624622100628(x)(x − 1). f0,i are the coefficients of Nn . Though
this computation is manageable without a table, it is most convenient to tabulate the values of fi,j as they are
computed (just as is the case for Neville’s method). This is true for both humans and computers! A tabulation
of the computation makes it easier to internalize the recursion and imagine how this process might be automated.
Table 3.3, which is called a table of divided differences (or divided difference table), shows such a tabulation. Adding
a data point to the interpolation is as easy as computing another diagonal of coefficients (just like Neville’s method).

Sidi’s Method
We now return attention to Sidi’s k th degree root-finding method,
g(xn )
xn+1 = xn − ,
p0n,k (xn )

where pn,k is the interpolating polynomial passing through the points

(xn , g(xn )), (xn−1 , g(xn−1 )), . . . , (xn−k , g(xn−k )).

In its Newton form,

pn,k (x) = gn,0 + gn−1,1 (x − xn ) + gn−2,2 (x − xn )(x − xn−1 ) + · · · + gn−k,k (x − xn ) · · · (x − xn−k ),

so
p0n,k (xn ) = gn−1,1 + gn−2,2 (xn − xn−1 ) + · · · + gn−k,k (xn − xn−1 ) · · · (xn − xn−k ). (3.3.4)
In particular,
p0n,2 (xn ) = gn−1,1 + (xn − xn−1 )gn−2,2
and
p0n,3 (xn ) = gn−1,1 + (xn − xn−1 )gn−2,2 + (xn − xn−1 )(xn − xn−2 )gn−3,3
and so on. As a nested product,

p0n,k (xn ) = gn−1,1 + (xn − xn−1 ) [gn−2,2 + (xn − xn−2 ) [· · · + (xn − xn−k ) [gn−k,k ] · · · ]] .

The nested form is particularly efficient for implementation.


Assumptions: g is k times differentiable.
Input: Initial values x0 , x1 , . . . , xk ; diagonal entries gk,0 , gk−1,1 , . . . , g0,k of the divided difference table for
g.
Step 1: Set s = g0,k ;
Step 2: For i = 1, 2, . . . , k − 1 do Step 3:
Step 3: Set s = (xk − xi )s + gi,k−i ;
gk,0
Step 4: Set xk+1 = xk − ;
s
Output: Approximation xk+1 .
While this pseudo-code is good as far as it goes, it is far from complete. The most obvious deficiency is that it only
executes one step of Sidi’s method. A less obvious deficiency is that its input and output do not match in type or
quantity, so at the end of the routine, the computer is still not ready to compute another iteration. What we get
from this routine is xk+1 . What we need to run it again are the two arrays x0 , x1 , . . . , xk and gk,0 , gk−1,1 , . . . , g0,k .
In order to prepare these arrays for the next iteration, we must re-index the values of xi and then compute new
values for the gi,k−i .
120 CHAPTER 3. INTERPOLATION

Assumptions: g is k times differentiable.


Input: Initial values x0 , x1 , . . . , xk ; diagonal entries gk,0 , gk−1,1 , . . . , g0,k of the divided difference table for
g.
Step 1: Set xk+1 according to Sidi’s method applied to x0 , x1 , . . . , xk and gk,0 , gk−1,1 , . . . , g0,k ;
Step 2: Set gk+1,0 = g(xk+1 );
Step 3: For i = k, k − 1, . . . , 1 do Step 4:
gi+1,k−i − gi,k−i
Step 4: Set gi,k+1−i = ;
xk+1 − xi
Output: Approximations x1 , . . . , xk+1 and corresponding diagonal entries gk+1,0 , gk,1 , . . . , g1,k of the divided
difference table for g.

This new pseudo-code, which utilizes the previous pseudo-code in its first step is an improvement. Now the input
and output match in type and quantity, meaning the output of this routine may be used as input for the next
iteration. However, this routine still only calculates one step of Sidi’s method. Moreover, we have been ignoring
another issue. Each of the routines spelled out in pseudo-code so far assume we have the diagonal entries of the
corresponding divided difference table. It is not good practice to make the user of the code worry about this detail.
The routine we write should supply these values. After all, the end-user, the person trying to find a root of a
function, will only have immediate access to the function and some number of initial values. The routine must
supply the rest. Finally, we present pseudo-code in the spirit of other root-finding methods.

Assumptions: g has a root at x̂; g is k times differentiable; x0 , x1 , . . . , xk are sufficiently close to x̂.
Input: Initial values x0 , x1 , . . . , xk ; function g; desired accuracy tol; maximum number of iterations N .
Step 1: For i = 0, 1, . . . , k do Step 2:
Step 2: Set gi,0 = g(xi );
Step 3: For j = 1, 2, . . . , k do Steps 4-5:
Step 4: For i = 0, 1, . . . , k − j do Step 5:
gi+1,j−1 − gi,j−1
Step 5: Set gi,j =
xi+j − xi
Step 6: For i = 1 . . . N do Steps 7-11:
Step 7: Compute x = xk+1 according to Sidi’s method applied to
x0 , x1 , . . . , xk and gk,0 , gk−1,1 , . . . , g0,k ;
Step 8: If |x − xk | ≤ tol then return x;
Step 9: Compute gk+1,0 , gk,1 , . . . , g1,k ;
Step 10: Set x0 = x1 ; x1 = x2 ; · · · xk−1 = xk ; xk = x;
Step 11: Set gk,0 = gk+1,0 ; gk−1,1 = gk,1 ; · · · g0,k = g1,k ;
Step 12: Print “Method failed. Maximum iterations exceeded.”
Output: Approximation x near exact fixed point, or message of failure.

As complete as this latest pseudo-code is, it leaves one item unaddressed. It requires k initial values to run Sidi’s k th
degree method. When we encountered the secant method, we noted that needing two initial values as opposed to
one was a disadvantage. The disadvantage is only magnified in Sidi’s method where k + 1 initial values are required.
However, just as with the secant method, we can automatically generate initial values if needed. If Sidi’s method is
given one initial value, x0 , and we are trying to find a root of the function g, then we can set x1 = x0 + g(x0 ) just
as we did for the secant method. You may recall, this was not particularly successful, however. The secant method
often failed to converge with this selection of initial condition.
Much less is known about Sidi’s method and how the selection of intial values affects convergence. It might
make an interesting project to analyze good and bad practices for selecting initial values. In any case, if you have
initial values x0 , x1 , . . . , xj with 1 < j < k, the remaining k + 1 − j intial values can be found using Sidi’s method
of degree j (on x0 , x1 , . . . , xj ) to get xj+1 followed by using Sidi’s method of degree j + 1 (on x0 , x1 , . . . , xj+1 ) to
get xj+2 followed by using Sidi’s method of degree j + 2 (on x0 , x1 , . . . , xj+2 ) to get xj+3 , and so on until xk is
computed.
3.3. NEWTON POLYNOMIALS 121

Octave
As is the case with Neville’s method, the Octave code follows identically its corresponding pseudo-code except that
indices have been modified to accommodate indexing beginning with 1, not 0.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Written by Dr. Len Brin 1 April 2014 %
% Purpose: Implementation of Sidi’s Method %
% INPUT: function g; initial values x0,x1,...,xk; %
% tolerance TOL; maximum number of %
% iterations N %
% OUTPUT: approximation X and number of iterations %
% i; or message of failure %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function [X,j] = sidi(x, TOL, N, g)
n=length(x);
for i=1:n
G(i,1)=g(x(i));
end%for
for j=2:n
for i=1:n+1-j
G(i,j)=(G(i+1,j-1)-G(i,j-1))/(x(i+j-1)-x(i));
end%for
end%for
for i=1:N
s=G(1,n);
for j=2:n-1
s=(x(n)-x(j))*s+G(j,n+1-j);
end%for
X=x(n)-G(n,1)/s;
if (abs(X-x(n))<TOL)
return
end%if
G(n+1,1)=g(X);
for j=n:-1:2
G(j,n+2-j)=(G(j+1,n+1-j)-G(j,n+1-j))/(X-x(j));
end%for
for j=1:n-1
x(j)=x(j+1);
end%for
x(n)=X;
for j=1:n
G(n+1-j,j)=G(n+2-j,j);
end%for
end%for
X = "Method failed. Maximum iterations exceeded.";
end%function

sidi.m may be downloaded at the companion website.

More divided differences


Divided difference tables are generally computed for the sake of finding coefficients for one interpolating polynomial,
and one interpolating polynomial only. However, each table of divided differences is rife with representations of
interpolating polynomials. One of the strengths of a divided difference table is that its entries may be reused should
more data be added. This same property can be thought of in reverse. Suppose you have a divided difference table
computed over 4 data values but you are only interested in an at-most-degree-2 interpolating polynomial. The
divided difference table
122 CHAPTER 3. INTERPOLATION

x0 f0,0 f0,1 f0,2 f0,3


x1 f1,0 f1,1 f1,2
x2 f2,0 f2,1
x3 f3,0

actually gives us two different at-most-quadratic interpolating polynomials with four representations for each! First,
the table was devised to compute the interpolating polynomial

P3 (x) = f0,0 + f0,1 (x − x0 ) + f0,2 (x − x0 )(x − x1 ) + f0,3 (x − x0 )(x − x1 )(x − x2 ).

Notice that if we simply truncate the f0,3 (x − x0 )(x − x1 )(x − x2 ) term, we still have an interpolating polynomial
with nodes x0 , x1 , x2 . We can support this claim in at least two ways. First, the term f0,3 (x − x0 )(x − x1 )(x − x2 )
is 0 at x0 , x1 , x2 so it does not contribute to the interpolation at the nodes x0 , x1 , x2 . Second, we can “reverse
engineer” the table, simply erasing the bottom-most diagonal. The remaining table is still a legitimate divided
difference table since none of the remaining entries depends on any of the erased entries:

x0 f0,0 f0,1 f0,2


x1 f1,0 f1,1
x2 f2,0

So
P2 (x) = f0,0 + f0,1 (x − x0 ) + f0,2 (x − x0 )(x − x1 )
is one of the degree at most 2 interpolating polynomials. Erasing the top row of the table also leaves a legitimate
divided difference table:
x1 f1,0 f1,1 f1,2
x2 f2,0 f2,1
x3 f3,0
so
Q2 (x) = f1,0 + f1,1 (x − x1 ) + f1,2 (x − x1 )(x − x2 )
is another degree at most 2 interpolating polynomial. Notice that P2 and Q2 are not just different representations
of the same polynomial. They are two different polynomials! P2 interpolates over the nodes x0 , x1 , x2 while Q2
interpolates over the nodes x1 , x2 , x3 .
The bottom diagonals of each truncated table give degree at most 2 interpolating polynomials as well. Remember,
fi,j represents the potential leading coefficient of the interpolating polynomial over the nodes xi , xi+1 , . . . , xi+j .
Hence,
Q̃2 (x) = f3,0 + f2,1 (x − x3 ) + f1,2 (x − x3 )(x − x2 )
interpolates over the nodes x3 , x2 , x1 and

P̃2 (x) = f2,0 + f1,1 (x − x2 ) + f0,2 (x − x2 )(x − x1 )

interpolates over the nodes x2 , x1 , x0 . These are not new polynomials. These are new representations for P2 and
Q2 . Actually, P̃2 = P2 and Q̃2 = Q2 .
The critical feature of each of these interpolating polynomial representations is that each successive coefficient
depends on all the same nodes as its predecessor, plus one new one. For example, f2,0 depends on x2 , f1,1 depends
on x2 and x1 , and f0,2 depends on x2 , x1 , and x0 . Hence, these three coefficients can be used to produce the
interpolating polynomial over the nodes x0 , x1 , x2 in the form of polynomial P̃2 (which, as we have already noted,
equals P2 ). Another representation for the same polynomial can be written by utilizing f1,0 (which depends on x1 ),
f0,1 (which depends on x1 and x0 ), and f0,2 (which depends on x1 , x0 , x2 ):

P̂2 (x) = f1,0 + f0,1 (x − x1 ) + f0,2 (x − x1 )(x − x0 )

to give a representation of the polynomial interpolating over x0 , x1 , x2 (which, therefore, must equal P2 ). There is
one more representation of P2 that can be extracted from the original divided difference table. It comes from the
coefficients f1,0 , f1,1 , f0,2 . Can you write it down? Answer on page 126. There are two more representations of Q2
that can be extracted from the original divided difference table. Can you write them down? Answers on page 126.
3.3. NEWTON POLYNOMIALS 123

Key Concepts
Newton form of an interpolating polynomial: The Newton form, Nn , of the polynomial of degree at most n interpo-
lating the points (x0 , y0 ), (x1 , y1 ), . . . , (xn , yn ) is

Nn (x) = a0 + a1 (x − xi0 ) + a2 (x − xi0 )(x − xi1 ) + · · · + an (x − xi0 ) · · · (x − xin−1 )

for n distinct indices i0 , i1 , . . . , in−1 from the set {0, 1, 2, . . . , n}. The Newton form for a particular set of data is not
unique.
Potential leading coefficient: For an interpolating polynomial on k + 1 points, the coefficient of its xk term is called its
potential leading coefficient.
Divided differences: The coefficients of the Newton form of an interpolating polynomial are called divided differences.

Exercises
1. Modify the Neville’s method pseudo-code on page 113 to produce pseudo-code for computing the coefficients of Nn .

2. Modify the Neville’s method Octave code on page 114 to produce octave code for computing the coefficients of Nn .
Test it by computing N2 interpolating f (x) = ex at x = 0, 1, 2 and comparing your result to that on page 118.
3. Let f (0.1) = 0.12, f (0.2) = 0.14, f (0.3) = 0.13, and f (0.4) = 0.15.

(a) Find the leading coefficient of the polynomial of least degree interpolating these data.
(b) Suppose, additionally, that f (0.5) = 0.11. Use your previous work to find the leading coefficient of the polynomial
of least degree interpolating all of the data.
[S]
4. Find a Newton form of the polynomial of degree at most 3 interpolating the points (1, 2), (2, 2), (3, 0) and (4, 0).
5. Use the method of divided differences to find the at-most-second-degree polynomial interpolating the points (0, 10),
(30, 58), (1029, −32). [A]
6. Use divided differences to find an interpolating polynomial for the data f (1) = 0.987, f (2.2) = −0.123, and f (3) =
0.432. [S]
7. Create a divided differences table for the following data using only pencil and paper.

f (1.2) = 2.2 f (1.4) = 2.1 f (1.6) = 2.3

(a) What is the interpolating polynomial of degree at most 2? Does it actually have degree 2?
(b) Write down two distinct linear interpolating polynomials for this data based on your table.

8. Use divided differences to find the at-most-cubic polynomial of exercise 19 of section 3.2. Does it have the expected
degree? [A]
9. Find the degree at most two interpolating polynomial of the form

pn (x) = a0 + a1 (x − x0 ) + a2 (x − x0 )(x − x1 ) + · · · + an (x − x0 )(x − x1 ) · · · (x − xn−1 )

for the data in the table.


x 2 3 4
f (x) 3 5 4

10. Use the Octave code from question 2 to compute the interpolating polynomial of at most degree four for the data:
x f (x)
0.0 −6.00000
0.1 −5.89483
0.3 −5.65014
0.6 −5.17788
1.0 −4.28172
Then add f (1.1) = −3.9958 to the table, and compute the interpolating polynomial of degree at most 5 using a
calculator. You may use the Octave code to check your work. [S]

11. Use the Octave code from question 2 to find interpolating polynomials of degrees (at most) one, two, and three for
the following data. Approximate f (8.4) using each polynomial.

f (8.1) = 16.94410, f (8.3) = 17.56492,


f (8.6) = 18.50515, f (8.7) = 18.82091
124 CHAPTER 3. INTERPOLATION

12. Find a bound on the error in using the interpolating polynomial of question 6 to approximate f (2) assuming that all
derivatives of f are bounded between −2 and 1 over the interval [1, 3]. [S]

13. Regarding the polynomial of question 9,

(a) use the polynomial to approximate f (2.5); and

(b) assuming f ∈ C 3 , find a theoretical bound on the error of approximating f (x) on the interval [2, 4].

[A]
14.

(a) Find an error bound, in terms of f (4) (ξ8.4 ), for the approximation P3 (8.4) in question 11.

(b) Find an error bound, in terms of f (4) (x), for the approximation P3 (x) in question 11 good for any x ∈ [8.1, 8.7].

(c) Suppose f (4) (x) = x cos x − ex for the function f (x) of question 11. Use this information to find an error bound
for the approximation P3 (x) good for any x ∈ [8.1, 8.7].

15. Buck spilled coffee on his divided differences table, obscuring several numbers. Nevertheless, there is enough legible
information to find the at-most-degree-3 polynomial interpolating the data. Find it. [A]

16. Show that the polynomial interpolating the following data has degree 3.

x −2 −1 0 1 2 3
f (x) 1 4 11 16 13 −4

17. For a function f , Newton’s divided difference formula gives the interpolating polynomial

16
N3 (x) = 1 + 4x + 4x(x − 0.25) + x(x − 0.25)(x − 0.5)
3

[S]
on the nodes x0 = 0, x1 = 0.25, x2 = 0.5, x3 = 0.75. Find f (0.75).

18. Match the function with its Seeded Sidi method convergence diagram. In each case, Sidi’s 6th degree method was used.
The real axis passes through the center of each diagram, and the imaginary axis is represented, but is not necessarily
centered. [S]

f (x) = sin x
g(x) = sin x − e−x
h(x) = ex + 2−x + 2 cos x − 6
l(x) = 56 − 152x + 140x2 − 17x3 − 48x4 + 9x5
3.3. NEWTON POLYNOMIALS 125

(a) (b)

(c) (d)

19. Match the function with its Seeded Sidi method convergence diagram. The real axis passes through the center of each
diagram, and the imaginary axis is represented, but is not necessarily centered. [A]

f (x) = x4 + 2x2 + 4
g(x) = (x2 )(ln x) + (x − 3)ex
h(x) = 1 + 2x + 3x2 + 4x3 + 5x4 + 6x5
l(x) = (ln x)(x3 + 1)

(a) (b)

(c) (d)

20. You have found the following Octave function with no comments (boo to the author of the function!).

function ans = foo(x,y,x0)


n = length(x);
126 CHAPTER 3. INTERPOLATION

ans = 0;
for i=1:n
a=1;
for j=1:n
if (j==i)
a=a*y(i);
else
a=a*(x0-x(j))/(x(i)-x(j));
endif
endfor
ans=ans+a;
endfor
endfunction

What is the output (ans) of the Octave command

foo([1.1,1.2,1.3,1.4],[.78,.81,.79,.75],1.2)

and why?

Answers
P2 from f1,0 , f1,1 , f0,2 : P 2 (x) = f1,0 + f1,1 (x − x1 ) + f0,2 (x − x1 )(x − x2 )

Q2 two new ways: Q̂2 (x) = f2,0 + f1,1 (x − x2 ) + f1,2 (x − x2 )(x − x1 ) and Q2 (x) = f2,0 + f2,1 (x − x2 ) + f1,2 (x −
x2 )(x − x3 )
Chapter 4
Numerical Calculus

4.1 Rudiments of Numerical Calculus


The basic idea
g(x) = x − 2π 3 sin(x) has a root between 0 and π. You are trying various methods and become interested in how
the choice of initial value affects the results. Using Newton’s method, you do some research into how the choice of
x0 affects x2 . You run some tests and come up with the following data.

x0 x2
93/70 2.084603181618954
95/70 2.055494116570853
97/70 2.030278824314539
99/70 2.009751835391139
101/70 1.993574976724822
103/70 1.981091507449763
105/70 1.971614474758557

Using fixed point iteration on f (x) = 2π


3 sin(x), you decide to examine how the choice of x0 affects x10 , not x2 since
fixed point iteration generally converges slowly. You run some tests on this method and come up with the following
data.
x0 x10
1/7 1.949880891899200
2/7 1.951091775564697
3/7 1.923339403354019
4/7 1.941460911122824
5/7 1.960870620285721
6/7 1.965674866641883
1 1.961228252911260

In the Newton’s method experiment, x2 is a function of x0 , and in the fixed point iteration experiment, x10 is
a function of x0 . So you start to think of them completely independently from the original root-finding question.
As they sit in their tabular form, they are just two functions for which you know a handful of values and not much
more. What do these functions look like? Do we have enough information to perhaps find their derivatives, and,
hence, local extrema? Can we find their antiderivatives? This is the stuff of numerical calculus. We can certainly
approximate these things.
In chapter 3 we learned how to approximate functions by interpolation, so we know we can use the tabular data
to approximate the functions themselves. But what about their derivatives and integrals? Well, polynomials are
easy to differentiate and integrate. Perhaps we can use the derivatives and integrals of interpolating polynomials
to approximate the derivatives and integrals of x2 (x0 ) and x10 (x0 ). Indeed we can!
In order to avoid the confusion of using x0 for multiple purposes, we will rename our functions ν(x) for x2 (x0 )
and ϕ(x) for x10 (x0 ). Hence, we have ν(93/70) = 2.0846 . . ., ν(95/70) = 2.0554 . . ., and so on. Similarly, we

127
128 CHAPTER 4. NUMERICAL CALCULUS

have now ϕ(1/7) = 1.9498 . . ., ϕ(2/7) = 1.9510 . . ., and so on. We will also take up the practice of calling the
x-coordinates of the prescribed interpolation points nodes. Hence, the nodes we have for ν are 93/70, 95/70, and
so on. The nodes we have for ϕ are 1/7, 2/7, and so on.

Crumpet 24: ν and ϕ

ν is the (lower case) thirteenth letter of the Greek alphabet and is pronounced noo. ϕ is the (lower case) twenty-
first letter of the Greek alphabet and is pronounced fee. The letter fee is also written φ, but in mathematics it
is much more common to see the variant ϕ, perhaps to avoid confusion between fee and the empty set, ∅. The
capital versions of ν and ϕ are N and Φ, respectively.

We begin by considering interpolating polynomials on three nodes. For ν, we use the nodes 93/70, 99/70, and
1.5, and get
P2,ν (x) = .07673215587088045x2 −.07445530457646088x + 1.95895140161684.
For ϕ, we use the nodes 1/7, 4/7, and 1, and get

P2,ϕ (x) = 2.498590686342254x2 − 7.726543017101505x + 7.939599956140455.

We have added a second subscript to P2 in order to distinguish the interpolating polynomial for ν from that for ϕ.
Now we can approximate derivatives and integrals for both ν and ϕ using P2,ν and P2,ϕ , respectively:

ν 0 (x) ≈ P2,ν
0
(x) = 4.997181372684508x − 7.726543017101505
ϕ0 (x) ≈ P2,ϕ
0
(x) = .1534643117417609x − .07445530457646088
ˆ ˆ
ν dx ≈ P2,ν dx = .8328635621140847x3 − 3.863271508550753x2 + 7.939599956140455x + C
ˆ ˆ
ϕ dx ≈ P2,ϕ dx = .02557738529029348x3 − .03722765228823044x2 + 1.95895140161684x + D.

So, for example,

ν 0 (1.4) ≈ 0
P2,ν (1.4)
= 4.997181372684508(1.4) − 7.726543017101505
= −.7304890953431942
ϕ (0.5) ≈
0 0
P2,ϕ (0.5)
= .1534643117417609(0.5) − .07445530457646088
= .002276851294419568

and
ˆ 1.5 ˆ 1.5
ν(x)dx ≈ P2,ν (x)dx
1.4 1.4
1.5
= .8328635621140847x3 − 3.863271508550753x2 + 7.939599956140455x 1.4


= .1991481658283149
ˆ 1 ˆ 1
ϕ(x)dx ≈ P2,ϕ (x)dx
0 0
1
= .02557738529029348x3 − .03722765228823044x2 + 1.95895140161684x

0
= 1.947301134618903.

That’s it! This exercise encapsulates the entire strategy. Given some values of an otherwise unknown function, we
will approximate the unknown function with a polynomial. We will then approximate derivatives and integrals of
4.1. RUDIMENTS OF NUMERICAL CALCULUS 129

Table 4.1: Estimating the derivatives and integrals of ν and ϕ.


quantity using P2 using P6
ν 0 (1.4) −.7304890953431942 −.7178145479410887
ϕ0 (0.5) .002276851294419568 .1447147284558277
´ 1.5
ν(x)dx .1991481658283149 .1991932206801721
´1.4
1
0
ϕ(x)dx 1.947301134618903 1.925578216262883

the unknown function by differentiating and integrating the polynomial. There is very little more to be said about
the idea. There is, however, a lot more to be said about automation, accuracy, and efficiency, the focus of the rest
of the chapter. But before we tackle those issues, we will have another look and ν and ϕ.
Using all the nodes of ν, and the help of a computer algebra system, we compute the sixth degree interpolating
polynomial

P6,ν (x) = −1342.393417879939x6 + 11632.43754466623x5 − 41996.4789301455x4


+80851.91317212582x3 − 87536.60487741232x2 + 50528.3026241064x
−12144.27629915625.

Using all the nodes of ϕ (and a computer algebra system) we compute the sixth degree interpolating polynomial

P6,ϕ (x) = −25.41848741926543x6 + 97.00017832506126x5 − 147.1805326076494x4


+111.7996194440324x3 − 43.71110414341027x2 + 8.049781257197147x
+1.421773396945804.

Again we have added a second subscript in order to distinguish


´ 1.5 the interpolating
´1 polynomial for ν from that for ϕ.
Now we can get second estimates for ν 0 (1.4), ϕ0 (0.6), 1.4 ν dx, and 0 ϕ dx:

ν 0 (1.4) 0
≈ P6,ν (1.4) ≈ −.7178145479410887
ϕ (0.5)
0 0
≈ P6,ϕ (0.5) ≈ .1729311759579151
ˆ 1.5 ˆ 1.5
ν(x)dx ≈ P6,ν (x)dx ≈ .1991932206801721
1.4 1.4
ˆ 1 ˆ 1
ϕ(x)dx ≈ P6,ϕ (x)dx ≈ 1.925578216262883.
0 0
´ 1.5
Table 4.1 summarizes the eight estimates we have made so far. The first four digits of the estimates of 1.4 ν(x)dx
´1
agree, and the first two of 0 ϕ(x)dx agree. So there is some agreement for the estimates of the integrals. The
estimates for the derivatives don’t agree quite as well, however. The estimates for ν 0 (1.4) only agree in their first
significant digit. They both suggest ν 0 (1.4) ≈ −.7. But there is essentially no agreement between the estimates of
ϕ0 (0.5). One approximation is more than 60 times the other! Based on this simple analysis, we should have a hard
time believing either estimate of ϕ0 (0.5). And we should only trust the first few digits of the others. We will see
later that we can use this type of comparison to have the computer decide whether an approximation is good or
not.

Issues
There are three issues with the method of estimating derivatives and integrals just outlined.
1. Efficiency. For illustrative purposes and understanding the basic concept of numerical calculus, it is a good
idea to calculate some interpolating polynomials as done in the previous subsection. However, it is cumbersome
and time-consuming to do so. We will dedicate significant energy into finding shortcuts to this direct method,
thus making it more efficient and practical.
2. Automation. Numerical methods are meant to be run by a computer, not a human with a calculator. We
need to find ways that a computer can handle interpolating polynomials. This issue has intimate ties with
efficiency. After all, what will make an algorithm efficient is if it can be executed quickly by a computer!
130 CHAPTER 4. NUMERICAL CALCULUS

3. Accuracy. So far we have done very little to determine how accurate our approximations are. We need to
get a better handle on the error terms in order to understand how to use the method accurately.

Presently, we make strides toward addressing all three of these issues, but we leave the bulk of it for the upcoming
sections.
In chapter 3, we labeled the nodes of an interpolating function x0 , x1 , . . . , xn . It will be beneficial to begin calling
them x0 + θ0 h, x0 + θ1 h, . . . , x0 + θn h instead. And for most of our analysis, we will use x0 + θh instead of x for
the point at which we desire an estimate. One might call this substitution a change of variables or a recalibration
of the x-axis.
To see how this helps with the analysis, consider the degree at most 2 interpolating polynomial of f with nodes

x0 + θ0 h, x0 + θ1 h, and x0 + θn h.

In the notation of chapter 3, we have

(x − x1 )(x − x2 ) (x − x0 )(x − x2 ) (x − x0 )(x − x1 )


P2 (x) = f (x0 ) + f (x1 ) + f (x2 ),
(x0 − x1 )(x0 − x2 ) (x1 − x0 )(x1 − x2 ) (x2 − x0 )(x2 − x1 )

but with the new notation, we replace x0 by x0 + θ0 h, x1 by x0 + θ1 h, x2 by x0 + θ2 h, and x by x0 + θh, giving us

(θ − θ1 )(θ − θ2 )
P2 (x0 + θh) = f (x0 + θ0 h)
(θ0 − θ1 )(θ0 − θ2 )
(θ − θ0 )(θ − θ2 )
+ f (x0 + θ1 h)
(θ1 − θ0 )(θ1 − θ2 )
(θ − θ0 )(θ − θ1 )
+ f (x0 + θ2 h). (4.1.1)
(θ2 − θ0 )(θ2 − θ1 )

For the most part, we have just swapped x for θ and xi for θi . This benign-looking change is actually a huge step
forward! This formula makes it apparent that the actual values of the xi are not important. It is only their location
relative to some base point, x0 , measured by some characteristic length, h, that matters. θ and the θi are those
measures. Essentially this makes x0 the origin and h the unit of measure on the x-axis. We measure all values by
how many lengths of h they are from x0 .
To illustrate the benefit, let us assume that we have three nodes, equally spaced, so the least and greatest
nodes are the same distance from the third, middle node. Setting the central node as the base point, x0 , and the
characteristic length, h, to the distance from this central node to the others, we can then label them

x0 − h, x0 , and x0 + h.

And we have already arrived at the essential point. It doesn’t matter if the set of nodes is {1, 2, 3} or {80, 90, 100}
or {−4.3, −4.2, −4.1}. In each of these sets, we have three nodes, one of which is the midpoint of the other two.
Each set of nodes is equal to the set {x0 − h, x0 , x0 + h} for some values of x0 and h. Hence, if we can do any
analysis with the set {x0 − h, x0 , x0 + h}, then we get information about working with any of the sets of nodes
{1, 2, 3} or {80, 90, 100} or {−4.3, −4.2, −4.1} and so on.
Back to the set of nodes {x0 − h, x0 , x0 + h}. For this set of nodes, we have θ0 = −1, θ1 = 0, and θ2 = 1.
Substituting into 4.1.1,

(θ)(θ − 1) (θ + 1)(θ − 1) (θ + 1)(θ)


P2 (x0 + θh) = f (x0 − h) + f (x0 ) + f (x0 + h)
(−1)(−2) (1)(−1) (2)(1)
θ2 − θ θ2 + θ
= f (x0 − h) + (1 − θ2 )f (x0 ) + f (x0 + h).
2 2

Now this formula can be used to get the interpolating parabola over any set of three equally spaced nodes.
99
In an attempt to apply this formula to ν, consider the nodes 93/70, 99/70, and 105/70. Since 70 − 93 105 99
70 = 70 − 70 ,
99 6 3
we have a set of nodes of the form {x0 − h, x0 , x0 + h} with x0 = 70 and h = 70 = 35 . It just so happens that
4.1. RUDIMENTS OF NUMERICAL CALCULUS 131

99 1 3
1.4 = 70 − 6 · so we use θ = − 16 to calculate P2,ν (1.4):
35 ,

1
 
P2,ν (1.4) = P2,ν x0 − h
6
1 2 1  2 !   2
−6 + 6 93 1 99 − 16 − 16 105
    
= ν + 1− − ν + ν
2 70 6 70 2 70
93 99 105
7ν 70 + 70ν 70 − 5ν 70
  
=
72
7(2.084603181618954) + 70(2.009751835391139) − 5(1.971614474758557)
=
72
= 2.019677477429439.
This seems a pretty good estimate since it is between ν(93/70) ≈ 2.085 and ν(99/70) ≈ 2.009 but significantly closer
to 2.009. After all, 1.4 is between 93/70 ≈ 1.328 and 99/70 ≈ 1.414 but significantly closer to 1.414. Equation 3.2.3
gives us some idea how good we might expect this estimate to be.
7ν ( 70
93
)+70ν ( 99
70 )−5ν ( 70 )
105
But let’s back this calculation up just a couple steps. The constants of the 72 step were
determined purely from the values of θ and the θi . And the 93 ,
70 70
99
, and 105
70 are just the three nodes, x 0 −h, x0 , x0 +h,
so what we really have here is a prescription, or formula, for the value P2 (x0 − 16 h) for any degree at most 2
interpolating polynomial over the nodes x0 − h, x0 , and x0 + h:
1 1 7ν(x0 − h) + 70ν(x0 ) − 5ν(x0 + h)
   
ν x0 − h ≈ P2,ν x0 − h = .
6 6 72
And there is nothing special about the particular ν in this formula either. None of the constants − 61 , 7, 70, −5,
nor 72 is dependent on ν, but rather only dependent on the spacing of the nodes. Therefore, given any function f ,
we can extract from this calculation the succinct approximation formula
1 7f (x0 − h) + 70f (x0 ) − 5f (x0 + h)
 
f x0 − h ≈ . (4.1.2)
6 72
This formula illustrates the real purpose in reframing the values of the xi in terms of x0 , h, and the θi . This way,
we get formulas applicable to a whole class of nodes, not just one particular set of nodes.
As for ϕ, the nodes 71 , 74 , and 1 are equally spaced, so the set { 71 , 74 , 1} has the form {x0 − h, x0 , x0 + h} where
x0 = 74 and h = 37 . Not by accident, it happens that 47 − 16 · 73 = 0.5, so ϕ(0.5) = ϕ(x0 − 61 h) where x0 = 47 and
h = 37 . Now we can use formula 4.1.2 to approximate ϕ(0.5)!
7ϕ(x0 − h) + 70ϕ(x0 ) − 5ϕ(x0 + h)
ϕ(0.5) ≈ P2,ϕ (0.5) =
72
7(1.9498808918992) + 70(1.941460911122824) − 5(1.96122825291126)
=
72
= 1.94090678829633.
This time, we have completely circumvented any direct calculation and evaluation of P2,ϕ . Formula 4.1.2 allows us
to calculate P2,ϕ (0.5) directly from the values of ϕ at the three nodes. No need to calculate, refer back to, evaluate,
or simplify P2,ϕ ! All of that has been done in deriving the formula. Very quick. Very efficient.

Stencils
A formula such as 4.1.2 is only applicable to a set of nodes and point of evaluation with the same geometry (relative
positioning) as those used to derive the formula. Therefore, it will be important to keep track of the geometry used
to derive such formulas. To that end, we often refer to a particular set of nodes with its corresponding point of
evaluation as a stencil. For example, the nodes x0 − h, x0 , x0 + h with point of evaluation x0 − 16 h form a stencil—a
relative positioning of points that can be scaled (by changing the value of h) and translated (by changing the value
of x0 ). On a number line, this particular stencil looks like

.
132 CHAPTER 4. NUMERICAL CALCULUS

x0 can be located anywhere and h can be any size, even negative. It is this flexibility that makes formulas like 4.1.2
useful.
Now let’s suppose we do not have evenly spaced data, but we are interested in a point midway between two
others. An appropriate three-point stencil would use the nodes x0 − h, the leftmost node, x0 + h, the rightmost
node, x0 + θ1 h for some θ1 between −1 and 1, the middle node, and point of evaluation x0 , the point midway
between the leftmost and rightmost nodes. For θ1 = 13 , this stencil looks like

.
And we can derive a formula for P2 (x0 ) based on the values of f at the three nodes. Plugging θ = 0, θ0 = −1,
θ1 = 31 , and θ2 = 1 into equation 4.1.1, we get
(− 31 )(−1) (1)(−1) 1 (1)(− 13 )
P2 (x0 ) = 4 f (x0 − h) + 4 2 f (x0 + 3 h) + f (x0 + h)
(− 3 )(−2) ( 3 )(− 3 ) (2)( 23 )
f (x0 − h) + 9f (x0 + 13 h) − 2f (x0 + h)
= ,
8
again a succinct formula applicable to any function f . No need to calculate the interpolating polynomial or evaluate
it directly for any data that fit this stencil. That part has already been done and simplified.

Derivatives
Derivative formulas can be derived likewise. Once derived for a given stencil, they can be used very easily and
efficiently for other data fitting the same stencil. We now find the formula for the first derivative, P20 (x0 − 16 h), over
the stencil

used earlier. We begin by recognizing that in 4.1.1 x is a function of θ. In particular, x(θ) = x0 + hθ, so d
dθ x(θ) = h.
By the chain rule, dθ
d
P2 (θ) = dx
d
P2 (x) · dθ
d
x(θ) = h dx
d
P2 (x). From equation 4.1.1, we then have

dθ P2 (θ)
d
d
P2 (x) =
dx h
(θ − θ1 ) + (θ − θ2 )
= f (x0 + θ0 h)
h(θ0 − θ1 )(θ0 − θ2 )
(θ − θ0 ) + (θ − θ2 )
+ f (x0 + θ1 h)
h(θ1 − θ0 )(θ1 − θ2 )
(θ − θ0 ) + (θ − θ1 )
+ f (x0 + θ2 h). (4.1.3)
h(θ2 − θ0 )(θ2 − θ1 )
In particular, when θ0 = −1, θ1 = 0, θ2 = 1, and θ = − 61 , we have
1 − 16 − 76 5
−7 5
−1
 
P20 x0 − h = f (x0 − h) + 6 6 f (x0 ) + 6 6 f (x0 + h) (4.1.4)
6 h(−1)(−2) h(1)(−1) h(2)(1)
−2f (x0 − h) + f (x0 ) + f (x0 + h)
= .
3h
We now have a formula for P20 (x0 − 16 h) ≈ f 0 (x0 − 16 h) for the stencil with nodes x0 − h, x0 , x0 + h and x = x0 − 61 h.
We can now apply this formula to approximate ν 0 (1.4) and ϕ0 (0.5).
93 99
−2ν( 70 ) + ν( 70 ) + ν( 105
70 )
ν 0 (1.4) ≈ 3
3( 35 )
−2(2.084603181618954) + 2.009751835391139 + 1.971614474758557)
=
9/35
= −.7304890953430477.
4.1. RUDIMENTS OF NUMERICAL CALCULUS 133

Notice this is not exactly what we got in table 4.1 for ν 0 (1.4) using P2 . The two estimates differ in the last few
digits. This is due to floating-point error affecting the calculations in different ways. Generally there is more error
in calculating directly from the interpolating polynomial because the data are processed much more heavily. Best
not to trust the last several digits in either calculation, however. Now

−2ϕ( 71 ) + ϕ( 47 ) + ϕ(1)
ϕ0 (0.5) ≈
3( 37 )
−2(1.9498808918992) + 1.941460911122824 + 1.96122825291126)
=
9/7
= .002276851294420679.

Again, this is close to the approximation in table 4.1, but not exactly the same due to different floating-point errors
for the two calculations. But the point is made. Using a formula based on a stencil is preferable to working directly
from the interpolating polynomial. It is easier, more efficient, and can be automated.
Before moving on to integration, we make one more observation. When trying to approximate f using an
interpolating polynomial, it does not make much sense to consider a stencil like

where the point of evaluation is one of the nodes. We know, by definition of Pn , that Pn (xi ) = f (xi ) for each
node xi . Hence, the “formula” would be f (xi ) = P2 (xi ), and it would be exact, not an approximation. And not
particularly informative since this is one of the facts from which we calculated P2 ! On the other hand, it does make
sense to consider such a stencil when trying to approximate derivatives of f . There is no guarantee the derivative
of Pn will agree with the derivative of f anywhere, even at the nodes. Substituting θ0 = −1, θ1 = 0, θ2 = 1, and
θ = 0 into 4.1.3, we find
1 1 + (−1) 1
P20 (x0 ) = f (x0 − h) + f (x0 ) + f (x0 + h)
h(−1)(−2) h(1)(−1) h(2)(1)
f (x0 + h) − f (x0 − h)
= , (4.1.5)
2h
for example.

Integrals
For integration formulas, we use a modified stencil. We need the nodes plus the endpoints of integration, which will
be identified by square brackets, [ for the left endpoint and ] for the right endpoint. But the process is analogous.
We find a formula for the interpolating polynomial and, in place of integrating the unknown function, we integrate
the interpolating polynomial.
Following this procedure, we can derive a formula for the integral of f over the stencil

for example. The algebra is straightforward but tedious, so we do not show it here. It is best to use a computer
algebra system to derive such a formula. The result, an approximation of the integral over [x0 + 2.5h, x0 + 6h] using
nodes x0 , x0 + h, x0 + 2h, x0 + 3h, x0 + 4h, x0 + 5h, and x0 + 6h, is
ˆ x0 +6h
h
f (x)dx ≈ [42056f (x0 + 6h) + 201831f (x0 + 5h) + 63357f (x0 + 4h)
x0 +2.5h 138240
+195902f (x0 + 3h) − 28518f (x0 + 2h) + 10731f (x0 + h) − 1519f (x0 )] .
´ 1.5
This formula can now be used to approximate 1.4 ν(x)dx instead of integrating the interpolating polynomial
directly as done on page 129. You are invited to plug in the appropriate values of ν and compare your answer to
the one in table on page 129. Answer on´page 136.
1
The stencil for the approximation of 0 ϕ(x)dx using P6,ϕ looks like
134 CHAPTER 4. NUMERICAL CALCULUS

,
´ 1.5
different from the one we used to approximate 1.4 ν(x)dx. Consequently, the approximation formula is different
too. We need a formula for the integral over [x0 − h, x0 + 6h] with nodes x0 , x0 + h, x0 + 2h, x0 + 3h, x0 + 4h,
x0 + 5h, and x0 + 6h. The nodes are the same as before, but the interval of integration is different. The result is
ˆ x0 +6h
h
f (x)dx ≈ [5257f (x0 + 6h) − 5880f (x0 + 5h) + 59829f (x0 + 4h)
x0 −h 8640
−81536f (x0 + 3h) + 102459f (x0 + 2h) − 50568f (x0 + h) + 30919f (x0 )] . (4.1.6)

Again, a computer algebra system should


´ 1 be used to derive such a formula. You are now invited to plug in the
appropriate values of ϕ to approximate 0 ϕ(x)dx and compare your result to the one in table on page 129. Answer
on page 136.

Key Concepts
node: the abscissa (first coordinate) of a data point used in interpolation.
polynomial approximation: approximating the value of a function, its derivative or integral based on the cor-
responding value of an interpolating polynomial.
stencil: relative positioning of the abscissas used in a polynomial approximation.

1
Exercises (c) Substitute θ = 2
into your formula from (b) and
simplify. [A]
1. Derive an approximation formula for the first derivative
over the stencil 3. Derive an approximation formula for the first derivative
over the stencil

[S]
following these steps.
following these steps.
(a) Write down L1 (x), the Lagrange form of the inter-
(a) Calculate N2 (x), the Newton form of the interpo-
polating polynomial passing through the points
lating polynomial passing through the points
(x0 , f (x0 )) and (x1 , f (x1 )). (x0 , f (x0 )), (x1 , f (x1 )), and (x2 , f (x2 )).

(b) Calculate the derivative L01 (x). (b) Calculate the derivative N20 (x).
(c) Substitute x0 + 21 h for x and x0 + h for x1 in your (c) Substitute x0 + 21 h for x, x0 +h for x1 , and x0 +2h
formula from (b) and simplify. for x2 in your formula from (b) and simplify. [A]
2. Derive an approximation formula for the first derivative 4. Derive an approximation formula for the second deriva-
over the stencil tive over the stencil

[S]
following these steps.
following these steps.
(a) Calculate N2 (x(θ)) = N2 (x0 + θh), the New-
(a) Write down L1 (x(θ)) = L1 (x0 + θh), the La- ton form of the interpolating polynomial passing
grange form of the interpolating polynomial pass- through the points
ing through the points
(x0 , f (x0 )), (x0 + h, f (x0 + h)),
(x0 , f (x0 )) and (x0 + h, f (x0 + h)) and (x0 + 2h, f (x0 + 2h))
in terms of θ, h, and x0 . in terms of θ, h, and x0 .
2
(b) Calculate the derivative dx
d
L1 (x(θ)). Remember, (b) Calculate the derivative dx
d
2 N2 (x(θ)). Remem-
x(θ) = x0 + θh, and use the chain rule. ber, x(θ) = x0 + θh, and use the chain rule.
4.1. RUDIMENTS OF NUMERICAL CALCULUS 135

1
(c) Substitute θ = 2
into your formula from (b) and (b) for the first derivative.
simplify. (c) for the second derivative.
5. Formula 4.1.5 and the formula you got from question (d) for the third derivative. What can you say about
1 should be different. However, they were derived over this formula?
essentially the same stencil—two nodes with the point
of evaluation centered between them. Only the labels 11. The polynomial p(x) = 3x4 − 2x2 + x − 7 is an interpo-
on the stencils were different. In other words, they lating polynomial for f . Use p to approximate
were derived from the same geometry, so, in some sense,
(a) f (1)
must be the same. In question 1, x0 plays the same role
[A]
as x0 − h does in 4.1.5. Moreover, in question 1, the (b) f (2)
distance from the point of evaluation to either node is (c) f 0 (1)
h
while in 4.1.5, that distance is h. Make the substitu-
2 (d) f 0 (2) [S]
tion x0 for x0 − h in 4.1.5. Then make the substitution ˆ 1
h
2
for the h in the denominator of 4.1.5. With these (e) f (x)dx
substitutions, formula 4.1.5 should match exactly the 0
formula you got in question 1. In other words, different ˆ 2
[A]
labelings in a stencil produce different labelings in the (f) f (x)dx
0
associated formula. Nothing more.
6. Use formula 4.1.6 to approximate the integral. 12. The polynomial q(x) = −7x4 + 3x2 − x + 4 is an inter-
polating polynomial for g. Use q to approximate
ˆ 3
(a) ex dx [A] (a) g(1) [A]
−4
ˆ 6 (b) g(2)
(b) sin x dx (c) g 0 (1) [A]
−1
ˆ 17 (d) g (2)
0
1
(c) dx [S] ˆ 1
x−5 [S]
10
ˆ (e) g(x)dx
4 0
(d) x5 − 4 dx ˆ

2
−3
ˆ (f) g(x)dx
1 0
−x [A]
(e) e dx
0 13. Use 4.1.3 to find the formula for the first derivative over
ˆ π/2 the stencil
(f) cos x dx
−π/2 (a)
ˆ 2
1 [A]
(g) dx
1 x
ˆ 6.1
[A]
(h) 9 − x4 dx (b)

4

7. For each integral in question 6, (i) calculate the inte-


gral exactly, and (ii) calculate the absolute error in the (c)
approximation. [S][A]
8. Let f (x) = (x − 1)2 sin x. Use formula 4.1.4 to approx-
imate f 0 (0) using
[S]
(d)
(a) h = 1
1 [A]
(b) h = 2
(c) h = 1 (e)
4
1
(d) h = 8 [A]
(f)
9. Calculate the absolute error in each approximation of
question 8. Does the error get smaller as h gets smaller?
[A] (g)

10. Derive an approximation formula over the stencil [A]


(h)

14. Find a general approximation formula for the integral


using two nodes by doing the following.

(a) Write down the (linear) interpolating polynomial


(a) for the value of the function. with nodes x0 + θ2 h and x0 + θ3 h.
136 CHAPTER 4. NUMERICAL CALCULUS

(b) Integrate the polynomial over the interval [x0 + (e) [A]

θ0 h, x0 + θ1 h].
[A]
(c) Simplify.
15. Use the general approximation formula you derived in 16. A general three point formula for the first derivative
question 14 to find an approximation formula over the using f (x0 ), f (x0 + αh), and f (x0 + 2h), α 6= 0 and
stencil. α 6= 2, is given by
[A]
(a)
1 2+α
h
f 0 (x0 ) = − f (x0 )
2h α
(b) 4
+ f (x0 + αh)
α(2 − α)
α
i
(c) [S] − f (x0 + 2h) + O(h2 )
2−α

(d) Use Taylor expansions of f (x0 + αh) and f (x0 + 2h) to


derive the given formula.

Answers
´ x0 +6h
x0 +2.5h
f (x)dx:
1/35
[42056(1.971614474758557) + 201831(1.981091507449763)
138240
+63357(1.993574976724822) + 195902(2.009751835391139)
−28518(2.030278824314539) + 10731(2.055494116570853)
−1519(2.084603181618954)]
´ x0 +6h
x0 −h
f (x)dx:
1/7
[5257(1.96122825291126) − 5880(1.965674866641883)
8640
+59829(1.960870620285721) − 81536(1.941460911122824)
+102459(1.923339403354019) − 50568(1.951091775564697)
+30919(1.9498808918992)]
4.2. UNDETERMINED COEFFICIENTS 137

4.2 Undetermined Coefficients


The basic idea
According to equation 3.2.3, the difference between f and an interpolating polynomial is a multiple of f (n+1) (ξx ).
In other words, the error in approximating f by the interpolating polynomial Pn depends directly on f (n+1) . But
f (n+1) (x) is identically zero whenever f is a polynomial of degree less than n + 1. Consequently, (f − Pn )(x) is
identically zero in this case. At the risk of sounding redundant, this last thought is worthy of repeating. If f is
any polynomial of degree less than n + 1, then Pn , computed for any set of n + 1 nodes, equals f exactly, for all
x. As a result, derivatives of Pn and integrals of Pn are not just approximations of the corresponding derivatives
and integrals of f . They are exact because Pn = f for all x. This observation can be used to derive formulas for
derivatives and integrals without ever computing Pn or its derivatives or integrals!
All the formulas we have been deriving for approximating derivatives and integrals of the arbitrary function f
have taken the form
Xn
ai f (xi )
i=0

where x0 , x1 , . . . , xn are the nodes of the interpolating polynomial, places where the value of f is known, and the
ai are constants resulting from the derivation. The Method of Undetermined Coefficients takes a direct approach
to calculating the constants ai . Knowing that the “approximation” formula must be exact for all polynomials of
degree 0, 1, . . . , n, we can create n + 1 equations in the n + 1 unknowns, a0 , a1 , . . . , an . The solution of the resulting
system of equations gives the values of the coefficients.

Derivatives
We seek an approximation of the k th derivative of f based on knowledge of the values f (x0 + θ0 h), f (x0 +
θ1 h), . . . , f (x0 + θn h). To be precise, we desire an approximation of the form
n
f (k) (x0 + θh) ≈
X
ai f (x0 + θi h). (4.2.1)
i=0

Due to equation 3.2.3, the approximation must be exact for all polynomials of degree n or less. In particular, it
must be exact for the polynomials pj (x) = (x − x0 )j , j = 0, 1, . . . , n. Symbolically, it must be that
n
(k)
X
pj (x0 + θh) = ai pj (x0 + θi h)
i=0

for j = 0, 1, . . . , n. Notice the approximation has become an (exact) equality. Noting that pj (x0 + θi h) = ((x0 +
θi h) − x0 )j = (θi h)j , the system of equations becomes
n
(k)
X
pj (x0 + θh) = a0 + (θi h)j ai (4.2.2)
i=1

for j = 0, 1, . . . , n. It is the solution of this system that will yield the ai .

Crumpet 25: Vandermonde Matrices

In general, a system of linear equations may have zero, one, or many solutions. However, system 4.2.2 has a
special form. In each equation, the constants (θi h)j form a geometric progression. Such a matrix of coefficients
is called a Vandermonde matrix, and it is known that as long as the θi are distinct, this system will have one
solution.

To illustrate, suppose we have the stencil


138 CHAPTER 4. NUMERICAL CALCULUS

and are interested in formulas for both the first and second derivatives of f (at x0 ). For this stencil, θ = 0, θ0 = −1,
θ1 = 0, and θ2 = 1, so we are looking for formulas of the forms

f 0 (x0 ) ≈ a0 f (x0 − h) + a1 f (x0 ) + a2 f (x0 + h)


and
f (x0 ) ≈ b0 f (x0 − h) + b1 f (x0 ) + b2 f (x0 + h).
00

Each of these formulas must be exact when f = p0 , when f = p1 , and when f = p2 . These three requirements give
three equations in the three unknowns.
Beginning with the first derivative formula, we detail system 4.2.2 with k = 1 and n = 2:

p00 (x0 ) = a0 p0 (x0 − h) + a1 p0 (x0 ) + a2 p0 (x0 + h)


p01 (x0 ) = a0 p1 (x0 − h) + a1 p1 (x0 ) + a2 p1 (x0 + h)
p02 (x0 ) = a0 p2 (x0 − h) + a1 p2 (x0 ) + a2 p2 (x0 + h)

By definition, p0 (x) = (x − x0 )0 = 1 so p00 (x0 ) = 0; p1 (x) = (x − x0 )1 = x − x0 so p01 (x0 ) = 1; and p2 (x) = (x − x0 )2


so p02 (x) = 2(x − x0 ) giving p02 (x0 ) = 0. Substituting this information into the equations above,

0 = a0 + a1 + a2
1 = −ha0 + ha2
0 = h2 a0 + h2 a2 .

The system can be solved by substitution, elimination, or computer algebra system. The solution is a0 = 2h ,
−1
1
a1 = 0, and a2 = 2h , giving the approximation formula

f (x0 + h) − f (x0 − h)
f 0 (x0 ) ≈
2h

just as we got on page 133 in formula 4.1.5.


The second derivative formula is derived in the same manner. Since the second derivative formula must be exact
when f = p0 , when f = p1 , and when f = p2 , the ai must satisfy

p000 (x0 ) = b0 p0 (x0 − h) + b1 p0 (x0 ) + b2 p0 (x0 + h)


p001 (x0 ) = b0 p1 (x0 − h) + b1 p1 (x0 ) + b2 p1 (x0 + h)
p002 (x0 ) = b0 p2 (x0 − h) + b1 p2 (x0 ) + b2 p2 (x0 + h),

system 4.2.2 with k = 2 and n = 2. Notice the right-hand sides are exactly the same as they are for the first
derivative formula, save the name change from ai to bi . Only the left-hand side changes substantively. p000 (x) = 0 so
p000 (x0 ) = 0; p001 (x) = 0 so p1 (x0 ) = 0; and p002 (x) = 2 so p002 (x0 ) = 2. Making these substitutions into the equations
above,

0 = b0 + b1 + b2
0 = −hb0 + hb2
2 = h2 b0 + h2 b2 .

Again, the system can be solved by substitution, elimination, or computer algebra system. The solution is b0 =
b2 = h12 and b1 = h22 , giving the approximation formula

f (x0 + h) − 2f (x0 ) + f (x0 − h)


f 00 (x0 ) ≈ .
h2
4.2. UNDETERMINED COEFFICIENTS 139

Integrals

The idea for estimating integrals is identical to that of estimating derivatives. The mechanics only change nominally.
´b
Where there were derivatives before, we will have integrals now. We seek an approximation of a f (x)dx based on
knowledge of the values f (x0 + θ0 h), f (x0 + θ1 h), . . . , f (x0 + θn h):

ˆ b n
X
f (x)dx ≈ ai f (x0 + θi h). (4.2.3)
a i=0

The approximation will be exact for all polynomials of degree n or less. In particular, it will be exact for pj (x) =
(x − x0 )j , j = 0, 1, . . . , n. Therefore, the system of equations

ˆ b n
X
pj (x)dx = a0 + (θi h)j ai j = 0, 1, . . . , n (4.2.4)
a i=1

must be satisfied by the ai .


To illustrate, suppose we have the stencil

For this stencil, a = x0 − h, b = x0 + 6h, and θi = ih, i = 0, 1, . . . , 6. Therefore, we will have a system of seven
equations in the seven unknowns. First, the left-hand sides:

ˆ b ˆ x0 +6h ˆ x0 +6h
x +6h
p0 (x)dx = p0 (x)dx = 1dx = (x − x0 )|x00 −h = 7h
a x0 −h x0 −h
ˆ ˆ x0 +6h ˆ x0 +6h x0 +6h
b
1 2 35 2
p1 (x)dx = p1 (x)dx = (x − x0 )dx = (x − x0 ) =

h
a x0 −h x0 −h 2 x0 −h 2
ˆ ˆ x0 +6h ˆ x0 +6h x0 +6h
b
2 1 3 217 3
p2 (x)dx = p2 (x)dx = (x − x0 ) dx = (x − x0 ) =

h
a x0 −h x0 −h 3 x0 −h 3
ˆ ˆ x0 +6h ˆ x0 +6h x0 +6h
b
3 1 4 1295 4
p3 (x)dx = p3 (x)dx = (x − x0 ) dx = (x − x0 ) =

h
a x0 −h x0 −h 4 x0 −h 4
ˆ ˆ x0 +6h ˆ x0 +6h x0 +6h
b
4 1 5 7777 5
p4 (x)dx = p4 (x)dx = (x − x0 ) dx = (x − x0 ) =

h
a x0 −h x0 −h 5 x0 −h 5
ˆ ˆ x0 +6h ˆ x0 +6h x0 +6h
b
5 1 6 46655 6
p5 (x)dx = p5 (x)dx = (x − x0 ) dx = (x − x0 ) =

h
a x0 −h x0 −h 6 x0 −h 6
ˆ ˆ x0 +6h ˆ x0 +6h x0 +6h
b
6 1 7
p6 (x)dx = p6 (x)dx = (x − x0 ) dx = (x − x0 ) = 39991h7 .

a x0 −h x0 −h 7 x0 −h
140 CHAPTER 4. NUMERICAL CALCULUS

Now putting them together with the right-hand sides (and swapping sides):
6
(θi h)0 ai
X
= a0 + a1 + a2 + a3 + a4 + a5 + a6 = 7h
i=0
6
35 2
(θi h)1 ai
X
= ha1 + 2ha2 + 3ha3 + 4ha4 + 5ha5 + 6ha6 = h
i=0
2
6
217 3
(θi h)2 ai h2 a1 + 4h2 a2 + 9h2 a3 + 16h2 a4 + 25h2 a5 + 36h2 a6 =
X
= h
i=0
3
6
1295 4
(θi h)3 ai h3 a1 + 8h3 a2 + 27h3 a3 + 64h3 a4 + 125h3 a5 + 216h3 a6 =
X
= h
i=0
4
6
7777 5
(θi h)4 ai h4 a1 + 16h4 a2 + 81h4 a3 + 256h4 a4 + 625h4 a5 + 1296h4 a6 =
X
= h
i=0
5
6
46655 6
(θi h)5 ai h5 a1 + 32h5 a2 + 243h5 a3 + 1024h5 a4 + 3125h5 a5 + 7776h5 a6 =
X
= h
i=0
6
6
(θi h)6 ai h6 a1 + 64h6 a2 + 729h6 a3 + 4096h6 a4 + 15625h6 a5 + 46656h6 a6 = 39991h7
X
=
i=0

The system again may be solved by substitution, elimination, or computer algebra, at least in principle. Not many
humans have sufficient patience and precision to solve such a system with paper and pencil, though. Trusting a
computer algebra system, the solution is a0 = 30919 2107 34153 1274 19943
8640 h, a1 = − 360 h, a2 = 2880 h, a3 = − 135 h, a4 = 2880 h,
49 5257
a5 = − 72 h, and a6 = 8640 h giving the approximation formula
ˆ x0 +6h
h
f (x)dx ≈ [5257f (x0 + 6h) − 5880f (x0 + 5h) + 59829f (x0 + 4h) − 81536f (x0 + 3h)
x0 −h 8640
+102459f (x0 + 2h) − 50568f (x0 + h) + 30919f (x0 )] (4.2.5)

just as we got on page 134 in formula 4.1.6.

Practical considerations
We have used stencils like

and

not because the results are particularly helpful, but rather to (a) illustrate the methods and (b) emphasize that these
methods work in general for any stencil you may dream up. Most of the differentiation and integration formulas
presented in numerical analysis sources stick to a small host of regularly spaced stencils where, for derivatives the
point of evaluation is a node, and for integrals, all the nodes lie between the endpoints or there are nodes at both
endpoints. It is possible the regularly-spaced stencils are all you will ever need, but it is good to know that you can
derive appropriate formulas for more unusual stencils should the need arise.
As for their derivation, the main advantage of the method of undetermined coefficients over working directly
with interpolating polynomials is the ease of automation and lessening of the necessary and often laborious algebra
needed. In the method of undetermined coefficients, the only polynomials that need to be differentiated or integrated
4.2. UNDETERMINED COEFFICIENTS 141

are the polynomials pj = (x−x0 )j , a much simpler task than integrating or differentiating interpolating polynomials.
Formulas with up to three or four nodes can be handled this way with pencil and paper. The trade-off is the necessity
of solving a system of equations, again a simpler task than differentiating and simplifying interpolating polynomials
of degree 3 or 4. As a final benefit to the method of undetermined coefficients, it is a general solution technique
used not only in numerical analysis for deriving calculus approximations, but in other studies as well, particularly
differential equations. The method is applicable whenever the form of a solution or formula is known, but the
constants (coefficients) remain a mystery.

Crumpet 26: Undetermined Coefficients in Differential Equations

In differential equations, we know that a particular solution of the equation

y − 2y 0 + 3y 00 = 5 sin x (4.2.6)

has the form y = A sin x+B cos x, but we do not immediately know the values of A and B. They are undetermined
coefficients (at this point). They are determined by substituting the known form into the equation being solved.

y0 = A cos x − B sin x
y 00 = −A sin x − B cos x

So the equation being solved becomes

(A sin x + B cos x) − 2(A cos x − B sin x) + 3(−A sin x − B cos x) = 5 sin x.

Collecting the coefficients of sin x and cos x on the left side,

(−2A + 2B) sin x + (−2A − 2B) cos x = 5 sin x.

We now match coefficients on left and right sides to get the system of equations

−2A + 2B = 5
−2A − 2B = 0

whose solution is A = − 45 and B = 54 . Therefore, y = − 45 sin x + 5


4
cos x solves equation 4.2.6.
Conceptually, this process is no different from the method of undetermined coefficients used in deriving
numerical calculus formulas. The solution to some problem is known, save for some (undetermined) coefficients.
The parameters of the problem require the coefficients to satisfy some system of linear equations. The system is
solved, and the solution to the original problem is consequently known completely, coefficients determined.

When we get involved with stencils with more than 3 or 4 nodes, solving the resulting (relatively large) system of
linear equations by hand is not a task to which most of us would look forward. However, it is a standard calculation
any computer algebra system can do easily and efficiently. Yes, it is advisable to use a computer algebra system to
derive formulas as complicated as 4.1.6. We have used Maxima1 to handle or double check a number of the more
tedious calculations presented in this text.

Crumpet 27: wxMaxima

The best way to solve a large system of linear equations is with the aid of a computer algebra system. Figure
4.2.1 shows how wxMaxima may be used to derive formula 4.2.5.
Notice the similarities between Maxima code and Octave code. Maxima allows for statements, print state-
ments, variable assignments, arrays, and suppression of output. The syntax for these things is not the same, but

1 See http://maxima.sourceforge.net/
142 CHAPTER 4. NUMERICAL CALCULUS

Figure 4.2.1: wxMaxima deriving an integration formula

the principles behind them are. Once you have learned how to do these things in one language, learning how to
do them in another is usually straightforward.
Also notice the main difference between Maxima and Octave. Maxima was designed for symbolic manipulation
while Octave was designed for numerical computation. Octave can be made to do symbolic calculation and
Maxima can be made to do numerical computation, but the old carpenter’s adage “use the right tool for the
job” is worth consideration. Maxima is much more adept at symbolic manipulation than is Octave, and Octave
is much more adept at number crunching than is Maxima.

Reference
http://andrejv.github.io/wxmaxima/

It is unusual to use stencils with more than five nodes anyway. It is not because the formulas for more nodes
are significantly more complicated or difficult to use, however. As evidenced by formula 3.2.3, the error term for
an interpolating polynomial involves higher and higher derivatives of f as more nodes are added. This is generally
fine as long as f has sufficiently many derivatives and the values of the high derivatives are not prohibitively
large. However, numerical methods are often employed when the smoothness of f is known to be limited, the high
derivatives are known to be large, or the properties of its derivatives are unknown completely. For these functions,
stencils with fewer nodes, which give rise to formulas with lower order error terms, are often more accurate, not
less. And in the case of unknown smoothness, the lower order methods have a better chance of being accurate.
As a final note, some care must be taken not to ask too much of a derivative formula. With n+1 nodes, the error
term for the interpolating polynomial involves f (n+1) , so there is no hope of using these nodes to estimate f (n+1)
or any higher derivatives at any point. If you, however, forget this fact, it shows up in a direct way in the method
4.2. UNDETERMINED COEFFICIENTS 143

of undetermined coefficients. If k > n, then the system of equations with undetermined coefficients becomes
n
X
(θi h)j ai = 0, j = 0, 1, . . . , n
i=0

because the k th derivative of pj is identically 0 for all j ≤ n < k. The only solution to this system is a0 = a1 =
· · · = an = 0 giving the “approximation” formula
f (k) (x0 + θh) = 0.
Indeed, this is exact for all polynomials of degree n or less. However, the error in using this formula is exactly
f (k) (x0 + θh), a relative error of exactly 1, making it completely useless.

Stability
In Experiment 2 on page 3, section 1.1, we took a brief look at approximating the first derivative of f (x) = sin x
using the fact that
sin(1 + h) − sin(1 − h)
f 0 (1) = lim .
h→0 2h
The conclusion we drew was that this computation was highly susceptible to floating-point error. If calculations
are done exactly, then we expect sin(1+h)−sin(1−h)
2h to approximate f 0 (1) better and better as h becomes smaller and
smaller. Not so for floating-point calculations, as the experiment revealed. There was a point at which making
h smaller made the approximation worse! And this example is not unique. This problem always arises when
approximating f 0 using the centered difference formula
f (x + h) − f (x − h)
f 0 (x) ≈ . (4.2.7)
2h
But how can we predict at what value of h that might happen without comparing our results to the exact value of
the derivative? After all, numerical differentiation is employed most often when the exact formula for the derivative
is unknown or prohibitively difficult to compute.
Suppose f can be computed to near machine precision. In typical floating point calculations, including Octave,
that means a relative floating-point error of approximately 10−15 or absolute floating-point error εf ≈ 10−15 |f (x)|.
Since we assume h is small, we can approximate both |f˜(x + h) − f (x + h)| and |f˜(x − h) − f (x − h)| by εf giving
an absolute error of approximately 2εf in calculating the numerator f (x + h) − f (x − h). Assuming h is calculated
exactly, we have the absolute error
2εf εf |f (x)| 1
εr = |f˜0 (x) − f 0 (x)| ≈ = = · . (4.2.8)
2h h 1015 h
000
As we will see shortly, the algorithmic error, εa , is caused by truncation and equals f 6(ξ) h2 for some value of ξ

near x. Since ξ is near x, we approximate f 000 (ξ) by f 000 (x) and conclude that
|f 000 (x)| 2
εa ≈ h . (4.2.9)
6
We now minimize the value of εr + εa by setting its derivative (with respect to h) equal to zero and solving the
resulting equation:
d |f (x)| 1 |f 000 (x)| 2
 
d
0= (εr + εa ) ≈ · + ·h
dh dh 1015 h 6
|f (x)| 1 |f 000 (x)|
= − 15 · 2 + ·h
10 h 3

|f 000 (x)| |f (x)| 1
·h ≈ ·
3 1015 h2
|f (x)| 3
h3 ≈ ·
|f 000 (x)| 1015
s
3 3|f (x)|
h ≈ · 10−5 .
|f 000 (x)|
144 CHAPTER 4. NUMERICAL CALCULUS

q
For Experiment 2 on page 3, this means we should expect the optimal value of h to be around 3 3sin(1)
sin(1)
· 10−5 ≈
1.44(10)−5 . We reproduce the table from Experiment 2 here with the addition of a third column, the actual absolute
error:
h p̃∗ (h) |p̃∗ (h) − f 0 (1)|
10−2 0.5402933008747335 9.00(10)−6
10−3 0.5403022158176896 9.00(10)−8
10−4 0.5403023049677103 9.00(10)−10
10−5 0.5403023058569989 1.11(10)−11
10−6 0.5403023058958567 2.77(10)−11
10−7 0.5403023056738121 1.94(10)−10

Indeed, when h = 10−5 , we get our best results! However, the prediction of the optimal value of h was based on
knowledge of f 000 , something we generally will not be able to do. Unless we happen to know that |f|f000(x)|
(x)| is far from
1, we assume it is reasonably close to 1, in which case the optimal value of h is around 10−5 . Similar estimates can
be made for other derivative formulas.
Because numerical differentiation is so sensitive to floating-point error, we say that it is unstable. The root
finding methods and numerical integration we have discussed are all stable methods. Their sensitivity to floating-
point error is commensurate with that of calculating f .

Key Concepts
undetermined coefficients: A method for solving problems in which the solution is known save for a set of
(undetermined) coefficients.

Exercises [S]
(j)
1. Using the method of undetermined coefficients, derive
an approximation formula for the first derivative over
the stencil. (k)
(a)

[A]
(l)
[A]
(b)
2. Using the method of undetermined coefficients, derive
an approximation formula for the second derivative
over the stencil.
(c)
(a)

[S]
(d) [A]
(b)

(c)
(e)

[A]
(d)
[A]
(f)

(g) (e)

[S]
(f)

[A]
(h)
(g)
(i)
4.2. UNDETERMINED COEFFICIENTS 145

[A]
(h) (e)

3. Use the method of undetermined coefficients to derive


[A]
an approximation formula over the stencil (f)

(g)

(a) for the value of the function.


[A]
(b) for the first derivative. (h)
(c) for the second derivative.
(d) for the third derivative. What can you say about (i)
this formula?
(e) compare the method of undetermined coefficients
to the direct method employed in question 10 of (j) [A]
section 4.1.
4. Use the method of undetermined coefficients to derive
an approximation formula for the integral over the sten- (k)
cil.
(a)
[S]
(l)
[S]
(b)
(m)
(c)
5. Using the method of undeterminedˆ coefficients, find a
x0 +θ1 h

(d) [A] general approximation formula for f (x)dx us-


x0 +θ0 h
ing the two nodes x0 + θ2 h and x0 + θ3 h.
146 CHAPTER 4. NUMERICAL CALCULUS

4.3 Error Analysis


Errors for first derivative formulas
In section 3.2, we found that if f has sufficient derivatives, then f and Pn , an interpolating polynomial of degree
at most n, differ according to equation 3.2.3 on page 107, copied here for convenience:

f (n+1) (ξx )
f (x) − Pn (x) = (x − x0 )(x − x1 ) · · · (x − xn ).
(n + 1)!

We can use this formula to derive a concise formula for the error in approximating f 0 (x) by Pn0 (x).
As done in section 3.2, suppose n ≥ 1 and x0 , x1 , . . . , xn are n distinct real numbers. Set w(x) = (x − x0 )(x −
x1 ) · · · (x − xn ), a = min(x0 , . . . , xn , x), and b = max(x0 , . . . , xn , x). We know from equation 3.2.3 that, assuming
f has n + 1 derivatives on (a, b) and f 0 , f 00 , . . . , f (n) are all continuous on [a, b], for each x ∈ [a, b],

f (n+1) (ξx )
f (x) − Pn (x) = w(x)
(n + 1)!
for some ξx ∈ (a, b). Hence,

d f (n+1) (ξx ) f (n+1) (ξx ) 0


 
f (x) −
0
Pn0 (x) = w(x) + w (x).
dx (n + 1)! (n + 1)!

Since w vanishes at each node, this formula simplifies nicely when x is a node. Without loss of generality, we
evaluate for x = x0 and get
f (n+1) (ξx0 ) 0
f 0 (x0 ) − Pn0 (x0 ) = w (x0 ).
(n + 1)!
From here on, the error formula is only valid at a node! This last expression can be simplified further by noting
that
Xn Y n n
X
w0 (x) = (x − xj ) = pi (x),
i=0 j=0 i=0
i6=j

where pi is as defined for equation 3.2.2 on page 106. But pi (x0 ) = 0 for all i except i = 0, so

w0 (x0 ) = p0 (x0 ) = (x0 − x1 )(x0 − x2 ) · · · (x0 − xn ).

Substituting this expression for w0 , we have the first derivative error formula

f (n+1) (ξx0 )
f 0 (x0 ) − Pn0 (x0 ) = (x0 − x1 )(x0 − x2 ) · · · (x0 − xn ).
(n + 1)!

Making the substitutions x0 + θi h for xi , i = 1, 2, . . . , n, to get a formula in terms of h and the θi :

f (n+1) (ξx0 )
f 0 (x0 ) − Pn0 (x0 ) = (−θ1 h)(−θ2 h) · · · (−θn h).
(n + 1)!

This error formula simplifies just a bit:

f (n+1) (ξ)
f 0 (x0 ) − Pn0 (x0 ) = θ1 θ2 · · · θn (−h)n . (4.3.1)
(n + 1)!

For the stencil

n = 4, θ1 = −1, θ2 = 1, θ3 = 2, and θ4 = 3, so the error in calculating f 0 over this stencil is

f (5) (ξ) f (5) (ξ) 4


(−1)(1)(2)(3)(−h)4 = − h .
120 20
4.3. ERROR ANALYSIS 147

Error terms for the first derivative over other stencils are computed similarly as long as the derivative is evaluated
at a node. Table 4.2 summarizes some common first derivative formulas, including error terms.
Notice that the error term contains (x0 − x1 )(x0 − x2 ) · · · (x0 − xn ), the product of the differences between the
point of evaluation and all other nodes, as a factor. When the differences between the point of evaluation and
the other nodes is small, the product is small. Consequently, first derivative approximation formulas are generally
more accurate when the point of evaluation is centrally located among the nodes. Hence, we might expect a first
derivative formula involving nodes x0 < x1 < x2 to be more accurate when the point of evaluation is x1 rather
than when the point of evaluation is x0 or x2 . The same can be said about higher derivative formulas. The more
centrally located the point of evaluation, the more accurate the approximation.

Errors for other formulas


It is tempting to think we can simply repeat the procedure we used with first derivatives, taking the second
(n+1)
(ξx )
derivative of f (x) − Pn (x) = f (n+1)! w(x) to find the error for second derivative estimates, and the third derivative
f (n+1) (ξx )
of f (x) − Pn (x) = (n+1)! w(x) to find the error for third derivative estimates, and so on. Alas, the matter is
(ξx )
(n+1)
(ξx ) (n+1)
not so simple. Higher derivatives of f (x) − Pn (x) = f (n+1)! w(x) involve derivatives of the factor f (n+1)! which
do not vanish even when x is a node. Since ξx is entirely unknown, so are its derivatives, making this approach
unworkable. Other methods for producing precise bounds for certain higher derivative formulas or certain integral
formulas are limited in scope.
There is, however, a general method for determining good enough error terms for any derivative or integral
formula. We replace each evaluation of f in the approximation by a Taylor series expanded about x0 and simplify.
This gives an expression for the approximation in terms of f (x0 ), f 0 (x0 ), f 00 (x0 ), and so on. We compare it to
the Taylor series representation of the quantity being estimated. The difference between the two is the error. In
summary, that’s it. Making a rigorous argument of this method takes some care and is worthy of an example. We
demonstrate the method for the approximation of the first derivative over the stencil

Again, we choose this stencil not because the stencil is generally useful, but rather to emphasize that the method is
generally useful.
In subsection 4.1 on page 132, we derived the approximation

1 −2f (x0 − h) + f (x0 ) + f (x0 + h)


 
f 0 x0 − h ≈ . (4.3.2)
6 3h

The left hand side, the quantity being approximated, as a Taylor series looks like

1 1 1 1 3 (4)
 
f x0 − h = f 0 (x0 ) − hf 00 (x0 ) + h2 f 000 (x0 ) −
0
h f (x0 ) + · · · .
6 6 72 1296

The terms of the right hand side, the approximation, as Taylor series look like
1 1 3 000 1
f (x0 − h) = f (x0 ) − hf 0 (x0 ) + h2 f 00 (x0 ) − h f (x0 ) + h4 f (4) (x0 ) − · · ·
2 6 24
f (x0 ) = f (x0 )
1 1 3 000 1
f (x0 + h) = f (x0 ) + hf 0 (x0 ) + h2 f 00 (x0 ) + h f (x0 ) + h4 f (4) (x0 ) + · · · .
2 6 24
We now substitute these Taylor series into the right hand side of 4.3.2 and simplify. To facilitate the algebra, we
begin by summing −2f (x0 − h) + f (x0 ) + f (x0 + h):

−2f (x0 − h) = −2f (x0 ) + 2hf 0 (x0 ) − h2 f 00 (x0 ) + 31 h3 f 000 (x0 ) − 12 1 4 (4)
h f (x0 ) − · · ·
f (x0 ) = f (x0 )
f (x0 + h) = f (x0 ) + hf 0 (x0 ) + 21 h2 f 00 (x0 ) + 16 h3 f 000 (x0 ) + 24
1 4 (4)
h f (x0 ) + · · ·
1 2 00 1 3 000 1 4 (4)
−2f (x0 − h) + f (x0 ) + f (x0 + h) = 3hf (x0 ) − 2 h f (x0 ) + 2 h f (x0 ) − 24 h f (x0 ) + · · · .
0
148 CHAPTER 4. NUMERICAL CALCULUS

Hence, we have

−2f (x0 − h) + f (x0 ) + f (x0 + h) 3hf 0 (x0 ) − 21 h2 f 00 (x0 ) + 12 h3 f 000 (x0 ) − 24


1 4 (4)
h f (x0 ) + · · ·
=
3h 3h
1 00 1 2 000 1
= f (x0 ) − hf (x0 ) + h f (x0 ) − h3 f (4) (x0 ) + · · · .
0
6 6 72
−2f (x0 −h)+f (x0 )+f (x0 +h)
For the error, e(h) = f 0 x0 − 16 h − , we then get

3h

1 1 1 3 (4)
 
f 0 (x0 ) − hf 00 (x0 ) + h2 f 000 (x0 ) − h f (x0 ) + · · ·
6 72 1296
1 1 1
 
− f 0 (x0 ) − hf 00 (x0 ) + h2 f 000 (x0 ) − h3 f (4) (x0 ) + · · ·
6 6 72
11 2 000 17 3 (4)
= − h f (x0 ) + h f (x0 ) + · · · .
72 1296
We now know that we have an error of the form O(h2 f 000 (ξh )), the form of the remaining term with least degree,
but we do not have rigorous proof of that fact. Think of what has been done so far as discovery. Now that we know
the f 000 terms do not cancel, we go back and truncate all the Taylor series after the f 00 terms, replacing higher order
derivatives with an error term, and “redo” the algebra. We thus have

1 1 1
 
0
f x0 − h = f 0 (x0 ) − hf 00 (x0 ) + h2 f 000 (ξ1 )
6 6 72
1 1
f (x0 − h) = f (x0 ) − hf 0 (x0 ) + h2 f 00 (x0 ) − h3 f 000 (ξ2 )
2 6
f (x0 ) = f (x0 )
1 1
f (x0 + h) = f (x0 ) + hf 0 (x0 ) + h2 f 00 (x0 ) + h3 f 000 (ξ3 )
2 6

where ξ1 ∈ (x0 − 16 h, x0 ), ξ2 ∈ (x0 − h, x0 ), and ξ3 ∈ (x0 , x0 + h). And now when we compute e(h) = f 0 x0 − 61 h −

−2f (x0 −h)+f (x0 )+f (x0 +h)
3h , we know all the terms involving f , f 0 , and f 00 vanish. The only terms left are those
involving f :000

1 2 000 −2(− 61 h3 f 000 (ξ2 )) + 16 h3 f 000 (ξ3 )


e(h) = h f (ξ1 ) −
72 3h
1 2 000 1 2 000 1
= h f (ξ1 ) − h f (ξ2 ) − h2 f 000 (ξ3 )
72  9 18 
h2 1 000 1
= f (ξ1 ) − f 000 (ξ2 ) − f 000 (ξ3 ) .
9 8 2

The final formality is that of converting this expression into big-oh notation:
2
h 1 000 1

|e(h)| = f (ξ1 ) − f 000 (ξ2 ) − f 000 (ξ3 )

9 8 2
2
h 1 000 1 000
 
(ξ ) + 000
(ξ )| + (ξ )

≤ f 1
|f 2
f 3

9 8 2
h2 13
≤ · max {|f 000 (ξ1 )| , |f 000 (ξ2 )| , |f 000 (ξ3 )|}
9 8
= h2 · M |f 000 (ξh )|
13
for some ξh ∈ (x0 − h, x0 + h) and M = 72 (the value of ξh is ξ1 , ξ2 , or ξ3 ). We conclude

e(h) = O(h2 f 000 (ξh )).

In general, ξh is guaranteed to be between the least node and the greatest node. In the case of an integral
approximation, the endpoints of integration are treated as nodes for the purpose of locating ξh .
4.3. ERROR ANALYSIS 149

Gaussian quadrature
Ultimately, the accuracy of a numerical calculus formula is measured by its error term, a quantity having the form
O(hn f (k) (ξh )). If we are interested in the rate of convergence, we consider n, the power of h appearing in the error
term. The greater the power, the speedier the convergence. However, if we are interested in the largest class of
polynomials for which the formula is exact, we need to consider the value k, the order of the derivative appearing
in the error term. The greater k is, the larger the class of polynomials for which the formula is exact. In fact, if the
error term contains a factor of f (k) (ξh ), then the formula is exact for all polynomials up to (and including) degree
k − 1. The further implication is that there are degree k polynomials for which the formula is not exact, for if this
were not the case, then the error term would involve a higher derivative. We call the value k − 1 the degree of
precision. Formally, the degree of precision of a numerical calculus formula is the integer m such that the formula
is exact for all polynomials of degree up to and including m but is not exact for all polynomials of degree m + 1.
Gaussian quadrature formulas aim to maximize the degree of precision for integral formulas.
The numerical derivatives and integrals over a stencil with n + 1 points that we have derived so far are exact
for all polynomials up to degree n as they must be. They have degree of precision at least n. As it turns out, a
select few have degree of precision greater than n. Consider the second derivative approximation over the stencil

The stencil has three points, so we expect it to be exact for all polynomials up to degree 2 (and it is). However, its
error term is O(h2 f (4) (ξh )), indicating that the formula is exact for all polynomials up to degree 3. The degree of
precision is actually 3, not 2. The first derivative formula over the same stencil is similar. Though it has an error
2
term of h6 f 000 (ξh ), indicating that the formula has degree of precision 2 as expected, the formula itself only involves
two of the three points available! The coefficient of f (x0 ) turns out to be zero. It follows that we can derive the
same formula using the stencil

having only two points yet having degree of precision 2. Several other centered differences have this attribute. The
Newton-Cotes formulas with an odd number of nodes also have this property. Their error terms exceed degree of
precision expectations by one degree. We noted earlier that a centrally located point of evaluation tends to increase
accuracy, and now we see that the increase can be dramatic.
What we might gather from these observations is that it is not only the number of nodes that determines the
error term of a numerical calculus formula. The location of the nodes is also important. Up to now, we have only
seen how node location affects derivative approximation. We know that centrally locating the point of evaluation
generally increases accuracy. We now take up the question of how to locate nodes in order to increase the accuracy
of integral formulas. The idea of a centralized point of evaluation has no meaning in this context, however. Integrals
do not have a single point of evaluation. They are taken over an interval. It is the locations of the nodes relative
to the endpoints of evaluation that are important. We now find out where to put the nodes to attain the greatest
degree of precision for any given number of nodes.
Let Gn be the nth Legendre polynomial, defined recursively by

(2n + 1)xGn (x) − nGn−1 (x)


Gn+1 (x) =
n+1
G0 (x) = 1
G1 (x) = x.

We set the θi equal to the roots of Gn to derive the n-point quadrature formula over the interval [x0 − h, x0 + h]
with greatest degree of precision possible. With placement of the nodes chosen, we force the formula to be exact
for polynomials up to degree n − 1 as we did earlier. The difference this time is, due to the particular values of θi ,
the resulting formula will be exact for all polynomials up to degree 2n − 1. When the nodes are placed at the roots
´ x +h
of the nth Legendre polynomial, we get a quadrature formula for x00−h f (x)dx that exceeds the expected degree of
precision by n, the number of nodes!
We demonstrate for n = 1 and n = 3.

G1 (x) = x
150 CHAPTER 4. NUMERICAL CALCULUS

has for its only root, 0. Hence, we seek a formula of the form
ˆ x0 +h
f (x)dx ≈ a0 f (x0 )
x0 −h

which is exact for polynomials up to degree 0. The one equation for the one unknown, a0 , is
ˆ x0 +h
(1)dx = a0 (1)
x0 −h

or 2h = a0 . Hence, we have
ˆ x0 +h
f (x)dx ≈ 2hf (x0 ),
x0 −h

which we claim has degree of precision 1, not 0. Indeed, for f (x) = x − x0 ,


ˆ x0 +h x0 +h
1
f (x)dx = (x − x0 )2 =0

x0 −h 2 x0 −h

and
2hf (x0 ) = 2h(x0 − x0 ) = 0,

so it is exact for degree one polynomials. However, for f (x) = (x − x0 )2 ,


ˆ x0 +h x0 +h
1 2
f (x)dx = (x − x0 )3 = h3

x0 −h 3 x0 −h 3

and
2hf (x0 ) = 2h(x0 − x0 )2 = 0,

so it is not exact for all degree two polynomials. Therefore, its degree of precision is 1. Note the formula
ˆ x0 +h
f (x)dx ≈ 2hf (x0 ) is equivalent to the Midpoint Rule as found in Table 4.5.
x0 −h
Now
3xG1 (x) − G0 (x)
G2 (x) =
2
1 2
= (3x − 1)
2
so
5xG2 (x) − 2G1 (x)
G3 (x) =
3
5 3
2 (3x − x) − 2x
=
3
5(3x3 − x) − 4x
=
6
15x3 − 9x
=
6
1
= (5x3 − 3x),
2
q q
3 3
which has roots − 5 , 0, 5. Hence, we seek a formula of the form

ˆ x0 +h
! !
3 3
r r
f (x)dx ≈ a0 f x0 − h + a1 f (x0 ) + a2 f x0 + h
x0 −h 5 5
4.3. ERROR ANALYSIS 151

which is exact for polynomials up to degree 2. The three equations for the three unknowns are
ˆ x0 +h
(1)dx = 2h = a0 + a1 + a2
x0 −h
ˆ x0 +h
3 3
r r
(x − x0 )dx = 0 = − ha0 + ha2
x0 −h 5 5
ˆ x0 +h
2 3 3 2 3
(x − x0 )2 dx = h = h a0 + h2 a2 .
x0 −h 3 5 5

The solution is
5 8
a0 = a2 = h and a1 = h,
9 9
so the quadrature formula is
ˆ x0 +h
" ! !#
3 3
r r
h
f (x)dx ≈ 5f x0 − h + 8f (x0 ) + 5f x0 + h .
x0 −h 9 5 5

The formula was derived to be exact for polynomials up to degree 2, so its degree of precision is at least 2. We
claim the degree of precision is actually 5. For f (x) = (x − x0 )3 ,
ˆ x0 +h x0 +h
1
f (x)dx = (x − x0 )4 =0

x0 −h 4 x0 −h

and
" ! !#  !3 !3 
3 3 3 3
r r r r
h h
5f x0 − h + 8f (x0 ) + 5f x0 + h = 5 − h +0+5 h  = 0,
9 5 5 9 5 5

so it is exact for degree three polynomials. For f (x) = (x − x0 )4 ,


ˆ x0 +h x0 +h
1 5 2
f (x)dx = (x − x0 ) = h5

x0 −h 5 x0 −h 5

and
" ! !#  !4 !4 
3 3 3 3
r r r r
h h
5f x0 − h + 8f (x0 ) + 5f x0 + h = 5 − h +0+5 h
9 5 5 9 5 5

5 9 4 9 4
 
= h h + h
9 25 25
2 5
= h ,
5
so it is exact for degree four polynomials. For f (x) = (x − x0 )5 ,
ˆ x0 +h x0 +h
1 6
f (x)dx = (x − x0 ) =0

x0 −h 6 x0 −h

and
" ! !#  r !5 r !5 
3 3 3 3
r r
h h
5f x0 − h + 8f (x0 ) + 5f x0 + h = 5 − h +0+5 h  = 0,
9 5 5 9 5 5

so it is exact for degree five polynomials. However, for f (x) = (x − x0 )6 ,


ˆ x0 +h x0 +h
1 7 2
f (x)dx = (x − x0 ) = h7

x0 −h 7 x0 −h 7
152 CHAPTER 4. NUMERICAL CALCULUS

and
" ! !#  !6 !6 
3 3 3 3
r r r r
h h
5f x0 − h + 8f (x0 ) + 5f x0 + h = 5 − h +0+5 h
9 5 5 9 5 5

5 27 6 27 6
 
= h h + h
9 125 125
3 7
= h ,
25
so it is not exact for all degree six polynomials. Its degree of precision is 5. The formula is listed as the second
Gaussian quadrature formula in table 4.5.
We can also find the degree of precision of any numerical calculus formula by observing the form of its error
term. If the error term has the form O(hn f (k) (ξh )), then its degree of precision is k − 1.

Some standard formulas


Tables 4.2 , 4.3 , 4.4 , and 4.5 summarize some standard formulas for derivatives and integrals. Notice there are no
one-point formulas for any derivatives, no two-point formulas for second derivatives or higher, and no three-point
formulas for third derivatives or higher. The stencils have been streamlined to show only the values of θi . Hence,
the stencil

appears in the table as

Key Concepts
Degree of precision: The integer m such that a numerical calculus formula is exact for all polynomials of degree
up to and including m but is not exact for all polynomials of degree m + 1.

Error terms: Error terms for numerical calculus approximations can be found by replacing all occurrences of f
in an approximation formula by Taylor series expansions about x0 and reducing.
Gaussian quadrature: A quadrature method which maximizes the degree of precision relative to the number of
nodes used.

Quadrature: Another name for a numerical integration formula.


Weighted Mean Value Theorem: Assume that f and g are continuous on [a, b]. If g never changes sign and is
non-negative in [a, b], then we have that,
ˆ b ˆ b
f (x)g(x)dx = f (c) g(x)dx
a a

for some c in (a, b).


Stencil Formula Name

2-point formulas
4.3. ERROR ANALYSIS

−f (x0 ) + f (x0 + h) h 00
f 0 (x0 ) = − f (ξh ) Forward Difference
h 2
−f (x0 − h) + f (x0 ) h 00
f 0 (x0 ) = + f (ξh ) Backward Difference
h 2

3-point formulas

−3f (x0 ) + 4f (x0 + h) − f (x0 + 2h) h2 000


f 0 (x0 ) = + f (ξh ) Forward Difference
2h 3
−f (x0 − h) + f (x0 + h) h2 000
f 0 (x0 ) = + f (ξh ) Centered Difference
2h 6
f (x0 − 2h) − 4f (x0 − h) + 3f (x0 ) h2 000
f 0 (x0 ) = + f (ξh ) Backward Difference
2h 3

5-point formulas

−25f (x0 ) + 48f (x0 + h) − 36f (x0 + 2h) + 16f (x0 + 3h) − 3f (x0 + 4h) h4 (5)
f 0 (x0 ) = + f (ξh ) Forward Difference
12h 5
−3f (x0 − h) − 10f (x0 ) + 18f (x0 + h) − 6f (x0 + 2h) + f (x0 + 3h) h4 (5)
Table 4.2: Some standard first derivative formulas.

f 0 (x0 ) = + f (ξh )
12h 20
f (x0 − 2h) − 8f (x0 − h) + 8f (x0 + h) − f (x0 + 2h) h4 (5)
f 0 (x0 ) = + f (ξh ) Centered Difference
12h 30
−f (x0 − 3h) + 6f (x0 − 2h) − 18f (x0 − h) + 10f (x0 ) + 3f (x0 + h) h4 (5)
f 0 (x0 ) = + f (ξh )
12h 20
3f (x0 − 4h) − 16f (x0 − 3h) + 36f (x0 − 2h) − 48f (x0 − h) + 25f (x0 ) h4 (5)
f 0 (x0 ) = + f (ξh ) Backward Difference
12h 5
153
154

Stencil Formula Name

3-point formulas

f (x0 ) − 2f (x0 + h) + f (x0 + 2h)


f 00 (x0 ) = + O(hf (3) (ξh )) Forward Difference
h2
f (x0 − h) − 2f (x0 ) + f (x0 + h)
f 00 (x0 ) = + O(h2 f (4) (ξh )) Centered Difference
h2

4-point formulas

2f (x0 ) − 5f (x0 + h) + 4f (x0 + 2h) − f (x0 + 3h)


f 00 (x0 ) = + O(h2 f (4) (ξh )) Forward Difference
h2
f (x0 − h) − 2f (x0 ) + f (x0 + h)
f 00 (x0 ) = + O(h2 f (4) (ξh ))
h2

5-point formulas

35f (x0 ) − 104f (x0 + h) + 114f (x0 + 2h) − 56f (x0 + 3h) + 11f (x0 + 4h)
Table 4.3: Some second derivative formulas.

f 00 (x0 ) = + O(h3 f (5) (ξh )) Forward Difference


12h2
11f (x0 − h) − 20f (x0 ) + 6f (x0 + h) + 4f (x0 + 2h) − f (x0 + 3h)
f 00 (x0 ) = + O(h3 f (5) (ξh ))
12h2
−f (x0 − 2h) + 16f (x0 − h) − 30f (x0 ) + 16f (x0 + h) − f (x0 + 2h)
f 00 (x0 ) = + O(h4 f (6) (ξh )) Centered Difference
12h2
CHAPTER 4. NUMERICAL CALCULUS
Stencil Formula Name
4.3. ERROR ANALYSIS

4-point formulas

−f (x0 ) + 3f (x0 + h) − 3f (x0 + 2h) + f (x0 + 3h)


f 000 (x0 ) = + O(h)f (4) (ξh ) Forward Difference
h3
−f (x0 − h) + 3f (x0 ) − 3f (x0 + h) + f (x0 + 2h)
f 000 (x0 ) = + O(hf (4) (ξh ))
h3
−f (x0 − 2h) + 3f (x0 − h) − 3f (x0 ) + f (x0 + h)
f 000 (x0 ) = + O(hf (4) (ξh ))
h3
−f (x0 ) + 3f (x0 + h) − 3f (x0 + 2h) + f (x0 + 3h)
f 000 (x0 ) = + O(hf (4) (ξh )) Backward Difference
h3

5-point formulas

−5f (x0 ) + 18f (x0 + h) − 24f (x0 + 2h) + 14f (x0 + 3h) − 3f (x0 + 4h)
f 000 (x0 ) = + O(h2 f (5) (ξh )) Forward Difference
2h3
−3f (x0 − h) + 10f (x0 ) − 12f (x0 + h) + 6f (x0 + 2h) − f (x0 + 3h)
f 000 (x0 ) = + O(h2 f (5) (ξh ))
2h3
Table 4.4: Some third derivative formulas.

−f (x0 − 2h) + 2f (x0 − h) − 2f (x0 + h) + f (x0 + 2h)


f 000 (x0 ) = + O(h2 f (5) (ξh )) Centered Difference
2h3
f (x0 − 3h) − 6f (x0 − 2h) + 12f (x0 − h) − 10f (x0 ) + 3f (x0 + h)
f 000 (x0 ) = + O(h2 f (5) (ξh ))
2h3
3f (x0 − 4h) − 14f (x0 − 3h) + 24f (x0 − 2h) − 18f (x0 − h) + 5f (x0 )
f 000 (x0 ) = + O(h2 f (5) (ξh )) Backward Difference
2h3
155
156

Stencil Formula Name

open Newton-Cotes formulas


ˆ x0 +2h
f (x)dx = 2hf (x0 + h) + O(h3 f 00 (ξh )) Midpoint Rule
x0
ˆ x0 +3h
3h
f (x)dx = [f (x0 + h) + f (x0 + 2h)] + O(h3 f 00 (ξh ))
x0 2
ˆ x0 +4h
4h
f (x)dx = [2f (x0 + h) − f (x0 + 2h) + 2f (x0 + 3h)] + O(h5 f (4) (ξh ))
x0 3
ˆ x0 +5h
5h
f (x)dx = [11f (x0 + h) + f (x0 + 2h) + f (x0 + 3h) + 11f (x0 + 4h)] + O(h5 f (4) (ξh ))
x0 24

closed Newton-Cotes formulas


ˆ x0 +h
h
f (x)dx = [f (x0 ) + f (x0 + h)] + O(h3 f 00 (ξh )) Trapezoidal Rule
x0 2
ˆ x0 +2h
h
f (x)dx = [f (x0 ) + 4f (x0 + h) + f (x0 + 2h)] + O(h5 f (4) (ξh )) Simpson’s Rule
x0 3
ˆ x0 +3h
3h 3
Table 4.5: Some integration formulas.

f (x)dx = [f (x0 ) + 3f (x0 + h) + 3f (x0 + 2h) + f (x0 + 3h)] + O(h5 f (4) (ξh )) Simpson’s 8 Rule
x0 8
ˆ x0 +4h
2h
f (x)dx = [7f (x0 ) + 32f (x0 + h) + 12f (x0 + 2h) + 32f (x0 + 3h) + 7f (x0 + 4h)] + O(h7 f (6) (ξh )) Bode’s Rule
x0 45

Gaussian quadrature formulas


ˆ x0 +h     
1 1
f (x)dx = h f x0 − √ h + f x0 + √ h + O(h5 f (4) (ξh ))
x0 −h 3 3
ˆ x0 +h " r ! r !#
h 3 3
f (x)dx = 5f x0 − h + 8f (x0 ) + 5f x0 + h + O(h7 f (6) (ξh ))
x0 −h 9 5 5
CHAPTER 4. NUMERICAL CALCULUS
4.3. ERROR ANALYSIS 157

Exercises
1. Let f (x) = ex − sin x. Complete the following table using the approximation formula

−3f (x0 ) + 4f (x0 + h) − f (x0 + 2h)


f 0 (x0 ) ≈ .
2h
h approximate f 0 (2) abs. error
.01
.005
−.005
−.01

Is it OK to use negative values for h?


[A]
2. For each value of x in the table, use the most accurate three-point formula to approximate f 0 (x).

x f (x) f 0 (x)
−2.7 0.054797
−2.5 0.11342
−2.3 0.65536
−2.1 0.98472

3. Approximate the integral using Simpson’s rule.


ˆ 0
(a) x ln(x + 1)dx [S]
−0.5
´3
(b) 1
ln(x + 1) dx
ˆ 0.25
(c) (cos x)2 dx [A]

−0.25
´3
(d) 1
esin x dx
´2
(e) 1
x4 dx [A]

[S][A]
4. Do question 3 using the Trapezoidal rule.
[S][A]
5. Do question 3 using the Midpoint rule.
[S][A]
6. Find the error of the approximation in question 3.
[S][A]
7. Find the error of the approximation in question 4.
8. Find the error of the approximation in question 5. [S][A]
´ 11 √
9. Find the error in approximating −7 (32x2 + 7x − 2)dx using Simpson’s 38 Rule.
´ 36
10. Find the error in approximating −17 (32x5 + 7x3 − 2)dx using Bode’s Rule. [A]
11. For the following values of f , x0 , and h, use the formula

f (x0 + h) − f (x0 − h) h2 000


f 0 (x0 ) = − f (ξ)
2h 6
to approximate f 0 (x0 ).
[S]
(a) f (x) = ex ; x0 = 2; h = 0.1.
(b) f (x) = (cosh 2x)2 − sin x; x0 = π; h = 0.05. [A]

(c) f (x) = ln(2x − 3) + 5x; x = 10; h = 1.

12. Compute both a lower bound and an upper bound on the error for the approximation in question 11. Verify that the
actual error is between these bounds. [S][A]
[S][A]
13. For each part of question 11, find the value of ξ guaranteed by the formula.
14. State the degree of precision of the closed Newton-Cotes formula on 5 nodes, Bode’s Rule.
[S]
15. State the degree of precision of the five point formula.
1
f 0 (x0 ) = [−25f (x0 ) + 48f (x0 + h) − 36f (x0 + 2h)
12h
h4 (5)
+16f (x0 + 3h) − 3f (x0 + 4h)] + f (ξ)
5
158 CHAPTER 4. NUMERICAL CALCULUS

16. Find the degree of precision of the quadrature formula


ˆ 5
1 11
h   i
f (x) dx ≈ 3f + f (5) .
3 2 3

17. Find the error term for the quadrature method, and state its degree of precision.
ˆ x0 +h
(a) f (x) dx ≈ hf (x0 ) [A]
x0
ˆ x0 +h
h
 
(b) f (x) dx ≈ hf x0 +
x0 4
ˆ x0 +h
h 2
h   i
[S]
(c) f (x)dx ≈ 3f x0 + h + f (x0 )
x0 4 3
ˆ x0 +2h
h 4
h   i
(d) f (x)dx ≈ 3f x0 + h + f (x0 )
x0 2 3
ˆ x0 +3h
3h [A]
(e) f (x)dx ≈ [f (x0 ) + 3f (x0 + 2h)]
x0 4
ˆ x0 +2h
h h 3
h    i
(f) f (x)dx ≈ f x0 − + 3f x0 + h
x0 2 2 2
ˆ x0 +2h
h [A]
(g) f (x)dx ≈ [f (x0 − h) − 2f (x0 ) + 7f (x0 + h)]
x0 3
ˆ x0 +3h
3 3
h    i
(h) f (x)dx ≈ 3h 3f x0 + h − 6f (x0 + h) + 4f x0 + h
x0 2 4
ˆ x0 +3h
h 3 3 3
h      i
[A]
(i) f (x)dx ≈ − 208f x0 + h − 891f (x0 + h) + 1344f x0 + h − 625f x0 + h
x0 12 2 4 5

18. Find the error term for the derivative approximation:


f (x0 + 2h) − f (x0 ) [A]
(a) f 0 (x0 ) ≈
2h
0 f (x0 + 2h) − f (x0 − h)
(b) f (x0 ) ≈
3h
−3f (x 0 ) + 4f (x0 + h2 ) − f (x0 + h) [S]
(c) f 0 (x0 ) ≈
h
−13f (x 0 − 10h) − 12f (x0 + 5h) + 25f (x0 + 8h)
(d) f 0 (x0 ) ≈
270h
1 1 1 1
−7f (x 0 + h) + 416f (x 0 + 2 h) − 2916f (x0 + 3 h) + 5632f (x0 + 4 h) − 3125f (x0 + 5 h) [A]
(e) f 0 (x0 ) ≈
12h
00 2f (x0 − h) − 3f (x0 ) + f (x0 + 2h)
(f) f (x0 ) ≈
3h2
7f (x0 − 5h) − 12f (x0 ) + 5f (x0 + 7h) [A]
(g) f 00 (x0 ) ≈
210h2
5f (x0 − 5h) − 12f (x0 + 2h) + 7f (x0 + 7h)
(h) f 00 (x0 ) ≈
210h2
5f (x0 − 2h) + 32f (x0 − h) − 60f (x0 ) + 25f (x0 + 2h) − 2f (x0 + 4h) [A]
(i) f 00 (x0 ) ≈
60h2
19. Diffy Rence writes down the following approximation:

f 00 (3.0) ≈ 25[sin(2.8) − 2 sin(3.0) + sin(3.2)].


[S]
What is f (x)?
20. Let f (x) = sin x.

(a) Find a bound on the error of the approximation


−3 sin 6 + 4 sin 6.1 − sin 6.2
f 0 (6) ≈
0.2
according to the appropriate error term.
4.3. ERROR ANALYSIS 159

(b) Compare this bound to the actual error.

21. What can you say about the error in approximating the first derivative of

f (x) = −13x4 + 17x3 − 15x2 + 12x − 99

using a 5-point formula?


22. Let f (x) = 3x3 − 2x2 + x.

(a) Compute the error (not a bound on the error) in estimating f 0 (2) using the forward difference

f (x0 + h) − f (x0 )
h
with h = 0.1.
(b) Find ξ0.1 as guaranteed by the error term.

23. Let f (x) = sin x. Find a bound on the error of the approximation.
[A]
(a) f 00 (3.0) ≈ 25[sin(2.8) − 2 sin(3.0) + sin(3.2)]
(b) f 00 (3.0) ≈ 1600 [2 sin(3.0) − 5 sin(3.025) + 4 sin(3.05) − sin(3.075)]
[S]
(c) f 000 (3.0) ≈ 500000 [−5 sin(3.0) + 18 sin(3.01) − 24 sin(3.02) + 14 sin(3.03) − 3 sin(3.04)]
(d) f 000 (3.0) ≈ 1000 [− sin(2.8) + 3 sin(2.9) − 3 sin(3.0) + sin(3.1)]
ˆ 4
1
(e) f (x)dx ≈ [sin(3) + 4 sin(3.5) + sin(4)]
3 6
ˆ 4
1 7 1 7 1
    
[S]
(f) f (x)dx ≈ sin − √ + sin + √
3 2 2 2 3 2 2 3
[S]
24. Suppose you have the following data on a function f .

x 0 1 2 3 4
f (x) −0.2381 −0.3125 −0.4545 −0.8333 −5

(a) Approximate f 0 (4) and f 0 (2) using 5-point formulas.


(b) Which approximation would you expect to be more accurate, and why?
1
(c) Did it turn out that way? The data came from f (x) = x−4.2
.

25. Refer to the quadrature method


ˆ x0 +h
h h 2h h3 00
h    i
f (x) dx = f x0 + + f x0 + + f (ξ)
x0 2 3 3 36
[A]
in all of the following questions.

(a) What is the rate of convergence?


(b) What is the degree of precision?
´π
(c) Use the method to approximate 0
sin x dx.
(d) Find a bound on the error of this approximation.
(e) Compare the bound to the actual error.
ˆ 2
26. The Trapezoidal rule applied to f (x)dx gives the value 5, and the Midpoint rule gives the value 4. What value
0
does Simpson’s rule give?
´2 [A]
27. The Trapezoidal Rule applied to 0
f (x) dx gives the value 4, and Simpson’s Rule gives the value 2. What is f (1)?
[A]
28. When approximating f 000 (x0 ) using five nodes, the rate of convergence will be at least what?
29. Show that the average of the forward difference, −f (x0 )+f
h
(x0 +h)
, and backward difference, −f (x0 −h)+f (x0 )
h
, approxima-
f (x0 +h)−f (x0 −h)
tions of f (x0 ) gives the central difference approximation,
0
2h
, of f 0 (x0 ).
30. Chuck was “approximating” a definite integral using Simpson’s Rule. As you can see from his work below, he was
integrating a cubic polynomial. Calculate the error he incurred even though you can not read all the coefficients. [A]
160 CHAPTER 4. NUMERICAL CALCULUS

[A]
31. Repeat 30 supposing Chuck was using the Trapezoidal Rule.
32. Sketch the graph of a function f (x), and indicate on it values for x0 and h so that the backward difference f (x0 )−fh(x0 −h)
gives a better approximation of f 0 (x0 ) than does the central difference f (x0 +h)−f
2h
(x0 −h)
.
´1
33. Sketch the graph of a function f (x) for which the Trapezoidal Rule gives a better approximation of 0 f (x) dx than
does Simpson’s Rule, and explain how you know. [S]
34. Suppose a 5 point formula is used to approximate f 00 (x0 ) for stepsizes h = 0.1 and h = 0.02. If E0.1 represents the
error in the approximation for h = 0.1 and E0.02 represents the error in the approximation for h = 0.02, what would
you expect EE0.02
0.1
to be, approximately? [S]
35. A general three point formula using nodes x0 , x0 + αh, and x0 + 2h, (α 6= 0, 2) is given by
1 2+α 4
 
α
f 0 (x0 ) ≈ − f (x0 ) + f (x0 + αh) − f (x0 + 2h) .
2h α α(2 − α) 2−α
(a) Show that this formula reduces to one of the standard formulas when α = 1.
(b) Find the error term for this formula.
[A]
36. Find three different approximations for f 0 (0.2) using three-point formulas.
x f (x)
0 1
0.1 1.10517
0.2 1.22140
0.3 1.34986
0.4 1.49182
The graph of f 000 (x) is shown below. Use it to rank your three approximations in order from least expected error to
greatest expected error, and explain why you ranked them the way you did.

1.4

1.3

1.2

1.1

1
0 0.1 0.2 0.3 0.4
−2f (x0 −h)−3f (x0 )+6f (x0 +h)−f (x0 +2h)
37. Verify numerically that the error in using the formula f (x0 ) = 0
6h
to approximate
f 0 (3) using the function f (x) = (cos 3x)2 + ln x is really O(h3 ).
38. Numerically approximate the best estimate that can be obtained from the formula
−2f (3 − h) − 3f (3) + 6f (3 + h) − f (3 + 2h)
f 0 (3) =
6h
with double precision (standard Octave) computation and f (x) = (cos 3x)2 + ln x. What value of h gives this optimal
approximation? [A]
39. Find the degree of precision of the quadrature formula
ˆ 1  √  √ 
3 3
f (x)dx = f − +f .
−1 3 3
ˆ 2
40. The quadrature formula f (x)dx = c0 f (0) + c1 f (1) + c2 f (2) is exact for all polynomials of degree less than or equal
0
to 2. Determine c0 , c1 , and c2 .
4.4. COMPOSITE INTEGRATION 161

4.4 Composite Integration


In section 4.3 we supplied error terms that took the form O(hk f (l) (ξh )). As a prime example, the trapezoidal
ˆ x0 +h
h
rule, f (x)dx = [f (x0 ) + f (x0 + h)] + O(h3 f 00 (ξh )), has error term O(h3 f 00 (ξh )). This conclusion follows
x0 2
directly from a Taylor series analysis, but what does it mean?
Error terms for derivative approximations are comparatively easy to understand. Consider the first derivative
−f (x0 − h) + f (x0 + h) h2 000
approximation f 0 (x0 ) = + f (ξh ). The smaller h is, the smaller the error in approxi-
2h 6
mating f (x0 ) is (as long as the f (ξh ) term doesn’t counteract the benefit of shrinking h). Error terms for integral
0 000

approximations are not as straightforward because, in each case, the quantity being approximated depends on h.
Changing h in the integration formula also changes the quantity being approximated. This is true of each formula in
table 4.5. The trapezoidal rule is as good an example as any. The left hand side, the quantity being approximated,
´ x +h
is x00 f (x)dx, so smaller h means approximating the integral over a smaller interval. So how does having a
smaller error in approximating a different number tell us anything about the potential benefit of computing with
smaller values of h? Careful study of the trapezoidal rule will reveal the answer.
According to the trapezoidal rule, h2 [f (x0 ) + f (x0 + h)] approximates the integral of f over the interval [x0 , x0 +
h]. If h is replaced by h/2, the resulting approximation, h4 f (x0 ) + f (x0 + h2 ) , is an approximation of the integral
 

of f over the interval [x0 , x0 + h2 ]. It is no longer an approximation of the integral over [x0 , x0 + h]! To use
the trapezoidal rule to approximate the original quantity, the integral of f over [x0 , x0 + h], using h/2 instead of
h requires two applications of the trapezoidal rule—one over the interval [x0 , x0 + h2 ] and one over the interval
[x0 + h2 , x0 + h]. The sum of these two approximations is an approximation for the integral of f over [x0 , x0 + h].
Reducing h further requires more applications of the trapezoidal rule over more intervals. In general, reducing h to
n for any whole number n requires n applications of the trapezoidal rule:
h

ˆ x0 +h ˆ x0 + n
h ˆ x0 +2 n
h ˆ x0 +h
f (x)dx = f (x)dx + f (x)dx + · · · + f (x)dx
x0 x0 x0 + n
h
x0 +(n−1) n
h
       
h h h h h
≈ f (x0 ) + f x0 + + f x0 + + f x0 + 2 +
2n n 2n n n
   
h h
··· + f x0 + (n − 1) + f (x0 + h) . (4.4.1)
2n n

´ x +h ´x ´x ´ xn
Decomposing x00 f (x)dx into the sum x01 f (x)dx+ x12 f (x)dx+· · ·+ xn−1 f (x)dx and summing approximations
of these integrals is called composite integration.
As for using the trapezoidal rule to do the approximating, the error in a single application of the trapezoidal rule is
3 2 Pn
O(h3 f 00 (ξh )). The error in the above sum is, therefore, bounded by i=1 M nh f 00 (µi ) = M h nh · n1 i=1 f 00 (µi )
Pn

for some µi with x0 + (i − 1) nh < Pµ i < x0 + i n . Assuming f is continuous on [x0 , x0 + h], the P
h 00
intermediate value
theorem allows us to replace n i=1 f (µi ) with f (ξn ) for some ξn ∈ (x0 , x0 + h) because n1 i=1 f 00 (µi ) is the
1 n 00 00 n

average of the f (µi ), which is no more than the maximum of the f (µi ) and no less than the minimum of the f 00 (µi ).
00 00
2
Making this replacement gives us the error bound M h nh f 00 (ξn ). In conclusion,   the trapezoidal rule used multiple
´ x0 +h 1 2 00

times when necessary to approximate x0 f (x)dx actually has error O n f (ξn ) , where n is the number of
subintervals used in the calculation and ξn depends on n. Now the nature of the error is clearer. It is measured
by how many subintervals are used in the calculation. More subintervals (greater n) means less error (assuming
the benefit of more subintervals is not counteracted by the f 00 factor). Other composite integration formulas are
similar. If asingle-interval quadrature
 formula has error O(hk f (l) (ξh )), then the corresponding composite version
k−1
has error O n1 f (l) (ξn ) . More intervals generally means smaller error.


Composite Trapezoidal Rule


Equation 4.4.1 encapsulates the composite trapezoidal rule but does not represent the most efficient way to use it.
Simplifying the expression will help. Notice that all of the function evaluations except f (x0 ) and f (x0 + h) occur
162 CHAPTER 4. NUMERICAL CALCULUS

Table 4.6: Minimum


´ 3 −x2 number of intervals to achieve certain accuracies using the composite trapezoidal rule to
approximate 0 e dx.
accuracy 2.2(10)−2 5(10)−5 10−5 10−7 10−11 10−15
subintervals 2 3 8 75 7453 > 745300

twice, so we can condense the formula to


ˆ x0 +h     
h h h h
f (x)dx ≈ [f (x0 ) + f (x0 + h)] + f x0 + + · · · + f x0 + (n − 1)
x0 2n n n n
" n−1   #
h X h
= f (x0 ) + f (x0 + h) + 2 f x0 + i .
2n i=1
n

This leads to the following pseudo-code where we make the substitutions a = x0 and b = x0 + h.

Assumptions: f has a continuous second derivative on [a, b].


Input: Function f ; interval over which to integrate [a, b]; number of subintervals n.
f (a)+f (b)
Step 1: Set s = n ;
b−a
I= 2 ;
Step 2: For i = 1, 2, . . . , n − 1 do Step 3:
Step 3: Set I = I + f (a + is);
Step 4: Set I = sI;
´b
Output: Approximate value of a
f (x)dx.

Other composite integration formulas should be simplified likewise to minimize the number of times f is evaluated.

Adaptive quadrature
ˆ 3
2
e−x dx ≈ 4.57837939409486
0

and it is simple enough to approximate this value with the composite trapezoidal rule. Table 4.6 shows the
minimum number of subintervals needed to achieve various accuracies, assuming the calculations are done with
enough significant digits that floating point error does not overwhelm the calculation. It should be apparent that
achieving high accuracy results using the

Crumpet 28: error function

The error function is defined as ˆ x


2 2
erf(x) = √ e−t dt
π 0
and is critical in the study of statistics as it is used to calculate√probabilities associated with the normal distri-
´∞ 2
bution. The factor √2π comes from the fact that −∞ e−t dt = 2π , an interesting fact itself.
Computer algebra systems will have the error function built-in just as they do the sine or logarithm functions.
´3 2
Hence, the easiest way to evaluate 0 e−x dx is to have a computer algebra system (or perhaps your calculator)

compute 2π erf(3).

trapezoidal rule is not practical. It requires too many computations. We will take
 up this deficiency in the next
1 2 00
section. For now, let’s analyze the usefulness of the error bound O n f (ξn ) . Assuming f 00 (ξn ) is roughly

4.4. COMPOSITE INTEGRATION 163

constant, we should expect to improve our estimate from an accuracy of 2.2(10)−2 to an accuracy of 5(10)−5 ,
−2
an increase in accuracy of 2.2(10) 5(10)−5 ≈ 440 times, by increasing the number of subintervals by a factor of about

440 ≈ 21. In other words, we should expect it to take approximately 42 subintervals to achieve 5(10)−5 accuracy
based on accuracy of 2.2(10)−2 with 2 intervals. Since it only takes 3, we conclude that the assumption that
f 00 (ξ2 ) ≈ f 00 (ξ3 ) is bad! Luckily, the badness of this assumption actually works in our favor. It takes less, not more,
than the expected number of intervals to achieve 5(10)−5 accuracy. On the other √ hand, increasing the accuracy from
5(10)−5 to 10−5 , an increase by a factor of 5, we should expect to need about 5 ≈ 2.2 times as many subintervals.
3 × 2.2 = 6.6, so the 8 needed is just about what we would expect. Similarly, to increase the accuracy from 10−5
to 10−7 , an increase in accuracy by a factor of 100, we should expect to need about 10 times as many subintervals.
Indeed, 75 is about 10 times as many as 8. Likewise, to increase accuracy by a factor of 10, 000 (as in going from
10−7 to 10−11 or from 10−11 to 10−15 ), we should expect to need to increase the number of subintervals by a factor
of 100. Indeed, the table bears this estimate out as well.
Just remember, if f 00 does not exist or is wildly discontinuous, or just wildly varying, the assumption that
f (ξn ) is constant could be a bad one, no matter how many subintervals are used. The more common case is when
00

f 00 is continuous and reasonably tame, though. Even in this case, when the number of subintervals is small, the
assumption is often not a good one, but when the number of subintervals is large, it is a pretty reliable assumption.
The exact number of subintervals needed before this assumption is reasonable changes from one function to another,
however.
Taking this lesson to heart, we approximate ˆ 3 p 
x − ex cos e2x − x2 dx
0

using the trapezoidal rule with 50 subintervals and find that it is accurate to within about 10−1 of the exact value.
How many subintervals should we expect to need to achieve 10−3 accuracy? About 10 times as many, or about
500. With 500 subintervals, we actually attain accuracy of about .997(10)−3 , spot on! The assumption that f 00 (ξn )
is constant seems to be valid for this integral with n ≥ 50 (and maybe for some n < 50 too). Alas, this is the type
of analysis that can not be done in practice. In practice, we calculate integrals numerically because we don’t know
how to compute their values exactly! In “real life” situations, we have no way of knowing how accurate an integral
estimate is with 3 or 50 or 500 or 3000 subintervals. We need the computer to estimate errors as it calculates, just
as we had it do for root-finding algorithms.
Even though we know the assumption is not perfect, especially for small n, we assume f 00 (ξn ) is constant, so the
2
error of the trapezoidal rule becomes O n1 . The f 00 factor is subsumed by the implied constant of the big-oh
notation. Accordingly, halving the number of intervals can be expected to increase the error by a factor of about 4.
´b
Introducing the notation Tk (a, b) for the composite trapezoidal rule approximation of a f (x)dx with k subintervals
´b
and ek = a f (x)dx − Tk (a, b) for its error,
 2 2
1 1

en ≈ M and e2n ≈ M
n 2n
so
1 2

en M
≈ n
= 4, which implies en ≈ 4e2n .
e2n 1 2

M 2n
´b
Because a
f (x)dx = T2 (a, b) + e2 = T1 (a, b) + e1 ,

T2 (a, b) − T1 (a, b) = e1 − e2
≈ 4e2 − e2
= 3e2

so e2 ≈ 31 (T2 (a, b) − T1 (a, b)). Explicitly,


ˆ b
1
f (x)dx − T2 (a, b) ≈ (T2 (a, b) − T1 (a, b)).
a 3
We now have a way of approximating the error numerically, a significant breakthrough! The error is approximately
one third the difference between the trapezoidal rule approximations with one subinterval and with two.
164 CHAPTER 4. NUMERICAL CALCULUS

To harness this knowledge, we need to incorporate this estimate into our calculation. Suppose we wish to estimate
´b
a
f (x)dx to within an accuracy of tol. We begin by calculating T2 (a, b) and T1 (a, b). If 31 |T2 (a, b) − T1 (a, b)| < tol,
we are done. T2 (a, b) is our approximation. In the more likely case that 31 |T2 (a, b) − T1 (a, b)| ≥ tol, we divide
the interval [a, b] into two subintervals, [a, a+b
2 ] and [ 2 , b] and compare our error estimates on these subintervals
a+b
1
to tol
2 . If 3 T2 (a, 2 ) − T1 (a, 2 ) < 2 , we are done with the subinterval [a, 2 ]. T2 (a, 2 ) is a satisfactory
a+b a+b
tol a+b a+b
´ a+b
approximation of a 2 f (x)dx. If not, we bisect the interval again and compare error estimates to tol 4 . On the other
1
half of [a, b], if 3 T2 ( 2 , b) − T1 ( 2 , b) < 2 , we are done with the subinterval [ a+b
a+b a+b tol
2 , b]. T2 ( a+b
2 , b) is a satisfactory
´b
approximation of a+b f (x)dx. If not, we bisect the interval again and compare error estimates to tol 4 . Each time
2
a subinterval fails to meet the error tolerance, we divide it in half and try again. The process will normally end
successfully because, with each subinterval division, we will generally have the error decreasing by a factor of 4
while the error requirement is decreasing by a factor of only 2. In the end, the sum of the T2 estimates where the
´b
error tolerance is met will be our approximation for a f (x)dx.
The simplest way to code this algorithm is to use a recursive function. It is possible to do without, but the record
keeping is burdensome. Depending on the programming language you are using, the trade-off may be simplicity for
speed. Some languages do not handle recursive functions quickly.
Assumptions: f has a continuous second derivative on [a, b].
Input: Function f ; interval over which to integrate [a, b]; tolerance tol.
Step 1: Set m = 2 ;
b+a
I1 = T1 (a, b); I2 = T2 (a, b);
Step 2: If |I2 − I1 | < 3tol then return I2 ;
2 ]; and
Step 3: Do Steps 1-5 with inputs f ; [a, a+b 2 ;
tol
and set A equal to the result;
Step 4: Do Steps 1-5 with inputs f ; [ a+b
2 , b]; and 2 ;
tol
and set B equal to the result;
Step 5: Return A + B;
´b
Output: Approximate value of a
f (x)dx.
A tabulated example of such a computation´ 3 might help clarify any confusion over how this algorithm works. The
following table approximates the integral 0 ln(3 + x)dx with a tolerance of .006.
1
a b T1 (a, b) T2 (a, b) 3 |T2 (a, b) − T1 (a, b)| tol
0 3 4.33555 4.42389 .02944 .00600
0 1.5 1.95201 1.96732 .00510 .00300
0 0.75 0.90763 0.90997 .00077 .00150
0.75 1.5 1.05968 1.06124 .00051 .00150
1.5 3 2.47187 2.47961 .00257 .00300
´3
0
ln(3 + x)dx ≈ 0.90997 + 1.06124 + 2.47961 = 4.45082
The calculation in the table requires 7 evaluations of f and underestimates the integral by about .00390. In order
of occurrence, the evaluations happen at x = 0, 3, 1.5, .75, .375, 1.125, 2.25. The composite trapezoidal rule with
7 evaluations (6 subintervals each of length .5) underestimates the integral by about .00346. The non-adaptive
composite trapezoidal rule gives a slightly better estimate with essentially the same amount of computation. But
remember, it is not necessarily efficiency we are after. It is automatic error estimates. The adaptive trapezoidal
rule does something the conventional composite trapezoidal rule does not. It monitors itself for accuracy, so when
the routine completes, you not only get an estimate, but you can have some confidence in its accuracy even when
you have no way to calculate the integral exactly for comparison.

Key Concepts
Composite numerical integration: Dividing the interval of integration into a number of subintervals, applying
a simple quadrature formula to each subinterval and summing the results.
Adaptive numerical integration: Leveraging the error term of a simple quadrature formula in order to obtain
automatic calculation of the number and nature of subintervals needed to obtain a definite integral with some
prescribed accuracy.
4.4. COMPOSITE INTEGRATION 165

Exercises
ˆ x0 +h
1. Use the composite midpoint rule with 3 subintervals to h h 2h
h    i
f (x) dx ≈ f x0 + + f x0 + .
approximate x0 2 3 3
ˆ 3
12. Based on our discussion of composite integration, the
(a) ln(sin(x))dx [S]
1 error term for composite Simpson’s rule applied  to
ˆ ´b 1 4 (4)
7 f (x) dx with n subintervals is O f (ξn ) .


(b) x cos x dx a n

5 With a bit more work, it can be shown that the error


ˆ 4 term is actually − b−a h4 f (4) (ξn ) where h = b−a . No
e ln(x)
x
[A] 90 n
(c) dx big-oh needed. This error is exact for some ξn ∈ [a, b].
1 x
ˆ Use this error term to find a theoretical bound on the
13
error in estimating
p
(d) 1 + cos2 x dx
10 ˆ 4
ˆ ln 7
1
ex [A]
dx
(e) dx 2 1−x
ln 3 1+x
ˆ 1 using (composite) Simpson’s rule with h = 0.1.
x2 − 1
(f) dx 13. Why does the composite trapezoidal rule ALWAYS (for
0 x2 + 1
any h) give an underestimate of
2. Redo question 1 using the composite trapezoidal rule. ˆ π
[S][A]
sin x dx?
0
3. Redo question 1 using the composite Simpson’s rule.
[S][A]
14. Demonstrate geometrically
´8 and with some words the
4. Redo question 1 using the composite Simpson’s 3
rule. approximation of 7 x sin 8
x
dx using the composite
8
[S][A] trapezoidal rule with 4 trapezoids (that is, 4 subinter-
vals).
5. Redo question 1 using the composite version of the ´3
15. Approximate 1 ln(sin(x))dx using adaptive Simpson’s
quadrature rule [S][A]
method with tolerance 0.002. [S]
ˆ x0 +3h
3h 16. Use
f (x)dx = [f (x0 + h) + f (x0 + 2h)] . ˆ 1 adaptive Simpson’s method to approximate
2
x0
ln(x + 1)dx accurate to within 10−4 . [A]
0
6. Use a composite version of the quadrature rule 17. Derive a quadrature formula for
ˆ x0 +h ˆ b
h h 2h
h    i
f (x) dx ≈ f x0 + + f x0 + f (x) dx
x0 2 3 3 a

with three subintervals to approximate using unspecified nodes a ≤ x0 < x1 ≤ b. In other


ˆ 3 words, derive a “general trapezoidal rule” where x0 and
x3 x1 are allowed to be any two distinct values in [a, b].
dx.
0 x +1
3
18. In your formula from question 17, make the substitu-
´π 4 tions x0 = a, x1 = b, and x1 − x0 = h, and show that
7. Use the (simple) trapezoidal rule on 0 sin x dx to help
it thus reduces to the trapezoidal rule.
estimate the number of intervals´[0, π] must be divided ´2
into in order to approximate 0 sin4 x dx to within
π 19. Let I = 0 x2 ln(x2 + 1) dx. [A]
10
´π
−4
using the composite trapezoidal rule. NOTE:
(a) Approximate I using the Midpoint rule.
0
sin4 x dx = 83 π. [S]
(b) Use your answer to (a) to estimate the number of
[A]
8. Repeat question 7 using the midpoint rule. subintervals needed to approximate I to within
−1
9. Repeat question 7 using Simpson’s rule. 10−4 . NOTE: I = 24 ln(5)−6 tan
9
(2)−4
.
10. Suppose composite Simpson’s ´2 2
´ 12 rule with 100 subinter- 20. Let I = 0 x ln(x2 + 1) dx.
vals was used to estimate 5 f (x) dx, and the absolute
error turned out to be less than 10−5 . What function (a) Approximate I using Simpson’s rule.
might f (x) have been? (b) Use your answer to (a) to estimate the number of
11. Derive a summation formula for the composite version subintervals needed to approximate I to within
−1
of 10−4 . NOTE: I = 24 ln(5)−6 tan (2)−4
.
9

(a) the midpoint rule.


21. Use Octave to calculate the estimate suggested in
[A]
(b) Simpson’s rule. question 19b. Is the absolute error less than 10−4 ? [A]
3 [A]
(c) Simpson’s 8
rule. 22. Use Octave to calculate the estimate suggested in
(d) the quadrature formula question 20b. Is the absolute error less than 10−4 ?
166 CHAPTER 4. NUMERICAL CALCULUS

23. ˆ Use the composite trapezoidal rule to estimate 30. (i) Use your code from question 27 to approximate
1
the integral using tol = 10−5 . (ii) Calculate the actual
ln(x + 1)dx accurate to within 10 −6
. How many
0 error of the approximation. (iii) Is the approximation
[S] accurate to within 10−5 as requested?
subintervals are needed?
24. Repeat question 23 using the composite midpoint ˆ 2π
rule. (a) x sin(x2 )dx [A]

25. ˆ Use composite Simpson’s rule to estimate 0


1 ˆ 2
ln(x + 1)dx accurate to within 10 −6
. How many 1
(b) dx
0 0.1 x
subintervals are needed?
ˆ 2
3
26. Repeat question 25 using composite Simpson’s 8 (c) x2 ln(x2 + 1) dx
rule. [A] 0

27. Write an Octave function that implements adap- ´2 24 ln(5)−6 tan−1 (2)−4
tive Simpson’s rule as a recursive function. Some notes NOTE: 0
x2 ln(x2 + 1) dx = 9
.
about the structure: [A]
31. Write an Octave function that implements the gen-
(a) The inputs to the function should be f (x), a, b,
eral trapezoidal rule of question 1 in such a way that
and a maximum overall error, tol.
x0 and x1 are chosen at random.
(b) The output of the function should be the esti-
mate and, if you are feeling particularly stirred, 32. Write an Octave function that implements a com-
the number of function evaluations. posite version of the quadrature method in question
28. ´ Use your code from question 27 to approximate 31.
3
ln(sin(x))dx with tolerance 0.002. [A]
1 33. Do some numerical experiments to compare the
29. ˆ Use your code from question 27 to approximate (standard) composite trapezoidal rule to the (random)
1 composite trapezoidal rule of question 32. What do
ln(x + 1)dx accurate to within 10−4 . you find?
0
4.5. EXTRAPOLATION 167

4.5 Extrapolation
In calculus, you undoubtedly encountered Euler’s constant, e, which you were probably told is approximately 2.718,
or maybe just 2.7. And unless you were involved in a digits-of-e memorization contest, you probably never saw
more digits of e than your calculator could show. We’re about to change that. The first 50 digits of e are

2.7182818284590452353602874713526624977572470936999.

How many of them do you remember? Not to worry if it is not very many. No quiz on the digits of e is imminent.

Crumpet 29: Digits of e

The first 1000 digits of e, 50 per line, are

2.7182818284590452353602874713526624977572470936999
59574966967627724076630353547594571382178525166427
42746639193200305992181741359662904357290033429526
05956307381323286279434907632338298807531952510190
11573834187930702154089149934884167509244761460668
08226480016847741185374234544243710753907774499206
95517027618386062613313845830007520449338265602976
06737113200709328709127443747047230696977209310141
69283681902551510865746377211125238978442505695369
67707854499699679468644549059879316368892300987931
27736178215424999229576351482208269895193668033182
52886939849646510582093923982948879332036250944311
73012381970684161403970198376793206832823764648042
95311802328782509819455815301756717361332069811250
99618188159304169035159888851934580727386673858942
28792284998920868058257492796104841984443634632449
68487560233624827041978623209002160990235304369941
84914631409343173814364054625315209618369088870701
67683964243781405927145635490613031072085103837505
10115747704171898610687396965521267154688957035035

However, do you recall from calculus that


lim (1 + h)1/h = e?
h→0

Can you prove it? Proof on page 174. Based on this fact, we might use

ẽ(h) = (1 + h)1/h

to approximate e. No time like the present!

ẽ(0.01) ≈ 2.704813829421529
ẽ(0.005) ≈ 2.711517122929293
ẽ(0.0025) ≈ 2.714891744381238
ẽ(0.00125) ≈ 2.716584846682473
ẽ(0.000625) ≈ 2.717432851769196.
168 CHAPTER 4. NUMERICAL CALCULUS

Sadly, this sequence of approximations is not converging very quickly. We have two digits of accuracy in the first
approximation and still only three digits of accuracy in the fifth. We could, of course, continue to make h smaller to
get more accurate approximations, but based on the slow improvement observed so far, this does not seem like a very
promising route. Instead, we can combine the estimates we already have to get an improved approximation. This
idea should remind you, at least on the surface, of Aitken’s delta-squared method. In that method, we combined
three consecutive approximations to form another that was generally a better approximation than any of the original
three. We will do something similar here, combining inadequate approximations to find better ones. We will name
the various new approximations for continued reuse.

2ẽ(0.005) − ẽ(0.01) ≡ ẽ1 (0.01) = 2.718220416437056


2ẽ(0.0025) − ẽ(0.005) ≡ ẽ1 (0.005) = 2.718266365833184
2ẽ(0.00125) − ẽ(0.0025) ≡ ẽ1 (0.0025) = 2.718277948983707
2ẽ(0.000625) − ẽ(0.00125) ≡ ẽ1 (0.00125) = 2.718280856855920. (4.5.1)

Each of these new approximations is accurate to 5 or 6 significant digits! Already a significant improvement. We
can combine them further to find yet better approximations:

4ẽ1 (0.005) − ẽ1 (0.01)


≡ ẽ2 (0.01) = 2.718281682298560
3
4ẽ1 (0.0025) − ẽ1 (0.005)
≡ ẽ2 (0.005) = 2.718281810033881
3
4ẽ1 (0.00125) − ẽ(0.0025)
≡ ẽ2 (0.0025) = 2.718281826146657. (4.5.2)
3
The first of these approximations is accurate to seven significant digits, the second to eight, and the third to nine!
And we can combine them further:
8ẽ2 (0.005) − ẽ2 (0.01)
≡ ẽ3 (0.01) = 2.718281828281785
7
8ẽ2 (0.0025) − ẽ2 (0.005)
≡ ẽ3 (0.005) = 2.718281828448482. (4.5.3)
7
Now we have approximations accurate to ten and eleven significant digits! Looking back, we took five approximations
that had no better than 3 significant digits of accuracy and combined them to get two approximations that were
accurate to at least 10 significant digits each. Magic! Okay, not magic, mathemagic! Here is how it works.
Suppose we are approximating p using the formula p̃(h), and we know that

p̃(h) = p + c1 · hm1 + c2 · hm2 + c3 · hm3 + · · · .

Then
m1
p̃ (αh) = p + c1 · (αh) + c2 · (αh)m2 + c3 · (αh)m3 + · · · .
Now, if we multiply the second equation by α−m1 and subtract the first from it, the hm1 terms vanish, and we get
an approximation with error term beginning with c2 · hm2 :

α−m1 p̃ (αh) = α−m1 p + c1 · hm1 + c2 αm2 −m1 · hm2 + c3 αm3 −m1 · hm3 + · · ·
− [p̃(h) = p + c1 · hm1 + c2 · hm2 + c3 · hm3 + · · · ]
α−m1 p̃ (αh) − p̃(h) = (α−m − 1)p + c2 (αm2 −m1 − 1) · hm2 + c3 (αm3 −m1 − 1) · hm3 + · · ·

With a little rearranging,


α−m1 p̃ (αh) − p̃(h)
= p + d2 · hm2 + d3 · hm3 + · · · (4.5.4)
α−m1 − 1
for some constants d2 , d3 , . . .. If m2 > m1 , then this method will tend to improve on the two approximations p̃(h)
and p̃ (αh) by combining them into a single approximation with error commensurate with some constant multiple
of hm2 . This calculation is the basis for Richardson’s extrapolation.
It just so happens ẽ(h) has exactly the form needed.

ẽ(h) = e + c1 h + c2 h2 + c3 h3 + c4 h4 + O(h5 ) (4.5.5)


4.5. EXTRAPOLATION 169

for some constants c1 , c2 , c3 , c4 . The actual values of the constants are not relevant for this computation. To
understand the computation of ẽ1 , we use equation 4.5.4 with α = 12 and m1 = 1 to get

2ẽ h2 − ẽ(h)

ẽ1 (h) =
2−1
1 1 1
= 2e + c1 h + c2 h2 + c3 h3 + c4 h4 + O(h5 )
2 4 8
− e + c1 h + c2 h2 + c3 h3 + c4 h4 + O(h5 )
 

= e + d2 h2 + d3 h3 + d4 h4 + O(h5 )
for some constants d2 , d3 , d4 . ẽ1 (h) is the formula that gave us the round of approximations accurate to 5 or 6
significant digits. It is not hard to find the constants di in terms of the constants ci , but, again, the values of the
constants are immaterial and can only serve to complicate further refinements. What is important is the form of
the error. Now that we know ẽ1 (h) = e + d2 h2 + d3 h3 + d4 h4 + O(h5 ), we find ẽ2 (h) using formula 4.5.4 with α = 21
and m1 = 2:
4ẽ1 h2 − ẽ1 (h)

ẽ2 (h) =
3
= e + k3 h3 + k4 h4 + O(h5 )
for some constants k3 and k4 . ẽ2 (h) is the formula that gave us the round of approximations accurate to 7 to 9
significant digits. We can again use formula 4.5.4, this time with α = 21 and m1 = 3:

8ẽ2 h2 − ẽ2 (h)



ẽ3 (h) =
7
= e + l4 h + O(h5 )
4

for some constant l4 . ẽ3 (h) is the formula that gave us the approximations accurate to 10 and 11 significant digits.
Now is a good time to see if you can use the expression for ẽ3 (h) and formula 4.5.4 to derive an O(h5 ) formula for
ẽ4 (h). Then use your formula to compute ẽ4 (0.01) using the previously given values of ẽ3 (0.01) and ẽ3 (0.005). How
accurate is ẽ4 (0.01)? Answers on page 174.
As a special case, Richardson’s extrapolation with α = 12 applied to any approximation of the form

p̃0 (h) = p + c1 h + c2 h2 + c3 h3 + · · ·
gives the recursively defined refinements
2k p̃k−1 h
− p̃k−1 (h)

2
p̃k (h) = , k = 1, 2, 3, . . .
2k −1
which are expected to increase in accuracy as k increases. For other α or other forms of error, the formula for p̃k (h)
changes according to 4.5.4.

Crumpet 30: A Taylor polynomial for ẽ(h)

ẽ is undefined at 0, so its derivatives at 0 are as well. However, if we extend the definition of ẽ to

(1 + h)1/h if h 6= 0

ẽ(h) = ,
e if h = 0

thus defining ẽ at 0, then ẽ(h) becomes infinitely differentiable at 0, and its fifth Taylor polynomial, for example,
is:
e 11e 2 7e 3 2447e 4 f (5) (ξ) 5
ẽ(h) = e − · h + ·h − ·h + ·h + h
2 24 16 5760 120
for some ξ ∈ (0, h).
170 CHAPTER 4. NUMERICAL CALCULUS

Differentiation
Using extrapolation, high order differentiation approximation formulas can be derived from low order formulas.
−f (x0 ) + f (x0 + h) h 00
We begin with the lowest order approximation, f 0 (x0 ) = − f (ξh ). The standard error term,
h 2
− 2 f (ξh ) does not give the error in the form c · h + O(h ) as required by Richardson’s extrapolation, so we
h 00 m1 m2

return to Taylor series to determine the O(hm2 ) term:

1 1
f (x0 + h) = f (x0 ) + hf 0 (x0 ) + h2 f 00 (x0 ) + h3 f 000 (x0 ) + · · ·
2 6
so
−f (x0 ) + f (x0 + h) 1 1
= f 0 (x0 ) + hf 00 (x0 ) + h2 f 000 (x0 ) + · · · .
h 2 6
Hence,

−f (x0 ) + f (x0 + h) 1 1
f 0 (x0 ) − = − hf 00 (x0 ) − h2 f 000 (x0 ) − · · ·
h 2 6
= c1 h + O(h2 )

−f (x0 )+f (x0 +h)


and extrapolation will yield an O(h2 ) formula. Letting p̃(h) = h , α = 2, and m1 = 1, formula 4.5.4
tells us the approximation
1
2 p̃(2h) − p̃(h)
1
2 −1

will be an O(h2 ) formula for f 0 (x0 ). Simplifying,


h
−f (x0 )+f (x0 +2h)
i
1 −f (x0 )+f (x0 +h)
1 −
2 p̃(2h) − p̃(h) 2 2h h
1 =
2 −1 − 12
−f (x0 )+f (x0 +2h)
4h − −4f (x0 )+4f
4h
(x0 +h)
=
− 12
3f (x0 )−4f (x0 +h)+f (x0 +2h)
4h
=
− 12
−3f (x0 ) + 4f (x0 + h) − f (x0 + 2h)
= .
2h

Hence, we have f 0 (x0 ) = −3f (x0 )+4f (x2h


0 +h)−f (x0 +2h)
+ O(h2 ), but this is not news. This is the first 3-point formula
in table 4.2! Other high order derivative formulas can be derived by extrapolation too, but, generally, nothing new
is learned from the result. We simply have a new way of deriving high order differentiation formulas.

Integration
Applying extrapolation to definite integrals is more rewarding. We begin with any composite integration formula
and apply Richardson’s extrapolation. We now consider the composite trapezoidal rule and use the notation Tk (a, b)
´b
to represent the approximation of a f (x)dx using the trapezoidal rule with k subintervals.
Before
 continuing we need to have a good idea what it means for the composite trapezoidal rule to have error
1 2
 
term O n . In essence, it means we should expect the error to decrease by a factor of about 4 when the number
of intervals is doubled. We should expect the error to decrease by a factor of about 9 when the number of intervals is
tripled. And generally we should expect the error to decrease by a factor of about β 2 when the number of intervals
is multiplied by β. To see this effect in action, consider the definite integral
ˆ 1
sin x dx
0
4.5. EXTRAPOLATION 171

whose exact value is 1 − cos(1) ≈ .4596976941318602. The absolute errors of T5 (0, 1), T10 (0, 1), and T15 (0, 1) are
ˆ 1
sin x dx − T5 (0, 1) ≈ 1.533(10)−3


0

ˆ 1
sin x dx − T10 (0, 1) ≈ 3.831(10)−4


0

ˆ 1
sin x dx − T15 (0, 1) ≈ 1.702(10)−4


0

´ ´
1 1

We should expect the error 0 sin x dx − T5 (0, 1) to be about four times the error 0 sin x dx − T10 (0, 1) and nine

´
1

times the error 0 sin x dx − T15 (0, 1) . To check, we compute the ratios:

´
1

0 sin x dx − T5 (0, 1)

1.533(10)−3
´ = ≈ 4.001
1 3.831(10)−4

0 sin x dx − T10 (0, 1)

´
1

0 sin x dx − T5 (0, 1)

1.533(10)−3
´ = ≈ 9.007.
1 1.702(10)−4

0 sin x dx − T15 (0, 1)

´
|´01 sin x dx−T10 (0,1)|
What should you expect the ratio to be about? Answer on page 174.
| 01 sin x dx−T15 (0,1)|
Finally, we apply Richardson’s extrapolation with α = 21 and m1 = 2 to produce the higher order estimate,

4T2k (a, b) − Tk (a, b)


Tk,1 (a, b) ≡ .
3
We defer to numerics to get a handle on the error term of the refinement Tk,1 . We begin by collecting some data.
´1
Continuing with the analysis of 0 sin x dx, note that

T5 (0, 1) ≈ .4581643459604436
T10 (0, 1) ≈ .4593145488579763
T20 (0, 1) ≈ .4596019197882473
T40 (0, 1) ≈ .4596737512942187.

Hence,
4T10 (0, 1) − T5 (0, 1)
T5,1 (0, 1) = ≈ .4596979498238206
3
4T20 (0, 1) − T10 (0, 1)
T10,1 (0, 1) = ≈ .4596977100983375
3
4T40 (0, 1) − T20 (0, 1)
T20,1 (0, 1) = ≈ .4596976951295424
3
and
´
1

0 sin x dx − T5,1 (0, 1)

´
1
≈ 16.01
0 sin x dx − T10,1 (0, 1)

´
1

0 sin x dx − T10,1 (0, 1)

´
1
≈ 16.00.
0 sin x dx − T20,1 (0, 1)

4 3
When we double the number of subintervals, the error is decreased
 by a factor of 16. That’s
 2 , not 2 as we might
2 4
have expected! The first refinement takes us from a O n1 approximation to a O n1 approximation. In


4
  
other words, the error of Tn,1 is O n1 .
172 CHAPTER 4. NUMERICAL CALCULUS

Table 4.7: Romberg’s method


T1 T1,1 T1,2 T1,3 ···
..
T2 T2,1 T2,2 .
..
T4 T4,1 .
..
T8 .
..
.

1 4
 
Now that we know the error of Tn,1 is O we can extrapolate again. Applying Richardson’s extrapolation

n
1
with α = 2 and m1 = 4, we have

16T10,1 (0, 1) − T5,1 (0, 1)


T5,2 (0, 1) = ≈ .4596976941166387
15
16T20,1 (0, 1) − T10,1 (0, 1)
T10,2 (0, 1) = ≈ .4596976941316228.
15
We now have approximations T5,2 and T10,2 whose errors are only about 1.522(10)−11 and 2.374(10)−13 , respectively.
Use this information to calculate T5,3 and its absolute error. Answers on page 174.
The method of combining Richardson’s extrapolation with the trapezoidal rule is known as Romberg’s method
or Romberg integration. The calculation is often tabulated for organizational purposes as in Table 4.7. Rows are
added until the differences |Tk,n − Tk,n+1 | and |T2k,n − Tk,n+1 | are both less than some tolerance.
Though Richardson’s extrapolation may be applied to any composite integration formula, the computations
of the error terms above help explain why the trapezoidal rule is the right one to use. We might infer from our
calculations (and it can be proven true) that the error term of the composite trapezoidal rule contains only even
powers of n1 . To be explicit, we have
ˆ  2  4  6
b
1 1 1
f (x)dx = Tn (a, b) + c2 + c4 + c6 + ···
a n n n

so each refinement increases the least degree in the error term by 2, not 1. Skipping the odd degrees makes this
particular choice very efficient. But this method comes with a price. Hidden within c2 is the assumption that f has
a continuous second derivative. Hidden within c4 is the assumption that f has a continuous fourth derivative. And
so on. The accuracy of each refinement depends on f having two more continuous derivatives. The more refinements
we do, the smoother f must be for this method to work. For this reason, it is advisable to use Romberg’s method
only when the integrand is known to have sufficient derivatives.

Key Concepts
Richardson’s extrapolation: If approximation p̃ is know to have the form

p̃(h) = p + c1 hm1 + O(hm2 )

then the approximation


α−m1 p̃ (αh) − p̃(h)
α−m1 − 1
will have error O(hm2 ).
Romberg integration: The application of Richardson’s extrapolation to the trapezoidal method.

Exercises Therefore, N (h) = h1 sin(hπ) is an O(h2 ) approxima-


tion of π. Use Richardson’s extrapolation to derive an
1. One can use Taylor Polynomials to show that
O(h4 ) approximation of π. [A]
1
π= sin(hπ) + K2 h2 + K4 h4 + K6 h6 + · · · . 2. It is interesting to note that we can reverse engi-
h
4.5. EXTRAPOLATION 173

neer Richardson refinements in order to approximate (a) Use Richardson’s extrapolation to derive an
the ci of equation 4.5.5 on page 168. For example, O(h2 ) formula for f 0 (x0 ).
ẽ(h) = e + c1 h + O(h2 ), and we assume the O(h2 ) term (b) The formula you derived should look familiar.
is relatively small, so we can rearrange this equation to What formula does it look like? Is it exactly the
find same? Why or why not?
ẽ(h) − e
≈ c1 .
h 7. Derive an O(h3 ) formula for approximating M that
To take a specific example, ẽ(.005)−e
= uses N (h), N ( h2 ), and N ( h3 ), and is based on the as-
.005
2.711517122929293−e
.005
≈ −1.35 so c 1 ≈ −1.35. If we sumption that [S]
pay careful attention to how the constants are affected
M = N (h) + K1 h + K2 h2 + K3 h3 + · · · .
as we refine our initial approximations, we can find c2 ,
c3 , and c4 as well. 8. The following data give estimates of the integral M =
´ 3π/2
h
 
0
cos x dx.
ẽ1 (h) = 2ẽ − ẽ(h)
2
c2 2 c3 3 c4 4 N (h) = 2.356194 N (h/2) = −0.4879837
= 2e + c1 h + h + h + h + O(h5 )
2 4 8 N (h/4) = −0.8815732 N (h/8) = −0.9709157
−(e + c1 h + c2 h2 + c3 h3 + c4 h4 + O(h5 )) Assuming M − N (h) = K1 h + K2 h4 + K3 h6 + · · · , find
2
c2 3c3 3 7c4 4 a third Richardson’s extrapolation for M . [S]
= e − h2 − h − h + O(h5 ).
2 4 8
9. Suppose that N (h) is an approximation of M for every
Therefore, ẽ1 (h) − e ≈ − c22 h2 , from which we conclude h > 0 and that
−2(ẽ1 (h) − e) M − N (h) = K1 h + K2 h2 + K3 h3 + · · ·
≈ c2 .
h2
for some constants K1 , K2 , K3 , . . .. Use the values
(a) Use this formula and the values in 4.5.1 to verify N (h), N (h/3), and N (h/9) to produce an O(h3 ) ap-
that c2 ≈ 1.24. proximation of M . [A]
(b) Approximate c3 using values in 4.5.2. 10. Use Romberg integration to compute the integral with
(c) Approximate c4 using values in 4.5.3. tolerance 10−4 .
(d) Compare these approximations of c1 , c2 , c3 , c4 to ˆ 3
the exact values in crumpet 30. (a) ln(sin(x))dx [S]
1
ˆ 7
3. Suppose N approximates M according to N (h) = √
(b) x cos x dx
M +K1 h3 +K2 h5 +K3 h7 +· · · . Of what order will N3 (h) 5
(the third generation Richardson’s extrapolation) be? ˆ 4 x
e ln(x)
[A] (c) dx [A]
1 x
4. Suppose N approximates M according to N (h) = ˆ 13 p
M + K1 h2 + K2 h4 + K3 h6 + · · · . What would you (d) 1 + cos2 x dx
10
expect the value of ˆ ln 7
ex [A]
|M − N (h/3)| (e) dx
ln 3 1+x
|M − N (h/4)| ˆ 1 2
x −1
[A] (f) dx
to be for small h, approximately? 0 x2 + 1
1−cos h [A] ˆ 2
5. N (h) = can be used to approximate
h2
(g) x2 ln(x2 + 1)dx [A]

1 − cos h 0
lim
h→0 h2 [A]
11. Write a Romberg integration Octave function.
(a) Compute N (1.0) and N (0.5).
12. (i) Use your code from question 11 to approximate
(b) Compute N1 (1.0), the first Richardson’s extrapo-
the integral using tol = 10−5 . (ii) Calculate the actual
lation, assuming
error of the approximation. (iii) Is the approximation
i. N (h) has an error of the form K1 h + K2 h2 + accurate to within 10−5 as requested?
K3 h3 + · · · ˆ 2π
ii. N (h) has an error of the form K2 h2 +K4 h4 + (a) x sin(x2 )dx [A]
K6 h6 + · · · 0
ˆ 2
(c) Which of the assumptions in part 5b do you think 1
(b) dx
gives the correct error and why? 0.1 x
ˆ 2
6. The backward difference formula can be expressed as (c) x2 ln(x2 + 1) dx
0
1 ´2
f 0 (x0 ) = [f (x0 ) − f (x0 − h)] 24 ln(5)−6 tan−1 (2)−4
h NOTE: 0
x2 ln(x2 + 1) dx = 9
.
h h2 000 13. Compare the results of question 12 with those of ques-
+ f 00 (x0 ) − f (x0 ) + O(h3 )
2 6 tion 30 on page 166.
174 CHAPTER 4. NUMERICAL CALCULUS

Answers
ln(1+h)
lim (1 + h)1/h = e: Begin by noting ln (1 + h)1/h = . Set
 
h→0 h

ln(1 + h)
L = lim
h→0 h
d
(ln(1 + h))
= lim dh

dh (h)
h→0 d

1
= lim
h→0 1 + h
= 1.

Thus L = 1, and due to continuity of the exponential function, ex ,

lim eln[(1+h) ]
ln(1+h) ln(1+h) 1/h
e = eL = elimh→0 h = lim e h =
h→0 h→0

= lim (1 + h)1/h .
h→0

ẽ4 (h): We use formula 4.5.4 with α = 12 , m = 4, and n = 5 to find

16ẽ3 − ẽ3 (h)


h

2
ẽ4 (h) =
15
= e + O(h5 ).

Applying this formula to ẽ3 (0.01) and ẽ3 (0.005) we get

16(2.718281828448482) − 2.718281828281785
ẽ4 (0.01) =
15
= 2.718281828459595,

a value that is accurate to 13 significant digits!


´
|´01 sin x dx−T10 |
error ratio: We should expect to be about 1.52 = 2.25 because 15 (the number of intervals used in
| 01 sin x dx−T15 |
the approximation of the denominator) is 1.5 times 10 (the number of intervals used in the approximation of
the numerator).
´
|´ 01 sin x dx−T5,2 | 1.522(10)−11
T5,3 and its error: ≈ 2.374(10)−13 ≈ 64 so
| 01 sin x dx−T10,2 |

64T10,2 − T5,2
T5,3 = ≈ .4596976941318606
63
ˆ 1
sin x dx − T5,3 4(10)−16


0

Chapter 5
More Interpolation

5.1 Osculating Polynomials


The Taylor polynomials of Section 1.2 and interpolating polynomials of Chapter 3 represent opposite extremes in
the spectrum of osculating polynomials. Taylor polynomials require the value of the polynomial at a single point
while interpolating polynomials require the value of the polynomial at, generally anyway, multiple points. Taylor
polynomials require the values of, generally anyway, multiple derivatives while interpolating polynomials do not
allow derivative specification.
The set of osculating polynomials contains Taylor polynomials, interpolating polynomials, and hybrids. Any
polynomial required to pass through any set of points with any number of derivatives specified at those points is
called an osculating polynomial. Thus a Taylor polynomial is the special case of an osculating polynomial specified
by one point and any number of derivatives at that point. An interpolating polynomial is the special case of
an osculating polynomial specified by any number of points and no derivatives at any point. To be precise, an
osculating polynomial is one that is required to pass through a set of points

(t0 , y0 ), (t1 , y1 ), . . . , (tn , yn )

with the first mi derivatives specified at (ti , yi ), i = 0, 1, . . . , n. As before, the t0 , t1 , . . . , tn are called nodes.
One useful type of osculating polynomial is the Hermite polynomial in which the value of the polynomial and
its first derivative are both given at each node. Even more specifically, third degree, or cubic, Hermite polynomials
play an important role in approximation theory. Since a third degree polynomial has four parameters, data—the
ordinate and first derivative—at two nodes is sufficient to specify such a polynomial. So suppose we wish to find a
polynomial p of degree at most three that passes through (t0 , y0 ) and (t1 , y1 ) with derivative ẏ0 at t0 and ẏ1 at t1 .
Remembering the lessons of our study of interpolating polynomials, we might begin with the Lagrange form of
the interpolating polynomial passing through (t0 , y0 ) and (t1 , y1 ) and worry about the derivatives later. That gives
us f (t) = tt−t 1
0 −t1
y0 + tt−t0
1 −t0
y1 to begin. Of course f passes through the required points, but it is not even potentially
cubic, and its derivative is f 0 (t) = t0y−t0
1
+ t1y−t
1
0
, a constant. It would be nice if we could add to it, a third degree
polynomial that has zeroes at t0 and t1 and whose derivatives we can control. Well, g(t) = (t − t0 )(t − t1 )2 , for
example, is cubic, has zeroes at t0 and t1 , and has derivative (t − t1 )2 + 2(t − t0 )(t − t1 ), so we have at least some
control over its derivative. Great, now let us look at it a little more closely:

g 0 (t) = (t − t1 )2 + 2(t − t0 )(t − t1 ) = (t − t1 ) [(t − t1 ) + 2(t − t0 )] .

So g 0 (t1 ) = 0 and g 0 (t0 ) = (t0 − t1 )2 is nonzero. That should remind you of how we developed the Lagrange
interpolating polynomial. Only, there, the value of the polynomial was either 0 or 1 at each node before we added
an unknown coefficient. Of course, ĝ(t) = (t0g(t) −t1 )2 has derivative 1 at t1 and 0 at t0 . Putting it all together,
ĝa (t) = a (t−t 0 )(t−t1 )
has everything we need to control the derivative at t0 . Similarly, ĥb (t) = b (t−t 0 ) (t−t1 )
2 2

(t0 −t1 )2 (t1 −t0 )2 has


everything we need to control the derivative at t1 . The sum of ĝa and ĥb is a degree at most three polynomial with

175
176 CHAPTER 5. MORE INTERPOLATION

zeroes at t0 and t1 and easily specified derivatives at t0 and t1 . Finally, a polynomial p of the form
t − t1 t − t0
p(t) = y0 + y1 + ga (t) + hb (t)
t0 − t1 t1 − t0
t − t1 t − t0 (t − t0 )(t − t1 )2 (t − t0 )2 (t − t1 )
y0 + y1 + a + b
t0 − t1 t1 − t0 (t0 − t1 )2 (t1 − t0 )2
would be the Hermite polynomial we are after. The first two terms form the interpolating polynomial passing
through the required points. The last two terms are zero at t0 and t1 so do not affect this interpolation. Moreover,
the last two terms are chosen so that their derivatives are convenient at t0 and t1 . The derivative of (t−t 1 ) (t−t0 )
2

(t0 −t1 )2
is 1 at t0 and 0 at t1 . The derivative of (t−t 0 ) (t−t1 )
2

(t1 −t0 )2 is 0 at t0 and 1 at t1 . These characteristics ensure simple


values for a and b in terms of the specified derivatives. To find out exactly what they should be, it remains to force
ṗ(t0 ) = ẏ0 and ṗ(t1 ) = ẏ1 :
y1 − y0 (t − t1 )(t − t0 ) (t − t0 )(t − t1 ) (t − t1 )2 (t − t0 )2
ṗ(x) = +2 2
a+2 2
b+ 2
a+ b
t1 − t0 (t0 − t1 ) (t1 − t0 ) (t0 − t1 ) (t1 − t0 )2
so
y1 − y0
ṗ(t0 ) = +a
t1 − t0
and
y1 − y0
ṗ(t1 ) = + b.
t1 − t0
y1 −y0
Therefore, we need c = ẏ0 − t1 −t0 and d = ẏ1 − yt11 −y
−t0 . The desired degree at most three Hermite (osculating)
0

polynomial is
t − t1 t − t0 (t − t1 )2 (t − t0 ) (t − t0 )2 (t − t1 )
p(t) = y0 + y1 + (ẏ 0 − m) + (ẏ1 − m) (5.1.1)
t0 − t1 t1 − t0 (t0 − t1 )2 (t1 − t0 )2
−y0
where m = yt11 −t 0
.
This form of the Hermite cubic polynomial is convenient for humans. It is formulaic and requires very little
computation to write down. We will call it the Human form of the Hermite cubic polynomial. A more computer-
friendly form, which we will refer to as the Computer form of the Hermite cubic is obtained via divided differences.
In general, for an osculating polynomial where the first k derivatives are specified at ti , ti and yi must be repeated
k +1 times in the divided differences table. Quotients that would otherwise be undefined as a result of the repetition
are replaced by the specified derivatives, first derivatives for first divided differences, second derivatives for second
divided differences, and so on.
For the cubic Hermite polynomial p passing through (t0 , y0 ) and (t1 , y1 ) with derivative ẏ0 at t0 and ẏ1 at t1 ,
the table looks like so:
t0 y0 y00
t 0 y0
t1 y1 y10
t1 y1
The four remaining entries are to be filled in by the usual divided difference method. Can you compute them in
general (in terms of t0 , t1 , y0 , y1 , ẏ0 , ẏ1 )? Answers on page 183. Using the results, we write down the interpolating
polynomial in two ways:
 
y1 − y0 ẏ0
p(t) = y0 + [ẏ0 ] (t − t0 ) + − (t − t0 )2
(t1 − t0 )2 t1 − t0
ẏ1 + ẏ0
 
y1 − y0
+ − 2 (t − t0 )2 (t − t1 )
(t1 − t0 )2 (t1 − t0 )3
and
 
ẏ1 y1 − y0
p(t) = y1 + [ẏ1 ] (t − t1 ) + − (t − t1 )2
t1 − t0 (t1 − t0 )2
ẏ1 + ẏ0
 
y1 − y0
+ − 2 (t − t1 )2 (t − t0 ).
(t1 − t0 )2 (t1 − t0 )3
Just as we had for interpolating polynomials, we have two ways to find cubic Hermite osculating polynomials. One
way is convenient for humans and the other for computers.
5.1. OSCULATING POLYNOMIALS 177

Bèzier Curves
Forcing (x(0), y(0)) = (−1, 2), we need

x(0) = ax = −1
y(0) = ay = 2.

Forcing (x(1), y(1)) = (5, −2), we need

x(1) = ax + bx + cx = −1 + bx + cx = 5
y(1) = ay + by + cy = 2 + by + cy = −2

or

bx + cx = 6
by + cy = −4.

Bèzier curves are parametric curves with parameter t ∈ [0, 1] connecting two points. The simplest Bèzier curve
is a straight line passing through the two points. For example, the simplest Bèzier curve from (−1, 2) to (5, −2) is
given by the parametric linear functions

x(t) = (1 − t)(−1) + t(5)


y(t) = (1 − t)(2) + t(−2),

which we choose to write down in Lagrange form. You can check that x(0) = −1, x(1) = 5, y(0) = 2, and y(1) = −2.
In other words, x passes through (0, −1) and (1, 5) while y passes through (0, 2) and (1, −2). This parametrization
is unique because x and y are interpolating polynomials.
One the other hand, if we allow x and y to be quadratic, there are infinitely many (parametric) pairs of functions
connecting (−1, 2) to (5, −2) even if we require x and y to be interpolating polynomials and restrict the parameter
t to the interval [0, 1]. That is not to say we do not have quadratic Bèzier curves, but rather that we need to specify
more than just the two points to be connected. Allowing the parameter function to be quadratic, we have say

x(t) = ax + bx t + cx t2
y(t) = ay + by t + cy t2 ,

giving six unknowns or undetermined coefficients, if you will. That leaves two conditions that may yet be imposed
on the parameter functions.
Any particular quadratic Bèzier curve is prescribed by specifying a control point distinct from the two endpoints.
The two linear Bèzier curves, one connecting (−1, 2) to the control point and the other connecting the control point
to (5, −2), then determine the quadratic Bèzier curve. Suppose B ~ 1,0 (t) is the linear Bèzier curve from (−1, 2) to
the control point and B1,1 (t) is the linear Bèzier curve from the control point to (5, −2). These two curves define
~
a family of linear Bèzier curves, namely the set of linear Bèzier curves from B ~ 1,0 (t0 ) to B
~ 1,1 (t0 ), where t0 ∈ [0, 1].
Letting B2,0,t0 (t) be the linear Bèzier curve from B1,0 (t0 ) to B1,1 (t0 ), the point B2,0,t0 (t0 ) is on the quadratic Bèzier
~ ~ ~ ~
curve from (−1, 2) to (5, −2) via the given control point. The collection of all such points as t0 varies from 0 to 1
is the quadratic Bèzier curve we are after. Different control points determine different quadratics. For example, if
we have (0, 4) as our control point, B ~ 1,0 is the linear Bèzier curve connecting (−1, 2) to (0, 4) and B ~ 1,1 is the linear
Bèzier curve from (0, 4) to (5, −2):
(1 − t)(−1)
 
~ 1,0 (t) =
B
(1 − t)(2) + t(4)
and
 
~ 1,1 (t) = t(5)
B .
(1 − t)(4) + t(−2)
~ 2,0,t is the linear Bèzier curve connecting B
B ~ 1,0 (t0 ) to B
~ 1,1 (t0 ). Therefore, B~ 2,0,t (t) = (1 − t)B
~ 1,0 (t0 ) + tB
~ 1,1 (t0 )
0 0
or
(1 − t0 )(−1) t0 (5)
   
B2,0,t0 (t) = (1 − t)
~ +t .
(1 − t0 )(2) + t0 (4) (1 − t0 )(4) + t0 (−2)
178 CHAPTER 5. MORE INTERPOLATION

Then
(1 − t0 )(−1) t0 (5)
   
~ 2,0,t (t0 ) = (1 − t0 )
B + t0 .
0
(1 − t0 )(2) + t0 (4) (1 − t0 )(4) + t0 (−2)

~ 2,0,1 (1) = 5 .
   
Observe that B~ 2,0,t is quadratic as a function of t0 and that B ~ 2,0,0 (0) = −1 and B
0
2 −2
But the notation B ~ 2,0,t (t0 ) is cumbersome and we are really interested in a parametrization of the quadratic
0

anyway. Letting B ~ 2,0 (t) = B ~ 2,0,t (t), we get the quadratic Bèzier curve from (−1, 2) to (5, −2) via control point
(0, 4):
(1 − t)(−1)
   
t(5)
B2,0 (t) = (1 − t)
~ +t
(1 − t)(2) + t(4) (1 − t)(4) + t(−2)
and we have cleaner notation.
With some algebra, the expression for B ~ 2,0 can be simplified, but leaving it unsimplified emphasizes whence it
came. It is the result of nested linear interpolations. Higher order Bèzier curves are constructed by continued nesting.
We now use this idea to define the Bèzier curve from P~0 to P~n via control points P~1 , P~2 , . . . , P~n−1 . Commonly, P~0
and P~n are also considered control points and so this Bèzier curve is also referred to as the Bèzier curve with control
points P~0 , P~1 , . . . , P~n . Such a Bèzier curve will have degree at most n.
We begin by defining the linear Bèzier curves
~ 1,i (t) = (1 − t)P~i + (t)P~i+1 ,
B i = 0, 1, . . . , n − 1. (5.1.2)

Note that B
~ 1,i is the linear Bèzier curve from P~i to P~i+1 . Then

~ j,i (t) = (1 − t) · B
B ~ j−1,i (t) + (t) · B
~ j−1,i+1 (t), j = 2, 3, . . . , n; i = 0, 1, . . . , n − j. (5.1.3)

Note that B ~ 2,i (t) is the quadratic Bèzier curve connecting P~i to P~i+2 via control point P~i+1 . With a little algebra,
you can confirm that B ~ 3,i (t) is at-most-cubic and connects P~i to P~i+3 . An inductive proof will show that B
~ j,i (t) is
an at-most-degree-j polynomial parametrization connecting Pi to Pi+j . Can you provide it? Answer on page 5.1.
~ ~
It follows that B ~ n,0 (t) is the degree at most n Bèzier curve connecting P~0 to P~n .
Returning to our previous example, we add the control point (5, 1) so we have now four control points:

0 5 5
       
−1
P~0 = , P~1 = , P~2 = , P~3 = .
2 4 1 −2

By equation 5.1.2,

0 −1 + t
     
~ 1,0 (t) −1
B = (1 − t)P0 + (t)P1 = (1 − t)
~ ~ +t =
2 4 2 + 2t
0 5 5t
     
B1,1 (t) = (1 − t)P1 + (t)P2 = (1 − t)
~ ~ ~ +t =
4 1 4 − 3t
5 5 5
     
B1,2 (t) = (1 − t)P2 + (t)P3 = (1 − t)
~ ~ ~ +t = .
1 −2 1 − 3t

And by equation 5.1.3,

−1 + t 5t −1 + 2t + 4t2
     
~ 2,0 (t)
B = (1 − t)B
~ 1,0 (t) + (t)B
~ 1,1 (t) = (1 − t) +t =
2 + 2t 4 − 3t 2 + 4t − 5t2
5t 5 10t − 5t2
     
~ 2,1 (t)
B = (1 − t)B
~ 1,1 (t) + (t)B
~ 1,2 (t) = (1 − t) +t = ,
4 − 3t 1 − 3t 4 − 6t

and
~ 3,0 (t)
B = (1 − t)B~ 2,0 (t) + (t)B
~ 2,1 (t)
−1 + 2t + 4t2 10t − 5t2
   
= (1 − t) +t
2 + 4t − 5t2 4 − 6t
2 3
−1 + 3t + 12t − 9t
 
= . (5.1.4)
2 + 6t − 15t2 + 5t3
5.1. OSCULATING POLYNOMIALS 179

Figure 5.1.1: Three points on a cubic Bèzier curve constructed by recursive linear interpolation.

5 0 5
       
~ 3,0 (t) is the cubic Bèzier curve from −1
B to via control points and . Figure 5.1.1 shows this
2 −2 4 1
Bèzier curve and the construction of three of its points via recursive linear interpolation. The blue points lie along
the linear Bèzier curves B ~ 1,0 , B ~ 1,2 . The orange points lie along the quadratic Bèzier curves B
~ 1,1 , B ~ 2,0 and B ~ 2,1 .
The black points lie along the cubic Bèzier curve. The graphs of the quadratics have been suppressed to avoid
overcomplicating the figure.
Figure 5.1.1 may help you grasp the recursion, but maybe more importantly, may help you understand the
relationship between the control points and the Bèzier curve. For example, upon close examination, you may be
led to believe the line segments B ~ 1,0 and B ~ 1,2 are tangent to the cubic Bèzier curve B
~ 3,0 at P~0 and P~3 , respectively.
Close examination of the formulas will confirm it.
According to formulas 5.1.2 and 5.1.3, the (at most) cubic Bèzier curve with control points P~0 , P~1 , P~2 , P~3 is
computed thus:
~ 1,0 (t)
B = (1 − t)P~0 + (t)P~1
~ 1,1 (t)
B = (1 − t)P~1 + (t)P~2
~ 1,2 (t)
B = (1 − t)P~2 + (t)P~3
so
h i h i
~ 2,0 (t)
B = (1 − t)B
~ 1,0 (t) + (t)B
~ 1,1 (t) = (1 − t) (1 − t)P~0 + (t)P~1 + t (1 − t)P~1 + (t)P~2

= (1 − t)2 P~0 + 2t(1 − t)P~1 + t2 P~2


h i h i
~ 2,1 (t)
B = (1 − t)B
~ 1,1 (t) + (t)B
~ 1,2 (t) = (1 − t) (1 − t)P~1 + (t)P~2 + t (1 − t)P~2 + (t)P~3

= (1 − t)2 P~1 + 2t(1 − t)P~2 + t2 P~3


so
~ 3,0 (t)
B = (1 − t)B
~ 2,0 (t) + (t)B
~ 2,1 (t)
h i h i
= (1 − t) (1 − t)2 P~0 + 2t(1 − t)P~1 + t2 P~2 + t (1 − t)2 P~1 + 2t(1 − t)P~2 + t2 P~3 (5.1.5)

=(1 − t)3 P~0 + 3t(1 − t)2 P~1 + 3t2 (1 − t)P~2 + t3 P~3 .

Hence, dt
d ~
B3,0 (t) = −3(1 − t)2 P~0 + 3 (1 − t)2 − 2t(1 − t) P~1 + 3 2t(1 − t) − t2 P~2 + 3t2 P~3 , from which it follows
   

d ~
B3,0 (t) = −3P~0 + 3P~1 = 3(P~1 − P~0 )

dt t=0

d ~
B3,0 (t) = −3P~2 + 3P~3 = 3(P~3 − P~2 ).

dt t=1

Indeed, the derivative of B ~ 3,0 at 0 is in the direction of the line segment from P~1 to P~2 , and the derivative of B
~ 3,0
at 1 is in the direction of the line segment from P2 to P3 . Moreover, these derivatives have magnitude exactly three
~ ~
times the magnitudes of the line segments.
Though we took a somewhat circuitous route, we now see another way to compute cubic Bèzier curves besides
using recursion 5.1.2/5.1.3 or formula 5.1.5. Control points P~0 and P~3 give us two points x and y must pass through.
Control points P~1 and P~2 give us ẋ and ẏ at those two points. Thus specified, x and y are cubic Hermite polynomials!
180 CHAPTER 5. MORE INTERPOLATION

To be precise, let P~i = (xi , yi ) for i = 0, 1, 2, 3. Then x(t) is the cubic Hermite polynomial with x(0) = x0 ,
ẋ(0) = 3(x1 − x0 ), x(1) = x3 , and ẋ(1) = 3(x3 − x2 ); and y(t) is the cubic Hermite polynomial with y(0) = y0 ,
ẏ(0) = 3(y1 − y0 ), y(1) = y3 , and ẏ(1) = 3(y3 − y2 ).
5 0 5
       
−1
We close this section by computing the Bèzier curve from to via control points and
2 −2 4 1
using equation 5.1.1 and comparing our results to 5.1.4. With x(0) = −1, ẋ(0) = 3, x(1) = 5, and ẋ(1) = 0 (and
5+1
the understood substitution of x for y), equation 5.1.1 gives m = 1−0 = 6 and
t−1 t (t − 1)2 t t2 (t − 1)
x(t) = (−1) + (5) + (3 − 6) + (−6) .
−1 1 1 1
Using equation 5.1.1 with y(0) = 2, ẏ(0) = 6, y(1) = −2, and ẏ(1) = −9 gives m = −2−2
1−0 = −4 and
t−1 t (t − 1)2 t t2 (t − 1)
y(t) = (2) + (−2) + (6 + 4) + (−9 + 4) .
−1 1 1 1
While these equations are complete and correct, it is difficult to compare them to 5.1.4 without some simplification.
Can you show
x(t) = −1 + 3t + 12t2 − 9t3
y(t) = 2 + 6t − 15t2 + 5t3
as required? Answer on page 183.

Crumpet 31: Bézier curves and CAGD

Bézier curves were originally developed around 1960 by employees at french automobile manufacturing companies.
Paul de Casteljau of Citroën was first, but Pierre Bèzier of Renault popularized the method so has his name
associated with the polynomials.
Nowadays, almost all computer aided graphic design, or CAGD, software uses Bèzier curves, particularly
cubic, for drawing smooth objects. CAGD software with cubic Bèzier tools will display the four control points
and allow the user to move them about. In fact, the software will draw the two linear Bézier curves at the
endpoints as well. This gives the user “handles” to manipulate the curve. Some software will include the third
linear Bèzier curve as well. The three linear Bèzier curves together form the so-called control polygon. Since the
relationship between the control points and the curve is intuitive, manipulation of the control points, whether it
be by handles or control polygons, provides a means for swift modeling of smooth shapes.
Some shapes are too intricate to model with a single cubic Bèzier curve, however. To handle such shapes,
CAGD software allows a user to string cubic Bèzier curves together end to end, forming a composite, or piecewise,
Bèzier curve, such as that shown here.

This particular curve is made of two cubic Bèzier curves, one with control points P ~0 , P
~1 , P ~3 and the other with
~2 , P
control points P
~3 , P
~4 , P ~6 . Since Bèzier curves are intended to model smooth objects, software will provide the
~5 , P
option of forcing derivative matching at a common point such as P ~3 . This is done by making sure the common
point is on the line segment between its two adjacent control points (P ~2 and P~4 in this diagram). You may view
an interactive version of this diagram at the companion website.
Free open source software such as Inkscape, LibreOffice Drawing, and Dia provide Bezier curve drawing
tools, but not all of them use the technical term. Inkscape has a Bezier curve tool by that name, but LibreOffice
Drawing’s Bezier curve tool is simply called “curve”, and Dia’s tool for single Bezier curves only, not composite,
goes by the name of “Bezierline”.
5.1. OSCULATING POLYNOMIALS 181

References [1, 10, 9, 15, 27, 32]

Key Concepts
osculating polynomial: A polynomial whose graph is required to pass through a set of prescribed points

(x0 , y0 ), (x1 , y1 ), . . . , (xn , yn )

and whose first mi derivatives may also be specified at xi .


Hermite polynomial: An osculating polynomial required to pass through two points with its first derivative
specified at each point.

Bèzier curve: A curve connecting two points via parametric osculating polynomials.

Exercises
1. Find the cubic Hermite polynomial interpolating the
data.
x f (x) f 0 (x)
1 2 1
5 3 −1

2. Find the Hermite polynomial of degree (at most) 5 in-


terpolating the data.

x f (x) f 0 (x)
0 2 1
0.5 2 0
1 2 1

3. Let g(x) = ( 2)x .
7. Write down the parametric equations of the Bèzier
(a) Using x0 = 1 and x1 = 2, find a Hermite interpo- curve with control points (−1, 2), (−3, 2), (3, 1), and
lating polynomial for g. (3, 0). It is not necessary to simplify your answer.
(b) Use the Hermite polynomial to approximate 8. Construct the parametric equations for the Bèzier curve
g(1.5). with control points (1, 1), (2, 1.5), (7, 1.5), (6, 2).
(c) Calculate the actual error of this approximation, 9. Find equations for the cubic polynomials that make up
and compare it to the error you got in question the composite Bézier curve.
15 of section 3.2 on page 116.
(d) Which polynomial approximated g(1.5) with
smaller absolute error, the Hermite or the La-
grange interpolating polynomial?

4. Find a polynomial that passes through the points (0, 0)


and (4, −3) and whose derivative passes through the
points (0, 1) and (4, 1).
5. Construct the Hermite interpolating polynomial for the
given data. Do this using a pencil, paper and calcula-
tor, or a spreadsheet. Do not use Octave code.

x f (x) f 0 (x)
0.1 −0.29004996 −2.8019975
0.2 −0.56079734 −2.6159201
0.3 −0.81401972 −2.4533949

6. Find parametric equations for the cubic Bèzier curve. 10. The data in question 5 were generated using f (x) =
The ends of the “handles” are the four control points. x2 cos(x) − 3x.
182 CHAPTER 5. MORE INTERPOLATION

(a) Approximate f (0.18) using the polynomial from


question 5.
(b) Calculate the absolute error of this approxima-
tion.

11. Suppose H(x) = x5 − 3x4 + 2x3 − 6x + 2 is a Hermite


polynomial interpolating the data

x f (x) f 0 (x)
0 2 −6
1 −4
2 −10 2

collected from a function f . Find the missing datum.


12. A Hermite polynomial H(x) is constructed using the
data
(a) The graph can not be the graph of a single cubic
x 0.3 0.5 0.6 0.8 Bèzier curve. Why not?
f (x) 0.8 0.6 0.3 0.5 (b) The graph is that of a composite cubic Bèzier
f 0 (x) 1.5 −1.2 −5.3 −2 curve. At least how many cubic Bèzier curves
have been spliced together, and why?
(a) Find (H ◦ H)0 (0.6). That is, the derivative of
H(H(x)) evaluated at x = 0.6. 18. Give three reasons that might make you use a Bèzier
(b) Find (f ◦ f )0 (0.8). curve rather than a Lagrange polynomial to model a
certain graph.
13. The Hermite interpolating polynomial for the following
19. The osculating polynomial p(x) passing through
data has the form H(x) = a0 + a1 (x − 0.3) + a2 (x −
(x0 , f (x0 )) with P 0 (x0 ) = f 0 (x0 ), P 00 (x0 ) = f 00 (x0 ),
0.3)2 + . . ..
and P 000 (x0 ) = f 000 (x0 ) is also called what? Be as spe-
x f (x) f 0 (x) cific as you can.
0.30 0.295 −0.155 20. A cubic polar Bèzier curve is the unique (parametrized)
0.32 0.314 −0.149 cubic polar function (r(t), θ(t)) satisfying the following
0.35 0.342 −0.139 data.
(a) Fill in the missing part of the form of H(x). t r(t) θ(t) ṙ(t) ˙
θ(t)
(b) What is the maximum possible degree of H(x)? 0 r0 θ0 δ0 µ0
(c) Find a0 and a1 . 1 r1 θ1 δ1 µ1

14. Construct the divided differences table that led to the (a) A standard cubic Bèzier curve is given by the con-
Hermite polynomial trol points (0, 0), (2, 0), (0, 1), and (0, 3) (in that
order). Convert this data into polar coordinate
1 1 data. Recall that the conversion from Cartesian
p(x) = 2 − (x − 1) + (x − 1)2 + (x − 1)2 (x − 3).
4 4 coordinates to polar coordinates involves the for-
mulas
15. The Bèzier Curve
p y
r= x2 + y 2 and tan θ = .
x(t) = 11t3 − 18t2 + 3t + 5 x
y(t) = t3 + 1
(b) Find the cubic polar Bèzier curve based on your
has control points (5, 1), (6, 1), and (1, 2). Find the results from (a).
fourth control point.
21. Write an Octave function to compute Hermite poly-
16. What is the minimum number of cubic Bèzier curves
nomials.
in the diagram, and why?
22. A car traveling along a straight road is clocked at
a number of points. The data from the observations
are given in the following table, where the time is in
seconds, the distance is in feet, and the speed is in feet
per second.

Time 0 3 5 8 13
Distance 0 225 383 623 993
Speed 75 77 80 74 72

(a) Compute a Hermite interpolating polynomial for


17. Refer to the following graph. the data.
5.1. OSCULATING POLYNOMIALS 183

(b) Use your polynomial from part (a) to predict the # Written by Dr. Len Brin #
position (distance) of the car and its speed when # 13 March 2012 #
t = 10 seconds. # Purpose: Evaluate an interpolating #
(c) Determine whether the car ever exceeds the 55 # polynomial at the value z. #
mph speed limit on the road. If so, what is the # INPUT: number z #
first time the car exceeds this speed? # Data x0,x1,...,xn used to #
# calculate the polynomial: x #
(d) What is the predicted maximum speed for the
# Entries a0;0, a1;0,1, ... #
car?
# an;0,1,...,n as an array: c #
NOTES: Speed is the derivative of distance. # OUTPUT: P(z), the value of the #
# interpolating polynomial at z. #
miles miles 5280 feet 1 hour
55 = 55 × × #######################################
hour hour mile 3600 seconds function ans = divDiffEval(z,x,c)
feet
≈ 80.67 n = length(x);
second ans = c(n);
for i=1:n-1
23. Complete the following code.
ans=(z-x(???))*ans+c(???);
end#for
####################################### end#function

Answers
Hermite polynomial computer form: The four remaining entries are

y1 − y0
f1,1 =
t1 − t0
f1,1 − ẏ0 y1 − y0 ẏ0
f0,2 = = −
t1 − t0 (t1 − t0 )2 t1 − t0
ẏ1 − f1,1 ẏ1 y1 − y0
f1,2 = = −
t1 − t0 t1 − t0 (t1 − t0 )2
f1,2 − f0,2 ẏ1 + ẏ0 y1 − y0
f0,3 = = −2
t1 − t0 (t1 − t0 )2 (t1 − t0 )3

Bezier curve B~ j,i (t) is an at-most-degree-j polynomial connecting P~i to Pi+j


~ : Proof. We proceed by in-
duction on j, beginning with j = 1: Since

~ 1,i (t) = (1 − t)P~i + (t)P~i+1 ,


B i = 0, 1, . . . , n − 1,

B~ 1,i (0) = P~i and B1,i (1) = P~i+1 so B ~ 1,i connects P~i to P~i+1 . Furthermore, B ~ 1,i (t) = P~i + t(P~i+1 − P~i ), so B
~ 1,i
is an at-most-degree-1 polynomial. Now assume B ~ j,i (t) is an at-most-degree-j polynomial connecting P~i to
Pi+j for some j ≥ 1 (and all applicable i). By definition, B
~ ~ j+1,i (0) = B~ j,i (0) and B ~ j+1,i (1) = B~ j,i+1 (1). By
the inductive hypothesis, B ~ j,i (0) = P~i and B ~ j,i+1 (1) = P~i+j+1 , so B ~ j+1,i connects P~i to P~i+j+1 . Furthermore,

~ j+1,i (t) = (1 − t) · B
B ~ j,i (t) + (t) · B
~ j,i+1 (t)

has degree at most j + 1 because B


~ j,i (t) and B
~ j,i+1 (t) have at most degree j (by the inductive hypothesis).
This completes the proof.

Bézier curve via Hermite cubics: The simplification may be done as follows.

t−1 t (t − 1)2 t t2 (t − 1)
x(t) = (−1) + (5) + (3 − 6) + (−6)
−1 1 1 1
= (t − 1) + 5t − 3t(t − 1)2 − 6t2 (t − 1)
= 6t − 1 − 3t(t2 − 2t + 1) − 6t3 + 6t2
= 6t − 1 − 3t3 + 6t2 − 3t − 6t3 + 6t2
= −9t3 + 12t2 + 3t − 1
184 CHAPTER 5. MORE INTERPOLATION

and
t−1 t (t − 1)2 t t2 (t − 1)
y(t) = (2) + (−2) + (6 + 4) + (−9 + 4)
−1 1 1 1
= −2(t − 1) − 2t + 10t(t − 1)2 − 5t2 (t − 1)
= −2t + 2 − 2t + 10t(t2 − 2t + 1) − 5t3 + 5t2
= 2 − 4t + 10t3 − 20t2 + 10t − 5t3 + 5t2
= 5t3 − 15t2 + 6t + 2.
5.2. SPLINES 185

5.2 Splines
Osculating polynomials have limited use in applications where a curve is required to pass through a large number
of points. And large may mean only a half dozen or so. Take the following innocuous-looking set of points.

1
0.5
0
-0.5
-1
0 1 2 3 4 5 6 7

It is easy to imagine an equally innocuous function passing through these eight points, but actually finding such a
function poses a slight challenge. The interpolating polynomial of least degree oscillates too widely.

-1

-2

0 1 2 3 4 5 6 7

This is a common problem with high-degree interpolating polynomials. There is no control over their oscillations,
and the oscillations are most often undesirable. The oscillations can be tamed to some degree by finding the
osculating polynomial through these points with, say, a first derivative of 0 at 0 and of − 12 at the seventh point
from the left (the one whose x-coordinate is between 5 and 6).

1
0.5
0
-0.5
0 1 2 3 4 5 6 7

That’s better, but still leaves something to be desired. And the business of setting the first derivatives at two of the
points strictly for the purpose of reducing the oscillations is a bit arbitrary—better to let the nature of the problem
dictate. The oscillations of the previous attempts make them far too distinctive and interesting for the vapid set of
points with which we began. A rightfully trite way to interpolate the data is by connecting consecutive points by
line segments.

1
0.5
0
-0.5
-1
0 1 2 3 4 5 6 7
186 CHAPTER 5. MORE INTERPOLATION

This forms what is known as the piecewise linear interpolation of the data set. This type of graph is often seen in
public media. Many applications, especially those from engineering, require some smoothness, however. Connecting
sets of three consecutive points by quadratic functions helps.
1
0.5
0
-0.5
-1
0 1 2 3 4 5 6 7
That takes care of smoothness at three of the points, but still lacks differentiability at the points common to
consecutive quadratics. Moreover, using the first three points for the first quadratic (which looks linear to the
naked eye), the third through fifth points for the second quadratic, and the fifth through seventh points for the
third quadratic (which also looks linear to the naked eye) leaves only the seventh and eighth points for what would
presumably be a fourth quadratic. With only two points, however, a line segment is used instead. A smoother
solution to the problem is to make sure the first derivatives of consecutive quadratics match at their common point.
With that in mind, it makes sense to fit only two points per parabola, leaving one coefficient (of the three in any
quadratic) for matching the derivative of the neighboring quadratic.
1
0.5
0
-0.5
-1
0 1 2 3 4 5 6 7
That’s better! This piecewise parabolic function has continuous first derivative, but there is still something arbitrary
about it. The seven parabolas have, all together, 21 coefficients. Making each parabola pass through two points
gives 14 conditions on those coefficients. Having adjacent parabolas match first derivatives at their common points
gives 6 more conditions, one at each of the 6 interior points. That leaves one “free” coefficient. Specifying one last
condition seems a bit arbitrary, and is. The graph shows the result when the derivative at 0 is set to 1. Notice
there is no control over the derivative at the right end. Besides the arbitrariness, this asymmetry is bothersome. If
only we had one more degree of freedom...

Piecewise polynomials
A piecewise-defined function whose pieces are all polynomials is called a piecewise polynomial. It takes the form
p1 (x), x ∈ [x0 , x1 ]



p2 (x), x ∈ (x1 , x2 ]


p(x) = ..


 .
pn (x) x ∈ (xn−1 , xn ]

where pi (x) is a polynomial for each i = 1, 2, . . . , n and x0 < x1 < · · · < xn ; or some variant where p(xj ) is defined
by exactly one of the pi . If each pi is a linear function, p is called piecewise linear. If each pi is a quadratic function,
p is called piecewise quadratic. If each pi is a cubic function, p is called piecewise cubic. And so on. Examples of
piecewise linear and piecewise quadratic functions appear in the introduction to this section.

Splines
Nothing about the definition of piecewise polynomials requires one to be differentiable or even continuous. The
following function is a piecewise polynomial.
0.4
0.2
0
-0.2
-0.4
0 0.5 1 1.5 2 2.5 3
5.2. SPLINES 187

Most applications of piecewise polynomials require continuity or differentiability, however. Any piecewise polynomial
with at least one continuous derivative is called a spline. The points separating adjacent pieces, the xj , j =
1, 2, . . . , n − 1, are called knots or joints.
The last graph in the introduction to this section shows a quadratic spline. Each piece of the piecewise function
is a quadratic, and the quadratics are chosen so that their derivatives match at the joints. As pointed out there,
though, we needed to supply one unnatural condition—the derivative at the left endpoint. It could have been the
derivative at any of the points, or even the second derivative at one of the points. In a very real sense, the choice
was arbitrary. It was not governed naturally by the question at hand. Consequently, there is a family of solutions
to the problem of connecting those eight points with a continuously differentiable piecewise quadratic.

Cubic splines
The most common spline in use is the cubic spline. As with the quadratic spline, a cubic spline is computed by
matching derivatives at the joints. In fact, there are enough coefficients in the set of cubics that both first and
second derivatives are matched. Note that, according to our definition of spline, matching both first and second
derivatives at the joints is not strictly necessary, however. Other sources will give a more restrictive definition of
spline where matching both derivatives is required. As a matter of convention, we focus on such splines.
A cubic spline required to interpolate n + 1 points has n − 1 joints and n pieces. It follows that the set of cubics
has 4n coefficients. Requiring each cubic to pass through 2 points gives 2n conditions on the coefficients. Requiring
first derivative matching at the joints gives n − 1 more conditions. Requiring second derivative matching at the
joints gives an additional n − 1 conditions for a grand total of 4n − 2 conditions. That leaves 2 “free” coefficients.
Mathematically speaking, we have a family of splines with two degrees of freedom. To find any specific spline, we
need to enforce two more conditions on the coefficients. These conditions may include the first, second, or third
derivative at two of the nodes, both the first and second derivative at a single node, or some other combination of
two derivative requirements.
Guided perhaps by knowledge of draftsman’s splines, convention leads us to supply endpoint conditions. That
is, we require something of some derivative at x0 and at xn . Supplying the first derivative is akin to pointing
the draftsmen’s spline in a particular direction at its ends. Setting the second derivative equal to 0 is akin to
allowing the ends of a draftsman’s spline to freely point in whatever direction physics takes them. These models of
draftsman’s splines are not particularly accurate, but they are motivational.
A cubic spline with its first derivative specified at both endpoints is called a clamped spline. A cubic spline with
its second derivative set equal to zero at both endpoints is called a natural or free spline. A hybrid where the first
derivative is specified at one end and the second derivative is set to zero at the other has no special name. To be
precise, we have the following definitions.
Let (x0 , y0 ), (x1 , y1 ), . . . , (xn , yn ) be n + 1 points where x0 < x1 < · · · < xn and let Si (x) = ai + bi (x − xi ) +
ci (x − xi )2 + di (x − xi )3 for i = 1, 2, . . . , n. Then S, defined by
S1 (x), x ∈ [x0 , x1 ]



S2 (x), x ∈ [x1 , x2 ]


S(x) = .. ,


 .
Sn (x), x ∈ [xn−1 , xn ]

is a cubic spline if it satisfies the following three conditions.


1. Si (xi−1 ) = yi−1 and Si (xi ) = yi for i = 1, 2, . . . , n (interpolation)
2. Si0 (xi ) = Si+1
0
(xi ) and Si00 (xi ) = Si+1
00
(xi ) for i = 1, 2, . . . , n − 1 (derivative matching)
3. One of the following is satisfied (endpoint conditions)
(a) S100 (x0 ) = Sn00 (xn ) = 0
(b) S10 (x0 ) = m0 and Sn0 (xn ) = mn for some m0 and mn
(c) S10 (x0 ) = m0 for some m0 and Sn00 (xn ) = 0
(d) S100 (x0 ) = 0 and Sn0 (xn ) = mn for some mn
If endpoint condition 3a is satisfied, S is called a free spline or natural spline. If endpoint condition 3b is satisfied,
S is called a clamped spline.
The natural (cubic) spline passing through the eight points presented in the introduction to this section looks
like this.
188 CHAPTER 5. MORE INTERPOLATION

1
0.5
0
-0.5
-1
0 1 2 3 4 5 6 7

Finally, a function that is as unspectacular as the data set itself! How was it calculated, you ask? The short answer
is, the 28 simultaneous equations resulting from the definition of natural cubic spline were solved. The solution
provided the coefficients ai , bi , ci , di , i = 1, 2, . . . , 7.

Setting up the equations


The long answer is, well, a bit longer to tell, but really only differs from the short version in the level of detail. To
begin, the requirement that Si (xi ) = yi immediately gives us the values of n of the coefficients:

Si (xi ) = ai = yi .

The requirement that Si (xi−1 ) = yi−1 gives us the n equations

Si (xi−1 ) = yi + bi (xi−1 − xi ) + ci (xi−1 − xi )2 + di (xi−1 − xi )3 = yi−1 (5.2.1)

for i = 1, 2, . . . , n. The derivative requirements give us n − 1 equations each:

Si0 (xi ) = Si+1


0
(xi ) ⇒ bi = bi+1 + 2ci+1 (xi − xi+1 ) + 3di+1 (xi − xi+1 )2 (5.2.2)
Si00 (xi ) = 00
Si+1 (xi ) ⇒ 2ci = 2ci+1 + 6di+1 (xi − xi+1 ) (5.2.3)

for i = 1, 2, . . . , n − 1. Finally, the endpoint conditions give us the two equations

S100 (x0 ) = 2c1 + 6d1 (x0 − x1 ) = 0 (5.2.4)


Sn00 (xn ) = 2cn = 0. (5.2.5)

Without much ado, we have the values of the ai and of cn . The remaining 3n − 1 coefficients are found by solving
the remaining 3n − 1 simultaneous equations. Though a computer can certainly handle the solution from here,
finding a bit of the general solution by hand gives a much more efficient algorithm.

Solving the equations


Essentially, we now have three equations with three unknowns. Equations 5.2.1, 5.2.2, and 5.2.3 are written in
the variables bi , ci , di . Equation 5.2.3 can easily be solved for di in terms of ci and equation 5.2.1 can easily be
solved for bi . The resulting expressions can be substituted into equation 5.2.2 to get an equation in only ci . It is a
straightforward matter to complete the calculation. At this point, it becomes convenient to define hi = xi−1 − xi .
ci − ci+1
(5.2.3) ⇒ di+1 = , i = 1, 2, . . . , n − 1
3hi+1
ci−1 − ci
⇒ di = , i = 2, 3, . . . , n. (5.2.6)
3hi
yi−1 − yi
(5.2.1) ⇒ bi = − ci hi − di h2i , i = 1, 2, . . . , n
hi
yi−1 − yi (ci−1 − ci )hi
⇒ bi = − ci hi − , i = 2, 3, . . . , n
hi 3
yi−1 − yi (ci−1 + 2ci )hi
⇒ bi = − , i = 2, 3, . . . , n (5.2.7)
hi 3
yi − yi+1 (ci + 2ci+1 )hi+1
⇒ bi+1 = − , i = 1, 2, . . . , n − 1.
hi+1 3
Substituting into equation 5.2.2,
yi−1 − yi (ci−1 + 2ci )hi yi − yi+1 (ci + 2ci+1 )hi+1
− = − + 2ci+1 hi+1 + (ci − ci+1 )hi+1
hi 3 hi+1 3
5.2. SPLINES 189

for i = 2, 3, . . . , n − 1. With a bit of simplification, this becomes


 
yi−1 − yi yi − yi+1
hi ci−1 + 2(hi + hi+1 )ci + hi+1 ci+1 = 3 − , i = 2, 3, . . . , n − 1. (5.2.8)
hi hi+1
We now have n − 2 equations in the n unknown ci . These equations hold for any cubic spline with any endpoint
conditions. But equation 5.2.2 has not been used with index i = 1. Hence, we still have to incorporate

b1 = b2 + 2c2 h2 + 3d2 h22 (5.2.9)

into the solution. It remains to replace b1 , b2 , and d2 by expressions in ci .


To begin, equations 5.2.7 and 5.2.6 with i = 2 give
y1 − y2 (c1 + 2c2 )h2
b2 = −
h2 3
c1 − c2
d2 = .
3h2
Making the substitutions for b2 and d2 , equation 5.2.9 becomes
y1 − y2 (c1 + 2c2 )h2
b1 = − + 2c2 h2 + (c1 − c2 )h2
h2 3
y1 − y2 2 1
= + h2 c1 + h2 c2 . (5.2.10)
h2 3 3
We have not used the endpoint conditions yet, so this equation is good for any cubic spline. Whatever endpoint
conditions are given must result in an expression for b1 in terms of ci plus one other equation in the ci .
In the case of the free spline, endpoint condition 5.2.5 gives cn = 0. This is the first of the final two equations.
Endpoint condition 5.2.4 gives d1 = − 3h c1
1
. This relationship is not directly useful since we are looking for an
expression for b1 . However, equation 5.2.1 with i = 1 gives b1 = y0h−y
1
1
− c1 h1 − d1 h21 so we can use it to find

y0 − y1 2
b1 = − c1 h1 .
h1 3

Finally, substituting into equation 5.2.10, the final equation in ci is y0h−y


1
1
− 23 c1 h1 = y1 −y2
h2 + 32 h2 c1 + 13 h2 c2 , which
simplifies to  
y0 − y1 y1 − y2
2(h1 + h2 )c1 + h2 c2 = 3 − . (5.2.11)
h1 h2
Equations 5.2.8, 5.2.11, and cn = 0 are n equations which can be solved for the n coefficients ci . Back-substitution
will give the values of the bi and di .
Other endpoint conditions lead to a different pair of final equations, but the process is the same. We need to
substitute an expression for b1 into 5.2.10 and come up with one other equation.

Natural spline Octave code


Computing a spline for three or four points can be done by hand with a bit of patience and attention to detail,
but many more points and the algebra becomes too tedious. However, each of the equations in ci have no more
than three of the ci at a time, and they appear in a regular pattern, at least for n − 2 of the equations. These
characteristics make automating the solution reasonably straightforward. The following code is perhaps not the
most efficient for finding a natural spline, but it is presented this way for two reasons. First, it is meant to emulate
the algebraic solution outlined in the previous section closely, making it clearer to follow. Second it is meant to be
general enough that modifying it for other endpoint conditions would take minimal effort. Such modification will
be requested in the exercises.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Written by Dr. Len Brin 3 June 2014 %
% Purpose: Calculation of a natural cubic %
% spline. %
% INPUT: points (x(1),y(1)), (x(2),y(2)), ... %
% spline must interpolate. %
190 CHAPTER 5. MORE INTERPOLATION

% OUTPUT: coefficients of each piece of the %


% piecewise cubic spline: %
% S(i,x) = a(i) %
% + b(i)*(x-x(i+1)) %
% + c(i)*(x-x(i+1))^2 %
% + d(i)*(x-x(i+1))^3 %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function [a,b,c,d] = naturalCubicSpline(x,y)
n=length(x)-1;
for i=1:n
h(i)=x(i)-x(i+1);
end%for
% Left endpoint condition:
% m(1,1)*c(1) + m(1,2)*c(2) = m(1,n+1)
m(1,1)=2*(h(1)+h(2)); m(1,2)=h(2);
m(1,n+1)=3*((y(1)-y(2))/h(1)-(y(2)-y(3))/h(2));
% Right endpoint condition:
% m(n,n-1)*c(n-1) + m(n,n)*c(n) = m(n,n+1)
m(n,n-1)=0; m(n,n)=1; m(n,n+1)=0;
% Conditions for all splines:
for i=2:n-1
m(i,i-1)=h(i);
m(i,i)=2*(h(i)+h(i+1));
m(i,i+1)=h(i+1);
m(i,n+1)=3*((y(i)-y(i+1))/h(i)-(y(i+1)-y(i+2))/h(i+1));
end%for
% Solve for c(i)
l(1)=m(1,1); u(1)=m(1,2)/l(1); z(1)=m(1,n+1)/l(1);
for i=2:n-1
l(i)=m(i,i)-m(i,i-1)*u(i-1);
u(i)=m(i,i+1)/l(i);
z(i)=(m(i,n+1)-m(i,i-1)*z(i-1))/l(i);
end%for
l(n)=m(n,n)-m(n,n-1)*u(n-1);
c(n)=(m(n,n+1)-m(n,n-1)*z(n-1))/l(n);
for i=n-1:-1:1
c(i)=z(i)-u(i)*c(i+1);
end%for
% Compute a(i), b(i), d(i)
% Endpoint conditions:
b(1)=(y(1)-y(2))/h(1)-2*c(1)*h(1)/3;
d(1)=-c(1)/(3*h(1));
% Conditions for all splines:
a(1)=y(2);
for i=2:n
d(i)=(c(i-1)-c(i))/(3*h(i));
b(i)=(y(i)-y(i+1))/h(i)-(c(i-1)+2*c(i))*h(i)/3;
a(i)=y(i+1);
end%for
end%function

naturalCubicSpline.m may be downloaded at the companion website.


5.2. SPLINES 191

An application of natural cubic splines?


“For many important applications, this mathematical [cubic spline] model of the draftsman’s spline is highly real-
istic.”1 Claims such as this rely on the assumptions that a draftsman’s spline is aptly modeled by a thin beam and
that beam deflections are small. But the shapes modeled by splines often include large deflections, and unless the
draftsman’s spline is damaged in some way, its shape will be an infinitely differentiable curve. Cubic splines gener-
ally lack continuity in their third derivative, hence, do not have higher order derivatives. Moreover, the endpoint
conditions S000 (x0 ) = Sn00 (xn ) = 0 do not translate well to the physical situation. These conditions imply the shape
of the spline has zero curvature (concavity) at the endpoints while nothing about the physical situation points to
that conclusion.
Despite the cubic spline’s ineffective use as a model for a draftsman’s spline, it can be used with great efficacy
in design applications. At Boeing, the airplane manufacturer, for example, they are used in computer-aided graphic
design, computer-aided manufacturing, engineering analysis and simulation, and as a key component in Boeing’s
Automated Flight Manual system. By 2005, it was estimated that Boeing’s use of splines involved about 500 million
spline evaluations every day!2

Exercises 9. Find the spline described in question


1. What problem with polynomial interpolation does cu- [S]
(a) 2
bic spline interpolation address? (b) 3
2. Write down the system of equations that would need (c) 5 [A]
to be solved in order to find the cubic spline through
(0, −9), (1, −13), and (2, −29) with free boundary con- (d) 6
[S]
ditions. Do not attempt to solve the system. [S] (e) 7
[A]
3. Set up but do not solve the equations which could be (f) 8
solved to find the free cubic spline through the points
(1, 1), (2, 3), and (4, 2). 10. Use the Octave code presented in this section to
check your answer to question
4. List three reasons that might make you use a cubic
spline rather than a Lagrange polynomial to model a (a) 9a [S]

certain graph.
(b) 9b
5. Write down a system of equations that could be solved (c) 9c [A]
in order to find the free cubic spline through the fol-
lowing data points. Do not solve the system. 11. Modify the Octave code presented in this section
x f (x) so that it computes the coefficients for a clamped cubic
0.1 −0.62 spline. [S]
0.2 −0.28 12. Use your code from question 11 to check your an-
0.3 0.0066 swer to question
0.4 0.24
(a) 9d
6. Write down the system of equations that would need [S]
to be solved in order to find the cubic spline through (b) 9e
[A]
(0, −9), (1, −13), and (2, −29) with clamped boundary (c) 9f
conditions S 0 (0) = 1 and S 0 (2) = −1. Do not attempt
to solve the system. 13. Modify the Octave code presented in this section so
that it computes the coefficients for a cubic spline with
7. Set up but do not solve the equations which could be
mixed endpoint conditions 3c (page 187).
solved to find the clamped cubic spline through the
points (1, 1), (2, 3), and (4, 2) with S 0 (1) = S 0 (4) = 0. 14. Use your code from question 13 to find the cu-
[S]
bic spline through (0, −9), (1, −13), and (2, −29) with
8. Write down a system of equations that could be solved mixed boundary conditions S 0 (0) = 1 and S 00 (2) = 0.
in order to find the clamped cubic spline through the
following data points with S 0 (0.1) = 0.5 and S 0 (0.4) = 15. Use your code from question 13 to find the cubic
0.1. Do not solve the system. spline through the points (1, 1), (2, 3), and (4, 2) with
S 0 (1) = S 00 (4) = 0.
x f (x) 16. Suppose n + 1 points are given (n > 1). How many
0.1 −0.62 endpoint conditions are needed to fit the points with a
0.2 −0.28
0.3 0.0066 (a) quadratic spline with first derivative matching at
0.4 0.24 each joint?
1 Ahlberg and Nilson, The Theory of Splines and their Applications, Elsevier, 1967.
2 SIAM News, volume 38, number 4, May 2005.
192 CHAPTER 5. MORE INTERPOLATION

(b) cubic spline with first and second derivative any such endpoint conditions must be specified at x3
matching at each joint? and not x0 .
(c) quartic spline with first, second, and third deriva- 18. Let f (x) = sin x and x0 = 0, x1 = π/4, x2 = π/2,
tive matching at each joint? x3 = 3π/4, and x4 = π.
(d) a degree k spline (k > 1) with derivative matching (a) Find the cubic (clamped) spline through
up to degree k − 1 at each joint? (x0 , f (x0 )), (x1 , f (x1 )), . . . , (x4 , f (x4 )) with
S 0 (0) = f 0 (0) and S 0 (π) = f 0 (π).
17. Suppose a spline S is to be fit to the four points (xi , yi ),
(b) Approximate f (π/3) by computing S(π/3).
i = 0, 1, 2, 3 where x0 < x1 < x2 < x3 . Further sup-
pose S is to be linear on [x0 , x1 ], quadratic on [x1 , x2 ], (c) Approximate f (7π/8) by computing S(7π/8).
and cubic on [x2 , x3 ]. Finally suppose S is to have one (d) Calculate the absolute errors in the approxima-
continuous derivative. How many endpoint conditions tions.
are needed to specify the spline uniquely? Argue that
Chapter 6
Ordinary Differential Equations

The gate and key to the sciences is mathematics.


–Roger Bacon (Opus Majus)

If I were again beginning my studies, I would follow the advice of


Plato and start with mathematics.
–Galileo Galilei

6.1 The Motion of a Pendulum


A brief history
Christiaan Huygens (1629-1695) is credited with inventing the pendulum clock in 1656, and Galileo Galilei (1564-
1642) is credited with the first scientific study of the properties of pendula.[25, 33] In a famous letter to Guidobaldo
del Monte in 1602, Galileo asserts that the period of a swinging pendulum (the time it takes to swing one way and
back) is independent of the amplitude of the swing (how far it swings left and right). Del Monte famously argued
that the physical evidence did not support the claim.[20] And he was right—it does not, and Galileo’s claim is
actually false. The period of a pendulum varies with the amplitude of its swing (all else equal).
Historians are generally willing to forgive Galileo for this error, though, likely due, in part, to the fact that the
period of a pendulum is nearly constant for small amplitudes, and in part, to the fact that Galileo was the main
figure in the scientific revolution (the birth of modern science) in the 17th century. His results regarding pendular
motion account for only a small part of his total contribution to the sciences. The way he utilized idealized
mathematical models of the physical world to inform his claims and experiments, a method of scientific study that
directly contrasted with the generally held wisdom of his day, forms the basis for the scientific revolution, and as
such was at least as important to science as any of his individual scientific discoveries. As for the pendulum, he
put in motion the investigations which would one day (some years after his death) lead to a method of determining
longitude at sea, an accomplishment that would change the world! With the ability to calculate their longitude,
sailors were able to sail the seas, discover new places, and map the globe. Perhaps the biggest impact was the
European colonization of foreign lands.
The thought of a pendulum today most likely brings to mind the grandfather clock. While arguably less
important than its contribution to science and navigation, the timekeeping accuracy that pendulum clocks brought
to the world had a substantial impact on broad society. With accurate timekeeping, time-based labor, transit and
trade schedules, announced starting times for religious or other meetings, and every other clock-based phenomenon
we take for granted today became possible. In the 17th century, these things were novel. To put into some
perspective just how important the clock, and therefore the pendulum became to society, consider Mumford’s
claim: “the clock, not the steam-engine, is the key-machine of the modern industrial age.”[24]

193
194 CHAPTER 6. ORDINARY DIFFERENTIAL EQUATIONS

Figure 6.1.1: Free body diagram for a pendulum.

Crumpet 32: The Pendulum Clock

Galileo never implemented the pendulum as a timekeeping mechanism. It was around 15 years after Galileo’s
death that the pendulum clock became a reality. Even though his first pendulum clock (1656) was more accurate
than any other clock at the time, Huygens strived to improve upon its design. During his quest, he built a clock
with a modified pendulum and published the classic work, Horologium Oscillatorium, where mathematical details
of the isochronism of the cycloid were laid out for the first time, in 1673.[33, 21]
Today, we take for granted that the cycloid is the path a falling object must follow in order for its travel to a
given point to happen in the same time regardless of its starting position. And we also take for granted that the
period of a simple pendulum varies with its amplitude. We have over 400 years of physical and mathematical
hindsight that tell us so!

The equation of motion


Hopefully having justified an interest in the pendulum, let us turn to a modern derivation of the motion of a
pendulum by appealing to the free body diagram, a mechanical engineering mainstay. In a free body diagram, a
body, in this case the bob of a pendulum, is isolated from everything except the forces acting on it. Those forces are
indicated by vectors, and Newton’s second law of motion (the acceleration of an object is directly proportional to the
magnitude of the net force applied to the object, in the same direction as the net force, and inversely proportional
to the mass of the object, or F = ma) is applied. Figure 6.1.1 shows the three forces acting on a pendulum—the
force of gravity; the tension in the rod or string holding the bob to the pivot; and a third force called drag, which
is due to air resistance—along with the directions normal (N ~ ) and tangential (T~ ) to the path of the pendulum.
Technically only the bob and the three forces are part of the free body diagram. Nothing else is part of the free
body diagram, but is added in dashed lines to help describe the motion. The length of the pendulum is taken to
be `, and we will apply Newton’s second law in the direction tangent to the motion. That is, in the direction T~ .
The speed of the bob is the product of the  length of the pendulum and the angular speed, `θ̇. The acceleration
of the bob, the derivative of speed, is dtd
`θ̇ = `θ̈. Therefore, the ma (mass times acceleration) term of Newton’s
second law for the motion of a pendulum is mlθ̈.
Gravity causes a constant downward force on the bob with magnitude equal to the weight of the bob, mg. The
magnitude of this force in the T~ direction, however, is mg sin θ. It is worth taking a moment to make sure we have
the correct sign. For values of θ between 0 and π, the bob is to the right of the pivot, so the force of gravity tends
to accelerate the bob in the clockwise (negative with respect to θ) direction. Since mg sin θ is positive for values
of θ between 0 and π, the force due to gravity is actually −mg sin θ. For values of θ between −π and 0, the bob
is to the left of the pivot, so the force of gravity tends to accelerate the bob in the counterclockwise (positive with
respect to θ) direction. Since mg sin θ is negative for values of θ between −π and 0, the force due to gravity is again
−mg sin θ. Similar analysis for any other angle will lead to the same conclusion.
6.1. THE MOTION OF A PENDULUM 195

The damping or drag force (air resistance) is taken as a force proportional to the speed of the bob, `θ̇, so has
magnitude c`θ̇. Damping forces are always taken to directly oppose the motion, so the magnitude of damping in
the direction of T~ , is its entirety. It only remains to choose the right sign. Since θ̇ indicates the direction of motion,
the damping force must have the opposite sign. The damping constant c is taken to be positive, and of course ` is
positive, so the damping force must be −c`θ̇.
The tension acting on the bob is irrelevant because it is always perpendicular to the motion. The component of
tension in the tangential direction is always zero.
Substituting the sum of these tangential forces for F , Newton’s second law applied to the pendulum becomes
−mg sin θ − c`θ̇ − 0 = m`θ̈ or
c g
θ̈ + θ̇ + sin θ = 0. (6.1.1)
m `
Equation 6.1.1 is known as a differential equation because it is an equation that involves derivatives (or differentials).
To be more precise, it is a second degree ordinary differential equation (o.d.e.). Second degree because the highest
degree derivative is the second and ordinary because it involves only one independent variable (time t).
The simplest differential equations are considered in calculus, though the term “differential equation” is rarely
used. When first discussing the idea of antidifferentiation, the question of “What function has a derivative equal to
... ?” inevitably comes up. For example, one might be faced with the question of what function’s derivative equals
x? This question can also be asked, what function y satisfies the (differential) equation y 0 = x? The answer can be
arrived at by integrating the equation:
ˆ ˆ
y dx =
0
x dx
1 2
y = x +C
2
(don’t forget the constant of integration!).

Forces in a free body diagram


The derivation of the equation of motion for the pendulum touches on three forces typically found in a free body
diagram: gravity, drag, and tension. There are several other forces that may creep into a free body diagram. Most
typical is the normal force a surface applies to a body lying upon it. In summary, here are the forces that should
be considered when constructing a free body diagram.
Gravity: always acts directly downward with magnitude equal to the weight of the body, mg.
Drag: always acts directly opposite the direction of motion with a magnitude approximated in different ways
depending on the application. This force is perhaps the most complicated to account for. It depends on
the geometry of the body, the speed of the body, and the viscosity of the fluid relative to which the body
moves. For slowly moving objects in low viscosity fluids, such as pendula in air, drag (air resistance) is taken
proportional to the speed of the object. For faster moving objects in low viscosity fluids, drag is often taken
proportional to the square of the speed of the object. In reality, drag is not exactly proportional to any
power of speed, but rather varies in a very complicated way as the body moves through the fluid. For sake of
tractability, though, it is almost always modeled as proportional to an appropriate power of speed. For our
purposes, that power will simply be given.
Tension/compression: tension is transmited through a rope, wire, chain, or other similar object by pulling on
its ends (in opposite directions). The magnitude of the tension is constant within the object assuming, as
we often do, that the rope, wire, or chain is massless. Tension is always directed along the rope, wire, or
chain. The opposite of tension is compression. Rigid objects such as rods, dowels, or poles are capable of
transmitting compressive forces by pushing on their ends. Ropes, wires, chains, and other objects that simply
slacken when pushed are not capable of transmitting compression.
Spring: a spring exerts a force proportional to the deflection of the spring, in the direction opposite the deflection.
Normal: when a body lies atop a solid surface and the body is not floating away from the surface nor sinking into
the surface, there must be a balance between the forces perpendicular (normal) to the surface. The force that
the surface applies to a body to keep it from sinking into the surface is called the normal force and always
acts normal (perpendicular) to and away from the surface. The magnitude of the normal force is always equal
to the net magnitude of all other forces in the normal direction. Often the normal component of gravity is
the only other force acting normal to the surface.
196 CHAPTER 6. ORDINARY DIFFERENTIAL EQUATIONS

Friction: when a body lies in contact with a surface, friction opposes motion with a magnitude proportional to the
normal force. The constant of proportionality is called the coefficient of friction and is denoted by µ. For any
body/surface combination, there are two types of friction to consider—static friction and kinetic friction. A
body at rest on a surface is capable of resisting a greater force than is the same body sliding across the same
surface (with the same normal force). You may be familiar with this phenomenon if you’ve ever tried to slide
an oven into or out of its usual position in a kitchen. It’s much harder to get it started moving than it is to
keep it moving. Whether the friction is static or kinetic, it always resists motion tangential to the surface.

Applied: a force that is applied to a body by another body, such as a person pushing a sofa or an engine accelerating
a vehicle.

Crumpet 33: Anti-lock braking systems

The anti-lock braking system (ABS) of an automobile is designed to take advantage of the fact that the static
friction between a tire and the road can stop a car more quickly than the kinetic friction between the same tire
and the same road. A tire that is not skidding is capable of applying a greater braking (frictional) force than the
same tire skidding. When the ABS senses that a wheel has locked (ceased rotation) while the car is still moving,
it forces the driver to let up on the brake enough so the wheel will start spinning again, though very briefly. If
the driver continues to hold down the brake hard enough to skid, the ABS will force the driver to let up again.
The ABS rapidly alternates between forcing the driver to let up and allowing the driver to do as (s)he will. The
quick alternation between making the driver let up and allowing the driver to brake hard is what causes the
vibration or pulsing you feel when the ABS kicks in. If the ABS is working properly, a vehicle will come to a
halt more quickly than it would have if it were allowed to skid to a stop. Also, it’s much easier to steer a car
when it is not skidding than when it is skidding!

Solutions of ordinary differential equations


The solution of a differential equation is, in one way, very much like the solution of an algebraic equation but, in
another way, entirely different. For an algebraic equation in x, for example, we say that we have a solution x = s if
substituting s for x in the equation makes the equation true. Likewise, for a differential equation in θ, for example,
we say that we have a solution θ = s if substituting s for θ in the equation makes the equation true. The difference
is s is a number in the case of an algebraic equation while s is a function in the case of a differential equation. We
would say that x = 2 is a solution of the algebraic equation 3x2 − 8x + 4 = 0 since, substituting 2 for x gives

3(2)2 − 8(2) + 4 = 0,

a true statement. Analogously, we would say that θ = e2t is a solution of the differential equation 3θ̈ − 8θ̇ + 4θ = 0
since, substituting e2t for θ gives
3(4e2t ) − 8(2e2t ) + 4(e2t ) = 0,
again a true statement. Notice that the derivatives θ̇ and θ̈ need to be calculated in order to complete the substi-
tution.
Approximate solutions of differential equations, then, must be approximations of functions. In fact, for any
given ode, we settle for the crudest approximation, a set of points that, if our approximation is good, lie near the
graph of an exact solution. Hence the set {(0, 1), (.25, 1.5), (.5, 2.25), (.75, 3.375), (1, 5.0625)} might qualify as an
approximate solution of the equation 3θ̈ − 8θ̇ + 4θ = 0 for t ∈ [0, 1]. See figure 6.1.2. The approximation is good
for values of t near zero but not as good for values of t near 1.

Initial Value Problems


As with algebraic equations, differential equations may have more than one
√ solution. We already saw that θ = e2t
2t 2t
is a solution of 3θ̈ − 8θ̇ + 4θ = 0. So are θ = 5e , θ = −2.1e , and θ = 7πe . In fact, θ = ce2t is a solution for
2t
6.1. THE MOTION OF A PENDULUM 197

Figure 6.1.2: Approximate solution of 3θ̈ − 8θ̇ + 4θ = 0.

any constant c. The ode 3θ̈ − 8θ̇ + 4θ = 0 has infinitely many solutions! It is a straightforward exercise to check.
For θ = ce2t , θ̇ = 2ce2t and θ̈ = 4ce2t , so
3θ̈ − 8θ̇ + 4θ = 3(4ce2t ) − 8(2ce2t ) + 4(ce2t )
= 12c(e2t ) − 16c(e2t ) + 4c(e2t )
= (12c − 16c + 4c)e2t
= 0.
2t/3
Even more, θ = ae is a solution for any constant a. This solution can be verified just as the solution θ = ce2t
was verified. Can you do it? Answer on page 199. Finally, θ = ce2t + ae2t/3 is also a solution for any pair of
constants c and a! Can you show it? Answer on page 200. It is not uncommon for a differential equation to have
infinitely many solutions.
Another differential equation with infinitely many solutions is
t
ẏ = .
y
√ √
The solutions are y = t2 + c and y = − t2 + a, valid for any constants c and a as long√as y 6= 0. Complex
solutions are valid! However, if we also require y(0) = 1, there is only one√solution! y = − t2 + c is no longer a
solution because it gives negative
√ values of y for all values of t. And y = t2 + c is only a solution if c = 1. The
one and only solution is y = t2 + 1.
The requirement y(0) = 1 is called an initial value, or initial condition, and the pair of equations
t
ẏ =
y
y(0) = 1
is called an initial value problem. More generally, the pair of equations
ẏ = f (y, t)
y(t0 ) = y0
forms what is knows as a first order initial value problem.

Crumpet 34: There is exactly one solution of ẏ = t


y such that y(0) = 1.

√ 1
Setting y = t2 + 1, ẏ = 2
√1 (2t) = √ t
. Hence the equation ẏ = t
y
becomes
t2 +1 t2 +1

t t
√ = √ ,
t2 + 1 t2 + 1
198 CHAPTER 6. ORDINARY DIFFERENTIAL EQUATIONS

√ √
an undeniably true statement. Hence y = t2 + 1 is a solution of ẏ = yt . Moreover y(0) = 02 + 1 = 1, so
√ √
the particular solution y = t2 + 1 satisfies the √ requirement that √y(0) = 1 also. Hence y = t2 + 1 is one
solution—and the only solution of the form y = t2 + c or y = − t2 + a. But is it the only solution of any
form? Perhaps there are other functions that satisfy the differential
√ equation. A little
√ bit of calculus should help
settle the issue. The demonstration hinges on showing that y = t2 + c and y = − t2 + a are the only solutions
of ẏ = yt . The following sequence of equations show it. Each line implies the next.

dy t
= , y 6= 0
dt y
y dy = t dt, y 6= 0
ˆ ˆ
y dy = C+ t dt, y 6= 0
1 2 1 2
y = C+ t , y 6= 0
2 2
y2 = 2C + t2 , y 6= 0
p
y = ± t2 + 2C, y 6= 0.

Replacing
√ the constant√2C with c or a does not change the fact that the term is an arbitrary constant, so
y = t2 + c and y = − t2 + a are the only solutions of ẏ = yt . This method of solving the differential equation
is called separation of variables.

Key Concepts
Approximate solution of a differential equation: a set of points that, ideally, lie near the graph of an exact
solution.
Degree of a differential equation: equal to the highest order derivative appearing in the equation.
Differential equation: an equation with derivatives (or differentials) in it.

Free body diagram: An engineering diagram consisting of only a body and the forces acting on it.
Initial value problem: a differential equation coupled with a required value of the solution.
Newton’s second law of motion: the acceleration of an object is directly proportional to the magnitude of the
net force applied to the object, in the same direction as the net force, and inversely proportional to the mass
of the object—often summarized by the equation F = ma. This equation assumes the mass of the object is
constant.
Ordinary differential equation (o.d.e.): a differential equation with only one independent variable.
Solution of a differential equation: a function that, when substituted for the dependent variable, makes the
equation a true statement.

[A]
Exercises (a) y(t) = et ; ẏ = y
1. State the degree of the differential equation. (b) y(x) = x3 − 26.83x − sin x; y 00 = 6x + sin x
√ 
3 [A]
(a) ẏ = y [A] (c) s(t) = e−t/2 sin 2
t ; s̈ + ṡ + s = 0

(b) y 00 = 6x + sin x x3
(d) f (x) = 4
+ x4 , x > 0; f 0 + f
x
= x2 [S]

[A]
(c) s̈ + ṡ + s = 0 (e) h(x) = −2x; (2h + x)h + h = 4x 0

(d) f +
0 f
x
=x 2 [S]
(f) r(t) = t, t > 0; r̈ṙt2 = − 18 [A]
(e) (2h + x)h0 + h = 4x 3. Verify that the function is a solution of the initial value
(f) r̈ṙt2 = − 18 [A] problem.
[A]
2. Verify that the function is a solution of the differential (a) y(t) = 4et ; ẏ = y, y(0) = 4
equation. (b) y(x) = x − sin x − π ; y = 3x2 − cos x, y(π) = 0
3 3 0
6.1. THE MOTION OF A PENDULUM 199

 
[A]
(b) A block sliding down an inclined plane.
2
1 [A]
(c) s(t) = 2
1 + e−t ; ṡ = (1 − 2s)t, s(0) = 1
x3 (c) A block sitting on an inclined plane (not moving).
(d) f (x) = 4
+ 16
x
, x > 0; f 0 = − fx + x2 , f (4) = 20 [S]
[S]

(d) A block being pushed up an inclined plane.


(e) h(x) = −2x − 1; h0 = 1+4x−h
2h+x+1
, h(0) = −1
√ 2 (e) A sofa being pushed across a level floor where the
(f) r(t) = t − 3, t > 0; r̈ṙt = − 81 , r(9) = 0,
applied force is parallel to the floor. [A]
ṙ(9) = 16 . [A] HINT: The solution must satisfy
the o.d.e. and both conditions, r(9) = 0 and (f) A sofa being pushed across a level floor where the
ṙ(9) = 61 . applied force is not parallel to the floor. [S]

4. Solve the differential equation. (g) A sofa being pushed up an old, slanted hardwood
floor. The applied force may or may not be par-
(a) y 0 = 5x4 [A]
allel to the floor. [A]
2
(b) y 0 = 3xex (h) A sledder has reached the bottom of a hill (and is
(c) ẏ = t − sin t [S] now traveling on level snow) and is coasting to a
[A] stop. [A]
(d) ẏ = 1t , t < 0
[A]
(e) s0 = 1 − ln x (i) A sledder sledding down a hill.
[A] [A]
(f) ṡ = 3tet (j) A hockey puck sliding across an ice rink.

5. Given are an initial value problem, its exact solution, (k) A hockey puck sliding across ice at constant speed
and an approximate solution. Comment on how well (ignoring friction).
the approximate solution approximates the exact solu- (l) A sky diver falling. [A]
tion.
[S]
(m) A sky diver whose parachute just opened.
(a) ẏ = y, y(0) = 4; y(t) = 4e ; t

{(0, 4), (.25, 5), (.5, 6.3), (.75, 7.8), (1, 9.8)} [A] (n) A sky diver whose parachute just opened while a
constant breeze is blowing sideways. [A]
(b) y 0 = 3x2 − cos x, y(π) = 0; y(x) = x3 − sin x − π 3 ;
{(π, 0), ( 54 π, 30), ( 32 π, 74), ( 47 π, 135), (2π, 216)} (o) A football originally kicked at a 40 degree angle
 2
 just as it reaches its peak, ignoring drag. [A]
1
(c) ṡ = (1 − 2s)t, s(0) = 1; s(t) = 2
1 + e−t ;
(p) A football moving up and to the right approach-
[A]
{(0, 1), (.5, 1), (1, .75), (1.5, .5), (2, .5)} ing its peak, ignoring drag.
3
(d) f 0 = − fx + x2 , f (4) = 20; f (x) = x4 + 16
;
x
[S] 7. Use the free body diagram from question 6 to find the
{(4, 20), (4.25, 23), (4.5, 26), (4.75, 30), (5, 34)}
equation of motion in the tangential direction for (6a)-
(e) h0 = 1+4x−h
2h+x+1
, h(0) = −1; h(x) = −2x − 1; (6k), and in the vertical direction for (6l)-(6p). [S][A]
{(0, −1), (.25, −1.5), (.5, −2), (.75, −2.5), (1, −3)}
√ 8. How much easier is it to slide a sofa by pushing paral-
(f) r̈ṙt2 = − 81 , r(9) = 0, ṙ(9) = − 61 ; r(t) = t − 3; lel to the floor as opposed to slightly toward the floor?
{(9, 0), (10, .16), (11, .31), (12, .46), (13, .61)} [A] Compare the kinetic friction for a sofa being pushed
6. Draw a free body diagram for the situation. parallel to the floor to one being pushed at an angle of
20 degrees from parallel. Then calculate the necessary
(a) Pendular motion ignoring air resistance (no applied force to overcome kinetic friction in each case.
damping). [A] Assume the floor is level. [A]

Answers

θ = ae2t/3 is a solution of 3θ̈ − 8θ̇ + 4θ = 0: θ̇ = 23 ae2t/3 and θ̈ = 49 ae2t/3 so

4 2t/32 2t/3
   
 
3θ̈ − 8θ̇ + 4θ = 3 ae ae −8+ 4 ae2t/3
9 3
4 16 12
= a(e2t/3 ) − a(e2t/3 ) + a(e2t/3 )
3 3 3
4 16 12
 
= a − a + a e2t/3
3 3 3
= 0.
200 CHAPTER 6. ORDINARY DIFFERENTIAL EQUATIONS

θ = ce2t + ae2t/3 is a solution of 3θ̈ − 8θ̇ + 4θ = 0: θ̇ = 2ce2t + 32 ae2t/3 and θ̈ = 4ce2t + 49 ae2t/3 so

4 2t/3 2 2t/3
     
2t 2t
3θ̈ − 8θ̇ + 4θ = 3 4ce + ae − 8 2ce + ae + 4 ce2t + ae2t/3
9 3
4 16 12
= 12c(e2t ) + a(e2t/3 ) − 16c(e2t ) − a(e2t/3 ) + 4c(e2t ) + a(e2t/3 )
3 3 3
4 16 12
 
= (12c − 16c + 4c)e2t + a − a + a e2t/3
3 3 3
= 0.
6.2. TAYLOR METHODS 201

Figure 6.2.1: Beginning a numerical solution with the initial condition

6.2 Taylor Methods


The exact solution of the initial value problem
y
ẏ = − + t2
t
y(4) = 20 (6.2.1)

is y(t) = t4 + 16
3

t , t > 0, as verified in exercise 3d on page 199. For the time being, let us try to forget that we know
the exact solution, and study a method for approximating it. We will recall that we have the exact solution when
we are ready to check how the approximation is going. The initial condition, y(4) = 20, means that the graph of
the exact solution passes through (4, 20). What a great place to start an approximate solution—at a point that is
on the graph of the exact solution! Thus the approximation is seeded by the initial condition. There are numerous
ways to proceed from there. Perhaps the simplest way is to use the differential equation to compute the exact slope
(derivative) of y at (4, 20):
y(4)
ẏ(4) = − + 42
4
20
= − + 42
4
= 11.
You might imagine a graph like that in figure 6.2.1. The graph is that of the first order Taylor polynomial expanded
2
about t0 = 4. According to Taylor’s theorem, y(t) = 20 + 11(t − 4) + ÿ(ξ)
2 (t − 4) for t near 4 and some ξ, depending
on t. So, y(2) ≈ T1 (2) = 20 + 11(2 − 4) = −2 and y(5) ≈ T1 (5) = 20 + 11(5 − 4) = 31 (as long as y has two
derivatives on an open interval containing [2, 5]), and so on. As always, there is the concern of how good these
approximations are.
In section 4.4, two different approximations for the same number were used to estimate error in the adaptive
methods. A similar tack may be used here. We will compare approximations given by T1 and T2 . The differential
equation can be used to compute ÿ, in terms of y and t. Implicitly differentiating the differential equation gives
ẏt − y
ÿ = − + 2t.
t2
But ẏ = − yt + t2 , so we may substitute into and simplify the expression for ÿ:

(− yt + t2 )t − y
ÿ = − + 2t
t2
−y + t3 − y
= − + 2t
t2
2y t3
= − 2 + 2t
t2 t
2y
= + t.
t2
202 CHAPTER 6. ORDINARY DIFFERENTIAL EQUATIONS

Table 6.1: Comparing first and second order polynomial approximations


t T1 (t) T2 (t)
2 −2 11
5 31 34.25

Figure 6.2.2: A repetitive numerical calculation (truncated to 5 decimal places)

t0 y(t0 )
4 20
3.75 17.25
3.5 14.88437
3.25 12.88504
3 11.23557
2.75 9.92187
2.5 8.93323
2.25 8.26406
2 7.91666

Now we know ÿ(4) = 2y(4) 2(20) 13 13 2


42 + 4 = 16 + 4 = 2 , so T2 (t) = 20 + 11(t − 4) + 4 (t − 4) . Finally, we can compare
values of T1 to corresponding values of T2 , as in Table 6.1. T1 (2) and T2 (2) disagree wildly, so we should assume
that neither approximation is to be trusted. T1 (5) and T2 (5) differ by only around 10%, so these approximations
may be reasonable. To further hone the approximation of y(2), it is possible to calculate T3 (2) and again compare.
Can you do it? Answer on page 206.
Another way to approximate y(2) is to take things a little more slowly. We could use the initial condition to
approximate y(3.75) first. Then we could use this approximation to approximate y(3.5), which we could, in turn,
use to approximate y(3.25), and so on until we ultimately use the approximation of y(2.25) to approximate y(2).
We humans may think the prospect of doing all these calculations is repugnant, but with a little Octave code, the
burden is placed on the machine. It is the ability to understand the process well enough to write that Octave code
that now becomes the focus.
We know that y(4) = 20 and we are interested in approximating y(3.75). Since the difference between 4 and 3.75
is only .25, perhaps using T1 will be sufficiently accurate. From before, we know the Taylor polynomial expanded
about t0 = 4 is T1 (t) = 20 + 11(t − 4), so T1 (3.75) = 20 + 11(−.25) = 17.25. Now we can use y(3.75) = 17.25 as a
“new” initial condition. ẏ(3.75) = − 17.25 2
3.75 + 3.75 = 9.4625. We can use this information to approximate the Taylor
polynomial for y expanded about 3.75: T1 (t) ≈ 17.25 + 9.4625(t − 3.75), and use this expansion to approximate
y(3.5): y(3.5) ≈ T1 (3.5) ≈ 17.25 + 9.4625(3.5 − 3.75) = 14.884375. We then can use y(3.5) = 14.884375 as an initial
condition, approximating the Taylor polynomial for y expanded about 3.5. Continuing in this vein leads to the
tabular and graphical results in Figure 6.2.2. Can you reproduce these results? Details on page 206.
The method of repeated calculation leads to y(2) ≈ 7.91, but more importantly, illuminates an algorithm
for approximating solutions of differential equations. Calling the initial condition (t0 , y0 ), and succeeding points
(t1 , y1 ),(t2 , y2 ),(t3 , y3 ) . . ., the same procedure is used to calculate (t1 , y1 ) from (t0 , y0 ) as is used to calculate (t2 , y2 )
from (t1 , y1 ) as is used to calculate (t3 , y3 ) from (t2 , y2 ), and so on. It remains to capture that procedure as a
formula of some sort. To summarize, the procedure is to use a given point, call it (ti , yi ) to
1. calculate ẏ(ti , yi );
2. use the three values ti , yi , and ẏ(ti , yi ) to form T1 (t) expanded about ti ; and finally
3. set yi+1 = T1 (ti+1 ), which gives a new point, (ti+1 , yi+1 ).
But T1 (ti+1 ) = yi + ẏ(ti , yi ) · (ti+1 − ti ), so the procedure really boils down to setting

yi+1 = yi + ẏ(ti , yi ) · (ti+1 − ti ). (6.2.2)

The method of using formula (6.2.2) repeatedly to compute a sequence of points approximately on the solution of
an ordinary differential equation is most often called Euler’s method.[7] It may also be referred to as the Taylor
6.2. TAYLOR METHODS 203

method of degree 1 since it uses Taylor polynomials of degree 1 at each step. The value ti+1 − ti is called the step
size and is often held constant, so you are likely to see Euler’s method written as

yi+1 = yi + h · ẏ(ti , yi ) (6.2.3)

where h = ti+1 − ti is the constant step size.

Euler’s Method (pseudo-code)


As is most common, Euler’s method will be coded for a constant step size.

Assumptions: The solution of the o.d.e. exists and is unique on the interval from t0 to t1 .
Input: Differential equation ẏ = f (t, y); initial condition y(t0 ) = y0 ; numbers t0 and t1 ; number of steps N .
Step 1: Set t = t0 ; y = y0 ; h = (t1 − t0 )/N
Step 2: For j = 1 . . . N do Steps 3-4:
Step 3: Set y = y + hf (t, y)
Step 4: Set t = t0 + Ni (t1 − t0 )
Output: Approximation y of the solution at t = t1 .

Higher Degree Taylor Methods


Taylor methods of higher degree are rarely used in practice because they require computation of derivatives, a task
that is not always easy or even possible. Nonetheless, it is not a huge stretch from what we have already done
to consider higher degree methods. Rewriting the steps outlined in the enumeration that leads to 6.2.2, the third
degree Taylor method can be summarized by
...
1. calculate ẏ(ti , yi ) and ÿ(ti , yi ) and y (ti , yi );
...
2. use the three five values ti , yi , and ẏ(ti , yi ), ÿ(ti , yi ), and y (ti , yi ) to form T1 (t) T3 (x) expanded about ti ; and
finally
3. set yi+1 = T1 (ti+1 ) yi+1 = T3 (ti+1 ), which gives a new point, (ti+1 , yi+1 ).
Now written without all the markup, the procedure is
...
1. calculate ẏ(ti , yi ), ÿ(ti , yi ), and y (ti , yi );
...
2. use the five values ti , yi , ẏ(ti , yi ), ÿ(ti , yi ), and y (ti , yi ) to form T3 (x) expanded about ti ; and finally
3. set yi+1 = T3 (ti+1 ), which gives a new point, (ti+1 , yi+1 ).
Higher degree Taylor methods require higher derivatives in step 1 and a higher degree Taylor polynomial in steps
2 and 3. As should be expected, higher degree methods are generally more accurate than lower degree methods as
long as the formula for ẏ(t, y) is sufficiently smooth. To illustrate the point, we now compare approximate solutions
of 6.2.1.

Taylor’s Method of Degree 3 (pseudo-code)


Taylor’s method of degree 3 will be coded for a constant step size.

Assumptions: The solution of the o.d.e. exists and is unique on the interval from t0 to t1 .
...
Input: Differential equation ẏ = f (t, y); formulas ÿ(t, y) and y (t, y); initial condition y(t0 ) = y0 ; numbers
t0 and t1 ; number of steps N .
Step 1: Set t = t0 ; y = y0 ; h = (t1 − t0 )/N
Step 2: For j = 1 . . . N do Steps 3-4:
...
Step 3: Set y = y + hf (t, y) + 21 h2 ÿ(t, y) + 61 h3 y (t, y)
Step 4: Set t = t0 + Ni (t1 − t0 )
Output: Approximation y of the solution at t = t1 .
204 CHAPTER 6. ORDINARY DIFFERENTIAL EQUATIONS

Table 6.2: Approximate values of y(2) from solving 6.2.1


h = 0.5 error h = 0.25 error h = 0.125 error
Euler’s method 6.1 3.9 7.91666 2.08333 8.91911 1.08088
Taylor’s degree 3 method 9.975765 0.024234 9.996280 0.003719 9.999485 0.000514

Using Octave code based on the pseudo-code presented in this section, Table 6.2 summarizes the approximate
solution of 6.2.1 using Euler’s method and Taylor’s method of degree 3 to approximate y(2).
Now is a good time to say something about the error of Taylor methods. Remember a Taylor polynomial of
degree n has an error of order n + 1, so Euler’s method uses a Taylor polynomial with error of order 2 and Taylor’s
degree 3 method uses a Taylor polynomial with error of order 4. But how does that translate into an error term
for the Taylor method?
Though we will not answer this question completely here, we can get some idea what to expect from Table 6.2.
From the Euler’s method row, we see the error decrease from (roughly) 3.9 to 2.08 to 1.08 as the step size is reduced
by a factor of one half. Since
 1
2.08 1.08 1
≈ ≈ ,
3.9 2.08 2
we conclude that Euler’s method is of first order. Considering the row on Taylor’s degree 3 method, we see the
error decrease from about .024 to .0037 to .00051 as the step size is reduced by a factor of one half. Since
 3
.0037 .00051 1 1
≈ ≈ = ,
.024 .0037 8 2
we conclude that Taylor’s degree 3 method is of order 3.
Notice the similarity between this observation and the observation we made about composite integration. In
section 4.4, we argued that the error term for a composite integration formula had order one less than that of a
single application of the underlying integration formula. The same thing happens here. When the truncation error
for the underlying Taylor polynomial has order n, the corresponding o.d.e. solver has order n − 1, an order equal
to the degree of the Taylor polynomial itself.

Reducing a second order equation to a first order system


Taylor’s methods and the upcoming Runge-Kutta methods are all designed to work on first order differential
equations. However, all the equations of motion we have developed are second order differential equations. To
resolve this disconnect, a second order o.d.e. can be reduced to a first order system. The idea is straightforward.
Suppose y is the dependent variable in a second order o.d.e. and we have an equation of the form y 00 = f (y 0 , y, x).
We introduce an auxiliary variable u and set u = y 0 . Consequently, u0 = y 00 = f (y 0 , y, x) = f (u, y, x). We thus have
the first order system

u0 = f (u, y, x)
y0 = u

which can be solved using a numerical method for first order differential equations.
For example, the equation of a pendulum (6.1.1) can be rearranged as θ̈ = − m c
θ̇ − g` sin θ. If we substitute the
auxiliary variable u = θ̇ into the equation, it becomes u̇ = − m u − ` sin θ, and the system
c g

c g
u̇ = − u − sin θ
m l
θ̇ = u

is equivalent to (6.1.1). Euler’s method, for example, can be applied to this system in the following way:
 c g 
un+1 = un + h − un − sin θn
m l
θn+1 = θn + hun
tn+1 = tn + h

where u0 , θ0 , and t0 are taken from the initial conditions.


6.2. TAYLOR METHODS 205

Key Concepts
Taylor method: A method for approximating the solution of a first order o.d.e. in which a Taylor polynomial of
some predetermined order is used at each step to compute the next.
Euler’s method: Another name for the first order Taylor method, having formula yi+1 = yi + h · ẏ(ti , yi ).

Exercises 12. Use your code from exercise 9 to calculate y(2) for the
o.d.e. in 1a uisng h = 0.5, 0.25, 0.125, and 0.0625. Use
1. Use Euler’s method with step size h = 0.5 to approxi-
your calculations and the fact that the exact value of
mate y(2). −2
y(2) is 9+e4 to verify that Taylor’s method of degree
[S]
(a) 3 is an order 3 numerical method.
dy 13. Use your code from exercise 10 to calculate y(2) for the
= 3x − 2y o.d.e. in 1a uisng h = 0.5, 0.25, 0.125, and 0.0625. Use
dx
y(1) = 1 your calculations and the fact that the exact value of
−2
y(2) is 9+e4 to verify that Taylor’s method of degree
(b) 4 is an order 4 numerical method.
dy 14. Write the equation of motion you derived in exercise 7
= 3x3 − y
dx on page 199 as a first order system. [S][A]
y(1) = 3 15. Given the following parameter values and initial con-
[A] ditions for the referenced system, use Euler’s method
(c)
with a step size h = 0.25 to compute s(0.5) or θ(0.5)
ẏ = ty as appropriate.
y(1) = 0.5 14a: g = 9.81 m/s2 ; ` = .31 m; θ(0) = π
; θ̇(0) = 0 [A]
3

(d) [S]
14b: g = 32.2 ft/s2 ; µ = .21; α = .25 rad; s(0) = 0;
ṡ(0) = .3 ft/s [A]
cos(x)y 0 + sin(x)y = 2 cos3 (x) sin(x) − 1
14c: g = 32.2 ft/s2 ; µ = .21; α = .25 rad; s(0) = 0;
y(1) = 0
ṡ(0) = 0 [S]
(e) 14d: g = 32.2 ft/s2 ; µ = .21; α = .25 rad; m = .19
lbm; Fapplied = 15 lb; s(0) = 0; ṡ(0) = 1 ft/s
7ẏ + 3y = 5
14e: g = 9.81 m/s2 ; µ = .15; m = 35 kg; Fapplied = 75
y(1) = 2
N; s(0) = 0; ṡ(0) = .03 m/s [A]
2. Repeat exercise 1 using Taylor’s method of order 2. 14f: g = 9.81 m/s2 ; µ = .15; β = 10 π
rad; m = 35 kg;
[S][A]
Fapplied = 75 N; s(0) = 0; ṡ(0) = .03 m/s [S]
3. Repeat exercise 1 using Taylor’s method of order 3. 14g: g = 9.81 m/s2 ; µ = .15; α = .05 rad; β = 10
π
rad;
[S][A]
m = 35 kg; Fapplied = 90 N; s(0) = 0; ṡ(0) = .03
4. Execute two steps of Euler’s method for solving ẏ = ty m/s [A]
with y(1) = −0.5 and h = 0.25, thus approximating 14h: g = 32.2 ft/s2 ; µ = .01; s(0) = 0; ṡ(0) = 30 ft/s
y(1.5). [A] [A]
[A]
5. Write pseudo-code for Taylor’s method of order 2.
14i: g = 32.2 ft/s2 ; µ = .01; α = π
6
rad; s(0) = 0;
6. Write pseudo-code for Taylor’s method of order 4. ṡ(0) = 10 ft/s [A]
7. Write an Octave function that implements Euler’s 14j: g = 32.2 ft/s2 ; µ = .003; s(0) = 0; ṡ(0) = 88 ft/s
method. [S] [A]

8. Write an Octave function that implements Taylor’s 14k: g = 32.2 ft/s2 ; µ = 0; s(0) = 0; ṡ(0) = 88 ft/s
method of degree 2. [A] 14l: g = 9.81 m/s2 ; c = 4.5; m = 70 kg; s(0) = 10000;
9. Write an Octave functon that implements Taylor’s ṡ(0) = −10 m/s [A]
method of degree 3. 14m: g = 9.81 m/s2 ; c = 26; m = 70 kg; s(0) = 2000;
ṡ(0) = −55 m/s [S]
10. Write an Octave functon that implements Taylor’s
method of degree 4. 16. Find a formula for the angle at which a stationary block
11. Use your code from exercise 8 to calculate y(2) for the on an inclined plane (whose angle of inclination is in-
o.d.e. in 1a uisng h = 0.5, 0.25, 0.125, and 0.0625. Use creasing) will start moving.
your calculations and the fact that the exact value of 17. Find a formula for the angle at which a block moving
−2
y(2) is 9+e4 to verify that Taylor’s method of degree down an inclined plane (whose angle of inclination is
2 is an order 2 numerical method. [A] decreasing) will stop moving.
206 CHAPTER 6. ORDINARY DIFFERENTIAL EQUATIONS

[A]
18. Undetermined Coefficients. For each differential (e) 2ẏ + y = t4 + 1; y(t) = A + Bt + Ct2 + Dt3 + Et4
equation, a solution with undetermined coefficients is
suggested. Find values for the coefficients that make (f) ẍ + 2ẋ − x = 1 + tet ; x(t) = Atet + Bet + C
the suggested solution an actual solution. (g) [A]
θ̇ − θ = e−t sin t; θ(t) = Ae−t sin t + Be−t cos t
[S] 00 2 2
(a) y + 5y − 8y = 3x ; y(x) = Ax + Bx + C
0
[S] 1
(h) θ̈ + 10 θ̇ + θ = t cos t; θ(t) = At cos t + Bt sin t +
(b) 2y 000 − 5y 00 + 3y 0 + 5y = x + 1; y(x) = Ax + B C cos t + D sin t
(c) [A] 3y 0 + 2y = 3x + 2; y(x) = Ax + B
[A]
(d) [A] y 00 − 14y 0 + 7y = 2x2 + 3x − 1; y(x) = Ax2 + (i) ẍ − 2ẋ − 35x = e7t + 1; x(t) = Ate7t + Be7t + C
Bx + C

Answers
...
T3 (2): Begin by calculating y = d
dt ÿ.

... d 2y
 
y = +t
dt t2
2ẏt2 − 4ty
= +1
t4
2 − yt + t2 t2 − 4ty

= +1
t4
−2ty + 2t4 − 4ty
= +1
t4
−6y
= +3
t3
...
so y (4) = −6(20)
43 + 3 = 3 − 120 9 13 2 3 3
64 = 8 . Therefore, T3 (t) = 20 + 11(t − 4) + 4 (t − 4) + 16 (t − 4) , and T3 (2) = 9.5
so it is close to T2 (2) = 11. We can start to believe that y(2) is somewhere around 9.5 or 11.

Details:

t0 y(t0 ) ẏ(t0 ) T1 expanded about t0 T1 (t0 − .25)


4 20 11 20 + 11(t − 4) 17.25
3.75 17.25 9.4625 17.25 + 9.4625(t − 3.75) 14.88437
3.5 14.88437 7.99732 14.88437 + 7.99732(t − 3.5) 12.88504
3.25 12.88504 6.59787 12.88504 + 6.59787(t − 3.25) 11.23557
3 11.23557 5.25480 11.23557 + 5.25480(t − 3) 9.92187
2.75 9.92187 3.95454 9.92187 + 3.95454(t − 2.75) 8.93323
2.5 8.93323 2.67670 8.93323 + 2.67670(t − 2.5) 8.26406
2.25 8.26406 1.38958 8.26406 + 1.38958(t − 2.25) 7.91666
2 7.91666
6.3. FOUNDATIONS FOR RUNGE-KUTTA METHODS 207

6.3 Foundations for Runge-Kutta Methods


In section 6.2, derivatives were used to generate approximate solutions of ordinary differential equations. However,
approximate solutions can also be generated by integrating, a much more stable numerical process. An o.d.e. of
the form

ẏ = f (t, y)
y(t0 ) = y0

has an exact solution that can be written in terms of an integral. For any value t̃, and assuming existence of a
solution over the interval from t0 to t̃, we can find a value for y(t̃) by integrating both sides of ẏ = f (t, y) with
respect to t:
ˆ t̃ ˆ t̃
ẏ dt = f (t, y) dt
t0 t0
ˆ t̃
y(t̃) − y(t0 ) = f (t, y) dt
t0
ˆ t̃
y(t̃) = y(t0 ) + f (t, y) dt. (6.3.1)
t0

When t0 and t̃ are not close to one another, which is what we normally assume, we need to proceed in small steps
as done in section 6.2. ´t ´t
Substituting t1 for t̃ in equation 6.3.1, y(t1 ) = y(t0 ) + t01 f (t, y) dt, so we can add t01 f (t, y) dt to the known
value y(t0 ) to get y(t1 ), our first small step on the way to approximating y(t̃). Now substituting t1 for t0 and t2
´t
for t̃ in equation 6.3.1, y(t2 ) = y(t1 ) + t12 f (t, y) dt. So, we can compute y(t2 ) from knowledge of y(t1 ). Similarly
we can compute y(t3 ) from knowledge of y(t2 ), y(t4 ) from knowledge of y(t3 ), and so on, eventually computing
y(tn ) = y(t̃). With this in mind, we rewrite the integral representation in terms of ti and ti+1 instead of t0 and t̃:
ˆ ti+1
y(ti+1 ) = y(ti ) + f (t, y) dt. (6.3.2)
ti

This formula suggests that finding one approximation, y(ti+1 ), from the previous, y(ti ), boils down to approximating
´ ti+1
ti
f (t, y) dt. That should not be too challenging at this point. About half of chapter 4 is dedicated to exactly
this task! Every numerical integration formula is a candidate for use here, but let’s start simple. We know y(ti ),
the value of the function at the left endpoint of integration, at least approximately, so it makes sense to use a stencil
that includes the left endpoint of integration as one of the nodes. And to make our first stab as easy as possible,
let’s let that node be the only one! That is, let’s find an integration formula for the stencil

Using the method of undetermined coefficients, we calculate the left hand side of system 4.2.4 (which for us will
only be one equation since we only have one node):
ˆ b ˆ x0 +h ˆ x0 +h
x +h
p0 (x)dx = p0 (x)dx = 1dx = (x − x0 )|x00 = h
a x0 x0

and the right hand side:


0
(θi h)0 ai = a0 .
X

i=0

So a0 = h and we get the formula


ˆ x0 +h
f (x)dx ≈ hf (x0 ).
x0
´ ti+1
Consequently, ti
f (t, y) dt ≈ (ti+1 − ti )f (ti , y(ti )), and equation 6.3.2 becomes

y(ti+1 ) = y(ti ) + f (ti , y(ti )) · (ti+1 − ti ).


208 CHAPTER 6. ORDINARY DIFFERENTIAL EQUATIONS

Adopting the notation yi = y(ti ) and f = ẏ from section 6.2, this formula becomes

yi+1 = yi + ẏ(ti , yi ) · (ti+1 − ti ).

Wait a minute! We’ve seen this before. This is exactly equation 6.2.2.
The search for new methods of approximating solutions of o.d.e.s by integrating has not yielded anything new
yet. It has to be different, however. Integration formulas include evaluation of the integrand at various points
while Taylor methods involve evaluation of derivatives at a single point. Let’s push on. Perhaps the next simplest
integration formula that includes the left endpoint of integration is the trapezoidal rule (see section 4.3),
ˆ x0 +h
h
f (x)dx = [f (x0 ) + f (x0 + h)] + O(h3 f 00 (ξh ))
x0 2

over the stencil

Translating the trapezoidal rule to the current notation,


ˆ ti+1
ti+1 − ti
f (t, y) dt = [f (ti , yi ) + f (ti+1 , yi+1 )] + O((ti+1 − ti )3 ).
ti 2

Therefore our new approximation formula is

ti+1 − ti
yi+1 = yi + [f (ti , yi ) + f (ti+1 , yi+1 )] .
2
This equation is great except the right hand side includes yi+1 , the quantity we are trying to approximate! One
theory is to leave it at that. The equation for yi+1 is implicit in nature and that’s alright. Some root finding
method could be used to determine yi+1 for each step of the method. While this path is not impossible, it is also
not the simplest solution. Since the step size (ti+1 − ti ) is likely to be small, perhaps using Euler’s method to
approximate yi+1 on the right side will not cause irreparable harm to the overall approximation. Giving it a shot,
we let yi+1 = yi + (ti+1 − ti ) · f (ti , yi ) on the right hand side to get the new formula

ti+1 − ti
yi+1 = yi + [f (ti , yi ) + f (ti+1 , yi + (ti+1 − ti ) · f (ti , yi ))] .
2
Pausing for a moment to consider what we have, we might conclude the formula is getting a little unwieldy. Let’s
see if we can tidy it up a bit. First, substituting h for ti+1 − ti makes it a little nicer:

h
yi+1 = yi + [f (ti , yi ) + f (ti+1 , yi + h · f (ti , yi ))] .
2

Second, letting k1 = f (ti , yi ) and k2 = f (ti+1 , yi + h · f (ti , yi )) = f (ti+1 , yi + h · k1 ), we get a nice, neat, three-step
computation:

k1 = f (ti , yi )
k2 = f (ti+1 , yi + hk1 )
h
yi+1 = yi + (k1 + k2 ). (6.3.3)
2
But before getting too carried away with the clean formulation, it would be nice to have some evidence that this
“advanced” method gives a reasonable approximation of the solution to an o.d.e. as expected. Let’s have Octave
compute approximate solutions of o.d.e. 6.2.1 using both Euler’s method and this method based on the trapezoidal
rule, and compare them to the exact solution, y(t) = t4 + 16
3

t . The following code snippet, while specific to this one


task can be generalized to find approximate solutions of other o.d.e.s as well.
6.3. FOUNDATIONS FOR RUNGE-KUTTA METHODS 209

O.D.E. solver test code


t=4;
h=-1/4;
f=inline("-y/t+t^2");
exact=inline("t^3/4+16/t");
euler=20;
trap=20;
disp(’ Euler Trapezoid Exact Euler err Trap err’)
disp(’ -------------------------------------------------------------’)
for i=1:8
euler=euler+h*f(t,euler);
k1=f(t,trap);
k2=f(t+h,trap+h*k1);
trap=trap+h/2*(k1+k2);
t=t+h;
x=exact(t);
sprintf(’%12.5g%12.5g%12.5g%12.5g%12.5g’,euler,trap,x,abs(euler-x),abs(trap-x))
end%for

This test code may be downloaded at the companion website (rungeKuttaDemo.m). The only part of this code that
may appear unfamiliar to you at this point is the sprintf() command. The first argument,

’%12.5g%12.5g%12.5g%12.5g%12.5g’,

is the formatting string. This particular string means to string together 5 floating point numbers using 12 spaces
each and displaying 5 significant digits. In the sprintf command, %12.5g means “general” formatting of a floating
point number with 12 spaces and 5 significant figures. The computer will decide whether to use scientific notation
in the output. Since it is repeated 5 times, this particular command will format five such floating point values.
The rest of the arguments are the five numbers to print. The command sprintf should not be read as “sprint-eff”
but rather “ess-print-eff” or “string print formatted”. The s is for string and the f is for formatted. If you’re
thinking this command seems a bit arcane, you’re right. This type of print formatting command originated in the
C programming language during the 1970s!1 The output of running this Octave code is

Euler Trapezoid Exact Euler err Trap err


-------------------------------------------------------------
ans = 17.25 17.442 17.45 0.20026 0.0080729
ans = 14.884 15.273 15.29 0.4058 0.016741
ans = 12.885 13.479 13.505 0.62006 0.026142
ans = 11.236 12.047 12.083 0.84776 0.036458
ans = 9.9219 10.969 11.017 1.0955 0.04794
ans = 8.9332 10.245 10.306 1.373 0.060938
ans = 8.2641 9.8828 9.9588 1.6947 0.075955
ans = 7.9167 9.9062 10 2.0833 0.09375

Our method based on the trapezoidal rule, which we will call trapezoidal-ode for now, seems to do a better job
of approximating the solution of this o.d.e. than does Euler’s method. The last two columns contain the absolute
errors for each approximation. The errors in trapeziodal-ode are roughly 0.01 to 0.1 while the errors for Euler’s
method are roughly 0.2 to 2. All of the errors in trapezoidal-ode are smaller than all the errors in Euler’s method.
Of course trapezoidal-ode requires two evaluations of f per step, so it better deliver better results for the extra
work if it is to be useful at all.
Buoyed by this success, perhaps it is worth investing some time in other integration formulas, like Simpson’s
rule, for example. Recall from section 4.3, Simpson’s rule states
ˆ x0 +2h
h
f (x)dx = [f (x0 ) + 4f (x0 + h) + f (x0 + 2h)] + O(h5 f (4) (ξh )),
x0 3
1 See https://en.wikipedia.org/wiki/Printf_format_string for some details.
210 CHAPTER 6. ORDINARY DIFFERENTIAL EQUATIONS

which in the notation of this section we might write as


ˆ ti+1
h
f (t, y) dt = f (ti , yi ) + 4f (ti+1/2 , yi+1/2 ) + f (ti+1 , yi+1 ) ,

ti 6

ignoring the error term, and using the notation ti+1/2 to mean ti + 21 h and yi+1/2 to mean y(ti + 12 h). So an o.d.e.
solver based on Simpson’s rule might look like
h
yi+1 = yi + f (ti , yi ) + 4f (ti+1/2 , yi+1/2 ) + f (ti+1 , yi+1 ) .

6
Again, this is an implicit formula. Again, we can use Euler’s method to estimate yi+1 , and, in fact, we can use
Euler’s method to estimate yi+1/2 too! Since ti+1/2 is closer to ti than is ti+1 , we estimate yi+1/2 first. That is, we
replace yi+1/2 by yi + h2 f (ti , yi ). Using a multiple-step calculation as before, that gives us

k1 = f (ti , yi )
 
h h
k2 = f ti + , yi + k1
2 2
so far. This takes care of the first two terms in brackets. Now we estimate yi+1 by approximating f (ti+1 , yi+1 ).
But we now have an estimate of f at ti + h2 , and ti + h2 is closer to ti+1 than is ti . So, even though we could use
yi + hf (ti , yi ) = yi + hk1 to approximate yi+1 (as done before), we might expect yi + hk2 to be a better estimate.
With this hope in hand, we complete the method by calculating as follows:

k1 = f (ti , yi )
 
h h
k2 = f ti + , yi + k1
2 2
k3 = f (ti+1 , yi + hk2 )
h
yi+1 = yi + [k1 + 4k2 + k3 ] .
6
For now, we will refer to this method as Simpson’s-ode.
Before trying to assess whether this new method is better than the previous ones, let’s derive a couple more,
and compare them all together. The formula
ˆ x0 +3h
3h
f (x)dx = [f (x0 + h) + f (x0 + 2h)] + O(h3 f 00 (ξh ))
x0 2

(an open Newton-Cotes formula from section 4.3) leads to the method

k1 = f (ti , yi )
 
h h
k2 = f ti + , yi + k1
3 3
2h 2h
 
k3 = f ti + , yi + k2
3 3
h
yi+1 = yi + [k2 + k3 ] .
2
Can you fill in the steps to derive this method? Answer on page 213. We will call this method open-ode. Finally,
we use the stencil

to derive yet another integration formula. This is not an open Newton-Cotes formula nor is it a closed Newton-Cotes
formula. It is not one that was covered in section 4.3. Perhaps it might be called a “clopen” (half closed and half
open) Newton-Cotes formula. Can you derive the corresponding integration method? Details on page 214. The
result is ˆ x0 +3h
3h
f (x)dx ≈ [f (x0 ) + 3f (x0 + 2h)] ,
x0 4
6.3. FOUNDATIONS FOR RUNGE-KUTTA METHODS 211

disregarding the error term. This leads to the o.d.e. solver

k1 = f (ti , yi )
 
h h
k2 = f ti + , yi + k1
3 3
2h 2h
 
k3 = f ti + , yi + k2
3 3
h
yi+1 = yi + [k1 + 3k3 ] .
4
We will call this method clopen-ode. Notice two things. First, even though k2 is not used in the final line, it is still
computed since it is used to compute k3 . Second, the calculations of k1 , k2 , and k3 are identical to those in the
open-ode method. The only difference is how the kj are combined. The integration methods combine the values of
the function at the nodes differently. This idea of using the same kj for different purposes will come up again!.
So now we have three new methods to test out—one based on Simpson’s rule (Simpson’s-ode), one based on an
open Newton-Cotes formula (open-ode), and a third based on a “clopen” Newton-Cotes formula (clopen-ode). Can
you write test code for comparing the three new formulas (similar to the code used to compare Euler’s method with
trapezoidal-ode)? Answer on page 215. Results are summarized in the following Octave output:

Simpsons Open Clopen Simp err Open err Clop err


------------------------------------------------------------
ans = 17.44806 17.44999 17.45022 0.00220 0.00028 0.00004
ans = 15.28557 15.28953 15.29008 0.00461 0.00065 0.00010
ans = 13.49781 13.50395 13.50494 0.00730 0.00116 0.00017
ans = 12.07297 12.08146 12.08307 0.01036 0.00187 0.00027
ans = 11.00347 11.01450 11.01700 0.01393 0.00290 0.00040
ans = 10.28804 10.30185 10.30566 0.01821 0.00440 0.00059
ans = 9.93523 9.95208 9.95789 0.02354 0.00669 0.00088
ans = 9.96952 9.98969 9.99866 0.03048 0.01031 0.00134

Simpson’s-ode does the poorest job of finding an approximate solution and clopen-ode does the best. But why?
We’ve done a pretty thorough job of sweeping error analysis under the rug up until now. The bulk of that
investigation will happen in the next section, but we can do a quick analysis here. From section 4.3, we know
that the trapezoidal rule and the open Newton-Cotes formula we used here both have error terms of O(h3 ), while
Simpson’s rule has error term O(h5 ). The integration methods based on the stencils

(which led to Euler’s method and the clopen method) have yet undetermined error terms. Can you show that
their error terms are O(h2 ) and O(h4 ), respectively? Answer on page 215. Based on the error terms of the
underlying integration methods, we should expect these o.d.e. solvers to be, in order from least accurate to most
accurate, Euler’s method (based on a O(h2 ) integration formula), open-ode (based on a O(h3 ) integration formula),
clopen-ode (based on a O(h4 ) integration formula), and Simpson’s-ode (based on a O(h5 ) integration formula); with
trapezoidal-ode to be on par with open-ode. Table 6.3 shows the errors in calculating y(2) for 6.2.1 for the five
methods of this section using various values of h. Since the value of h in each row is half that of the previous row,
`
we would expect the ratio of the errors in consecutive rows to be approximately 21 where the rate of convergence
`
for the method is O(h` ). For Euler’s method, dividing the error in row 3 by that of row 2, we get 12 ≈ .55114 1
1.0809 ≈ 2
`
and dividing the error in row 6 by that in row 5, we get 12 ≈ .07013 1
.1399 ≈ 2 , for example. This evidence suggests that


` = 1 for Euler’s method, and therefore, Euler’s method has an O(h) convergence. Repeating the same calculation
for the other methods yields Table 6.4.
With the exception of Simpson’s-ode, Table 6.4 suggests that o.d.e. solvers have an error term of one less degree
than their underlying (single step) integration formula. In section 4.4 we noted that composite integration formulas
also have error terms of one less degree than their corresponding single-step integration formulas (and we made a
212 CHAPTER 6. ORDINARY DIFFERENTIAL EQUATIONS

Table 6.3: A comparison of absolute errors for five o.d.e. solvers


h Euler’s Trap-ode Open-ode Clopen-ode Simpson’s-ode
− 14 2.0833 0.09375 0.010311 0.0013444 0.030482
− 18 1.0809 0.023437 0.0025929 0.00017446 0.0077168
1
− 16 0.55114 0.0058594 0.00064977 2.2207(10)−5 0.0019412
1
− 32 0.27837 0.0014648 0.00016261 2.8008(10)−6 0.00048679
1
− 64 0.1399 0.00036621 4.0672(10)−5 3.5166(10)−7 0.00012188
1
− 128 0.07013 9.1553(10) −5
1.017(10)−5 4.4055(10)−8 3.0494(10)−5

Table 6.4: The error terms of five o.d.e solvers and their underlying integration methods
Euler’s Trap-ode Open-ode Clopen-ode Simpson’s-ode
Integration method O(h2 ) O(h3 ) O(h3 ) O(h4 ) O(h5 )
2 2 3
O.D.E. solver O(h) O(h ) O(h ) O(h ) O(h2 )

similar observation about Taylor methods in section 6.2). There is reason to believe in this parallel as the methods
proposed in this section are essentially composite integration techniques. So, it should be a little troubling that
Simpson’s-ode does not fit the pattern. A deeper exploration of the error term is needed to explain this anomaly.

Exercises
1. Derive an o.d.e. solver based on the stencil and corresponding integration formula.

[S]
(a)
h 2
  
f (x0 ) + 3f x0 + h + O(h4 )
4 3

[A]
(b)
1
 
hf x0 + h + O(h3 )
2

[A]
(c)
h 1
   
3f x0 + h − f (x0 ) + O(h3 )
2 3

(d)
1
 
hf x0 + h + O(h2 )
3

[S]
(e)
h 1
   
3f x0 + h + f (x0 + h) + O(h4 )
4 3

(f)
2
 
hf x0 + h + O(h2 )
3

[A]
(g)
h 1 1 2
      
3f x0 + h − 4f x0 + h + 3hf x0 + h + O(h5 )
2 3 2 3
6.3. FOUNDATIONS FOR RUNGE-KUTTA METHODS 213

(h)
h 1
   
3f x0 + h + f (x0 + h) + O(h4 )
4 3

(i)
√ √
3−1 3+1
    
h
f x0 + √ h +f x0 + √ h + O(h5 )
2 2 3 2 3

[A]
(j)
√ √  √ √ 
5− 3 1 5+ 3
   
h

5f x0 + √ h + 8f x0 + h 5f x0 + √ h + O(h7 )
18 2 5 2 2 5

2. Conduct a numerical experiment on test o.d.e. 6.2.1 to determine the rate of convergence of the method derived in
question 1. Based on the error term of the integration formula, is the rate of convergence of the o.d.e. solver as
expected?
[A]
3. Write an Octave function that implements Euler’s method.

4. Write an Octave function that implements trapezoidal-ode.

5. Write an Octave function that implements clopen-ode.

6. Write an Octave function that implements the solver you derived in exercise 1b. This is called the midpoint method
or the modified Euler method. It is based on the midpoint rule for integration. [A]

7. Write an Octave function that implements the solver you derived in exercise 1a. This is called Ralston’s method.
[A]

8. Use your code from exercise 3 to compute y(2) for the o.d.e. in exercise 1 on page 205 using step size h = 0.05.
[S][A]

9. Use your code from exercise 4 to compute y(2) for the o.d.e. in exercise 1 on page 205 using step size h = 0.05.
[S][A]

10. Use your code from exercise 5 to compute y(2) for the o.d.e. in exercise 1 on page 205 using step size h = 0.05.
[S][A]

11. Use your code from exercise 6 to compute y(2) for the o.d.e. in exercise 1 on page 205 using step size h = 0.05.
[S][A]

12. Use your code from exercise 7 to compute y(2) for the o.d.e. in exercise 1 on page 205 using step size h = 0.05.
[S][A]

Answers
Filling in the gaps: Beginning with the integration formula
ˆ x0 +3h
3h
f (x)dx = [f (x0 + h) + f (x0 + 2h)] + O(h3 f 00 (ξh )),
x0 2
we “shrink” the interval of integration to [x0 , x0 + s] by making the substitution s = 3h:
ˆ x0 +s
1 2
 
s
f (x)dx = f (x0 + s) + f (x0 + s) + O(s3 f 00 (ξk )).
x0 2 3 3
With the integration formula rephrased in terms of step size s, the o.d.e. solving method is
h
yi+1 = yi + f (ti+1/3 , yi+1/3 ) + f (ti+2/3 , yi+2/3 ) ,

2
214 CHAPTER 6. ORDINARY DIFFERENTIAL EQUATIONS

where we revert to using h for step size. We then use Euler’s method to estimate yi+1/3 and yi+2/3 , starting
with yi+1/3 . That is, we replace yi+1/3 by yi + h3 f (ti , yi ). Then we estimate yi+2/3 . Using a multiple-step
calculation as before, that gives us
k1 = f (ti , yi )
 
h h
k2 = f ti + , yi + k1 ,
3 3
taking care of the first term in brackets. It remains to estimate f (ti+2/3 , yi+2/3 ). But we now have an estimate
of f (the derivative of y) at ti + h3 , and ti + h3 is closer to ti+2/3 than is ti . So, we approximate yi+2/3 by
yi + 32 hk2 :
k1 = f (ti , yi )
 
h h
k2 = f ti + , yi + k1
3 3
2h 2h
k3 = f (ti + , yi + k2 )
3 3
h
yi+1 = yi + [k2 + k3 ] .
2

Clopen Newton-Cotes:
For this stencil, a = x0 , b = x0 + 3h, and θi = ih, i = 0, 1, 2. Therefore, we will have a system of three
equations in the three unknowns. First, the left-hand sides:

ˆ b ˆ x0 +3h ˆ x0 +3h
x +3h
p0 (x)dx = p0 (x)dx = 1dx = (x − x0 )|x00 = 3h
a x0 x0
ˆ ˆ x0 +3h ˆ x0 +3h x0 +3h
b
1 2 9
p1 (x)dx = p1 (x)dx = (x − x0 )dx = (x − x0 ) = h2

a x0 x0 2 x0 2
ˆ ˆ x0 +3h ˆ x0 +3h x0 +3h
b
1
p2 (x)dx = p2 (x)dx = (x − x0 )2 dx = (x − x0 )3 = 9h3

a x0 x0 3 x0

Now putting them together with the right-hand sides (and swapping sides):
2
(θi h)0 ai
X
= a0 + a1 + a2 = 3h
i=0
2
9 2
(θi h)1 ai
X
= ha1 + 2ha2 = h
i=0
2
2
(θi h)2 ai = h2 a1 + 4h2 a2 = 9h3
X

i=0

This system is small enough to solve by hand (without the use of a computer algebra system):

h2 a1 +4h2 a2 = 9h3
9 3
− (h2 a1 +2h2 a2 = 2h ) ⇒ a2 = 94 h.
9 3
2h2 a2 = 2h

Substituting a2 = 94 h into ha1 + 2ha2 = 92 h2 , we can solve for a1 :


9 9 2
ha1 + 2h · h = h
4 2
9 9 2
ha1 + h2 = h ⇒ a1 = 0.
2 2
ha1 = 0
6.3. FOUNDATIONS FOR RUNGE-KUTTA METHODS 215

Substituting a1 = 0 and a2 = 94 h into a0 + a1 + a2 = 3h, we can solve for a0 :


9
a0 + 0 + h = 3h
4
9 3
a0 = 3h − h ⇒ a0 = h.
4 4
P2
Therefore, i=0 ai f (x0 + θi h) = 43 h · f (x0 ) + 0 · f (x0 + h) + 49 h · f (x0 + 2h) and the integration formula is
ˆ x0 +3h
3h
f (x)dx ≈ [f (x0 ) + 3f (x0 + 2h)] .
x0 4

Test code: Comparing Simpson’s, open, and clopen methods:

t=4;
h=-1/4;
f=inline("-y/t+t^2");
exact=inline("t^3/4+16/t");
simp=20;
open=20;
clop=20;
disp(’ Simpsons Open Clopen Simp err Open err Clop err’)
disp(’ ------------------------------------------------------------’)
for i=1:8
k1simp=f(t,simp);
k1open=f(t,open);
k1clop=f(t,clop);
k2simp=f(t+h/2,simp+h/2*k1simp);
k2open=f(t+h/3,open+h/3*k1open);
k2clop=f(t+h/3,clop+h/3*k1clop);
k3simp=f(t+h,simp+h*k2simp);
k3open=f(t+2*h/3,open+2*h/3*k2open);
k3clop=f(t+2*h/3,clop+2*h/3*k2clop);
simp=simp+h/6*(k1simp+4*k2simp+k3simp);
open=open+h/2*(k2open+k3open);
clop=clop+h/4*(k1clop+3*k3clop);
t=t+h;
x=exact(t);
sierr=abs(simp-x);
operr=abs(open-x);
clerr=abs(clop-x);
sprintf(’%12.5g%12.5g%12.5g%12.5g%12.5g%12.5g’,simp,open,clop,sierr,operr,clerr)
end%for

This test code may be downloaded at the companion website (rungeKuttaDemo2.m).


Error terms: The error term for
ˆ x0 +3h
3h
f (x)dx ≈ [f (x0 ) + 3f (x0 + 2h)]
x0 4

is derived in the section 4.3 solutions. See page 273. The error term for
ˆ x0 +h
f (x)dx ≈ hf (x0 )
x0

is derived similarly. We are given that the error is O(h2 ), so we can skip the discovery. Expanding f (x) in a
Taylor polynomial with error term,

f (x) = f (x0 ) + (x − x0 )f 0 (ξx ).


216 CHAPTER 6. ORDINARY DIFFERENTIAL EQUATIONS

So
ˆ x0 +h ˆ x0 +h
f (x)dx − hf (x0 ) = (f (x0 ) + (x − x0 )f 0 (ξx )) dx − hf (x0 )
x0 x0
ˆ x0 +h
x +h
= xf (x0 )|x00 + (x − x0 )f 0 (ξx )dx − hf (x0 )
x0
ˆ x0 +h
= hf (x0 ) + (x − x0 )f 0 (ξx )dx − hf (x0 )
x0
ˆ x0 +h
= (x − x0 )f 0 (ξx )dx.
x0

´ x0 +h
By the weighted mean value theorem, there exists c ∈ (x0 , x0 + h) such that (x − x0 )f 0 (ξx )dx =
´ x +h x0
f 0 (c) x00 (x − x0 )dx = 21 f 0 (c)h2 . Hence
ˆ x0 +h
1 0
f (x)dx − hf (x0 ) = f (c)h2 ≤ M h2 f 0 (ξh )
x0 2

where we have replaced c by ξh .


6.4. ERROR ANALYSIS 217

6.4 Error Analysis


Section 6.3 ended with the mysterious (and unsettling?) observation that Simpson’s-ode did not live up to expec-
tations. Based on other o.d.e. solvers, we would expect the rate of convergence of Simpson’s-ode to be O(h4 ) since
Simpson’s rule, on which Simpson’s-ode is based, has local truncation error O(h5 ).
The explanation is rooted in the fact that we are solving an o.d.e. of the form ẏ = f (t, y), in which the derivative
is a function of two variables, t and y. To understand the error analysis, heavy use of partial derivatives and the
chain rule are required. As ever, we consult Taylor’s theorem and write

1 1 ...
y(t0 + h) = y(t0 ) + hẏ(t0 ) + h2 ÿ(t0 ) + h3 y (t0 ) + · · · .
2 6
Each derivative of y can be replaced by some function of f and its partial derivatives, starting with ẏ, which is
given by the o.d.e. we are trying to solve.

ẏ = f (t, y)
d d
ÿ = ẏ = f (t, y) = ft (t, y) + fy (t, y)ẏ = ft (t, y) + fy (t, y) · f (t, y)
dt dt
..
.

Eliminating the explicit use of arguments t and y,

ẏ = f
ÿ = ft + fy f
...
y = ftt + fty f + (fyt + fyy f )f + fy (ft + fy f )
= ftt + 2fty f + fyy f 2 + ft fy + fy2 f
..
.
...
so y(t0 + h) = y(t0 ) + hẏ(t0 ) + 21 h2 ÿ(t0 ) + 16 h3 y (t0 ) + · · · in terms of f is

1 1
y(t0 + h) = y(t0 ) + hf + h2 (ft + fy f ) + h3 (ftt + 2fty f + fyy f 2 + ft fy + fy2 f ) + · · · ,
2 6
and as an o.d.e. solver (replacing y(t0 ) by yi and y(t0 + h) by yi+1 ),

1 1
yi+1 = yi + hf + h2 (ft + fy f ) + h3 (ftt + 2fty f + fyy f 2 + ft fy + fy2 f ) + · · · . (6.4.1)
2 6
Rewriting high degree Taylor
... polynomials in terms of f quickly becomes complicated. We will focus on analysis
requiring only ẏ, ÿ, and y .
The o.d.e. solvers of section 6.3 have the form

k1 = f (ti , yi )
k2 = f (ti + β2 h, yi + β2 hk1 )
k3 = f (ti + β3 h, yi + β3 hk2 )
..
.
ks = f (ti + βs h, yi + βs hks−1 )
yi+1 = yi + h [α1 k1 + α2 k2 + α3 k3 + · · · + αs ks ] . (6.4.2)

We did not actually see any o.d.e. solvers with s > 3 in section 6.3, but the process we followed would clearly
require it should there be more than three nodes in the underlying integration formula.
The difference between y(t0 + h) from (6.4.1) and yi+1 from (6.4.2) is the local truncation error of the o.d.e.
solver (the error in taking a single step). In order to write this truncation error in the form O(h` ), though, we need
to expand each kj in its Taylor polynomial. Taylor’s theorem in two variables is needed.
218 CHAPTER 6. ORDINARY DIFFERENTIAL EQUATIONS

Theorem 8. Suppose f (t, y) and all its partial derivatives of order n + 1 and lower are continuous on the rectangle
D = {(t, y) : a ≤ t ≤ b, c ≤ y ≤ d}, and let (t0 , y0 ) ∈ D. Then for every (t, y) ∈ D, there exist ξ ∈ (a, b) and
µ ∈ (c, d) such that
f (t, y) = f (t0 , y0 ) + [(t − t0 ) · ft (t0 , y0 ) + (y − y0 ) · fy (t0 , y0 )]
1
+ (t − t0 )2 ftt (t0 , y0 ) + 2(t − t0 )(y − y0 ) · fty (t0 , y0 ) + (y − y0 )2 fyy (t0 , y0 )

2
+··· +
 
n 
1  n

X n ∂ f
(t − t0 )n−j (y − y0 )j n−j j (t0 , y0 )
n! j=0 j ∂t ∂y
 
n+1
1 n+1 n+1
 
X ∂ f
+ (t − t0 )n+1−j (y − y0 )j n+1−j j (ξ, µ) .
(n + 1)! j=0

j ∂t ∂y

As with Taylor’s theorem (of one variable), the first n + 1 terms form the Taylor polynomial and the last term is
the remainder term.
To illustrate, we let f (t, y) = − yt + t2 and compute its second Taylor polynomial with remainder term expanded
about (t0 , y0 ) = (1, 1). For this, we will need all partial derivatives of f up to and including order 3.
y
ft = 2 + 2t
t
1
fy = −
t
y
ftt = −2 3 + 2
t
1
fty = fyt = 2
t
fyy = 0
y
fttt = 6 4
t
2
ftty = ftyt = fytt = − 3
t
ftyy = fyty = fyyt = 0
fyyy = 0.
It follows that
f (1, 1) = 0
ft (1, 1) = 3
fy (1, 1) = −1
ftt (1, 1) = 0
fty (1, 1) = 1
fyy (1, 1) = 0
µ
fttt (ξ, µ) = 6
ξ4
2
ftty (ξ, µ) = − 3
ξ
ftyy (ξ, µ) = 0
fyyy (ξ, µ) = 0.
Therefore, the second Taylor polynomial for f (t, y) is
T2 (t, y) = f (1, 1) + [(t − 1) · ft (1, 1) + (y − 1) · fy (1, 1)]
1
+ (t − 1)2 ftt (1, 1) + 2(t − 1)(y − 1) · fty (1, 1) + (y − 1)2 fyy (1, 1)

2
= 0 + 3(t − 1) − (y − 1) + 0(t − 1)2 + (t − 1)(y − 1) + 0(y − 1)2
= 3(t − 1) − (y − 1) + (t − 1)(y − 1)
6.4. ERROR ANALYSIS 219

with remainder term


1
R2 (t, y) = (t − 1)3 fttt (ξ, µ) + 3(t − 1)2 (y − 1)ftty (ξ, µ) + 3(t − 1)(y − 1)2 ftyy (ξ, µ) + (y − 1)3 fyyy (ξ, µ)

6
1 2

3 µ 2 2 3
= (t − 1) · 6 4 − 3(t − 1) (y − 1) · 3 + 3(t − 1)(y − 1) · 0 + (y − 1) · 0
6 ξ ξ
3 µ 2 1
= (t − 1) 4 − (t − 1) (y − 1) 3 .
ξ ξ

More generally, suppose we are interested in Taylor polynomial expansions of expressions like f (ti + βj h, yi +
βj hkj−1 ), as we have in our o.d.e. solvers. Expanding about (ti , yi ), we let t0 = ti , y0 = yi , t = ti + βj h, and
y = yi + βj hkj−1 . Thus t − t0 = βj h and y − y0 = βj hkj−1 , and the second Taylor polynomial without explicit
listing of the arguments ti and yi on the right-hand side is
1
f (ti + βj h, yi + βj hkj−1 ) = f + hβj [ft + kj−1 fy ] + h2 βj2 ftt + 2kj−1 fty + kj−1
2
 
fyy
2
with remainder term O(h3 ).
In particular, when we set j = 1, βj = β1 = 0, we get

k1 = f (ti , yi ) = f.

When we set j = 2,

k2 = f (ti + β2 h, yi + β2 hk1 )
1
= f + hβ2 [ft + f fy ] + h2 β22 ftt + 2f fty + f 2 fyy + O(h3 ).
 
2
The calculation of k3 is a little bit messier since it involves k22 . Before diving in headlong, though, consider what
we will do with k3 first. After computing k1 , k2 , and k3 , we will substitute each into the formula

yi+1 = yi + h [α1 k1 + α2 k2 + α3 k3 ] (6.4.3)

and subtract the result from (6.4.1). For purposes of this discussion, we seek a method with local truncation error
O(h4 ). Therefore, we need only retain constant terms and terms containing a factor of h3 , h2 , or h in equation
(6.4.3). Terms with higher powers of h are irrelevant. They will be assumed (or should I say consumed?) by the
O(h4 ). Since the sum α1 k1 + α2 k2 + α3 k3 is multiplied by h, we need only retain terms with factors of up to h2 in
k1 , k2 , and k3 . Taking a look at the expansion of k3 :

k3 = f (ti + β3 h, yi + β3 hk2 )
1
= f + hβ3 [ft + k2 fy ] + h2 β32 ftt + 2k2 fty + k22 fyy
 
2
we see only the term 12 h2 β32 · k22 f contains k22 , and it already has a factor of h2 . Consequently, we only need to
include the constant term of k22 . The rest of the terms of k22 become part of the O(h4 ). That’s not so bad!

k22 = f 2 + O(h).

Similarly, when we substitute expressions for k2 into k3 , we will be careful to avoid any terms that would give a
factor of h to any power greater than 2:

k3 = f + hβ3 [ft + (f + hβ2 [ft + f fy ]) fy ]


1
+ h2 β32 ftt + 2(f )fty + f 2 fyy + O(h3 )
  
2
= f + hβ3 ft + hβ3 f fy + h2 β2 β3 (ft fy + f fy2 )
1
+ h2 β32 ftt + 2f fty + f 2 fyy + O(h3 ).
 
2
After all that detailed computation, now is a good time to lean back and take a look at what we have so far.
We have expanded all the terms of (6.4.2) for s = 3 and are ready to compare the result to the Taylor expansion
220 CHAPTER 6. ORDINARY DIFFERENTIAL EQUATIONS

of the o.d.e. in (6.4.1). The difference of the two is the local truncation error, so we will be interested in the least
power of h that remains after subtraction. Copying the two equations here for convenience, we are subtracting

1 1
yi+1 = yi + hf + h2 (ft + fy f ) + h3 (ftt + 2fty f + fyy f 2 + ft fy + fy2 f ) + O(h4 )
2 6

from

yi+1 = yi + h [α1 k1 + α2 k2 + α3 k3 ]
= yi + hα1 k1 + hα2 k2 + hα3 k3
= yi + hα1 f
1
 
+hα2 f + hβ2 [ft + f fy ] + h2 β22 ftt + 2f fty + f 2 fyy + O(h3 )
 
2
1
 
+hα3 f + hβ3 ft + hβ3 f fy + h2 β2 β3 (ft fy + f fy2 ) + h2 β32 ftt + 2f fty + f 2 fyy + O(h3 ) .
 
2

The constant term (term containing no factor of h) for each equation is simply yi , so no constant will remain after
subtraction. The difference of the terms involving h is hf − (hα1 f + hα2 f + hα3 f ) = hf (1 − (α1 + α2 + α)), so if
there is to be no h left in the difference, we must have

α1 + α2 + α3 = 1.

The difference of the terms involving h2 ft is 21 h2 ft − (h2 α2 β2 ft + h2 α3 β3 ft ) = h2 ft ( 12 − (α2 β2 + α3 β3 )), so if there


is to be no h2 ft left in the difference, we must have

1
α2 β2 + α3 β3 = .
2

Similarly, we consider the differences of the rest of the terms to get the following conditions on the αj and βj .

term leads to condition


h2 fy f α2 β2 + α3 β3 = 21
h3 ftt α2 β22 + α3 β32 = 13
h3 fty f α2 β22 + α3 β32 = 31
h3 fyy f 2 α2 β22 + α3 β32 = 13
h3 ft fy α3 β2 β3 = 16
h3 fy2 f α3 β2 β3 = 16

We have considered all 8 different terms, but have only arrived at 4 distinct conditions:

α1 + α2 + α3 = 1
1
α2 β2 + α3 β3 =
2
1
α2 β22 + α3 β32 =
3
1
α3 β2 β3 = . (6.4.4)
6

Since we have 5 variables and only 4 conditions, we should think that there are multiple o.d.e. solvers of the form
(6.4.2) with s = 3 and local truncation error O(h4 ).
Evidence from section 6.3 suggests that clopen-ode should have local truncation error O(h4 ). Let’s check. For
that method, we have
1 3
α1 = , α2 = 0, α3 =
4 4
1 2
β2 = , β3 = ,
3 3
6.4. ERROR ANALYSIS 221

so
1 3
α1 + α2 + α3 = +0+ =1
4 4
1 3 2 1
α2 β2 + α3 β3 = 0· + · =
3 4 3 2
 2  2
1 3 2 1
α2 β22 + α3 β32 = 0 + =
3 4 3 3
3 1 2 1
α3 β2 β3 = · · = .
4 3 3 6
Indeed, clopen-ode satisfies all the conditions of an o.d.e. solver with local truncation error (at least) O(h4 ). We
would actually have to show that at least one term containing an h4 remains in the difference to prove that the
local truncation error is not of greater degree.
Before finally answering the question of what happened to Simpson’s-ode, our hard work so far is sufficient
to check that trapezoidal-ode and open-ode have local truncation error O(h3 ) and that Euler’s method has local
truncation error O(h2 ). For trapezoidal-ode, we have α1 = 21 , α2 = 12 , α3 = 0, β2 = 1, and β3 undefined (we may
assign any particular number we choose since having α3 = 0 makes β3 irrelevant to the method), which gives us
1 1
α1 + α2 + α3 = + +0=1
2 2
1 1
α2 β2 + α3 β3 = ·1+0=
2 2
 2
1 1 1 1
α2 β22 + α3 β32 = +0= 6 =
2 3 18 3
1
α3 β2 β3 = 0 6= .
6
The first two conditions are satisfied, but the last two are not. Recall, though, that the first two conditions were
derived from the h and h2 terms while the last two conditions were derived from the h3 terms. So, for trapezoidal-
ode, the local truncation error is O(h3 ).
For Euler’s method, we have α1 = 1, α2 = α3 = 0, and β2 and β3 undefined (or whatever we choose), which
gives us

α1 + α2 + α3 = 1+0+0=1
1
α2 β2 + α3 β3 = 0 + 0 = 0 6=
2
1
α2 β22 + α3 β32 = 0 + 0 = 0 6=
3
1
α3 β2 β3 = 0 6= .
6
The second equation, which was derived from terms involving h2 , is not satisfied but the first equation, which was
derived from terms involving h, is, so the local truncation error for Euler’s method is O(h2 ).
Finally, for Simpson’s-ode, we have α1 = 16 , α2 = 23 , α3 = 61 , β2 = 12 , and β3 = 1, which gives us
1 2 1
α1 + α2 + α3 = + + =1
6 3 6
2 1 1 1
α2 β2 + α3 β3 = · + ·1=
3 2 6 2
 2
2 1 1 2 1
α2 β22 + α3 β32 = + (1) =
3 2 6 3
1 1 1
α3 β2 β3 = · · 1 6= .
6 2 6
The first two equations are satisfied, so the local truncation error is (at least) O(h3 ), but the last equation is
not satisfied, so the local truncation error is no more than O(h3 ). No terms containing factors of h or h2 (that
don’t also contain higher powers of h) appear in the local truncation error, but the term h3 α3 β2 β3 (ft fy + f fy2 ) =
1 3 2 3
6 h (ft fy + f fy ) does, so it is O(h ).
222 CHAPTER 6. ORDINARY DIFFERENTIAL EQUATIONS

A Note About Convention and Practice


We have derived five o.d.e. solvers so far with little nod to established practice. It’s time to fix that. What we have
been calling trapezoidal-ode (since it was derived from the trapezoidal rule) is better known as the improved Euler
method, though some will refer to it as the explicit trapezoidal method. What we have been calling clopen-ode
is better known as Heun’s third order method. These methods can easily be found in the literature. They are
prototypical examples of efficient methods. The improved Euler method requires two function evaluations per step
and gives a local truncation error O(h3 ). Heun’s third order method requires three function evaluations per step
and gives a local truncation error O(h4 ).
What we have been calling open-ode has not been named as it would never be used in practice. It is not an effi-
cient method, requiring three function evaluations but having a local truncation error of only O(h3 ). Consequently,
you are not likely to see it appear in the literature as it is not a useful method in practice. Heun’s third order
method or the improved Euler method would both be preferable to open-ode. Heun’s third order method gives a
smaller truncation error for the same amount of computation (three function evaluations) and the improved Euler’s
method gives the same truncation error for less computation (two function evaluations). Simpson’s-ode has the
same shortcomings as open-ode, and thus you are not likely to see it in the literature either. It is also an inefficient
method.
Methods of the form (6.4.2) are part of a class of methods called Runge-Kutta methods, named after the German
mathematicians Carl Runge and Martin Kutta. The basic idea for such methods was laid out by Runge in a paper
published in 1895, where Runge introduced the improved Euler method and others. His work was continued by Heun,
whose paper of 1900 brought us Heun’s third order method and others. In 1901, Kutta derives the most famous
Runge-Kutta method, what is sometimes now referred to as the classic Runge-Kutta method or the Runge-Kutta
method of order 4, RK4. We will see shortly that it is a modification of Simpson’s-ode.[7]

Higher Order Methods


Higher order Runge-Kutta methods can be derived by considering methods of the form (6.4.2) with a number of
stages, s > 3. Of course higher order methods must satisfy more conditions. In fact, the number of conditions
grows faster as the desired order increases than does the number of variables as the number of stages increases. In
other words, there is a point where the number of stages to achieve order p exceeds p. Order 1 methods can be
derived with one stage (Euler’s method) and no less. Order 2 methods can be derived with two stages (improved
Euler’s method) and no less. Order 3 methods can be derived with three stages (Heun’s third order method) and no
less. Order 4 methods can be derived with four stages (example upcoming) and no less. However, order p methods
with p > 4 require a number of stages s > p, which, in turn means more than p function evaluations. So, the most
efficient methods are to be found with order 4 or less.
Simpson’s-ode failed to live up to its potential because it did not have enough stages, not because there is no
Simpson’s-rule-derived formula with local truncation error O(h5 ). The classic Runge-Kutta method of order 4 (local
truncation error O(h5 )) has four stages and is given by
k1 = f (ti , yi )
 
h h
k2 = f ti + , yi + k1
2 2
 
h h
k3 = f ti + , yi + k2
2 2
k4 = f (ti + h, yi + hk3 )
h
yi+1 = yi + [k1 + 2k2 + 2k3 + k4 ] .
6
Compare this to Simpson’s-ode:
k1= f (ti , yi )
 
h h
k2 = f ti + , yi + k1
2 2
k3 = f (ti+1 , yi + hk2 )
h
yi+1 = yi + [k1 + 4k2 + k3 ] .
6
They are very similar. If we separate the second stage of Simpson’s-ode into two stages, we get Runge-Kutta’s order
4 method. That is the difference. Two stages are used to approximate ẏ(ti + h2 ) instead of one!
6.4. ERROR ANALYSIS 223

Crumpet 35: Derivation of The (Classic) Runge-Kutta Order 4

To derive any Runge-Kutta method of order 4, the stages of the computation must be expanded in a third Taylor
polynomial:

1 2 2 2
f (ti + βj h, yi + βj hkj−1 ) = f + hβj [ft + kj−1 fy ] + h βj ftt + 2kj−1 fty + kj−1

fyy
2
1
+ h3 βj3 fttt + 3kj−1 ftty + 3kj−1
2 3
ftyy + kj−1 fyyy + O(h4 )
 
6

and f (t0 , y0 ) must be expanded in a fourth Taylor polynomial:

1 2 1 ... 1 4 ....
y(t0 + h) = y(t0 ) + hẏ(t0 ) + h ÿ(t0 ) + h3 y (t0 ) + h y (t0 ) + O(h5 ).
2 6 24
....
But y , in terms of f , is

d ... d
(y) = ftt + 2fty f + fyy f 2 + ft fy + fy2 f

dt dt
= fyyy f 3 + 3ftyy f 2 + 4fy fyy f 2 + 3ftty f + 5fty fy f + fy3 f
+3ft fyy f + ft fy2 + ftt fy + fttt + 3ft fty

so
1 2 1
yi+1 = yi + hf + h (ft + fy f ) + h3 (ftt + 2fty f + fyy f 2 + ft fy + fy2 f )
2 6
1 4
+ h fyyy f 3 + 3ftyy f 2 + 4fy fyy f 2 + 3ftty f + 5fty fy f + fy3 f
24
+3ft fyy f + ft fy2 + ftt fy + fttt + 3ft fty + O(h5 ).


Furthermore,
k1 = f (ti , yi ) = f
and

k2 = f (ti + β2 h, yi + β2 hk1 )
1
= f + hβ2 [ft + f fy ] + h2 β22 ftt + 2f fty + f 2 fyy
 
2
1
+ h3 β23 fttt + 3f ftty + 3f 2 ftyy + f 3 fyyy + O(h4 ).
 
6

Consequently, k22 = f 2 + 2hβ2 [ft + f fy ] f + O(h2 ) and k23 = f 3 + O(h). Therefore

1 2 2
= f + hβ3 [ft + k2 fy ] + h β3 ftt + 2k2 fty + k22 fyy

k3
2
1
+ h3 β33 fttt + 3k2 ftty + 3k22 ftyy + k23 fyyy
 
6
1
h   i
= f + hβ3 ft + f + hβ2 [ft + f fy ] + h2 β22 ftt + 2f fty + f 2 fyy fy

2
1 2 2
+ h β3 ftt + 2 (f + hβ2 [ft + f fy ]) fty + f 2 + 2hβ2 [ft + f fy ] f fyy
 
2
1
+ h3 β33 fttt + 3f ftty + 3f 2 ftyy + f 3 fyyy + O(h4 )
 
6
1
= f + hβ3 [ft + f fy ] + h2 β2 β3 [ft + f fy ] fy + h2 β32 ftt + 2f fty + f 2 fyy
 
2
1
+ h3 β3 β22 ftt + 2f fty + f 2 fyy fy + h3 β32 β2 [ft + f fy ] [fty + f fyy ]
 
2
1
+ h3 β33 fttt + 3f ftty + 3f 2 ftyy + f 3 fyyy + O(h4 ).
 
6
224 CHAPTER 6. ORDINARY DIFFERENTIAL EQUATIONS

So, k32 = f 2 + 2hβ3 [ft + f fy ] f + O(h2 ) and k33 = f 3 + O(h). Therefore


1 2 2
= f + hβ4 [ft + k3 fy ] + h β4 ftt + 2k3 fty + k32 fyy

k4
2
1
+ h3 β43 fttt + 3k3 ftty + 3k32 ftyy + k33 fyyy + O(h4 )
 
6
1 2 2
h   i
= f + hβ4 ft + f + hβ3 [ft + f fy ] + h2 β2 β3 [ft + f fy ] fy + h β3 ftt + 2f fty + f 2 fyy fy

2
1
+ h2 β42 ftt + 2 (f + hβ3 [ft + f fy ]) fty + f 2 + 2hβ3 [ft + f fy ] f fyy
  
2
1
+ h3 β43 fttt + 3f ftty + 3f 2 ftyy + f 3 fyyy + O(h4 )
 
6
1
= f + hβ4 [ft + f fy ] + h2 β3 β4 [ft + f fy ] fy + h2 β42 ftt + 2f fty + f 2 fyy
 
2
1
+h3 β2 β3 β4 [ft + f fy ] fy2 + h3 β4 β32 ftt + 2f fty + f 2 fyy fy
 
2
1
+h3 β42 β3 [ft + f fy ] [fty + f fyy ] + h3 β43 fttt + 3f ftty + 3f 2 ftyy + f 3 fyyy + O(h4 ).
 
6
Matching coefficients in
1 2 1
yi+1 = yi + hf + h (ft + fy f ) + h3 (ftt + 2fty f + fyy f 2 + ft fy + fy2 f )
2 6
1 4
+ h fyyy f 3 + 3ftyy f 2 + 4fy fyy f 2 + 3ftty f + 5fty fy f + fy3 f
24
+3ft fyy f + ft fy2 + ftt fy + fttt + 3ft fty + O(h5 ).


with coefficients in
yi+1 = yi + h [α1 k1 + α2 k2 + α3 k3 + α4 k4 ]
up to order 4 yields the conditions

α1 + α2 + α3 + α4 = 1 (6.4.5)
1
α2 β2 + α3 β3 + α4 β4 = (6.4.6)
2
1
α2 β22 + α3 β32 + α4 β42 = (6.4.7)
3
1
α3 β2 β3 + α4 β3 β4 = (6.4.8)
6
1
α2 β23 + α3 β33 + α4 β43 = (6.4.9)
4
1
α3 β32 β2 + α4 β42 β3 = (6.4.10)
8
1
2α3 β32 β2 + 2α4 β42 β3 + α3 β3 β22 + α4 β4 β32 = (6.4.11)
3
5
α3 β32 β2 + α4 β42 β3 + α3 β3 β22 + α4 β4 β32 = (6.4.12)
24
1
α3 β3 β22 + α4 β4 β32 = (6.4.13)
12
1
α4 β2 β3 β4 = . (6.4.14)
24
Any four-stage (s = 4) fourth order Runge-Kutta method of the form (6.4.2) will have to satisfy these 10 equations
with only 7 degrees of freedom (7 variables). Either the equations form a dependent set or solutions will be rare.
In an attempt to solve the system, we solve (6.4.14) for α4 :
1
α4 = .
24β2 β3 β4
Substituting our formula for α4 into (6.4.8) and solving for α3 :
4β2 − 1
α3 = .
24β22 β3
Substituting our formulas for α3 and α4 into (6.4.13) and solving for β3 :
β3 = −4β22 + 3β2 .
6.4. ERROR ANALYSIS 225

Substituting our formulas for α3 , α4 and β3 into (6.4.10) and solving for β4 :

β4 = (6 − 16β2 + 16β22 )β2 .

Substituting our formulas for α3 , α4 , β3 and β4 into (6.4.6) and solving for α2 :

2 − 16β2 + 52β22 − 48β23


α2 = .
24β23 (3 − 4β2 )

Substituting our formulas for α2 , α3 , α4 , β3 and β4 into (6.4.7) and simplifying:

16β23 − 12β22 + 4β2 − 1 = 0.



The roots of this last equation are β2 = 12 , 1±i 7
8
, so we conclude that β2 = 12 . Back substituting, we find

1
β2 =
2
1
α2 =
3
β4 = 1
1
β3 =
2
1
α3 =
3
1
α4 = .
6
Substituting these values of α2 , α3 , and α4 into (6.4.5), we find
1
α1 = .
6
These seven values are the unique simultaneous real solution of the equations (6.4.14), (6.4.8), (6.4.13), (6.4.10),
(6.4.6), (6.4.7), and (6.4.5). So the seven parameters are determined by 7 of the ten conditions. It remains to
show that these seven values also satisfy (6.4.9), (6.4.11), and (6.4.12), which they do. Finally, note that these
are the values of the parameters for the (classic) Runge-Kutta method of order 4.

Key Concepts
Taylor’s theorem in two variables: Suppose f (t, y) and all its partial derivatives of order n + 1 and lower are
continuous on the rectangle D = {(t, y) : a ≤ t ≤ b, c ≤ y ≤ d}, and let (t0 , y0 ) ∈ D. Then for every (t, y) ∈ D,
there exist ξ ∈ (a, b) and µ ∈ (c, d) such that

f (t, y) = f (t0 , y0 ) + [(t − t0 ) · ft (t0 , y0 ) + (y − y0 ) · fy (t0 , y0 )]


1
+ (t − t0 )2 ftt (t0 , y0 ) + 2(t − t0 )(y − y0 ) · fty (t0 , y0 ) + (y − y0 )2 fyy (t0 , y0 )

2
+··· +
 
n 
1  n

X n ∂ f
(t − t0 )n−j (y − y0 )j n−j j (t0 , y0 )
n! j=0 j ∂t ∂y
 
n+1
1 n+1 n+1
 
X ∂ f
+ (t − t0 )n+1−j (y − y0 )j n+1−j j (ξ, µ) .
(n + 1)! j=0

j ∂t ∂y
226 CHAPTER 6. ORDINARY DIFFERENTIAL EQUATIONS

Exercises 3. Explain geometrically, and in your own words, im-


proved Euler’s method.
1. Determine analytically the local truncation error for
the o.d.e. solver derived in exercise 1 on page 212. 4. Write an Octave function that implements improved
Compare it to the local truncation error of the under- Euler’s method (same as exercise 4 on page 213 except
lying integration formula. Are they the same? Also this time the method has a proper name). [A]
compare it to the experimentally determined rate of
convergence (see exercise 2 on page 213). Is it one de- 5. Write an Octave function that implements Heun’s
gree higher, as should be expected? [S][A] third order method (same as exercise 5 on page 213
2. Execute one step of Runge-Kutta order four for solv- except this time the method has a proper name). [A]
ing ẏ = ty with y(1) = 0.5 and h = 1, thus approx- [A]
6. Write an Octave function that implements RK4.
imating y(2). Compare your answer to that of sec-
tion 6.2 exercise 1c on page 205 in which you used 7. Use your code from exercise 6 to compute y(2) for
Euler’s method with two steps. The exact solution is the o.d.e. in exercise 1 on page 205 using step size
3/2
y(2) = e 2 ≈ 2.240844535169032. [S] h = 0.05. [S][A]
6.5. ADAPTIVE RUNGE-KUTTA METHODS 227

6.5 Adaptive Runge-Kutta Methods


Two of the o.d.e. solvers derived in section 6.3 used the exact same set of calculations for k1 , k2 , and k3 , but
combined the results differently to compute yi+1 . At the time, these were called open-ode and clopen-ode. In the
analysis of section 6.4 it was noted that open-ode was not an efficient method while clopen-ode was, at which point
we began referring to clopen-ode by its proper name, Heun’s third order method.

Crumpet 36: Heun’s third order method

In this article from 1900 [16] Karl Heun puts forth the third order method that bears his name. Even if you can
not read the German, his formula VI) is clear!

Due to its inefficiency, open-ode should never be used in practice by itself, but combined with Heun’s third order
228 CHAPTER 6. ORDINARY DIFFERENTIAL EQUATIONS

method, it has some potential usefulness.


According to Heun’s third order method

k1 = f (ti , yi )
 
h h
k2 = f ti + , yi + k1
3 3
2h 2h
 
k3 = f ti + , yi + k2
3 3
h
yi+1 = yi + [k1 + 3k3 ] + O(h4 ).
4
Using the same k1 , k2 , and k3 , the open-ode method is calculated as
h
yi+1 = yi + [k2 + k3 ] + O(h3 ).
2
The difference between these estimates is
h
[k1 − 2k2 + k3 ] = M h3 + O(h4 ) (6.5.1)
4
for some constant M , and represents the local truncation error of the lower order method, open-ode. This error
estimate can be used to adapt the size of h from one step to the next, decreasing the step size when the local
truncation error is bigger than some tolerance and increasing the step size when the local truncation error is smaller
than some tolerance.
To illustrate the algrorithm and the benefits of adaptive routines, let’s return to o.d.e. 6.2.1, ẏ = − yt + t2 , which
we have generously leaned upon already. As before we will estimate y(2) given initial condition y(4) = 20. This
time the number of steps to compute will be determined by the algorithm, not by us, at least after the first step.
Unfortunately, there is no standard or fool-proof way to choose the size of the first step. Because we are looking
for a computation that can be done by hand, let’s try h = −1 to begin, 12 of the width of the interval [2, 4], over
which we will integrate.
As was needed for adaptive quadrature, a desired level of accuracy, or tolerance, is needed here too. Again
because we are looking for a computation that can be done by hand, let’s try 0.1, a pretty modest accuracy.
Finally, we are ready to compute:

k1 = f (4, 20) = 11
1 1
 
k2 = f 4 − , 20 − · 11 ≈ 8.98989898989899
3 3
2 2
 
k3 = f 4 − , 20 − · 8.9898 . . . ≈ 6.90909090909091.
3 3
Before computing y1 from these values, we need to check that the expected accuracy of the calculation would not
violate the 0.1 requirement:
h
[k1 − 2k2 + k3 ] ≈ 0.017.

4
The approximate error in stepping to t1 = 3 is about 0.02, well below the desired threshhold. We are clear to
proceed:
h
y1 = y0 + [k1 + 3k3 ] ≈ 12.06818181818182
4
t1 = t0 + h = 3.

Hence we have y(3) ≈ 12.07. Continuing with h = 1,

k1 = f (3, 12.068 . . .) ≈ 4.977272727272728


1 1
 
k2 = f 3 − , 12.068 . . . − · 4.9773 . . . ≈ 3.20770202020202
3 3
2 2
 
k3 = f 3 − , 12.068 . . . − · 3.2077 . . . ≈ 1.188852813852814.
3 3
6.5. ADAPTIVE RUNGE-KUTTA METHODS 229

Before computing y2 from these values, we need to check that the expected accuracy of the calculation would not
violate the 0.1 requirement:
h
[k1 − 2k2 + k3 ] ≈ 0.062.

4

The approximate error in stepping to t2 = 2 is about 0.06, well below the desired threshhold. We are clear to
proceed:

h
y2 = y1 + [k1 + 3k3 ] ≈ 9.932224025974026
4
t1 = t0 + h = 2.

Hence we have y(2) ≈ 9.932. After two steps, the actual error is about |10 − 9.932| = 0.068. Of course, we could
have simply executed Heun’s third order method with step size h = 1 (and no error checking) and gotten the same
answer. The difference is we would not have had any idea what to expect for an error! With the adaptive method,
you can be reasonably sure each step incurs only the error you request. At the risk of belaboring the point, consider
redoing the calculation with step size h = −2:

k1 = f (4, 20) = 11
2 2
 
k2 = f 4 − , 20 − · 11 ≈ 7.311111111111111
3 3
4 4
 
k3 = f 4 − , 20 − · 7.3111 . . . ≈ 3.266666666666667.
3 3

If we proceed with Heun’s third order method (and no error checking), we get

h
y1 = y0 + [k1 + 3k3 ] ≈ 9.6
4
t1 = t0 + h = 2.

However, without the exact answer, which will be the usual when using a numerical method, we have no way to
know how accurate this estimate is! In that regard, the value 9.6 is a somewhat useless estimate.
On the other hand, since we know the exact value of y(2) is 10, we know the error is 0.4, larger than the desired
0.1. The adaptive Heun should catch this and arrive at a more accurate estimate:

h
[k1 − 2k2 + k3 ] ≈ 0.177.

4

The adaptive method would reject this step because the approximate error is greater than the desired accuracy,
without calculating y1 ! So what should it do instead? The adaptive method will try again with a smaller step size.
Since
h
[k1 − 2k2 + k3 ] ≈ M h3 ,

4

we have M h3 ≈ 0.177 for any step size close to the one just attempted. If we scale the step size by a factor of q, say,
we should expect the new error to be approximately M (qh)3 , or q 3 M h3 ≈ 0.177q 3 . Since we would q like that error
0.1 0.1
to be no more than 0.1, we should choose q so that 0.177q 3 < 0.1 or q 3 < 0.177 , which implies q < 3 0.177 ≈ 0.8254.
But it would slow down the algorithm immensely if the step size were too large very often, so instead, we will take
a somewhat conservative next step of 0.9qh ≈ 0.9(0.8254)(−2) ≈ −1.485. Recalculating with the new step size:

k1 = f (4, 20) = 11
1.485 1.485
 
k2 = f 4− , 20 − · 11 ≈ 8.130924301356263
3 3
4 4
 
k3 = f 4 − , 20 − · 7.3111 . . . ≈ 5.087191526760124.
3 3

and
h
[k1 − 2k2 + k3 ] ≈ 0.06487930780869297,

4
230 CHAPTER 6. ORDINARY DIFFERENTIAL EQUATIONS

so this step is accepted:


h
y2 = y1 + [k1 + 3k3 ] ≈ 10.24469652063055
4
t1 = t0 + h = 2.514132737997418.

Now we keep the new step size until it proves to be inappropriate. In this case, that happens right away. Another
step of −1.485 would take the solution to t2 ≈ 1.028, well past the desired t = 2. So, we shorten the step size to
2 − t1 = −0.514132737997418. There is no worry about shortening the step size as that is expected to reduce the
error! Finally, with h = −0.514132737997418:

k1 = f (2.514 . . . , 10.244 . . .) ≈ 2.246020292164824


0.5141 . . . 0.5141 . . .
 
k2 = f 2.514 . . . − , 10.244 . . . − 2 · 2.246 . . . ≈ 1.279876276642283
3 3
0.5141 . . . 0.5141 . . .
 
k3 = f 2.514 . . . − , 10.244 . . . − 2 · 1.279 . . . ≈ 0.1988478127940674.
3 3

and
h
[k1 − 2k2 + k3 ] ≈ 0.01476646399275057,

4
this step is accepted:
h
y2 = y1 + [k1 + 3k3 ] ≈ 9.879332752200975
4
t1 = t0 + h = 2.

We have y(2) ≈ 9.879332752200975 with some confidence that the error will not be terribly much more than about
0.2, since we took two steps each of which may have incurred an error of about 0.1. There is no guarantee the error
will be less than 0.2, but at least we have some confidence that it’s not drastically greater. And because we used
a conservative estimate for step size, the actual error is probably a bit smaller (as it turns out, the error is about
0.12).

Adaptive Runge-Kutta (pseudo-code)


There are many different adative Runge-Kutta schemes, but the one discussed here uses second and third order
methods, so might be called RK2(3). Technically, it is an order 2 method since the error estimate is for the lower
order method. In practice, however, it is often the higher order method that is used for the o.d.e. solution. While
there is never any guarantee the higher order method is more accurate than the lower order method, it rarely causes
any adverse problems. Besides hedging our bets with the 0.9 safety factor when adjusting the step size, we also
disallow any scaling of h by any factor less than 0.1 or any factor greater than 5. These extra safeties are not
terribly restrictive since they allow for exponential growth or decay of h, but they can help avoid problems when
the error estimates are simply bad. Moreover, the estimates are only good for a small range since the constant of
proportionality may change dramatically for large changes in h. A more detailed discussion of the algorithm can
be found in [26] Section 16.2.
Assumptions: ẏ = f (t, y), y(a) = y0 has a unique solution over the interval from a to b.
Input: Initial value (a, y0 ); function f (t, y); interval endpoints, a and b; initial step size h; desired accuracy
tol; maximum number of iterations N .
Step 1: Set i = 1; t = a; y = y0 ; done = f alse;
Step 2: While not done and i ≤ N do Steps 3-6:
Step 3: If ((b − (t + h)) · (b − a) ≤ 0) then set h = b − t; done = true;
Step 4: Set k1 = f (t, y); k2 = f (t + h3 , y + h3 k1 ); k3 = f (t + 2h 2h
3 , y + 3 k2 ); err = | 4 (k1 − 2k2 + k3 )|;
h

Step 5: If done or err ≤ tol then set y = y + h4 (k1 + 3k3 ); temp = t + h;


Step 6: If temp = t then do Steps 7-8:
Step 7: Print “Method failed. Step size reached zero.”
Step 8: Return
6.5. ADAPTIVE RUNGE-KUTTA METHODS 231

Step 9: Set i = i + 1;
Step 10: If err < tol
5 or err > tol then do steps 11-14:
  13
Step 11: Set q = 0.9 errtol
1 1
Step 12: If q < 10 then set q = 10
Step 13: If q > 5 then set q = 5
Step 14: Set h = qh
Step 15: If not done then Print “Method failed. Maximum iterations exceeded.”
Output: Approximation y(b) or message of failure.

The formulas for ki and err will need to be changed for different adaptive Runge-Kutta schemes, as will the
recalculation of h in Steps 11-14, but the basic algorithm does not require modification for other embedded methods.

General Runge-Kutta Schemes


Up to now, we have considered Runge-Kutta methods of the form (6.4.2), copied here for convenience:

k1 = f (ti , yi )
k2 = f (ti + β2 h, yi + β2 hk1 )
k3 = f (ti + β3 h, yi + β3 hk2 )
..
.
ks = f (ti + βs h, yi + βs hks−1 )
yi+1 = yi + h [α1 k1 + α2 k2 + α3 k3 + · · · + αs ks ] .

In methods of this type, k1 is used in the computation of k2 ; k2 is used in the computation of k3 ; k3 is used in the
computation of k4 ; and so on. However, there is nothing preventing one from deriving a method where both k1
and k2 are used in the computation of k3 ; all of k1 , k2 , and k3 are used in the computation of k4 ; and in general
allowing all of k1 , k2 , . . . , kj−1 to be used in computing kj . Doing so gives more degrees of freedom for satisfying
the error analysis equations, lending hope that there are many more Runge-Kutta methods possible. Any method
of this more general form is called an explicit Runge-Kutta method and can be formulated as

k1 = f (ti , yi )
k2 = f (ti + δ2 h, yi + β21 hk1 )
k3 = f (ti + δ3 h, yi + β31 hk1 + β32 hk2 )
..
.
s−1
X
ks = f (ti + δs h, yi + βsj hkj )
j=1
yi+1 = yi + h [α1 k1 + α2 k2 + α3 k3 + · · · + αs ks ] . (6.5.2)

Methods of this form are often summarized in a Butcher tableau,

0
δ2 β21
δ3 β31 β32
.. .. ..
. . .
δs βs1 βs2 ··· βs(s−1)
α1 α2 ··· αs−1 αs

much like the coefficients of a system of linear equations might be summarized in a matrix. The Butcher tableau
for any of the Runge-Kutta methods we have considered so far will take the form
232 CHAPTER 6. ORDINARY DIFFERENTIAL EQUATIONS

0
δ2 β21
δ3 0 β32
δ4 0 0 β43
.. .. .. .. ..
. . . . .
δs 0 0 ··· 0 βs(s−1)
α1 α2 α3 ··· αs−1 αs

For example, Heun’s third order method would be summarized in a Butcher tableau as

0
1 1
3 3
2 2
3 0 3
1 3
4 0 4

For our purposes, adaptive Runge-Kutta schemes, also called embedded methods, will be coded in a Butcher tableau
by adding one more line for the coefficients αj of the lower order method. For example the Butcher tableau for
RK2(3) as presented above would be

0
1 1
3 3
2 2
3 0 3
1 3
4 0 4
1 1
0 2 2

The most general Butcher tableaux for non-embedded methods take the form

0 β11 β12 ··· β1s


δ2 β21 β22 ··· β2s
.. .. .. .. ..
. . . . .
δs βs1 βs2 ··· βss
α1 α2 ··· αs

If any of the βij with j > i are nonzero, the associated Runge-Kutta scheme is an implicit method. Each step
of the method will require solving a system of equations. Implicit Runge-Kutta methods can be considered for
approximating the solutions of stiff o.d.e. since explicit methods are often exceedingly bad at it.

Crumpet 37: A Stiff Ordinary Differential Equation

The ordinary differential equation

ẋ = x2 − x3
x(0) = δ (6.5.3)

has no closed form solution. The best one can do is derive an implicit solution, so a numerical solution is necessary
to approximate values of the function. Some basic analysis can give an idea what the solution is like, however. It
has an equilibrium at x = 0, which means if x(t0 ) = 0 for some t0 , then x(t) = 0 for all t. The function remains
constant for all time. It is in equilibrium. It does not change. This follows from the fact that when x = 0,
6.5. ADAPTIVE RUNGE-KUTTA METHODS 233

ẋ = 02 − 03 = 0. Similarly, the o.d.e. has an equilibrium at x = 1 (because 1 is another root of the polynomial
x2 − x3 ), and it has no others. However, the two equilibria are very different from one another. The equilibrium
at x = 0 is unstable while the equilibrium at x = 1 is stable. If x(t0 ) is near enough to 1 (|x(t0 ) − 1| < 1 will do),
then x will tend toward 1 as t → ∞. However, there is no such condition near x = 0. No matter how close x(t0 )
is to zero, if it is positive, x will still tend to the other equilibrium, 1, as t → ∞. More to the point, though, is
how the values of x approach 1 as t → ∞.
The hope for an adaptive o.d.e. solver is that it will take large steps where the function is not varying quickly
(has a small first derivative) and will be more careful by taking small steps where the function is varying quickly
(has a large first derivative). More often than not, this is exactly what happens. Stiff o.d.e.s are an exception to
the rule where an adaptive method takes many small steps even in a region where the function has a small first
derivative. The following figures show the solution of (6.5.3) using RK2(3) with tolerance 10−6 , δ = 10−3 , and
initial step size 3 over the interval [0, 2δ ]. First, the solution over [0, 980] acts as we would hope. The solver takes
large steps, including one step from t ≈ 93 to t ≈ 210, a step size h > 117 at the beginning where the function
changes very slowly.

0.045
0.04
0.035
0.03
0.025
x

0.02
0.015
0.01
0.005
0
0 100 200 300 400 500 600 700 800 900
t

In the middle, the solution over [980, 1020] continues to act as we would hope. The solution begins to vary more
quickly here and, consequently, the solver takes a number of smaller steps.

0.8

0.6
x

0.4

0.2

0
980 985 990 995 1000 1005 1010 1015 1020
t

Toward the end, the solution over [1020, 2000] demonstrates the consequence of stiffness. The exact solution is
very nearly constant over this region, gradually approaching 1 from below. A good solver would again take large
steps across this region, but adaptive explicit Runge-Kutta schemes do not. The numerical solution oscillates
within tolerance about 1, so it does what it is supposed to do, but it takes many short steps to do so.

1.000002
1.000001
1.000001
1.000000
1.000000
x

0.999999
0.999999
0.999998
0.999998
1100 1200 1300 1400 1500 1600 1700 1800 1900 2000
t
234 CHAPTER 6. ORDINARY DIFFERENTIAL EQUATIONS

Key Concepts
Embedded Runge-Kutta method: A Runge-Kutta method in which there are two schemes of different orders
derived from the same set of function evaluations.
Adaptive Runge-Kutta method: A Runge-Kutta method that takes advantage of an embedded Runge-Kutta
scheme to automatically adapt the step size as it estimates the solution of an o.d.e.
Butcher tableau: A tabular representation of a Runge-Kutta method.
RKm(n): Shorthand for an embedded Runge-Kutta method containing schemes with rates of convergence (com-
monly called orders) m and n.

Exercises 0

1. Write an Octave function that implements RK2(3) 1 1


as presented in pseudo-code. [A] 1
2
1
2

2. Which are the Butcher tableaux of implicit methods? 5. Show that the method given by the Butcher tableau
[A]
has order 2 for any δ ∈ [ 12 , 1].
0
0
1 1 1
4 8 8 δ δ
1 1
0 1 1
(a) 2 2 1− 2δ 2δ
3 3 9
4 16
0 16

1 − 73 2 − 12
7
8
7
6. Demonstrate numerically that the method sug-
7 32 12 32 7 gested by the Butcher tableau has rate of convergence
90 90 90 90 90 O(h3 ).
0
0
1 1
4 4 1 1
3 3 3
− 49 3 2 2
(b) 4 (a) 3
0 3
1 1 5 1
2 18 12 36 1 0 0 1
7
1 9
− 35 − 19 2 3 1
0 4
0 4
1 2 1
6
0 0 3 6
0
0 2 2
7 7
1 1
4 8 4 [S]
2 2 (b) 7
− 35 5
1 1
(c) 0 6 29
2 2
7 42
− 23 5
6
1 0 0 1 1 1 5 1
6 6 12 4
1 1 1 1
6 3 3 6
√ √
0
1
0 12
− 125 12
5 1
12
1 1
2 2

5− 5 1 1

10−7 5

5 (c)
3 3
10

12 4

60 60
√ 4
0 4
5+ 5 1 10+7 5 1
(d) 10 12 60 4
− 605 2 1 4
9 3 9
1 5 5 1
1 12 12 12 12
1 5 5 1
7. Euler’s method and the improved Euler method use the
12 12 12 12 same function evaluations. Thus, they can be combined
into an embedded, and therefore adaptive, method.
3. Show that this is the Butcher tableau for Euler’s Write the Butcher tableau for the Euler/improved Eu-
method. ler embedded method.

0 0 8. Write an Octave function that implements the


adaptive method suggested in exercise 7.
1
3
9. 8
-rule Runge-Kutta method. Demonstrate nu-
4. Show that this is the Butcher tableau for the improved merically that the 83 -rule method, given by the Butcher
Euler method. [S] tableau, has rate of convergence O(h4 ).
6.5. ADAPTIVE RUNGE-KUTTA METHODS 235

0 (a) The method of exercise 6b and the following.


1 1
3 3 0
2
3
− 13 1 2 2
7 7
1 1 −1 1 4 8 4
7
− 35 5
1 3 3 1
6 29
8 8 8 8
7 42
− 23 5
6
1 1 5 1
1
10. Write an Octave function that implements the 6 6 12 4
11 7 35 7 1
RK3(4) adaptive method ([6] page 301) given by the 96 24 96 48 12
Butcher tableau. [S]
(b) Bogacki–Shampine rk2(3). The method of ex-
0 ercise 6c and the following. [S]
1 1
4 4 0
3
4
− 49 3 1 1
2 2
1 1 5 1
3 3
2 18 12 36
4
0 4
7
1 9
− 53 − 19 2 1 2 1 4
9 3 9
1 2 1
6
0 0 3 6 7 1 1 1
24 4 3 8
7
9
− 53 − 19 2 0
13. Butcher [6] credits Merson (1957) with the earliest
11. Cash-Karp RK4(5). Write an Octave function proposed embedded Runge-Kutta method, given by the
that implements the Cash-Karp adaptive method given Butcher tableau. What are the orders of the two meth-
by the Butcher tableau. [A] ods?

0 0
1 1
1 1 3 3
5 5
1 1 1
3 3 9 3 6 6
10 40 40
1 1 3
3 3 9
− 10 6 2 8
0 8
5 10 5
1
1 − 11 5
− 70 35 1 2
0 − 32 2
54 2 27 27
1 2 1
7 1631 175 575 44275 253 6
0 0 3 6
8 55296 512 13824 110592 4096
1 3 2 1
37
0 250 125
0 512 10
0 10 5 5
378 621 594 1771
2825 18575 13525 277 1
27648
0 48384 55296 14336 4 14. Merson (1957). Write an Octave function that
12. The following pairs of Runge-Kutta methods use the implements the adaptive method of exercise 13. [A]
same function evaluations, but have different rates of 15. The initial value problem
convergence. They can each therefore be paired to form
an embedded Runge-Kutta scheme. Write the Butcher x + 2ey cos(ex )
y0 =
tableau for the embedded method. 1 + ey
y(0) = 2 (6.5.4)
(a) The method of exercise 6a and open-ode.
3 [A] can not be solved analytically. The solution must be
(b) The -rule (exercise 9) and the following.
8 approximated. Use your code from the given exercise to
approximate y(4) with an error of no more than 10−4 .
0
[S]
1 1 (a) 1
3 3
2 (b) 8
3
− 13 1
1 1 (c) 10
0 2 2 [A]
(d) 11
3
(c) The 8
-rule (exercise 9) and the following. (e) 12a
[A]
(f) 12b
0
(g) 12c
1 1
3 3
(h) 12a
2
− 13 1
3 (i) 12b
1 1 −1 1 (j) 13
3
2
− 32 0 1 (k) 14
236 CHAPTER 6. ORDINARY DIFFERENTIAL EQUATIONS


16. The initial value problem due to their failure to proceed beyond x = e.
They get “stuck” taking tinier and tinier steps
x2 + y √
y0 = near x = e, as they should since the solution
x − y2 does not exist beyond that point.
y(0) = 5 (6.5.5)
18. Attempt to approximate y(4) for the initial value
can not be solved analytically. The solution must be
problem in exercise 16. Use a variety of adaptive and
approximated. Use your code from the given exercise to
non-adaptive methods with a variety of tolerances. You
approximate y(3) with an error of no more than 10−4 .
should find that you can not obtain dependable results.
(a) 1 [S] Can you explain why not? HINT: You may wish to plot
(b) 8 the approximate solutions. If your solvers are written
so as to store the points in arrays, it is a simple mat-
(c) 10 ter to plot the solutions, as demonstrated for RK2(3),
[A]
(d) 11 using the code from the solution of exercise 1.
(e) 12a
[y,x]=rk23(f,0,5,4,.0001,1000);
[A]
(f) 12b plot(x,y)
(g) 12c
(h) 12a 19. The initial value problem
(i) 12b y0 = ln(x + y)
(j) 13 1
y(0) =
(k) 14 2
can not be solved analytically. The solution must be
17. Consider the initial value problem approximated. Apply the indicated method to com-
pute y(5) using tolerance 10−4 and an initial step
2
+ y2 1
y0 = −x size 10 . Is the global error (the error in approximat-
2xy ing y(5)) around 10−4 ? significantly smaller? sig-
y(1) = 1. nificantly larger? Accurate to 10 significant digits,
(a) Use your code from exercise 5 on page 226 (Heun’s y(5) = 6.409445034. [A]
third order method) to estimate y(2) with step (a) Cash-Karp (exercise 11)
size 0.01.
(b) Bogacki-Shampine (exercise 12b)
(b) Use your code from exercise 6 on page 226 (RK4)
to estimate y(2) with step size 0.01. (c) Merson (exercise 14)
(c) Compare the results of parts (a) and (b). You (d) RK2(3) (exercise 1)
should notice that they are rather different. The
rest of this exercise explores the reason for the 20. Modify the code you used in exercise 19 to count
discrepancy. the number of function evaluations performed. Which
method was most efficient? The method with the
(d) Use your code from exercise 1 (rk2(3)) to estimate
fewest evaluations was the most efficient. [A]
y(2) with tolerance 0.001 and maximum number
of steps 1000. 21. There are many embedded methods not mentioned
(e) Use your code from any of the parts of exercise 12 in this text, mostly of high order. Look some of
to estimate y(2) with tolerance 0.001 and maxi- them up, write code to implement them, and test your
mum number of steps 1000. code. In particular, you may look for the methods of
Fehlberg, Verner, or Dormand & Prince.
(f) You should have found that the method fails in
both parts (d) and (e). However, if you look at the 22. The Cash-Karp RK4(5) method [8] was designed to
last calculated values of x and y anyway (x(1001) contain embedded methods of all orders from 1 through
and y(1001)), you should find that in both cases, 5, not just orders 4 and 5. Show that the three em-
x ≈ 1.648 and y ≈ 0. The failure to approxi- bedded methods given in the Butcher tableau have the
mate y(2) is not a shortcoming of the numerical indicated orders.
method. The solution of the initial value problem
√ 0
only exists over the interval [1, e) ≈ [1, 1.648).
1 1
For dependable results, care must be taken that 5 5
the solution of the o.d.e. exists and is unique over 3 3 9
10 40 40
the entire interval from a to b. That said, the ba- 3 3 9 6
sic (non-adaptive) solvers plow right along and 5 10
− 10 5
give an approximation for y(2) that is entirely in- 19
54
0 − 10
27
55
54
Order 3
correct. Without some further analysis, you may
− 32 5
0 0 Order 2
not notice that the basic solvers are producing 2

bogus information. On the other hand, the adap- 1 0 0 0 Order 1


tive solvers give some clue as to what is going on
6.5. ADAPTIVE RUNGE-KUTTA METHODS 237
238 CHAPTER 6. ORDINARY DIFFERENTIAL EQUATIONS
Solutions to Selected Exercises

Section 1.1
3a: |p̃ − p| = 1106
9 − 123 =
1
≈ 0.111

9

3c: |p̃ − p| = |1000 − 210 | = |1000 − 1024| = 24


3e: |p̃ − p| = |10−4 − π −7 | = 0.0001 − π17 ≈ 2.3109(10)−4 , using the Octave command

abs(10^-4-pi^-7).
1106
9 − 123

|p̃ − p| 1
4a: = = ≈ 9.03(10)−4
|p| 123 1107
|p̃ − p| |1000 − 210 | 3
4c: = 10
= ≈ 0.0234
|p| 2 128
|p̃ − p| |10−4 − π −7 | π7
4e: = =1− ≈ 0.69797, using the Octave command
|p| π −7 10000

abs(10^-4-pi^-7)/pi^-7.

123

p
5a: log = log ≈ 3.0

1106
9 − 123
p̃ − p

210

p
5c: log = log ≈ 1.6


p̃ − p 1000 − 210
π −7

p
5e: log = log −4 ≈ 0.15616, using the Octave command


p̃ − p 10 − π −7

log(pi^-7/abs(10^-4-pi^-7))/log(10).

10a: f (2) = esin(2) . In Octave: exp(sin(2)), which gives 2.4826.


10c: f (2) = tan−1 (2 − 0.429). In Octave: atan(2-0.429), which gives 1.0039.
12a: We need to find p̃ such that |p̃ − π| = 0.001, so p̃ − π = ±0.001, so p̃ = π ± 0.001. There are two possible
solutions, π − 0.001 ≈ 3.14059 and π + 0.001 ≈ 3.14259.
12c: We need to find p̃ such that |p̃ − ln(3)| = 0.001, so p̃ − ln(3) = ±0.001, so p̃ = ln(3) ± 0.001. There are two
possible solutions, ln(3) − 0.001 ≈ 1.09761 and ln(3) + 0.001 ≈ 1.09961.

10 10 10
12e: We need to find p̃ such that p̃ − ln(1.1) = 0.001, so p̃ − ln(1.1) = ±0.001, so p̃ = ln(1.1) ± 0.001. There are two

10 10
possible solutions, ln(1.1) − 0.001 ≈ 104.91958 and ln(1.1) + 0.001 ≈ 104.92158.

239
240 Solutions to Selected Exercises

13a: We need to find p̃ such that |p̃−π|


π = 0.001, so p̃ − π = ±0.001π, so p̃ = π(1 ± 0.001). There are two possible
solutions, π(0.999) ≈ 3.13845 and π(1.001) ≈ 3.14473.

13c: We need to find p̃ such that |p̃−ln(3)|


ln(3) = 0.001, so p̃ − ln(3) = ±0.001 ln(3), so p̃ = ln(3)(1 ± 0.001). There are
two possible solutions, ln(3)(0.999) ≈ 1.09751 and ln(3)(1 + 0.001) ≈ 1.09971.

10
p̃− ln(1.1)
10 10 10
13e: We need to find p̃ such that

10 = 0.001, so p̃ − ln(1.1) = ±0.001 ln(1.1) , so p̃ = ln(1.1) (1 ± 0.001). There
ln(1.1)
10 10
are two possible solutions, ln(1.1) (0.999) ≈ 104.81566 and ln(1.1) (1.001) ≈ 105.02550.

Section 1.2
P3 00 000
1a: From Taylor’s theorem, T3 (x) = k=0 f k!(x0 ) (x − x0 )k = f (x0 ) + f 0 (x0 ) · (x − x0 ) + f 2! (x0 )
· (x − x0 )2 + f 3!(x0 ) ·
(k)

(x − x0 )3 for any function f with enough derivatives. So to find T3 (x), we need to evaluate f , f 0 , f 00 , f 000 at
x0 = 0. To that end, f (x) = sin(x), so f 0 (x) = cos(x), f 00 (x) = − sin(x), and f 000 (x) = − cos(x). Therefore,
f (x0 ) = sin(0) = 0, f 0 (x0 ) = cos(0) = 1, f 00 (x0 ) = − sin(0) = 0, and f 000 (x0 ) = − cos(0) = −1. Substituting
this information into the formula for T3 (x), we have
0 −1
T3 (x) = 0 + 1 · (x − 0) + · (x − 0)2 + · (x − 0)3
2! 3!
1
= x − x3 .
6

Also from Taylor’s Theorem, we know R3 (x) = f 4!(ξ) (x − x0 )4 for any function f with enough derivatives.
(4)

So we need to evaluate f (4) (x) at x = ξ. To that end, f (4) (x) = sin(x) so f (4) (ξ) = sin(ξ). Hence,

sin(ξ) 4
R3 (x) = x .
24
P3 00 000
1c: From Taylor’s theorem, T3 (x) = k=0 f k!(x0 ) (x − x0 )k = f (x0 ) + f 0 (x0 ) · (x − x0 ) + f 2!(x0 )
· (x − x0 )2 + f 3!(x0 ) ·
(k)

(x − x0 )3 for any function f with enough derivatives. So to find T3 (x), we need to evaluate f , f 0 , f 00 , f 000 at
x0 = π. To that end, f (x) = sin(x), so f 0 (x) = cos(x), f 00 (x) = − sin(x), and f 000 (x) = − cos(x). Therefore,
f (x0 ) = sin(π) = 0, f 0 (x0 ) = cos(π) = −1, f 00 (x0 ) = − sin(π) = 0, and f 000 (x0 ) = − cos(π) = 1. Substituting
this information into the formula for T3 (x), we have
0 1
T3 (x) = 0 + (−1) · (x − π) + · (x − π)2 + · (x − π)3
2! 3!
1
= π − x + (x − π)3 .
6

Also from Taylor’s Theorem, we know R3 (x) = f 4!(ξ) (x − x0 )4 for any function f with enough derivatives.
(4)

So we need to evaluate f (4) (x) at x = ξ. To that end, f (4) (x) = sin(x) so f (4) (ξ) = sin(ξ). Hence,

sin(ξ)
R3 (x) = (x − π)4 .
24

8: (a) 1 (b) 0.87760 (c) 0.54167 (d) 0.12391

octave:1> f=inline(’1-x^2/2+x^4/24’)
f = f(x) = 1-x^2/2+x^4/24
octave:2> f(0)
ans = 1
octave:3> f(1/2)
ans = 0.87760
octave:4> f(1)
ans = 0.54167
octave:5> f(pi)
ans = 0.12391
241

10: taylorExercise.m:

f=inline(’1-x^2/2+x^4/24’);
f(0)
f(1/2)
f(1)
f(pi)

Running taylorExercise.m:

octave:1> taylorExercise
ans = 1
ans = 0.87760
ans = 0.54167
ans = 0.12391

P2 00
26: (a) From Taylor’s theorem, T2 (x) = k=0 f k!(x0 ) (x − x0 )k = f (x0 ) + f 0 (x0 ) · (x − x0 ) + f 2! (x0 )
(k)
· (x − x0 )2 for
any function f with enough derivatives. So to find T2 (x), we need to evaluate f , f , and f at x0 = 5. To
0 00

that end, f (x) = x1 , so f 0 (x) = − x12 , and f 00 (x) = x23 . Therefore, f (x0 ) = 15 , f 0 (x0 ) = − 25
1
, and f 00 (x0 ) = 1252
.
Substituting this information into the formula for T2 (x), we have

1 1 2/125
 
T2 (x) = + − · (x − 5) + · (x − 5)2
5 25 2!
1 x − 5 (x − 5)2
= − + .
5 25 125

(b) From Taylor’s Theorem, R2 (x) = f 3!(ξ) (x − x0 )3 for any function f with enough derivatives. So we need
(3)

to evaluate f (x) at x = ξ. To that end, f (x) = − x64 so f (ξ) = − ξ64 . Hence,


000 000 000

−6/ξ 4
R2 (x) = (x − 5)3
6
(x − 5)3
= − .
ξ4

(1−5) 2
(9−5) 2
(c) f (1) ≈ T2 (1) = 51 − 1−5 1 4 16
25 + 125 = 5 + 25 + 125 =
61
125 and f (9) ≈ T2 (9) = 15 − 9−5 1 4 16
25 + 125 = 5 − 25 + 125 =
21
125
64
(d) The bounds are 64 and 625 respectively. According to Taylor’s Theorem, the absolute error |f (x)−T2 (x)| =
|R2 (ξ)| for some ξ strictly between x and x0 . So we can obtain a theoretical bound by bounding |R2 (x)| over
64
all values of ξ between x and x0 . For x = 1, R2 (x) = − (1−5)
3

ξ4 = − 64ξ 4 . Hence, |f (1) − T2 (1)| ≤ ξ∈[1,5]


max 4 .
ξ
Since 64ξ4 is a decreasing function of ξ over the interval from 1 to 5, its maximum value is obtained at ξ = 1.
64
Finally, we can conclude |f (1) − T2 (1)| ≤ 64. Similarly, |f (9) − T2 (9)| ≤ max 4 . We get a much smaller
ξ∈[5,9] ξ
bound, though, since we are finding our bound over the interval from 5 to 9. |f (9) − T2 (9)| ≤ 64 64
54 = 625 .
64 64
(e) The bounds are 625 and 6561 respectively. Just as we can find an upper bound on the absolute error, we
can find a lower bound. The same analysis applies up to the point where we maximized the remainder term
over an interval of ξ values. The only change is that we now must minimize this function over the interval.
64 64
So |f (1) − T2 (1)| ≥ min 4 and |f (9) − T2 (9)| ≥ min 4 . Since 64 ξ 4 is a decreasing function of ξ over the
ξ∈[1,5] ξ ξ∈[5,9] ξ
interval from 1 to 5 (and over the interval from 5 to 9), its minimum value is obtained at the right endpoint.
So |f (1) − T2 (1)| ≥ 64 64 64 64
5 4 = 625 and |f (9) − T2 (9)| ≥ 94 = 6561 .
1 61 64 64 64
(f) |f (1) − T2 (1)| = 1 − 125 = 125 = 0.5120. Indeed 625 ≤ 125 ≤ 64. |f (9) − T2 (9)| = 19 − 125 21 64
= 1125


64 64 64
.0568. Indeed 625 ≤ 1125 ≤ 6561 .
(g)
242 Solutions to Selected Exercises

1
T2(x)
f(x)
0.8

0.6

0.4

0.2

0
1 2 3 4 5 6 7 8 9

30b: Perhaps it may initially come as a surprise, but we do not need to find T4 (x) in order to answer this question.
The matter of error is entirely taken up by the remainder term. So we need only calculate R4 (x). This does,
however, require us to find the first 5 derivatives of f (x):
2
f (x) = e−x
2
f 0 (x) = −2xe−x
2 2
f 00 (x) = −2e−x + (−2x)(−2xe−x )
2(2x2 − 1)e−x
2
=
2[4xe−x + (2x2 − 1)(−2xe−x )]
2 2
f 000 (x) =
−4(2x3 − 3x)e−x
2
=
f (4) (x) −4[(6x2 − 3)e−x + (2x3 − 3x)(−2xe−x )]
2 2
=
= −4(−4x4 + 12x2 − 3)e−x
2

f (5) (x) −4[(−16x3 + 24x)e−x + (−4x4 + 12x2 − 3)(−2xe−x )]


2 2
=
= −8(4x5 − 20x3 + 15x)e−x
2

2
+15ξ)e−ξ
Now, R4 (x) = f 5!(ξ) x5 = −8(4ξ −20ξ
(5) 5 3 5
x5 = x15 4ξ 5 − 20ξ 3 + 15ξ e−ξ . For any given value of x, we
 2

120
are faced with maximizing the absolute value of this expression over all ξ between 0 and x. We may ignore the
x5 5 3 −ξ 2
15 factor which is independent of ξ, and focus on finding extrema of 4ξ − 20ξ + 15ξ e . Sometimes, at
this point, the expression requiring optimization is easy enough to handle using standard calculus techniques—
finding critical points and evaluating. However, in this case, that would involve finding the roots of a sixth
degree polynomial. Ironically, techniques we will learn later in this course would be helpful right now, but as
it is, we have no way to do that in  general. The best we can do is have a look at a graph and hope it helps.
Letting g(ξ) = 4ξ 5 − 20ξ 3 + 15ξ e−ξ , we proceed by graphing g(ξ):
2

4 g

-1

-2

-3

-4

-4 -3 -2 -1 0 1 2 3 4
.
243

With the goal of maximization in mind, it makes sense to take note of the relative extrema. The function
appears to have 6 relative extrema and seems to approach zero as ξ approaches ±∞. To confirm that these
observations are facts, we start by calculating g 0 (ξ) = −(8ξ 6 − 60ξ 4 + 90ξ 2 − 15). Since a sixth degree
polynomial has at most 6 distinct roots, g has at most 6 relative extrema. Since we can see 6 relative extrema
on the graph, there are no others. Also,

lim −(8ξ 6 − 60ξ 4 + 90ξ 2 − 15) = 0


ξ→±∞

since the exponential factor dominates the polynomial factor. We would possibly not have thought to consider
these two facts if it were not for the graph. But there’s more. The graph appears to be odd. Again, we can
verify that this is indeed the case:

4(−ξ)5 − 20(−ξ)3 + 15(−ξ) e−(−ξ)


2
=

g(−ξ)
−(4ξ 5 − 20ξ 3 + 15ξ)e−ξ
2
=
= −g(ξ).

Due to this symmetry, we may focus on finding extrema for positive values of ξ. And since we are ultimately
interested in maximizing |g|, it is a good time to consider the graph of |g(ξ)| over ξ ∈ [0, 4]:

4.5
|g|
4

3.5

2.5

1.5

0.5

0
0 0.5 1 1.5 2 2.5 3 3.5 4
.

Finally, we can tackle the maximization. The relative maximum, marked with a red plus, will be the key to
ˆ g(ξ)).
the answer. Let the coordinates of this point be (ξ, ˆ Then, since |g(ξ)| is increasing on the interval from
ˆ we can conclude that
0 to ξ,
max |g(ξ)| = |g(x)| = g(x)
ξ∈[0,x]

ˆ Moreover,
for all x between 0 and ξ.

ˆ = g(ξ)
max |g(ξ)| = |g(ξ)| ˆ
ξ∈[0,x]

ˆ By symmetry, we can conclude that max |g(ξ)| = g(x) for x between −ξˆ and 0, and
for all x ≥ ξ.
ξ∈[x,0]
ˆ for all x ≤ −ξ.
max |g(ξ)| = g(ξ) ˆ Putting it all together,
ξ∈[x,0]

if |x| < ξˆ
(
x5
|T4 (x) − f (x)| = |R4 (x)| ≤ 15 g(x) .
x5 ˆ if |x| ≥ ξˆ
15 g(ξ)

Granted, we do not know the values of ξˆ or g(ξ),


ˆ but we can approximate them using a graphing calculator:
ˆ g(ξ))
(ξ, ˆ ≈ (.43607, 4.0892).
244 Solutions to Selected Exercises

Section 1.3
n+1 n+1
1/3e 1/3e
1b: We need to find α such that limn→∞ (1/3en )α
= λ for some λ 6= 0. So, taking a close look at (1/3en )α
should
help:
n
3e α
n+1
1/3e
α =
(1/3en ) 3en+1
n
3αe
=
3en+1
n
3αe
= .
3e·en
en+1 en+1
1/3 1/3
Consequently, if α = e, then (1/3 en )α = 1, from which if follows that limn→∞ (1/3en )α = 1. Therefore, the

order of convergence is α = e.
n+1 n+1
22 −2 −1 22 −2 −1
22n+1 +3 22n+1 +3
1c: We need to find α such that limn→∞ 22n −2 α = λ for some λ 6= 0. So, taking a close look at 22n −2 α
n −1 n −1
22 +3 22 +3
should help:
2n+1 2n+1 n+1

2 2 22 +3
22n+1 −2 − 1 −2


n+1 n+1
+3 22 +3 22 +3

n α = n n
α
22 −2 22 −2 22 +3
22n +3 − 1 22n +3 − 22n +3


−5
22n+1 +3
=
−5 α

22n +3
2n

1−α 2 +3
= 5 .
22n+1 + 3
If α = 2, the leading terms in both numerator and denominator of the resulting fraction will match. This is
strong evidence that α = 2 is the right choice. Let’s try it:
2n
2
1−2 2 +3 1 22·2 + 6 · 22 + 3
n n

5 = ·
22n+1 + 3 5 22n+1 + 3
1 22 + 6 · 22 + 3
n+1 n

= ·
5 22n+1 + 3
n n+1
1 1 + 6 · 2−2 + 3 · 2−2
= · .
5 1 + 3 · 2−2n+1
In the last step, we have divided both numerator and denominator by 22
n+1
to make taking the limit as n
approaches ∞ simple:
2n+1
2
22n+1 −2 − 1 n n+1

+3
1 1 + 6 · 2−2 + 3 · 2−2
lim 2 = lim ·
n→∞ 22n −2 n→∞ 5 1 + 3 · 2−2n+1
22n +3 − 1

1
= .
5
So, the order of convergence is α = 2.

6c: To begin, we are looking for a function of the form C
that will be at least as great as sin
or the form K
for
√ n
np an n
large n. In the end, though, we want the smallest such function (up to a constant). The key to the solution
is to note that | sin n| ≤ 1 for all n:
sin n | sin n| 1 1

√ = √ ≤ √ = 1/2 .
n n n n
1
Since this inequality will not hold for any higher power of n, the rate of convergence is O n1/2 .

245


4
6d: To begin, we are looking for a function of the form C
or the form K
that will be at least as great as 10n +35n+9

np an
for large n. In the end, though, we want the smallest such function (up to a constant). The key to the solution
is to note that 10n + 35n + 9 > 10n for all n:
4 4 4

10n + 35n + 9 = 10n + 35n + 9 ≤ 10n .

Since this inequality will not hold for any base greater than 10, the rate of convergence is O 101n .


4
6e: To begin, we are looking for a function of the form nCp or the form aKn that will be at least as great as 10n −35n−9


for large n. In the end, though, we want the smallest such function (up to a constant). The key to the solution
is dealing with the fact that 10n − 35n − 9 < 10n for all n:
4 8

10n − 35n − 9 = 2 · 10n − 70n − 18

8
=
+ 10n (10n − 70n − 18)
8

10n
for sufficiently large n since 10 − 70n − 18 ≥ 0 for all large n. Since no similar inequality will hold for any
n

base greater than 10, the rate of convergence is O 101n . Notice we have the same rate of convergence as in
question 6d even though we ended up with a larger constant. The rate of convergence is not dependent on
the constant needed in the inequality.
2
6k: To begin, we are looking for a function of the form nCp or the form aKn that will be at least as great as 2nn

for large n. In the end, though, we want the smallest such function (up to a constant). Let 2 > ε > 0 be
2
1 n2 1
arbitrary. Notice that 2nn ≤ (2−ε) n for large n by rearranging the inequality like so: 2n ≤ (2−ε)n if and only
  n
2n 2
if n2 ≤ (2−ε) n if and only if n
2
≤ 2−ε . We know this last inequality to be true for sufficiently large n
2
because > 1, and exponential functions dominate polynomial functions. Hence, we can use any rate of
2−ε  
1
convergence of the form O (2−ε)n , but there is no smallest such function. Hence, we are left simply using
 2
O 2n as the rate of convergence.
n

13: One possible .m file is:


for j=0:9
disp(7^j)
end%for
15: One possible .m file is:
f=inline(’(2^(2^x)-2)/(2^(2^x)+3)’);
n=[0,1,2,4,6,10];
for i=1:6
disp(f(n(i)))
end%for
19b: For a sequence with linear order of convergence, we know the number of significant digits increases by approx-
imately − log λ with each iteration, so we need to find the smallest k such that 1 − k log(0.5) ≥ 12. Solving
the equation 1 − k log(0.5) = 12 for k:
1 − k log(0.5) = 12
−k log(0.5) = 11
11
k = ≈ 36.54.
− log(0.5)
Therefore, it will take 37 iterations, using the rule of thumb. Remember, this estimate is only good as long as
|pn+1 −p|
|pn −p| ≈ λ. So, if the actual value of the ratio is significantly different from λ, the estimate of 37 iterations
could be significantly off.
246 Solutions to Selected Exercises

Section 1.4
8: (a) trominos.m may be downloaded at the companion website.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% trominos() written by Leon Q. Brin 14 February 2013 %
% is a recursively defined function for %
% calculating the number of trominos needed to %
% cover an n X n grid of squares, save one corner %
% INPUT: nonnegative integer n. %
% OUTPUT: T(n) %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function ans = trominos(n)
if (n==0)
ans = 0;
else
ans = 1+4*trominos(n-1);
end%if
end%function

(b)

octave:1> trominos(10)
ans = 349525

9: (a) 7. Follow this sequence of moves:

(b) i. Consider the following set of moves.

This demonstrates that the 4-disk game can be completed by completing the 3-disk game twice (the first and
last moves) plus one extra move (moving the bottom disk). There is no quicker way to do it because the top 3
disks must be moved off the bottom one before the bottom one can move. Then the bottom one must move,
and must take at least one move. Then the three top disks must be put back on top of the bottom disk. Since
we already know the minimum number of moves to move a stack of 3 disks, this diagram shows a minimum
number of moves to complete the 4-disk game.
ii. It takes a minimum of 2 · 7 + 1, or 15, moves to complete the 4-disk game.
247

10: (a) One—just move the disk to another peg.


(b) hanoi.m may be downloaded at the companion website.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% hanoi() written by Leon Q. Brin 14 February 2013 %
% is a recursively defined function for %
% calculating the number of moves needed to %
% complete the Tower of Hanoi with n disks. %
% INPUT: positive integer n. %
% OUTPUT: H(n) %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function ans = hanoi(n)
if (n==1)
ans = 1;
else
ans = 1+2*hanoi(n-1);
end%if
end%function

(c)

octave:1> hanoi(10)
ans = 1023

12a: This is asking for the number of ways to partition a set of 10 elements into a single nonempty subset. There
is only one way since there is only one subset allowed. That is, the “partition” contains just the set itself. So,
S(10, 1) = 1.

12d: This question is asking for the number of ways to partition a set of 4 elements into two nonempty subsets.
As implied by the question, the actual elements of the set are immaterial, so we can work with any set of
four elements and arrive at the correct answer. Consider the set {α, β, γ, δ}. The list of all partitions can be
categorized into those where one of the subsets has 1 element, one of the sets has 2 elements, or one of the
sets has 3 elements. One does not have a partition of nonempty subsets if one of the sets contains 0 or 4
elements. Here is the list of partitions where one of the sets has exactly one element:

{{α}, {β, γ, δ}}, {{β}, {α, γ, δ}}, {{γ}, {α, β, δ}}, {{δ}, {α, β, γ}}

Note that this is also the list of all partitions where one of the sets has exactly three elements. Here is the
list of partitions where one of the sets has exactly two elements (and, therefore, the other set also has two
elements):
{{α, β}, {γ, δ}}, {{α, γ}, {β, δ}}, {{α, δ}, {β, γ}}
There are no other partitions. Since we have listed 7 partitions, S(4, 2) = 7.

13: (a) S(n, 1) is the number of ways to partition a set of n elements into 1 nonempty subset. Of course, this is 1.
The only such partition contains the set itself.
(b) S(n, n) is the number of ways to partition a set of n elements into n nonempty subsets. Since the set
contains only n elements and we need to divide them among n subsets, each subset of the partition must
contain exactly one element, thus forming a partition of singleton sets. Since order does not matter in a
partition, there is only one way to do this. Thus, S(n, n) = 1.

16: 987. If we take a stack that is n − 1 inches high and add a block that is 1 inch high, we have a stack that is
n inches high with the top block being 1 inch tall. If we take a stack that is n − 2 inches high and add a
block that is 2 inches high, we have a stack that is n inches high with the top block being 2 inches tall. Any
stack created by adding a 1-inch block to a stack that is n − 1 inches tall is necessarily different from a stack
created by adding a 2-inch block to a stack that is n − 2 inches tall since the top blocks are different. Now, if
we take all the stacks that are n − 1 inches high and add 1-inch blocks to them, we have all the stacks that
are n inches high and have a 1-inch block on top. And if we take all the stacks that are n − 2 inches high
and add 2-inch blocks to them, we have all the stacks that are n inches high and have a 2-inch block on top.
248 Solutions to Selected Exercises

There are no other n-inch high stacks since any such stack will either have a 1-inch block or a 2-inch block on
top. Therefore, the number of n-inch high stacks is just the number of (n − 1)-inch stacks plus the number of
(n − 2)-inch stacks. Of course, this doesn’t make sense for n = 1 or n = 2, so we need to specify that there
is exactly 1 way to create a stack of blocks 1 inch high (one 1-inch block), and there are exactly two ways to
create a stack of blocks 2 inches high (two 1-inch blocks or one 2-inch block). Now we can use the recursive
answer to find out how many ways of building taller stacks. The number of 3-inch stacks is the number of
2-inch stacks plus the number of 1-inch stacks, or 2 + 1 = 3. The number of 4-inch stacks is the number of
3-inch stacks plus the number of 2-inch stacks, or 3 + 2 = 5. The number of 5-inch stacks is the number of
4-inch stacks plus the number of 3-inch stacks, or 5 + 3 = 8. Continuing this way reveals the following table:

n 6 7 8 9 10 11 12 13 14 15
number of 13 21 34 55 89 144 233 377 610 987
n-inch stacks

Section 2.1
2c: Since g is a polynomial, it is continuous on [0, 0.9]. g(0) = 2 and g(0.9) = −.1897 so g has opposite signs on
the endpoints of [0, 0.9]. Therefore, the Intermediate Value Theorem guarantees a root on the interval [0, 0.9].

2f: The discontinuities of g are at ±1 due to the (1 − t2 ) factor in the denominator and at odd multiples of π2
due to the (tan t) factor in the numerator. None of these discontinuities occurs in the interval [21.5, 22.5], so
g is continuous on it. g(21.5) ≈ 1.6 > 0 and g(22.5) ≈ −1.6 < 0 so g has opposite signs on the endpoints
of [21.5, 22.5]. Therefore, the Intermediate Value Theorem guarantees a root on the interval [21.5, 22.5].
Incidentally, the discontinuities closest to [21.5, 22.5] are 13π 15π
2 ≈ 20.42 and 2 ≈ 23.56.

3: There is no single correct table for executing the bisection method. Anything that shows successive choices of
interval and accompanying computations will do.
For g(x) = 3x4 − 2x3 − 3x + 2 on [0, 0.9]:

a g(a) b g(b) m g(m)


0 2 .9 −.1897 .45 .5907
.45 .9 .675 −.01731
.45 .675 .5625

The third iteration of the bisection method is 0.5625.


3t2 tan t
For g(t) = 1−t2 on [21.5, 22.5]:

a g(a) b g(b) m g(m)


21.5 1.608 22.5 −1.676 22 −.02660
21.5 22 21.75 .7393
21.75 22 21.875

The third iteration of the bisection method is 21.875.

11: The error, |mj − p| ≤ b−a


2j , and we need this quantity to be less than or equal to 10
−3
. So we need to solve the
3
inequality 2j ≤ 10 for j. b − a = 4 − 1 = 3, so we need to find j such that 2j ≤ 10−3 :
b−a −3

3
 
ln ≤ ln(10−3 )
2j
ln(3) − ln(2j ) ≤ −3 ln(10)
ln(3) + 3 ln(10) ≤ j ln(2)
ln(3) + 3 ln(10)
≤ j
ln(2)
ln(3)+3 ln(10)
So we need j ≥ ln(2) ≈ 11.55. The least integer satisfying this inequality is 12. We need 12 iterations.
249

21: sin(42 ) = sin(16) < 0 and sin(52 ) = sin(25) < 0 so the assumptions of the bisection are not met on [4, 5] as
stated. However, if the bisection method is run anyway, the first iteration will be 4.5 and sin(4.52 ) > 0. No
matter which endpoint (left or right) becomes 4.5, the assumptions of the bisection method will be met from
here on. It will work as prescribed starting with the second iteration, and, therefore, will return a root.

Section 2.2
2c: (i) g does satisfy the hypotheses of the Mean Value Theorem on [0, 0.9]. The hypotheses of the Mean Value
Theorem require a function to be continuous on the closed interval [a, b] and have a derivative on the open
interval (a, b). In this question, a = 0 and b = 0.9. Since g is a polynomial, it is continuous over all real
numbers. Therefore, g is continuous over [0, 0.9] = [a, b]. Furthermore, g 0 is a polynomial and exists over all
real numbers, so g has a derivative on (0, 0.9) = (a, b). Remark: g actually satisfies the hypotheses of the
Mean Value Theorem on any closed interval, as do all polynomials.
(ii) We need to find c such that g 0 (c) = g(b)−g(a)
b−a . To begin, g 0 (x) = 12x3 − 6x2 − 3, g(0) = 2, and g(0.9) =
3(.9)4 − 2(.9)3 − 3(.9) + 2 = −.1897. So we need to solve 12c3 − 6c2 − 3 = −.1897−2
.9−0 for c:

−2433
12c3 − 6c2 − 3 =
1000
567
12c3 − 6c2 − = 0.
1000
We can not solve this equation using basic techniques of algebra since the cubic does not factor. However, we
know the solution is between 0 and 0.9, so we can apply the bisection method to get an answer! Using Octave
with a tolerance of 10−10 , we get

ans = 0.622093084518565.

2f: g does not satisfy the hypotheses of the Mean Value Theorem on [20, 23]. The discontinuities of g are at ±1 due
to the (1 − t2 ) factor in the denominator and at odd multiples of π2 due to the (tan t) factor in the numerator.
The discontinuity at 13π 2 ≈ 20.42 is in the interval [20, 23], so g is not continuous over the given interval.

3h: We are asked to find the fixed points of h. By definition, a fixed point of h satisfies the equation h(x) = x, so
we are looking for all such values. h(x) = x − 10 + 3x + 25 · 3−x so we need to solve x − 10 + 3x + 25 · 3−x = x:

x − 10 + 3x + 25 · 3−x = x
−10 + 3 + 25 · 3
x −x
= 0
3x − 10 + 25 · 3−x = 0
3 · 3 − 3 · 10 + 3 · 25 · 3
x x x x −x
= 0
x 2
(3 ) − 10 · 3 + 25 x
= 0.

(3x )2 − 10 · 3x + 25 is quadratic in 3x so we can try to factor. This quadratic does factor:

(3x − 5)2 = 0
3x − 5 = 0
3x = 5
log3 3 x
= log3 5
x = log3 5.

Therefore, there is one fixed point of h, x = log3 5 ≈ 1.465.


4d: We are looking for roots of g(x) = x2 − e3x+4 , so we need to solve the equation x2 − e3x+4 = 0 for x. In
order to do so with a fixed point method, we need to manipulate this equation into one of the form f (x) = x
using algebra. The simplest way is to add x to both sides. This gives us x + x2 − e3x+4 = x, so we may take
f1 (x) = x + x2 − e3x+4 . Another way to transform the equation x2 − e3x+4 = 0 is to “solve” for the x in the
x2 term. Adding
√ e3x+4 to both sides, we have x2 = e3x+4 and now applying the square root to both sides we
have |x| = e 3x+4 or x = ±e(3x+4)/2 . We may now set f2 (x) = e(3x+4)/2 or f2 (x) = −e(3x+4)/2 .
250 Solutions to Selected Exercises

Remark: We can also “solve” for the x in the exponential:

x2 − e3x+4 = 0
2
x = e3x+4
ln x2 = ln(e3x+4 )
2 ln x = 3x + 4
2 ln x − 4 = 3x
2 ln x − 4
= x.
3
2 ln x−4
This gives another candidate function, f3 (x) = 3 .
Remark: There are always infinitely many ways to turn the equation g(x) = 0 into an equation of the form
f (x) = x. We can multiply both sides by any nonzero real number, c, and then add x to both sides.
This gives the infinitely many candidates fc (x) = x + cg(x).
Remark: See question 20 for another infinite set of candidates.

5b: We are asked to calculate the first 5 iterations of the fixed point iteration method applied to g(x) = 10 + x −
cosh(x) beginning with (initial value) x0 = −3. We have to apply g to x0 , then apply g to the result to get a
new result, then apply g to the new result to get a newer result, then apply g to the newer result to get yet
another result, and so on, until we have 5 results:

x0 = −3
x1 = g(x0 ) = 10 − 3 − cosh(−3) ≈ −3.067661995777765
x2 = g(x1 ) = 10 + x1 − cosh(x1 ) ≈ −3.836725126419593
x3 = g(x2 ) = 10 + x2 − cosh(x2 ) ≈ −17.03418648356706
x4 = g(x3 ) = 10 + x3 − cosh(x3 ) ≈ −12497508.54310043
x5 = g(x4 ) = 10 + x4 − cosh(x4 ) ≈ ’floating point overflow’

So the first 5 iterations are (approximately) −3.067, −3.836, −17.03, −1.249(10)7 , and a floating point error.
It does not look like fixed point iteration is converging on a fixed point. The numbers are getting larger in
magnitude with each iteration.

Remark: Calculators and computers using standard floating point arithmetic will not be able to calculate
cosh(−12497508.54310043) because it is too big! Thus the overflow. It does not mean it can not be
calculated. It’s just too large for a floating point calculator. Using a computer algebra system with
capability to handle such numbers, we find that

x5 ≈ −4.97(10)5427598 .

x5 has over 5 million digits to the left of the decimal point! Indeed, the magnitude of each iteration is
greater than the last.

6b: Using Octave with a properly programmed fixed point iteration function, we get the following:

fixedPointIteration(inline(’10+x-cosh(x)’),-3,1e-10,100)
ans = Method failed---maximum number of iterations reached

The method does not converge in 100 iterations.

Remark: As we find out in question 5b, this iteration causes an overflow in just 5 iterations.

7b: The web diagram will look something like this:


251

Remark: The line y = x is not set at a 45◦ angle because the aspect ratio of the graph is not 1 : 1. The
y-axis covers a length of 20, from −20 to 0 while the x-axis covers a length of only 3, from −5 to −2.

10: (a) To establish that f has a unique fixed point on [−4, −.9], we will show that f is continuous on [−4, −.9],
f ([−4, −.9]) ⊆ [−4, −.9] and |f 0 (x)| ≤ 1 for all x ∈ (−4, −.9). Proposition 3 gives us the result.

(i) f is continuous on [−4, −.9] because its only discontinuity is at x = − 32 , where the denominator, 6x + 4,
is zero, and − 23 ≈ −.6666 is not in [−4, −.9].

18x +24x+6
2 3(x+1)(3x+1)
(ii) We find the absolute extrema of f over [−4, −.9]. f 0 (x) = 36x 2 +48x+16 = 2(3x+2)2 has zeroes at
x = −1 and x = − 13 and is undefined at x = − 32 . The only relevant critical value is −1, so we check
f (−4) = − 47 143
20 = −2.35, f (−1) = −1, and f (−.9) = − 140 ≈ −1.021. Hence, f ([−4, −.9]) ⊆ [−2.35, −1] ⊆
[−4, −0.9]. Remark: For many functions, we can be happy enough with visual evidence or at least use
the graph to verify our conclusions. In this question, the graph of f for both x and y values from −4 to
−.9 looks like

The graph of the function does not leave the view through the top (no values greater than −.9) or the
bottom (no values less than −4), so f ([−4, −.9]) ⊆ [−4, −.9].

(iii) We find the absolute extrema of f 0 over [−4, −.9]. f 00 (x) = 27x3 +54x32 +36x+8 = (3x+2)
3
3 has no zeroes and
2 99
is undefined only at x = − 3 . There are no relevant critical values, so we check f (−4) = 200
0
= 0.495 and
f (−.9) = − 98 ≈ −.5204. Hence, − 98 ≤ f (x) ≤ 200 for all x ∈ (−4, −.9), which means |f 0 (x)| ≤ 51
0 51 51 0 99
98 < 1
for all x ∈ (−4, −.9). Remark: As with check (ii), we can be happy enough with visual evidence or
at least use the graph to verify our conclusions. In this question, the graph of f 0 for x ∈ [−4, −.9] and
y ∈ [−1, 1] looks like
252 Solutions to Selected Exercises

.
The graph of the function does not leave the view through the top (no values greater than 1) or the
bottom (no values less than −1), so |f 0 (x)| < 1 for all x ∈ (−4, −.9).

(b) Using the fixed point iteration method as described in the text with tolerance 10−2 and x0 = −4, we get
x6 = −1.00000176319, and we presume this is accurate to within 10−2 of the actual fixed point. Remark:
Since we don’t have a dependable way to calculate the error, it is possible that the final answer will not be
within tolerance of the actual root. In this case, though, the actual fixed point is −1, so we are well within
bounds.

12: First, f (x) = 3 8 − 4x = x =⇒ 8 − 4x = x3 =⇒ x3 + 4x − 8 = 0, so any fixed point of f is a root of g. It
remains to show that the fixed point iteration method will converge to a fixed point of f for any initial value
x0 ∈ [1.2, 1.5]. According to the Fixed Point Convergence Theorem, we need to establish that [1.2, 1.5] is a
neighborhood of a fixed point in which the magnitude of the derivative is less than 1.

(i) To
qestablish that there is a fixed point in [1.2, 1.5], note that f is continuous and that f (1.2) − 1.2 =
3 16

5 − 1.2 ≈ .27 > 0 and f (1.5) − 1.5 = 2 − 1.5 ≈ −.24 < 0. The Intermediate Value Theorem
3

guarantees there will be a value c ∈ (1.2, 1.5) such that f (c) − c = 0, or f (c) = c.
(ii) We need to establish that the magnitude of the derivative of f is less than 1 for all x ∈ (1.2, 1.5).
4 32
f 0 (x) = − 3(8−4x) 2/3 and f (x) = − 9(8−4x)5/3 . Since f (x) < 0 for all x ∈ (1.2, 1.5), we know f
00 00 0
is
decreasing over this interval. For this reason and the fact that f 0
(x) < 0 for all x ∈ (1.2, 1.5), we know
√ 3

|f 0 (x)| is bounded by |f 0 (1.5)| = − 2 3 2 ≈ .84 < 1.

This completes the proof.

Section 2.3
5: Because there is no particular pattern to the values n is to take, we will store the six values in an array. Then
we will loop over the array to get the values of f .

n=[0,1,2,4,6,10];
f=inline(’(2^(2^x)-2)/(2^(2^x)+3)’);
i=1;
while (i<7)
disp(f(n(i)));
i=i+1;
end%while

produces the following output:

0
0.285714285714286
0.736842105263158
253

0.999923709546987
1
NaN

Remark: We can avoid the NaN, read “Not a Number”, on the sixth value by rewriting the function as the
algebraically equivalent f=inline(’(1-2*2^-(2^x))/(1+3*2^-(2^x))’);. With this one change to the
above program, the following output is produced:
0
0.285714285714286
0.736842105263158
0.999923709546987
1
1
This works because 2^(2^10), which equals 21024 , produces an overflow while 2^-(2^10), which equals
2−1024 , evaluates to 0. 21024 ≈ 1.8(10)308 is too big to be represented as a standard floating point value.

11:

(a) Proceeding according to proposition 5, we will need an initial error and a bound on the magnitude of the
derivative of f .
(i) All we know about the initial value, x0 , and the fixed point, x̂, is that they both lie in [−4, −.9], so
the best we can do for an initial error is the width of the interval. Thus we take |x0 − x̂| = 3.1.
(ii) In 10 of section 2.2, we established the fact that |f 0 (x)| ≤ 51 51
98 < 1. Hence, we have M = 98 .
k
Therefore, we know |xk − x̂| ≤ 3.1 · 5198 , and we need this quantity to be less than 10−11 :
k
51

3.1 · < 10−11
98
k
51 1

<
98 3.1(10)11
51 1
   
k ln < ln
98 3.1(10)11
− ln 3.1(10)11

k > ≈ 40.51.
ln 51

98

Hence, 41 iterations will suffice for any initial value in [−4, −.9].
51
Remark: The inequality must switch from < to > in the last step because we are dividing by ln ,

98
which is negative.
(b) x0 = −4,
x1 = f (x0 ) = −2.35,
x2 = f (x1 ) ≈ −1.541336633663366,
x3 = f (x2 ) ≈ −1.167517670666227,
x4 = f (x3 ) ≈ −1.028014489100897,
x5 = f (x4 ) ≈ −1.001085950365354,
x6 = f (x5 ) ≈ −1.00000176318809, and
x7 = f (x6 ) ≈ −1.000000000004663.
It takes 7 iterations to come up with an estimate within 10−11 of the actual fixed point, −1.
(c) The theoretical bound is 41 while the actual number of iterations is 7. The bound is nearly six times the
actual! This is not a very tight bound.
Remark: The reason the bound is so loose is because the derivative at the fixed point is zero. The
estimate of proposition 5 does not account for this case where we know the convergence is quadratic
or better.

16a:
254 Solutions to Selected Exercises

n pn an
0 0.5 0.2586844276
1 0.2004262431 0.2576132107
2 0.2727490651 0.2575358323
3 0.2536071566 0.2575306600
4 0.2585503763 0.2575303107
5 0.2572656363
6 0.2575989852

20: The tenth iteration of Steffensen’s method is 0.01462973293 while the eleventh is 0.009752946539, so it takes
but 11 iterations to reach a number below 0.01. This is an incredible acceleration of convergence—from 29, 992
iterations to 11.

Section 2.4
8: Newton’s (fixed point iteration) method requires iteration of the function f (x) = x − gg(x) 0 (x) , so we need to know

g 0 (x). The derivative of g is


200 sin 10 1000 cos 10
 
g (x) = −
0 x
− x
.
x3 x4
Therefore,
100 10
1.252 sin 1.25

g(1.25)
x1 = 1.25 − 0 = ≈ 2.76794916279264
g (1.25) 200 sin( 1.25
10
) 1000 cos( 1.25
10
)
− 1.253 − 1.254
and  
100
g(x1 ) x21
sin x101
x2 = x1 − 0 =  ≈ 3.07240930016243.
g (x1 )

200 sin 10
1000 cos 10

− x31
x1
− x4
x1

Remark: Though it is not strictly needed in its simplified form,


3x2 sin 10 10
x  + 10x cos x
 
g(x)
f (x) = x − 0 = .
g (x) 2x sin 10x + 10 cos x
10

3·1.252 sin( 1.25


10
)+10(1.25) cos( 1.25
10
)
Therefore, x1 = 2(1.25) sin( 1.25
≈ 2.76794916279264, and x2 may be computed using this
10
)+10 cos( 1.25
10
)
expression as well.
10a: The formula for the secant method is
xn − xn−1
xn+1 = xn − g(xn ) .
g(xn ) − g(xn−1 )
When n = 1, we get x2 = x1 − g(x1 ) g(xx11)−g(x
−x0
0)
, so in this example,
6−5
x2 = 6 − g(6) ≈ 10.15086029699136.
g(6) − g(5)
When n = 2, we get x3 = x2 − g(x2 ) g(xx22)−g(x
−x1
1)
, so in this example,
x2 − 6
x3 = x2 − g(x2 ) ≈ 8.43462052844703.
g(x2 ) − g(6)

18: Since Newton’s method is a fixed point iteration method, we may use the fixed point convergence theorem to
find such an interval. As indicated in exercise 26 on page 55, though, we are guaranteed convergence over any
neighborhood of the root where the iterated function f has a derivative with magnitude less than 1. To that
x4 +2x3 −x−3
end, f (x) = x − gg(x)
0 (x) = x − 4x3 +6x2 −1 . Hence,

(4x3 + 6x2 − 1)2 − (12x2 + 12x)(x4 + 2x3 − x − 3)


f 0 (x) = 1−
(4x3 + 6x2 − 1)2
(12x + 12x)(x + 2x3 − x − 3)
2 4
= .
(4x3 + 6x2 − 1)2
255

A graph of f 0 in the neighborhood of 1.097740792,

seems to indicate that |f 0 (x)| < 1 for all x from just about 0.9 to ∞. This is an acceptable answer, but if we
would like to be more precise about the lower bound and prove our assertion, there is considerable work to
do. First, the roots of 4x3 + 6x2 − 1 are around −1.4, −0.5, and 0.4, so there are no asymptotes in the interval
under consideration. f 0 is continuous there. To locate the lower end of this interval, we solve the equation
f 0 (x) = −1:
(12x2 + 12x)(x4 + 2x3 − x − 3)
= −1
(4x3 + 6x2 − 1)2
(12x2 + 12x)(x4 + 2x3 − x − 3) = −(4x3 + 6x2 − 1)2
12 x6 + 36x5 + 24x4 − 12x3 − 48x2 − 36x = −16x6 − 48x5 − 36x4 + 8x3 + 12x2 − 1
6 5 4 3 2
28x + 84x + 60x − 20x − 60x − 36x + 1 = 0.
The real solutions of this equation are, in decreasing order, approximately 0.871748, 0.026590, −1.026590,
and −1.871748. A graph of 28x6 + 84x5 + 60x4 − 20x3 − 60x2 − 36x + 1 will point you in the right direction,
and Newton’s method can be used to find these roots. The one we seek is 0.871748. This value marks the
lower end of the desired interval. To verify that the interval is unbounded above, we solve f 0 (x) = 1:
(12x2 + 12x)(x4 + 2x3 − x − 3)
= 1
(4x3 + 6x2 − 1)2
(12x2 + 12x)(x4 + 2x3 − x − 3) = (4x3 + 6x2 − 1)2
12 x6 + 36x5 + 24x4 − 12x3 − 48x2 − 36x = 16x6 + 48x5 + 36x4 − 8x3 − 12x2 + 1
0 = 4x6 + 12x5 + 12x4 + 4x3 + 36x2 + 36x + 1.
The real solutions of this equation are, in decreasing order, approximately −0.028593 and −0.971407. Again,
a graph will point you in the right direction, and Newton’s method can be used to find these roots. There
are no solutions of f 0 (x) = ±1 greater than the root 1.097740792. We conclude that |f 0 (x)| < 1 for all
x ∈ (0.87175, ∞), so Newton’s method will converge to x̂ ≈ 1.097740792 for any initial value in (0.87175, ∞).
Finally, by looking at the graph of f (x),

,
256 Solutions to Selected Exercises

we see that the interval from the asymptote around 0.4 to the root maps into the interval from the root to
infinity. Therefore, Newton’s method converges to 1.097740792 for all initial values between the asymptote
near 0.4 to 0.87175 as well. Finally, we use Newton’s method to get a more accurate value for the asymptote
near 0.4. It turns out to be 0.366025403784439, so we conclude that Newton’s method will converge to the
root x̂ ≈ 1.097740792 for any initial value in (0.36602540378444, ∞).

Remark: Depending on how rigorously you want your answer shown, you may start with the graph of f as
above, approximate the asymptote near 0.4, and proceed straight to the final answer. This conclusion
can be justified (graphically) by assuming that the graph of f is more or less linear to the right of the
part shown and imagining the web diagram for any value in this interval. To make this argument slightly
more rigorous, note that f has a slant asymptote, y = 34 x, as x approaches ∞, so the assumption that
the graph of f is more or less a straight line to the right of the part shown is valid.

21:

26: The sum of two numbers, call them x and y, is√20, so x + y = 20. If each number is added to its square root,

the product of the two sums is 172.2, so (x + x)(y + y) = 172.2. Hence, we need to solve the system

x + y = 20
√ √
(x + x)(y + y) = 172.2

of two equations with two unknowns. Since this system is not linear, our best hope is to use substitution.
The first equation gives us y = 20 − x. Substituting this value of y in the second equation gives us
√ √
(x + x)(20 − x + 20 − x) = 172.2
√ √
or (x + x)(20 − x + 20 − x) − 172.2 = 0. It is a solution of this last equation we seek. Without having
any idea what the roots might be besides the reasonable assumption that they are between 0 and 20, it is
not clear what initial values to use. With a few different √
attempts, you √are likely to find some that work.
For example, applying the secant method to g(x) = (x + x)(20 − x + 20 − x) − 172.2 with x0 = 9 and
x1 = 10 gives 9.149620618, which is accurate to all digits shown, in just 9 iterations. The other number is
20 − 9.149620618 = 10.850379382. We can verify this is a solution by calculating
√ √
(9.149620618 + 9.149620618)(10.850379382 + 10.850379382)

which is very nearly 172.2.

27: Newton’s method will fail to find a root of g on the second iteration if g 0 (x1 ) = 0. For example, let g(x) =
x3 − 3x + 3. Then g 0 (x) = 3x2 − 3 has zeroes when x = ±1. So we need a value x0 such that x1 = 1 or
257

g(x0 ) x3 −3x+3
x1 = −1. We need to find any solution of x1 = x0 − g 0 (x0 ) =x− 3x2 −3 = ±1. One such solution follows.

x3 − 3x + 3
x− = 1
3x2 − 3
2x3 − 3
= 1
3x2 − 3
2x3 − 3 = 3x2 − 3
2x3 − 3x2 = 0
x2 (2x − 3) = 0

3
so either of the initial values x0 = 0 or x0 = 2 will produce the desired result.

3
Remark: The equation x − x 3x −3x+3
2 −3 = −1 has only one real solution, but it is irrational. It is, accurate to 20
significant digits, 1.0786168885087585968. Setting x0 = 1.078616888508759 as in the following Octave
code does not fail, however! There is enough round-off error that x1 is not exactly −1 and g 0 (x1 ) is not
exactly zero, so the method proceeds to find the result. It takes 99 iterations to settle in on the solution,
but it gets there. x1 displays as -0.999999999999999 and x2 displays as 7.50599937895082e+14.

format(’long’)
f=inline(’x^3-3*x+3’)
fp=inline(’3*x^2-3’)
x0=1.0786168885087585968
c=1;
for i=1:120
x=x0-f(x0)/fp(x0)
if (abs(x-x0)<1e-15)
c
return
end%if
x0=x;
c=c+1;
end%for

Section 2.5

1: Before trying to match any functions with their diagrams, we take stock of the functions available. f and h
are polynomials of degree 5 and, therefore, have at most 5 distinct roots. l is the product of the natural
logarithm with a third degree polynomial. The polynomial has three roots and the logarithm has one distinct
from those of the polynomial, so l has four roots. Now looking at the diagrams, we can match two functions
with their diagrams. Diagram (d) has patches of nine different colors, indicating nine roots within the area
shown. Since functions f , h, and l have fewer than 9 roots, function g must match with diagram (d). Along
the same lines, diagrams (a) and (b) both show 5 roots, so l can not match either of those. l has only four
roots. By process of elimination, function l matches with diagram (c). That leaves (a) and (b) to match with
f and h. Both diagrams show 5 roots, but there is a fundamental difference between the two. The real axis
passes horizontally through the center of each diagram. Diagram (a) has one patch covering the entire real
axis, indicating only one real root while diagram (b) has three patches covering the real axis, indicating three
real roots. The graph of f ,
258 Solutions to Selected Exercises

clearly shows that f has three roots, so f matches with (b) and h matches with (a). To recap,
f ↔ (b)
g ↔ (d)
h ↔ (a)
l ↔ (c).

3c: For each root r, the polynomial must have a factor of (x − r) and no other factors. This polynomial must have
factors of (x−(−4)), (x−(−1)), (x−2), (x−2i), and (−(−2i)), making p(x) = (x+4)(x+1)(x−2)(x−2i)(x+2i)
one solution.
Remark: q(x) = a(x + 4)(x + 1)(x − 2)(x − 2i)(x + 2i) where a is any nonzero complex number is another
solution.
Remark: Though it is not necessary to multiply the factors, p(x) = x5 + 3x4 −2x3 + 4x2 −24x−32.
3d: For each root r, the polynomial must have a factor of (x − r) and no other factors. This polynomial must have
factors of (x − (−4)), (x − (−1)), (x − 2), and (x − 2i), making p(x) = (x + 4)(x + 1)(x − 2)(x − 2i) one solution.
Remark: q(x) = a(x + 4)(x + 1)(x − 2)(x − 2i) where a is any nonzero complex number is another solution.
Remark: Though it is not necessary to multiply the factors, p(x) = x4 +(3−2i)x3 −(6+6i)x2 −(8−12i)x+16i.
Notice that not all the coefficients are real numbers. This is consistent with the conjugate roots theorem
stating that if a polynomial with real coefficients has complex roots, they must come in conjugate pairs.
7: f is periodic and has infinitely many roots regularly spread across the real axis. The only diagram showing roots
of this nature is (a) so f matches with (a). g and f differ only by a small amount for large real values so we
should expect to see infinitely many more or less regularly spaced roots on the positive real axis. The only
diagram with roots of this nature is (d) so g matches with (d). l is a fifth degree polynomial so has at most
5 roots. Diagram (b) shows 8 colors so 8 roots. Therefore, h matches with (b) and l matches with (c). To
recap,
f ↔ (a)
g ↔ (d)
h ↔ (b)
l ↔ (c).

Section 2.6
6a: g(2) = 38 and g 0 (2) = 71:

2 3 12 −13 −8
6 36 46
2 3 18 23 38
6 48
3 24 71
259

38 104
8a: From 6a, g(2) = 38 and g 0 (2) = 71, so x1 = 2 − 71 = 71 . g( 104
71 ) =
2911104
357911 and g 0 ( 104
71 ) =
209027
5041 :

104
71 3 12 −13 −8
312 121056 5774392
71 5041 357911
104 1164 55523 2911104
71 3 71 5041 357911
312 153504
71 5041
1476 209027
3 71 5041

104 2911104/357911 2689672


so x2 = 71 − 209027/5041 = 2120131 ≈ 1.268634815490175.

14a: newtonhorner([-144,144,-59,6,1],1,1e-5,100) returns ans = 3.

3 1 6 −59 144 −144


3 27 −96 144
1 9 −32 48 0

so the deflated polynomial is x3 + 9x2 − 32x + 48. newtonhorner([48,-32,9,1],3,1e-5,100) returns ans


= -12.

−12 1 9 −32 48
−12 36 −48
1 −3 4 0

the deflated polynomial is x2 − 3x + 4 which is quadratic. The quadratic formula gives the remaining roots,
so √
√ √ √ √
3± 9−4(4) 3+i 7 3−i 7
2 = 2 and 2 . To recap, the four roots are 3, −12, 3+i2 7 3−i 7
, 2 .
15a: format(’long’); c=[-40,16,-12,-2,1]; newtonhorner(c,1,1e-5,100) returns

ans = -3.54823289798023

so −3.54823289798023 is one root. c=deflate(c,ans) returns

c =

-11.27321716194279 7.68642249426964 -5.54823289798023 1.00000000000000

so the deflated polynomial is approximately x3 − 5.5482x2 + 7.6864x − 11.2732 and the coefficients of this poly-
nomial are now contained in array c. newtonhorner(c,-3.5,1e-5,100) returns ans = 4.38111344099655
so 4.38111344099655 is another root. c=deflate(c,ans) returns

c =

2.57313975402986 -1.16711945698368 1.00000000000000

so the deflated polynomial is approximately x2 − 1.1671x + 2.5731 and the coefficients of this polynomial are
now contained in array c. Since we have deflated the polynomial to a quadratic, we find the last two roots
using the quadratic formula. [s,t]=quadraticRoots(c(3),c(2),c(1)) returns

s = 0.583559728491838 + 1.494188006012761i
t = 0.583559728491838 - 1.494188006012761i.

To recap, the roots are


−3.54823289798023
4.38111344099655
0.583559728491838 + 1.494188006012761i
0.583559728491838 − 1.494188006012761i.
260 Solutions to Selected Exercises

16a: c=[-40,16,-12,-2,1]; newtonhorner(c,-3.54823289798023,1e-5,100)

returns

ans = -3.54823289797970.
c=[-40,16,-12,-2,1]; newtonhorner(c,4.38111344099655,1e-5,100)

returns

ans = 4.38111344099594.
c=[-40,16,-12,-2,1]; newtonhorner(c,0.583559728491838+1.494188006012761i,1e-5,100)

returns

ans = 0.583559728491879 + 1.494188006011256i.


c=[-40,16,-12,-2,1]; newtonhorner(c,0.583559728491838-1.494188006012761i,1e-5,100)

returns

ans = 0.583559728491879 - 1.494188006011256i.


Each attempt to refine the roots returns a slightly different answer, but none change within the first five
decimal places. The approximate roots of the approximate deflated polynomials are all within 10−5 of the
exact roots of the original polynomial without refinement.
19a: (i) format(’long’); horner(sqrt(3),[-40,16,-12,-2,1]) returns ans = -49.6794919243112. Notice we
only get the value of the polynomial, the first entry of the array of return values. This is the default behavior
if the function is not set equal to an array.
(ii) p=inline(’x^4-2*x^3-12*x^2+16*x-40’); p(sqrt(3)) returns ans = -49.6794919243112 so they cer-
tainly look like they are returning the same value.
(iii) horner(sqrt(3),[-40,16,-12,-2,1]) == p(sqrt(3)) returns ans = 0, however, so internally, the re-
sults are not exactly the same! We can conclude that the inline function evaluation is not done by nesting
(synthetic division).
Remark: horner(3,[-40,16,-12,-2,1]) == p(3) returns ans = 1, so for the integer input 3, the two
methods do result in exactly the same value.

Section 3.2
3c: We begin by constructing three polynomials—the first with roots at the second two data points and a value
of 1 at the first, the second polynomial with roots at the first and third data points and a value of 1 at the
second, the third polynomial with roots at the first and second data points and a value of 1 at the third.
Those polynomials are
(x − 20)(x − 1019)
l1 (x) =
(−10 − 20)(−10 − 1019)
(x + 10)(x − 1029)
l2 (x) =
(20 + 10)(20 − 1029)
(x + 10)(x − 20)
l3 (x) = .
(1019 + 10)(1019 − 20)
We then multiply li by yi and sum the products:
(x − 20)(x − 1019) (x + 10)(x − 1019)
P2 (x) = (10) + (58)
(−10 − 20)(−10 − 1019) (20 + 10)(20 − 1019)
(x + 10)(x − 20)
+ (−32).
(1019 + 10)(1019 − 20)
261

4c: Estimating (or approximating) the value of a function f using an interpolating polynomial means to evaluate
the polynomial there instead.

(1.3 − 20)(1.3 − 1019) (1.3 + 10)(1.3 − 1019)


f (1.3) ≈ P2 (1.3) = (10) + (58)
(−10 − 20)(−10 − 1019) (20 + 10)(20 − 1019)
(1.3 + 10)(1.3 − 20)
+ (−32)
(1019 + 10)(1019 − 20)
≈ 28.427

5c: Neville’s method is best executed on a computer or in a tabular format. f (1.3) ≈ P0,2 (1.3). The tabular format
is shown here:

xi Pi,0 = yi Pi,1 Pi,2


−10 10 28.08 28.427
20 58 59.684
1019 −32

(1.3 − 20)(10) − (1.3 + 10)(58)


P0,1 = = 28.08
(−10 − 20)
(1.3 − 1019)(58) − (1.3 − 20)(−32)
P1,1 = = 59.684
(20 − 1019)
(1.3 − 1019)P0,1 − (1.3 + 10)P1,1
P0,2 = ≈ 28.427
(−10 − 1019)

7: Since the interpolating polynomial error term contains the product (x − x0 )(x − x1 ) · · · (x − xn ), we should
choose data near the point of estimation x. This way, the product is minimized and we arrive at what is
likely to be the best approximation possible with the given data. It does not always work this way (perhaps
it would make a good exercise to find an example where using the data nearest the point of estimation does
not give the best estimate) but we have the best chance of good results this way. For the degree at most 1
polynomial, we will use the data at 2 and 3.5 since these are the two abscissas nearest 3. For the degree at
most 2 polynomial, we will use the data at 2, 3.5, and 4 since these are the three abscissas nearest 3. For the
degree at most 3 polynomial we have no choice but to use all of the data. Here is where Neville’s method
shines! The first estimate uses the first two data points. The second estimate uses these same two plus a
third. The last estimate uses these three plus a fourth. We can reuse each of the first two calculations in the
next by creating a single Neville’s method table. With the data in the table in the order in which we would
like to use them, we get

xi Pi,0 = yi Pi,1 Pi,2 Pi,3


2 .8 .73 .6916 .638
3.5 .7 .65 .53
4 .75 1
5 .5

P0,1 gives the at most degree 1 estimate. P0,2 gives the at most degree 2 estimate, and P0,3 gives the at most
degree 3 estimate.
(3−3.5)(.8)−(3−2)(.7)
(a) P0,1 (3) = 2−3.5 = 0.73.
(3−4)(.7)−(3−3.5)(.75)
(b) P1,1 (3) = 3.5−4 = 0.65; P0,2 (3) = (3−4)(.73)−(3−2)(.65)
2−4 = .6916
(3−5)(.75)−(3−4)(.5) (3−5)(.65)−(3−3.5)(1) (3−5)(.6916)−(3−2)(.53)
(c) P2,1 (3) = 4−5 = 1; P1,2 (3) = 3.5−5 = .53; P0,3 (3) = 2−5 =
.638

8b: Since the interpolating polynomial error term contains the product (x − x0 )(x − x1 ) · · · (x − xn ), we should
choose data near the point of estimation x. This way, the product is minimized and we arrive at what is
likely to be the best approximation possible with the given data. It does not always work this way (perhaps
it would make a good exercise to find an example where using the data nearest the point of estimation does
not give the best estimate) but we have the best chance of good results this way. For the degree at most 1
262 Solutions to Selected Exercises

polynomial, we will use the data at .1 and .2 since these are the two abscissas nearest .18. For the degree at
most 2 polynomial, we will use the data at .1, .2, and .3 since these are the three abscissas nearest .18. For
the degree at most 3 polynomial we have no choice but to use all of the data. Here is where Neville’s method
shines! The first estimate uses the first two data points. The second estimate uses these same two plus a
third. The last estimate uses these three plus a fourth. We can reuse each of the first two calculations in the
next by creating a single Neville’s method table. With the data listed in the Octave function in the order in
which we would like to use them, we get

>> nevilles(.18,[.1,.2,.3,.4],[-.29004986,-.56079734,-.81401972,-1.0526302])
ans =

-0.290049860000000 -0.506647844000000 -0.508049852000000 -0.508143074400000


-0.560797340000000 -0.510152864000000 -0.508399436000000 0.000000000000000
-0.814019720000000 -0.527687144000000 0.000000000000000 0.000000000000000
-1.052630200000000 0.000000000000000 0.000000000000000 0.000000000000000

For the interpolating polynomial of degree at most one, f (.18) ≈ P0,1 (.18) = −.506647844. For the interpo-
lating polynomial of degree at most two, f (.18) ≈ P0,2 (.18) = −.508049852. For the interpolating polynomial
of degree at most three, f (.18) ≈ P0,3 (.18) = −.5081430744.
8c: Since the interpolating polynomial error term contains the product (x − x0 )(x − x1 ) · · · (x − xn ), we should
choose data near the point of estimation x. This way, the product is minimized and we arrive at what is
likely to be the best approximation possible with the given data. It does not always work this way (perhaps
it would make a good exercise to find an example where using the data nearest the point of estimation does
not give the best estimate) but we have the best chance of good results this way. For the degree at most 1
polynomial, we will use the data at 2 and 2.5 since these are the two abscissas nearest 2.26. For the degree
at most 2 polynomial, we will use the data at 2, 2.5, and 1.5 since these are the three abscissas nearest 2.26.
For the degree at most 3 polynomial we have no choice but to use all of the data. Here is where Neville’s
method shines! The first estimate uses the last two data points. The second estimate uses these same two
plus a third. The final uses these three plus a fourth. We can reuse each of the first two calculations in the
next by creating a single Neville’s method table. With the data listed in the Octave function in the order in
which we would like to use them, we get

>> nevilles(2.26,[2,2.5,1.5,1],[-1.329,1.776,-2.569,1.654])
ans =

-1.32900 0.28560 0.05285 0.28036


1.77600 0.73320 -0.82219 0.00000
-2.56900 -8.98796 0.00000 0.00000
1.65400 0.00000 0.00000 0.00000

For the interpolating polynomial of degree at most one, f (2.26) ≈ P0,1 (2.26) = −.28560. For the interpolating
polynomial of degree at most two, f (2.26) ≈ P0,2 (2.26) = .05285. For the interpolating polynomial of degree
at most three, f (2.26) ≈ P0,3 (2.26) = .28036.
9a: Since the interpolating polynomial error term contains the product (x − x0 )(x − x1 ) · · · (x − xn ), we should
choose data near the point of estimation x. This way, the product is minimized and we arrive at what is likely
to be the best approximation possible with the given data. It does not always work this way (perhaps it would
make a good exercise to find an example where using the data nearest the point of estimation does not give
the best estimate) but we have the best chance of good results this way. For the degree at most 1 polynomial,
we will use the data at 1.25 and 1.6 since these are the two abscissas nearest 1.4. For the degree at most
2 polynomial, we have no choice but to use all of the data. We can use Neville’s method or the Langrange
form in this case. Neither method provides obvious advantage over the other. To begin, f (1) = sin π = 0;
f (1.25) = sin 1.25π ≈ −.70711; f (1.6) = sin(1.6π) ≈ −.95106.
Lagrange form: (degree at most 1) L1 (x) = 1.25−1.6
x−1.6
(−.70711) + 1.6−1.25 (−.95106)
x−1.25
so f (1.4) ≈ L1 (1.4) =
1.4−1.6 1.4−1.25
1.25−1.6 (−.70711) + 1.6−1.25 (−.95106) = −.81166.
(x−1.25)(x−1.6) (x−1)(x−1.6) (x−1)(x−1.25)
(degree at most 2) L2 (x) = (1−1.25)(1−1.6) (0) + (1.25−1)(1.25−1.6) (−.70711) + (1.6−1)(1.6−1.25) (−.95106) so
(1.4−1)(1.4−1.6) (1.4−1)(1.4−1.25)
f (1.4) ≈ L2 (1.4) = (1.25−1)(1.25−1.6) (−.70711) + (1.6−1)(1.6−1.25) (−.95106) = −.918232.
263

Neville’s Method: We use the same table for both the degree at most 1 and degree at most 2 polynomials:

xi Pi,0 = yi Pi,1 Pi,2


2
1.25 −.70711 .16414 − .697x 3.5524x − 10.82134x + 7.26894
1.6 −.95106 1.5851 − 1.5851x
1 0

(x − 1.6)(−.70711) − (x − 1.25)(−.95106)
P0,1 (x) = = .16414 − .697x
1.25 − 1.6
(x − 1)(−.95106)
P1,1 (x) = = 1.5851 − 1.5851x
1.6 − 1
(x − 1)P0,1 (x) − (x − 1.25)P1,1 (x)
P0,2 (x) = = 3.5524x2 − 10.82134x + 7.26894
1.25 − 1
(degree at most 1) P0,1 (1.4) = .16414 − .697(1.4) = −.8166
(degree at most 2) P0,2 (1.4) = 3.5524(1.4)2 − 10.82134(1.4) + 7.26894 = −.918232
f (2) (ξ1.4 )
10a: (degree at most 1) f (1.4) − P1 (1.4) = 2! (1.4 − 1.25)(1.4 − 1.6) so our bound is

|f (1.4) − P1 (1.4)| ≤ .015 max |π 2 sin πx|


x∈[1.25,1.6]

= .015π 2 |sin(1.5π)|
< .149

The actual absolute error is |f (1.4) − P1 (1.4)| = | sin(1.4π) + .8166| ≈ .134, which is rather near the bound.
f (3) (ξ1.4 )
(degree at most 2) f (1.4) − P2 (1.4) = 3! (1.4 − 1.25)(1.4 − 1.6)(1.4 − 1) so our bound is

|f (1.4) − P2 (1.4)| ≤ .002 max |π 3 cos πx|


x∈[1,1.6]

= .002π 3
< .0620

The actual absolute error is |f (1.4) − P2 (1.4)| = | sin(1.4π) + .918232| ≈ .0328, which is of the same order of
magnitude as the bound.

Section 3.3
4: The Newton form of an interpolating polynomial follows from a table of divided differences. Recursion 3.3.3 is
used to compute the entries in the table, as in Table 3.3. Answers will depend on the order in which the data
are listed in the table and on how the data are read from the table. Placing the data in the table in the order
given in the question, we have:

xi fi,0 = f (xi ) fi,1 fi,2 fi,3


1 2 0 −1 2/3
2 2 −2 1
3 0 0
4 0

Reading the coefficients across the first row, we use f0,0 , f0,1 , f0,2 , and f0,3 . This is a valid sequence to read
from the table since each coefficient depends on the same data as the previous plus one point. f0,0 depends
on x0 ; f0,1 depends on x0 and x1 ; f0,2 depends on x0 , x1 , and x2 ; and f0,3 depends on x0 , x1 , x2 , and x3 .
Therefore, one answer is
2
P0,3 (x) = 2 + 0(x − 1) − 1(x − 1)(x − 2) + (x − 1)(x − 2)(x − 3)
3
2
= 2 − (x − 1)(x − 2) + (x − 1)(x − 2)(x − 3).
3
264 Solutions to Selected Exercises

The sequence of coefficients f1,0 , f2,1 , f1,2 , f0,3 is not a valid sequence to choose. f1,0 depends on x1 but
f2,1 depends on x2 and x3 , two completely different data values from the first. With some study, you might
be able to draw the conclusion, and maybe even prove, that any sequence of coefficients starting in the first
column and progressing to the right one column at a time and either jumping up one row or remaining in
the same row with each change of column forms a valid sequence. For example, we can use coefficients f2,0 ,
f1,1 , f1,2 , f0,3 because f2,0 depends on x2 ; f1,1 depends on x2 and x1 ; f1,2 depends on x2 , x1 , and x3 ; and
f0,3 depends on x2 , x1 , x3 , and x0 . And the order in which new dependencies are encountered matters. The
(x − xi ) monomials must appear in the same order. Therefore, another answer is
2
P0,3 (x) = 0 − 2(x − 3) + 1(x − 3)(x − 2) + (x − 3)(x − 2)(x − 4)
3
2
= −2(x − 3) + (x − 3)(x − 2) + (x − 3)(x − 2)(x − 4).
3
Other possible answers garnered from this same divided difference table are
2
P0,3 (x) = (x − 4)(x − 3) + (x − 4)(x − 3)(x − 2)
3
2
P0,3 (x) = −2(x − 3) − (x − 3)(x − 2) + (x − 3)(x − 2)(x − 1).
3
With some algebra and a bit of patience, each of the four forms above can be reduced to
2 3 31
P0,3 (x) = x − 5x2 + x − 4.
3 3

6: Recursion 3.3.3 is used to compute the entries in the table, as in Table 3.3. Answers will depend on the order in
which the data are listed in the table and on how the data are read from the table. Placing the data in the
table in the order given in the question, we have

xi fi,0 = f (xi ) fi,1 fi,2


1 .987 −.925 .809375
2.2 −.123 .69375
3 .432

Reading the coefficients across the first row, we use f0,0 , f0,1 , and f0,2 . This is a valid sequence to read from
the table since each coefficient depends on the same data as the previous, plus one point. f0,0 depends on x0 ;
f0,1 depends on x0 and x1 ; and f0,2 depends on x0 , x1 , and x2 . Therefore, one answer is

P0,2 (x) = .987 − .925(x − 1) + .809375(x − 1)(x − 2.2).

The sequence of coefficients f0,0 , f1,1 , f1,2 is not a valid sequence to choose. f0,0 depends on x0 but f1,1
depends on x1 and x2 , two completely different data values from the first. Not to mention f1,2 , which is
not even part of the table. With some study, you might be able to draw the conclusion, and maybe even
prove, that any sequence of coefficients starting in the first column and progressing to the right one column
at a time and either jumping up one row or remaining in the same row with each change of column forms a
valid sequence.For example, we can use coefficients f1,0 , f0,1 , f0,2 because f1,0 depends on x1 ; f1,1 depends
on x1 and x2 ; and f0,2 depends on x1 , x2 , and x0 . And the order in which new dependencies are encountered
matters. The (x − xi ) monomials must appear in the same order. Therefore, another answer is

P0,2 (x) = −.123 − .925(x − 2.2) + .809375(x − 2.2)(x − 1)

The other two possible answers garnered from this same divided difference table are

P0,2 (x) = −.123 + .69375(x − 2.2) + .809375(x − 2.2)(x − 3)


P0,2 (x) = .432 + .69375(x − 3) + .809375(x − 3)(x − 2.2).

With some algebra and a bit of patience, each of the four forms above can be reduced to

P0,2 (x) = .809375x2 − 3.515x + 3.692625.


265

10: Answers will depend on the order in which the data are listed in the Octave call and on how the data are read
from the table. Placing the data in the Octave command in the same order they are listed in the question,
your Octave code should produce something like

dividedDiffs([0,.1,.3,.6,1],[-6,-5.89483,-5.65014,-5.17788,-4.28172])
ans =

-6.00000 1.05170 0.57250 0.21500 0.06302


-5.89483 1.22345 0.70150 0.27802 0.00000
-5.65014 1.57420 0.95171 0.00000 0.00000
-5.17788 2.24040 0.00000 0.00000 0.00000
-4.28172 0.00000 0.00000 0.00000 0.00000

One possibility for the interpolating polynomial of degree (at most) four is
P0,4 (x) = −6 + 1.05170x + .5725x(x − .1) + .215x(x − .1)(x − .3)
+.06302x(x − .1)(x − .3)(x − .6).
See discussion of question 4 above for other possibilities. Adding the point (1.1, −3.9958) to the table, we get
(accurate to 5 decimal places)
f5,0 = −3.9958
−4.28172 + 3.9958
f4,1 = = 2.8592
1 − 1.1
2.2404 − 2.8592
f3,2 = = 1.2376
.6 − 1.1
.95171 − 1.2376
f2,3 = = .35736
.3 − 1.1
.27802 − .35736
f1,4 = = .07934
.1 − 1.1
.06302 − .07934
f0,5 = = .01484.
0 − 1.1
Now we can add one more term to P0,4 to get (one possible representation of) P0,5 :
P0,5 (x) = −6 + 1.05170x + .5725x(x − .1) + .215x(x − .1)(x − .3)
+.06302x(x − .1)(x − .3)(x − .6) + .01484x(x − .1)(x − .3)(x − .6)(x − 1).

12: Since Nn , Ln , P0,n , and Pn are all the same polynomial except possibly the form in which they are written,
the error term for a Newton polynomial is the same as that for a Lagrange polynomial:
f (n+1) (ξx )
f (x) − Pn (x) = (x − x0 )(x − x1 ) · · · (x − xn ).
(n + 1)!
In this particular case, we have
f (3) (ξ2 )
f (x) − Pn (x) = (2 − 1)(2 − 2.2)(2 − 3)
3!
1 (3)
= f (ξ2 ).
30
Since all derivatives are bounded between −2 and 1 over the interval [1, 3], |f (3) (ξ2 )| ≤ 2 and, therefore, the
error has bound
2 1
|f (x) − Pn (x)| ≤ = = .06.
30 15
17: Since 0.75 is one of the nodes (it is x3 ), N3 and f agree there. That is what it means for N3 to interpolate the
data at x0 , x1 , x2 , x3 . Hence,
f (.75) = N3 (.75)
16
= 1 + 4(.75) + 4(.75)(.75 − .25) + (.75)(.75 − .25)(.75 − .5)
3
= 6.
266 Solutions to Selected Exercises

18: f is periodic and has infinitely many roots regularly spread across the real axis. The only diagram showing
roots of this nature is (d) so f matches with (d). g and f differ only by a small amount for large real values
so we should expect to see infinitely many more or less regularly spaced roots on the positive real axis. The
only diagram with roots of this nature is (a) so g matches with (a). l is a fifth degree polynomial so has at
most 5 roots. Diagram (b) shows 8 colors so 8 roots. Therefore, h matches with (b) and l matches with (c).
To recap,
f ↔ (d)
g ↔ (a)
h ↔ (b)
l ↔ (c).

Section 4.1
x − x1 x − x0 f (x0 ) f (x1 ) f (x1 ) − f (x0 )
1: (a) L1 (x) = f (x0 ) + f (x1 ) (b) L01 (x) = + = (c) L0 (x0 + 2)
h
=
x0 − x1 x1 − x0 x0 − x1 x1 − x0 x1 − x0
f (x0 +h)−f (x0 )
x0 +h−x0 = f (x0 +h)−f
h
(x0 )
so

f (x0 + h) − f (x0 )
 
h
f 0 x0 + ≈ .
2 h

4: (a) The Newton form of an interpolating polynomial derives from a table of divided differences whether it is a
single value or a formula for a general case. The divided differences table for this case is
f (x0 +h)−f (x0 ) f (x0 +2h)−2f (x0 +h)+f (x0 )
x0 f (x0 ) h 2h2
f (x0 +2h)−f (x0 +h)
x0 + h f (x0 + h) h
x0 + 2h f (x0 + 2h)
f (x0 + h) − f (x0 ) f (x0 + h) − f (x0 )
f0,1 = =
(x0 + h) − x0 h
f (x0 + 2h) − f (x0 + h) f (x0 + 2h) − f (x0 + h)
f1,1 = =
(x0 + 2h) − (x0 + h) h
f (x0 +2h)−f (x0 +h) f (x0 +h)−f (x0 )
f1,1 − f0,1 −
f0,2 = = h h
(x0 + 2h) − x0 2h
f (x0 + 2h) − 2f (x0 + h) + f (x0 )
=
2h2

Therefore, one possibility for the Newton form is


f (x0 + h) − f (x0 ) f (x0 + 2h) − 2f (x0 + h) + f (x0 )
N2 (x) = f (x0 ) + (x − x0 ) + (x − x0 )(x − (x0 + h)).
h 2h2
Making the substitution x0 + θh for x,
f (x0 + 2h) − 2f (x0 + h) + f (x0 )
 
N2 (x0 + θh) = f (x0 ) + [f (x0 + h) − f (x0 )] θ + (θ)(θ − 1).
2

dθ N2 (x(θ))
d
d d dx
(b) dx
dθ = h and dθ N2 (x(θ))
d
= dx N2 (x)
d
· dx
dθ so N2 (x) = N2 (x(θ)) ÷ = . Similarly, we
dx dθ dθ h
d2
d2 dθ 2 N2 (x(θ))
get N2 (x) = :
dx2 h2
f (x0 +2h)−2f (x0 +h)+f (x0 )
d [f (x0 + h) − f (x0 )] + 2 (2θ − 1)
N2 (x) =
dx h
f (x0 +2h)−2f (x0 +h)+f (x0 )
d2 2 (2)
N2 (x) =
dx2 h·h
f (x0 + 2h) − 2f (x0 + h) + f (x0 )
= .
h2
267

f (x0 +2h)−2f (x0 +h)+f (x0 )


(c) N200 (x0 + 12 h) = h2 so

1 f (x0 + 2h) − 2f (x0 + h) + f (x0 )


 
f 00 x0 + h ≈ .
2 h2

6c: To use this formula, we need x0 − h = 10 and x0 + 6h = 17, a system of two equations with two unknowns
whose solution is x0 = 11 and h = 1. Plugging these values into formula 4.1.6:
ˆ 17
1 1
dx ≈ [5257f (17) − 5880f (16) + 59829f (15)
10 x−5 8640
−81536f (14) + 102459f (13) − 50568f (12) + 30919f (11)]
1 1 1 1

= 5257 · − 5880 · + 59829 ·
8640 12 11 10
1 1 1 1

−81536 · + 102459 · − 50568 · + 30919 ·
9 8 7 6
≈ 0.8753962951271979.

´ 17 1 17
7c: (i) 10 x−5 dx = ln |x − 5||10 = ln(12) − ln(5) = ln 125 ≈ 0.8754687373539001 (ii) The absolute error is the
absolute value of the difference between the approximation and the exact value: | ln 12
5 −0.8753962951271979| ≈
7.24(10)−5 .

11d: To approximate some quantity in regard to a non-polynomial function, we simply evaluate the corresponding
quantity for the interpolating polynomial. That means in this case, f 0 (2) ≈ p0 (2). But p0 (x) = 12x3 − 4x + 1
so f 0 (2) ≈ 12 · 23 − 4 · 2 + 1 = 89.

12e: To approximate some quantity in regard to a non-polynomial function,


´ 1 we simply´evaluate the corresponding
1
quantity for the interpolating polynomial. That means in this case, 0 g(x)dx ≈ 0 q(x)dx:
ˆ 1 ˆ 1
g(x)dx ≈ (−7x4 + 3x2 − x + 4)dx
0 0
1
7 5 1 2

3
= − x + x − x + 4x
5 2 0
7 1
= − +1− +4
5 2
31
= = 3.1
10

13d: To use this formula, we need only to substitute proper values for θ and the θi . θ must be 0 since the point of
evaluation is at x0 (which equals x0 + 0h). It does not matter which stencil point gives which θi , but the θi
come from the fact that the nodes are x0 − h, x0 + 2h, and x0 + 3h. That gives us −1, 2, and 3 for the θi .
Setting θ0 = −1, θ1 = 2, and θ2 = 3:

f 0 (x) ≈ P20 (x)


(0 − 2) + (0 − 3)
= f (x0 − h)
h(−1 − 2)(−1 − 3)
(0 − (−1)) + (0 − 3)
+ f (x0 + 2h)
h(2 − (−1))(2 − 3)
(0 − (−1)) + (0 − 2)
+ f (x0 + 3h)
h(3 − (−1))(3 − 2)
5 2 −1
= − f (x0 − h) + f (x0 + 2h) + f (x0 + 3h)
12h 3h 4h
−5f (x0 − h) + 8f (x0 + 2h) − 3f (x0 + 3h)
= .
12h
268 Solutions to Selected Exercises

15c: The integral over this stencil is from x0 to x0 + 2h so θ0 = 0 and θ1 = 2. The nodes are x0 + 31 h and x0 + 43 h
so θ2 and θ3 are 31 and 34 . It does not matter which is which. Setting θ2 = 13 and θ3 = 43 , the formula from
question 14c becomes − h2 · 42−0 (2 · 13 − 2 − 0)f (x0 + 43 h) − (2 · 34 − 2 − 0)f (x0 + 31 h) , which simplifies to

−1
3 3

ˆ x0 +2h
2h 1 4
    
f (x)dx ≈ f x0 + h + 2f x0 + h
x0 3 3 3

Section 4.2
1d: We are trying to find the undetermined coefficients ai of formula 4.2.1. We solve system 4.2.2 to do so. The
stencil of this question has 2 nodes, x0 and x0 + h, and point of evaluation x0 + 43 h, so in system 4.2.2 we
have n = 1, θ0 = 0 and θ1 = 1, and θ = 34 . Because we are deriving a first derivative formula, we also have
k = 1. Therefore, the system we need to solve is
3
 
p00 x0 + h = a0 p0 (x0 ) + a1 p0 (x0 + h)
4
3
 
p01 x0 + h = a0 p1 (x0 ) + a1 p1 (x0 + h).
4

Now, p0 (x) = 1 so p00 (x0 + 43 h) = 0; and p1 (x) = x − x0 so p01 (x0 + 43 h) = 1. Substituting this information
into the system,
0 = a0 + a1
1 = a1 h.
From the second equation, a1 = h1 . Substituting into the first equation, 0 = a0 + 1
h so a0 = − h1 . Our
approximation, formula 4.2.1, becomes
3 1 1
 
f x0 + h
0
≈ − f (x0 ) + f (x0 + h)
4 h h
f (x0 + h) − f (x0 )
= .
h
That formula should look familiar!
1j: We are trying to find the undetermined coefficients ai of formula 4.2.1. We solve system 4.2.2 to do so. The
stencil of this question has 4 nodes, x0 , x0 + h, x0 + 32 h, and x0 + 2h with point of evaluation x0 + 12 h, so
in system 4.2.2 we have n = 3, θ0 = 0, θ1 = 1, θ2 = 23 , θ3 = 2, and θ = 12 . Because we are deriving a first
derivative formula, we also have k = 1. Therefore, the system we need to solve is
1 3
 
p00 x0 + h = a0 p0 (x0 ) + a1 p0 (x0 + h) + a2 p0 (x0 + h) + a3 p0 (x0 + 2h)
2 2
1 3
 
p01 x0 + h = a0 p1 (x0 ) + a1 p1 (x0 + h) + a2 p1 (x0 + h) + a3 p1 (x0 + 2h)
2 2
1 3
 
p02 x0 + h = a0 p2 (x0 ) + a1 p2 (x0 + h) + a2 p2 (x0 + h) + a3 p2 (x0 + 2h)
2 2
1 3
 
p03 x0 + h = a0 p2 (x0 ) + a1 p2 (x0 + h) + a2 p2 (x0 + h) + a3 p2 (x0 + 2h)
2 2

Now, p0 (x) = 1 so p00 (x0 + 21 h) = 0; p1 (x) = x − x0 so p01 (x0 + 21 h) = 1; p2 (x) = (x − x0 )2 so p02 (x0 + 21 h) = h;
and p3 (x) = (x − x0 )3 so p03 (x0 + 12 h) = 34 h2 . Substituting this information into the system,
0 = a0 + a1 + a2 + a3
3
1 = a1 h + a2 · h + a3 · 2h
2
9
h = a1 h2 + a2 · h2 + a3 · 4h
4
3 2 27
h = a1 h3 + a2 · h3 + a3 · 8h.
4 8
269

The first equation is the only one in which a0 appears so we concentrate on solving the last three equations,
which simplify to:
2
= 2a1 + 3a2 + 4a3
h
4
= 4a1 + 9a2 + 16a3
h
6
= 8a1 + 27a2 + 64a3 .
h
From the first equation, 2a1 = h2 − 3a2 − 4a3 so 4a1 = 4
h − 6a2 − 8a3 and 8a1 = 8
h − 12a2 − 16a3 . Substituting
into the second and third equations, respectively,
4 4
= − 6a2 − 8a3 + 9a2 + 16a3
h h
6 8
= − 12a2 − 16a3 + 27a2 + 64a3
h h
which simplifies to

0 = 3a2 + 8a3
2
− = 15a2 + 48a3 .
h
From the first equation, a3 = − 83 a2 . Substituting into the last equation, − h2 = 15a2 + 48(− 38 a2 ), which
simplifies to − h2 = −3a2 so
2
a2 = .
3h
Back-substituting, a3 = − 38 a2 = − 38 ( 3h
2
) so
1
a3 = − .
4h
Continuing the back-substitution, 2a1 = h2 − 3a2 − 4a3 = h2 − 3( 3h
2 1
) − 4(− 4h ), which simplifies to 2a1 = h1 so
1
a1 = .
2h
1 2 1
Finally, a0 = −a1 − a2 − a3 = − 2h − 3h + 4h so
11
a0 = − .
12h
Our approximation, formula 4.2.1, thus becomes
1 11 1 2 3 1
 
f x0 + h
0
≈ f (x0 ) + f (x0 + h) + f (x0 + h) − f (x0 + 2h)
2 12h 2h 3h 2 4h
−11f (x0 ) + 6f (x0 + h) + 8f (x0 + 23 h) − 3f (x0 + 2h)
= .
12h
2f: We are trying to find the undetermined coefficients ai of formula 4.2.1. We solve system 4.2.2 to do so. The
stencil of this question has 4 nodes, x0 , x0 + h, x0 + 23 h, and x0 + 2h with point of evaluation x0 + 12 h, so
in system 4.2.2 we have n = 3, θ0 = 0, θ1 = 1, θ2 = 23 , θ3 = 2, and θ = 21 . Because we are deriving a first
derivative formula, we also have k = 1. Therefore, the system we need to solve is
1 3
 
p0 x0 + h
00
= a0 p0 (x0 ) + a1 p0 (x0 + h) + a2 p0 (x0 + h) + a3 p0 (x0 + 2h)
2 2
1 3
 
p001 x0 + h = a0 p1 (x0 ) + a1 p1 (x0 + h) + a2 p1 (x0 + h) + a3 p1 (x0 + 2h)
2 2
1 3
 
p002 x0 + h = a0 p2 (x0 ) + a1 p2 (x0 + h) + a2 p2 (x0 + h) + a3 p2 (x0 + 2h)
2 2
1 3
 
p003 x0 + h = a0 p2 (x0 ) + a1 p2 (x0 + h) + a2 p2 (x0 + h) + a3 p2 (x0 + 2h)
2 2
270 Solutions to Selected Exercises

Now, p0 (x) = 1 so p000 (x0 + 12 h) = 0; p1 (x) = x − x0 so p001 (x0 + 21 h) = 0; p2 (x) = (x − x0 )2 so p002 (x0 + 21 h) = 2;
and p3 (x) = (x − x0 )3 so p003 (x0 + 12 h) = 3h. Substituting this information into the system,

0
= a0 + a1 + a2 + a3
3
0 = a1 h + a2 · h + a3 · 2h
2
9
2 = a1 h2 + a2 · h2 + a3 · 4h
4
27
3h = a1 h3 + a2 · h3 + a3 · 8h.
8
The first equation is the only one in which a0 appears so we concentrate on solving the last three equations,
which simplify to:

0 = 2a1 + 3a2 + 4a3


8
= 4a1 + 9a2 + 16a3
h2
24
= 8a1 + 27a2 + 64a3 .
h2
From the first equation, 2a1 = −3a2 − 4a3 so 4a1 = −6a2 − 8a3 and 8a1 = −12a2 − 16a3 . Substituting into
the second and third equations, respectively,
8
= −6a2 − 8a3 + 9a2 + 16a3
h2
24
= −12a2 − 16a3 + 27a2 + 64a3
h2
which simplifies to
8
= 3a2 + 8a3
h2
24
= 15a2 + 48a3 .
h2
16
Five times the first equation minus the second equation yields h2 = −8a3 so

2
a3 = − .
h2
8
Back-substituting, h2 = 3a2 + 8a3 = 3a2 + 8(− h22 ) so

8
a2 = .
h2

Continuing the back-substitution, 2a1 = −3a2 − 4a3 = −3( h82 ) − 4(− h22 ), which simplifies to 2a1 = − h162 so

8
a1 = − .
h2
8 8 2
Finally, a0 = −a1 − a2 − a3 = h2 − h2 + h2 so

2
a0 = .
h2
Our approximation, formula 4.2.1, thus becomes

1 2 8 8 3 2
 
f 0 x0 + h ≈ 2
f (x0 ) − 2 f (x0 + h) + 2 f (x0 + h) − 2 f (x0 + 2h)
2 h h h 2 h
2f (x0 ) − 8f (x0 + h) + 8f (x0 + 23 h) − 2f (x0 + 2h)
= .
h2
271

4b: We are trying to find the undetermined coefficients ai of formula 4.2.3. We solve system 4.2.4 to do so. The
stencil of this question has 1 node, x0 + 23 h and endpoints of integration x0 and x0 + 2h, so in system 4.2.4
we have n = 0, a = x0 and b = x0 + 2h. Therefore, the “system” we need to solve is
ˆ x0 +2h
p0 (x)dx = a0 p0 (x0 ).
x0
´ x0 +2h ´ x0 +2h
Now, p0 (x) = 1 so x0
p0 (x)dx = x0
dx = 2h. Substituting this information into the system,
2h = a0 .
Our approximation, formula 4.2.3, becomes
ˆ x0 +2h
2
 
f (x)dx ≈ 2hf x0 + h .
x0 3
4l: We are trying to find the undetermined coefficients ai of formula 4.2.3. We solve system 4.2.4 to do so. The
stencil of this question has 3 nodes, x0 , x0 + h, and x0 + 2h with endpoints of integration x0 and x0 + 2h, so
in system 4.2.4 we have n = 2, a = x0 and b = x0 + 2h. Therefore, the system we need to solve is
ˆ x0 +2h
p0 (x)dx = a0 p0 (x0 ) + a1 p0 (x0 + h) + a2 p0 (x0 + 2h)
x0
ˆ x0 +2h
p1 (x)dx = a0 p1 (x0 ) + a1 p1 (x0 + h) + a2 p1 (x0 + 2h)
x0
ˆ x0 +2h
p2 (x)dx = a0 p2 (x0 ) + a1 p2 (x0 + h) + a2 p2 (x0 + 2h)
x0
´ x +2h ´ x +2h ´ x +2h ´ x +2h
Now, p0 (x) = 1 so x00 p0 (x)dx = x00 dx = 2h; p1 (x) = x − x0 so x00 p1 (x)dx = x00 (x − x0 )dx =
1 2 x0 +2h 2 2
´ x0 +2h ´ x0 +2h 1
x0 +2h
(x − x0 ) = 2h ; and p2 (x) = (x − x0 ) so p2 (x)dx = (x − x0 ) dx = (x − x0 )3
2
=

2 x0 x0 x0 3 x0
8 3
3h . Substituting this information into the system,
2h = a0 + a1 + a2
2
2h = a1 h + a2 (2h)
8 3
h = a1 h2 + a2 (4h2 ).
3
The first equation is the only one in which a0 appears so we concentrate on the last two equations, which
simplify to:
2h = a1 + 2a2
8
h = a1 + 4a2 .
3
8
From the first equation, a1 = 2h − 2a2 . Substituting into the second equation, 3h = 2h − 2a2 + 4a2 , which
simplifies to 23 h = 2a2 , so
1
a2 = h.
3
Back-substituting, a1 = 2h − 2a2 = 2h − 2( 13 h) so
4
a1 = h.
3
Finally, a0 = 2h − a1 − a2 = 2h − 43 h − 13 h so
1
a0 = h.
3
Our approximation, formula 4.2.3, thus becomes
ˆ x0 +2h
1 4 1
f (x)dx ≈ hf (x0 ) + hf (x0 + h) + f (x0 + 2h)
x0 3 3 3
h
= [f (x0 ) + 4f (x0 + h) + f (x0 + 2h)] .
3
You may recognize this formula as Simpson’s rule!
272 Solutions to Selected Exercises

Section 4.3
ˆ x0 +2h
h
3a: Simpson’s rule for integral approximation is f (x)dx = [f (x0 ) + 4f (x0 + h) + f (x0 + 2h)]. To apply
3
ˆ 0 x0

it to the integral x ln(x + 1)dx we need to identify f , x0 , and h. In the formula, x0 is the lower limit of
−0.5
integration, so we have x0 = −0.5 in this question. In the formula, the length of the interval of integration
is 2h, so we have 2h = 0.5 in this question, or h = 0.25. In the formula, f is the integrand, so we have
f (x) = x ln(x + 1). With the parameters identified, we plug them into the right side of Simpson’s rule and we
have our estimate:
ˆ 0
.25
x ln(x + 1)dx ≈ [−0.5 ln(0.5) + 4(−0.25) ln(.75) + 0 ln(1)]
−0.5 3
≈ 0.05285463856097945.
ˆ x0 +h
h
4a: Trapezoidal rule for integral approximation is f (x)dx = [f (x0 ) + f (x0 + h)]. To apply it to the inte-
2
ˆ 0 x0

gral x ln(x + 1)dx we need to identify f , x0 , and h. In the formula, x0 is the lower limit of integration,
−0.5
so we have x0 = −0.5 in this question. In the formula, the length of the interval of integration is h, so we
have h = 0.5 in this question. In the formula, f is the integrand, so we have f (x) = x ln(x + 1). With the
parameters identified, we plug them into the right side of the trapezoidal rule and we have our estimate:
ˆ 0
.5
x ln(x + 1)dx ≈ [−0.5 ln(0.5) + 0 ln(1)]
−0.5 2
≈ 0.08664339756999316.
ˆ x0 +2h
5a: The midpoint rule for integral approximation is f (x)dx = 2hf (x0 + h). To apply it to the integral
x0
ˆ 0
x ln(x + 1)dx,
−0.5

we need to identify f , x0 , and h. In the formula, x0 is the lower limit of integration, so we have x0 = −0.5
in this question. In the formula, the length of the interval of integration is 2h, so we have 2h = 0.5 in this
question, or h = 0.25. In the formula, f is the integrand, so we have f (x) = x ln(x + 1). With the parameters
identified, we plug them into the right side of the trapezoidal rule and we have our estimate:
ˆ 0
x ln(x + 1)dx ≈ 2(.25)(−0.25 ln(0.75))
−0.5
≈ 0.03596025905647261.

6a: Using integration by parts,


ˆ 0 0 ˆ
x2 1 0 x2
x ln(x + 1)dx = ln(x + 1)

− dx
−0.5 2 −0.5 2 −0.5 x + 1
ˆ
(−0.5)2 1 0 1
 
= − ln(0.5) − x−1+ dx
2 2 −0.5 x+1
0
1 x2

= −.125 ln(.5) − − x + ln |x + 1|
2 2 −0.5
1 .25
 
= −.125 ln(.5) + + .5 + ln(.5)
2 2
= .3125 + .375 ln(.5)
≈ 0.05256980729002053

so the error is |0.05285463856097945 − 0.05256980729002053| ≈ 2.8483(10)−4 .


273

7a: See above for the exact evaluation of the integral. The error follows as
|0.08664339756999316 − 0.05256980729002053| ≈ 0.034073.
8a: See above for the exact evaluation of the integral. The error follows as
|0.03596025905647261 − 0.05256980729002053| ≈ 0.016609.
2
11a: − h6 f 000 (ξh ) is the error term for this approximation formula. The remainder of the equation is the approxi-
mation. We simply plug the given information into the approximation formula:

f (x0 + h) − f (x0 − h)
f 0 (x0 ) ≈
2h
e2.1 − e1.9
=
2(.1)
≈ 7.401377351441916.
2
12a: The error term, − h6 f 000 (ξh ), dictates the error. As in Taylor’s Theorem, this error term is exact for some
2
value of ξh . Finding a bound on the error means minimizing or maximizing | h6 f 000 (ξh )| over all possible values
of ξh . The possible values of ξh are all values between the least node and the greatest node, a fact that follows
from Taylor’s Theorem. For this question, h = .1 and f 000 (ξ) = eξ , so a lower bound for the error is

.12
min eξ
6 ξ∈[1.9,2.1]

and an upper bound is


.12
max eξ .
6 ξ∈[1.9,2.1]

But eξ is an increasing function, so its minimum value over [1.9, 2.1] occurs at 1.9 and its maximum at 2.1.
1.9 2.1
Hence, we have the error between .01
6 e and .01
6 e , or as floating point approximations, 0.01114315740379878
and 0.01361028318761275. f (x) = e so f (2) = e2 exactly. The actual error is thus |e2 −7.401377351441916| ≈
0 x 0

0.01232125251126526, which is between the bounds.


13a: The full details of the formula include the implied qualification “for some ξh ∈ (x0 − h, x0 + h)”, the interval
being decided by the least and greatest nodes. So we search for a value of ξh so that

f (x0 + h) − f (x0 − h) h2 000


f 0 (x0 ) = − f (ξh )
2h 6
and ξh ∈ (x0 − h, x0 + h). f , x0 , and h are given, so we substitute them into this equation and solve. But
first, note f 0 (x) = ex and f 000 (x) = ex :

e2.1 − e1.9 .12 ξh


e2 = − e
.2 6
.01 ξh e2.1 − e1.9 − .2e2
e =
6 .2
6  2.1
e ξh = e − e1.9 − .2e2

.002
ξh = ln(3000(e2.1 − e1.9 − .2e2 ))
≈ 2.00049999404725,

and ξh ∈ (1.9, 2.1) as required.


15: The degree of precision is 4 since the error term involves the fifth derivative of f . The fifth derivative of any
polynomial of degree 4 or less is identically zero, so if f is any polynomial of degree 4 or less, the error in
using the approximation formula is zero.
17c: The error in any approximation formula is the difference between the two sides. One side holds the exact
quantity and the other holds the approximation. To find the error, we subtract the two sides from one another,
expand each appearance of f in a Taylor series about x0 and simplify. The term of least degree remaining
determines the error term.
274 Solutions to Selected Exercises

´ x0 +h
The left side of this approximation is x0
f (x)dx, so replace f (x) by f (x0 ) + (x − x0 )f 0 (x0 ) + 21 (x −
x0 )2 f 00 (x0 ) + 16 (x − x0 )3 f 000 (x0 ) + · · · :
ˆ x0 +h ˆ x0 +h
1 1
 
2 00 3 000
f (x)dx = f (x0 ) + (x − x0 )f (x0 ) + (x − x0 ) f (x0 ) + (x − x0 ) f (x0 ) + · · · dx
0
x0 x0 2 6
x0 +h
1 1 1

= xf (x0 ) + (x − x0 )2 f 0 (x0 ) + (x − x0 )3 f 00 (x0 ) + (x − x0 )4 f 000 (x0 ) + · · ·
2 6 24 x0
1 2 0 1 3 00 1 4 000
= hf (x0 ) + h f (x0 ) + h f (x0 ) + h f (x0 ) + · · · .
2 6 24
The right side of the approximation includes f (x0 + 32 h), so this expression is also expanded in a Taylor series:

2 2 2 4
 
f x0 + h = f (x0 ) + hf 0 (x0 ) + h2 f 00 (x0 ) + h3 f 000 (x0 ) + · · · .
3 3 9 81
Substitute these expansions into the difference of the two sides and simplify. The error is

1 1 1
 
hf (x0 ) + h2 f 0 (x0 ) + h3 f 00 (x0 ) + h4 f 000 (x0 ) + · · ·
2 6 24
2 0 2 2 00 4 3 000
   
h
− 3 f (x0 ) + hf (x0 ) + h f (x0 ) + h f (x0 ) + · · · + f (x0 )
4 3 9 81
=
1 2 0 1 3 00 1
 
hf (x0 ) + h f (x0 ) + h f (x0 ) + h4 f 000 (x0 ) + · · ·
2 6 24
1 2 0 1 3 00 1 4 000
 
− hf (x0 ) + h f (x0 ) + h f (x0 ) + h f (x0 ) + · · ·
2 6 27
=
1 4 000
h f (x0 ) + · · · .
216
Work done heretofore is informal evidence that the error term is O(h4 f 000 (ξh )). To formalize, we truncate the
Taylor series, making them Taylor polynomials of convenient degree, with error terms! The error terms from
the Taylor polynomials become the error term for the approximation formula. Beginning with the left side of
the formula, the exact value:
ˆ x0 +h ˆ x0 +h 
1 1

2 00 3 000
f (x)dx = f (x0 ) + (x − x0 )f (x0 ) + (x − x0 ) f (x0 ) + (x − x0 ) f (ξx ) dx
0
x0 x0 2 6
x0 +h
1 1

= xf (x0 ) + (x − x0 )2 f 0 (x0 ) + (x − x0 )3 f 00 (x0 )
2 6 x0
ˆ x0 +h
1
+ (x − x0 )3 f 000 (ξx )dx
x0 6
ˆ x0 +h
1 2 0 1 3 00 1
= hf (x0 ) + h f (x0 ) + h f (x0 ) + (x − x0 )3 f 000 (ξx )dx
2 6 x0 6

for some unknown function ξx of x. Now, the f (x0 + 32 h) term from the right side of the formula, the
approximate value:

2 2 2 4
 
f x0 + h = f (x0 ) + hf 0 (x0 ) + h2 f 00 (x0 ) + h3 f 000 (ξ1 )
3 3 9 81

for some ξ1 ∈ (x0 , x0 + h). Subtracting the two sides, we know all terms with derivative lower than the third
will drop out since none of those terms have changed since our discovery. The error is, therefore,
ˆ x0 +h
1 h 4
(x − x0 )3 f 000 (ξx )dx − · 3 · h3 f 000 (ξ1 ).
x0 6 4 81
275

´ x +h ´ x0 +h
The Weighted Mean Value Theorem allows us to replace x00 16 (x − x0 )3 f 000 (ξx )dx by 1 000
6 f (c) x0 (x −
1 4 000
x0 )3 dx = 24 h f (c) for some c ∈ (x0 , x0 + h). The error term thus becomes

1 4 000 1
h f (c) − h4 f 000 (ξ1 )
24 27
for some c ∈ (x0 , x0 + h) and some ξh ∈ (x0 , x0 + h). The final formality is to replace this term with big-O
notation:
1 4 000
h f (c) − 1 h4 f 000 (ξ1 ) ≤ h4 1 |f 000 (c)| + 1 |f 000 (ξ1 )|
 

24 27 24 27
1 1
 
≤ h4 + max {|f 000 (c)| , |f 000 (ξ1 )|}
24 27
= M h4 |f 000 (ξh )|
1 1 17
for some ξh ∈ (x0 , x0 + h) and M = 24 + 27 = 216 (the value of ξh is either c or ξ1 ). Hence, the error is
O(h4 f 000 (ξh )).
18c: The error in any approximation formula is the difference between the two sides. One side holds the exact
quantity and the other holds the approximation. To find the error, we subtract the two sides from one another,
expand each appearance of f in a Taylor series about x0 and simplify. The term of least degree remaining
−3f (x0 ) + 4f (x0 + h2 ) − f (x0 + h)
determines the error term. f 0 (x0 ) ≈
h
The left side of this approximation is f 0 (x0 ), so its Taylor expansion is itself! The right side of the approxi-
mation includes f (x0 + 21 h) and f (x0 + h), so these expressions are expanded in Taylor series:

1 1 1 1
 
f x0 + h = f (x0 ) + hf 0 (x0 ) + h2 f 00 (x0 ) + h3 f 000 (x0 ) + · · ·
2 2 8 48
1 1
f (x0 + h) = f (x0 ) + hf 0 (x0 ) + h2 f 00 (x0 ) + h3 f 000 (x0 ) + · · · .
2 6
To simplify the display of the algebra, we begin by summing −3f (x0 ) + 4f (x0 + h2 ) − f (x0 + h):

−3f (x0 ) = −3f (x0 )


4f (x0 + 21 h) = 4f (x0 ) + 2hf 0 (x0 ) + 21 h2 f 00 (x0 ) + 12
1 3 000
h f (x0 ) + · · ·
1 2 00 1 3 000
−f (x0 + h) = −f (x0 ) − hf (x0 ) − 2 h f (x0 ) − 6 h f (x0 ) + · · ·
0
1 3 000
−3f (x0 ) + 4f (x0 + h2 ) − f (x0 + h) = hf 0 (x0 ) − 12 h f (x0 ) + · · · .

The difference of the two sides is then


1 3 000
hf 0 (x0 ) − 12 h f (x0 ) + ··· 1 2 000
f 0 (x0 ) − = h f (x0 ).
h 12
Work done heretofore is informal evidence that the error term is O(h2 f 000 (ξh )). To formalize, we truncate the
Taylor series, making them Taylor polynomials of convenient degree, with error terms! The error terms from
the Taylor polynomials become the error term for the approximation formula. The left side, again, is a Taylor
expansion! Now, the f (x0 + 21 h) and f (x0 + h) terms from the right side of the formula:

1 1 1 1
 
f x0 + h = f (x0 ) + hf 0 (x0 ) + h2 f 00 (x0 ) + h3 f 000 (ξ1 )
2 2 8 48
1 2 00 1 3 000
f (x0 + h) = f (x0 ) + hf (x0 ) + h f (x0 ) + h f (ξ2 )
0
2 6
for some ξ1 , ξ2 ∈ (x0 , x0 + h). Subtracting the two sides, we know all terms with derivative lower than the
third will drop out since none of those terms have changed since our discovery. The remaining terms, those
with the third derivative in them, is the error and is
1 3 000
−4 · 48 h f (ξ1 ) + 61 h3 f 000 (ξ2 ) 1 000 1 000
 
2
=h f (ξ2 ) − f (ξ1 )
h 6 12
276 Solutions to Selected Exercises

for some ξ1 , ξ2 ∈ (x0 , x0 + h). The final formality is to replace this term with big-O notation:
2 1 000 1 000 1 000 1 000
   
2
(ξ ) (ξ ) (ξ )| + (ξ )|

h f 2 − f 1
≤ h |f 2 |f 1
6 12 6 12
1 1
 
≤ h2 + max {|f 000 (ξ2 )| , |f 000 (ξ1 )|}
6 12
= M h2 |f 000 (ξh )|
1 1 1
for some ξh ∈ (x0 , x0 + h) and M = 6 + 12 = 4 (the value of ξh is either ξ2 or ξ1 ). Hence, the error is
O(h2 f 000 (ξh )).
19: Diffy Rence is using a second derivative formula with x0 = 3 since the left side is f 00 (3.0). On the right
side, we see a term with sin(3) in it. This is likely sin(x0 ) from one of the second derivative formulas.
We also see sin(2.8) and sin(3.2) which look likely to play the roles of sin(x0 − h) and sin(x0 + h) in the
approximation formula used. Looking at table 4.3 for a formula with f (x0 − h), f (x0 ), and f (x0 + h) in
it, we find f 00 (x0 ) = f (x0 −h)−2fh(x2 0 )+f (x0 +h) + O(h2 f (4) (ξh )). Continuing with the hypothesis that we have
f (x) = sin(x), x0 = 3, and h = .2, we plug into the formula to find

sin(2.8) − 2 sin(3) + sin(3.2)


f 00 (3) ≈
.22
= 25 [sin(2.8) − 2 sin(3) + sin(3.2)] .

We conclude that f (x) = sin x.


23c: First, we need to identify the formula being used. Since this is a third derivative formula with x0 = 3 and
evaluations of f at 3, 3.01, 3.02, 3.03, 3.04, this is a five-point formula with h = .01. The formula used is this
one from table 4.4:
−5f (x0 ) + 18f (x0 + h) − 24f (x0 + 2h) + 14f (x0 + 3h) − 3f (x0 + 4h)
f 000 (x0 ) = + O(h2 f (5) (ξh ))
2h3
so the error term is O(h2 f (5) (ξh )). The error is, therefore, bounded by

k(.01)2 max f (5) (x)

x∈[3,3.04]

for some constant k dependent on the method, not the function f or the nodes used. Now,

max f (5) (x) = max |cos(x)|

x∈[3,3.04] x∈[3,3.04]

= |cos(3.04)| .

A bound on the error is, therefore, 0.0001k cos(3.04) or 9.9485(10)−5 k for some k dependent on the method.
23f: First, we need to identify the formula being used. The unusual points of evaluation in the approximation
identify it quickly as
ˆ x0 +h
1 1
    
f (x)dx = h f x0 − √ h + f x0 + √ h + O(h5 f (4) (ξh ))
x0 −h 3 3

with x0 = 3.5, h = 0.5, and error term O(h5 f (4) (ξh )). The error is, therefore, bounded by

k(.5)5 max f (4) (x)

x∈[3,4]

for some constant k dependent on the method, not the function f or the nodes used. Now,

max f (4) (x) = max |sin(x)|

x∈[3,4] x∈[3,4]

= |sin(4)| .

A bound on the error is, therefore, 0.03125k sin(4) or 0.023651k for some k dependent on the method.
277

24: (a) We are given only 5 nodes, so we must use them all for each approximation. The nodes are (thankfully)
evenly spaced so we can use one of the formulas in table 4.2. There are two nodes to the left of 2 and two
to the right, so we need to use the five-point formula with nodes x0 − 2h, x0 − h, x0 , x0 + h, and x0 + 2h
to approximate f 0 (2). All four of the nodes other than 4 are to the left of 4 so we need to use the five-point
formula with nodes x0 − 4h, x0 − 3h, x0 − 2h, x0 − h, and x0 to approximate f 0 (4). Hence,

−.2381 − 8(−.3125) + 8(−.8333) − (−5)


f 0 (2) ≈
12(1)
= 0.049625
3(−.2381) − 16(−.3125) + 36(−.4545) − 48(−.8333) + 25(−5)
f 0 (4) ≈
12(1)
= −8.089825.

(b) We should expect the approximation of f 0 (2) to be better because the error term for the formula used is
h4 (5) 4

30 f (ξh ) where the error term for the formula used in approximating f 0 (4) is h5 f (5) (ξh ), six times greater.
Another reason we should expect the f 0 (2) approximation to be better is because 2 is centrally located amongst
the nodes where 4 is as far from centrally located as possible!
1 25
(c) f 0 (x) = − (x−4.2) 2 so f (2) = − 121 and f (4) = −25. The absolute errors are
0 0

|f 0 (2) − 0.049625| ≈ 0.2562365702479338


|f (4) − (−8.089825)|
0
≈ 16.910175

and the relative errors are


0.2562365702479338

≈ 1.240185
f 0 (2)
16.910175


f 0 (4)
≈ 0.6764070000000001.

So, as expected the absolute error in the approximation of f 0 (2) is smaller than that of f 0 (4), but the relative
errors, which are perhaps more important, are exactly the opposite in comparison!

33: The function shown below (λ = 2.584739179873929) is one example.

The area of trapezoid CDEF represents the approximation by the trapezoidal rule (which is where it gets
its name). The function f (x) was chosen so that the two brownish areas are (very nearly) equal, one above
line segment CD and one below. This means the trapezoidal rule approximation will be (very nearly) exact.
Moreover, since the point A is not on line segment CD, the approximation by Simpson’s rule will not be (very
nearly) exact. Other examples can be created similarly. To summarize, any example of a smooth function
where the following occur will work.

• The areas above and below the line segment from (0, f (0)) to (1, f (1)) are equal.
• (.5, f (.5)) does not lie on the line segment from (0, f (0)) to (1, f (1)).
278 Solutions to Selected Exercises

REMARK: Non-smooth functions with the two properties above also provide examples. The reason we chose
to give a smooth example is because the errors for non-smooth functions are completely unpredictable
(since they don’t possess the required number of derivatives), and, hence, it is not as surprising in that
case that we can find examples where the trapezoidal rule outdoes Simpson’s rule. The trapezoidal rule
and Simpson’s rule can not be applied reliably to functions without sufficient derivatives.
REMARK: The question did not request a formula, so any hand-sketched graph with the two properties
above would suffice. Since we have a formula, however, we can demonstrate numerically the result. For
the function f pictured above,
ˆ 1
f (x)dx ≈ 3.443097449311693
0
f (0) + f (1)
Trapezoidal Rule = ≈ 3.443097449311694
2
f (0) + 4f (.5) + f (1)
Simpson’s Rule = ≈ 3.632535470843161.
6

34: Five-point formulas for the 2nd derivative have error term O(h3 f (5) (ξh )) or O(h4 f (6) (ξh )) so E.1 = k(.1)3 f (5) (ξ.1 )
or E.1 = k(.1)4 f (6) (ξ.1 ) and E.02 = k(.02)3 f (5) (ξ.02 ) or E.02 = k(.02)4 f (6) (ξ.02 ). Assuming f (5) (ξ.1 ) ≈
f (5) (ξ.02 ) if the error term is O(h3 f (5) (ξh )) or that f (6) (ξ.1 ) ≈ f (6) (ξ.02 ) if the error term is O(h4 f (6) (ξh )), we
should expect
3
k(.1)3 f (5) (ξ.1 )

E.1 .1
= ≈ = 125
E.02 k(.02)3 f (5) (ξ.02 ) .02
or 4
k(.1)4 f (6) (ξ.1 )

E.1 .1
= ≈ = 625.
E.02 k(.02)4 f (6) (ξ.02 ) .02

Section 4.4
1a: Divide the interval of integration, [1, 3] into 3 subintervals of equal length and apply the midpoint rule to each
of the subintervals. The sum of the three estimates is the answer.

interval midpoint rule


[1, 1 + 23 ] 2
3 ln(sin(1 + 31 )) ≈ −0.0189755760325961
[1 + 23 , 2 + 31 ] 2
3 ln(sin(2)) ≈ −0.06338869073010707
[2 + 13 , 3] 2 2
3 ln(sin(2 + 3 )) ≈ −0.5216503391783174

ˆ 3
ln(sin(x))dx ≈ −0.6040146059410205
1

2a: Divide the interval of integration, [1, 3] into 3 subintervals of equal length and apply the trapezoidal rule to
each of the subintervals. The sum of the three estimates is the answer.

interval trapezoidal
 rule
[1, 1 + 23 ] 1
3 ln(sin(1)) + ln(sin(1 + 23 )) ≈ −0.05906878811071457
[1 + 23 , 2 + 13 ] 1
3 ln(sin(1 + 23 )) + ln(sin(2 + 31)) ≈ −0.1096099655624244
[2 + 13 , 3] 1 1
3 ln(sin(2 + 3 )) + ln(sin(3)) ≈ −0.7607906360781023

ˆ 3
ln(sin(x))dx ≈ −0.9294693897512412
1

3a: Divide the interval of integration, [1, 3] into 3 subintervals of equal length and apply Simpson’s rule to each of
the subintervals. The sum of the three estimates is the answer. Let f (x) = ln(sin(x)).
279

interval Simpson’s rule


[1, 1 + 32 ] 1
9 f (1) + 4f (1 + 13 ) + f (1 + 23 ) ≈ −0.03233998005863559
[1 + 23 , 2 + 31 ] 1
9 f (1 + 23 ) + 4f (2) + f (2 + 13 ) ≈ −0.0787957823408795
[2 + 13 , 3] 1
9 f (2 + 13 ) + 4f (2 + 23 ) + f (3) ≈ −0.6013637714782457

ˆ 3
ln(sin(x))dx ≈ −0.7124995338777608
1

3
4a: Divide the interval of integration, [1, 3] into 3 subintervals of equal length and apply Simpson’s 8 rule to each
of the subintervals. The sum of the three estimates is the answer. Let f (x) = ln(sin(x)).

interval Simpson’s 83 rule 


[1, 1 + 23 ] 1
12 f (1) + 3f (1 + + 3f (1 + 49 ) + f (1 + 32 ) ≈
2
9)  −0.03227403251196553
[1 + 23 , 2 + 13 ] 1
12 f (1 + 23 ) + 3f (1 + 98 ) + 3f (2 + 19 ) + f (2 + 13) ≈ −0.07868946204953159
[2 + 13 , 3] 1 1 5 7
12 f (2 + 3 ) + 3f (2 + 9 ) + 3f (2 + 9 ) + f (3) ≈ −0.5965852934114506

ˆ 3
ln(sin(x))dx ≈ −0.7075487879729477
1

5a: Divide the interval of integration, [1, 3] into 3 subintervals of equal length and apply the quadrature rule to
each of the subintervals. The sum of the three estimates is the answer. Let f (x) = ln(sin(x)).

interval quadrature rule


[1, 1 + 23 ] 1
f (1 + 92 ) + f (1 + 49 ) ≈ −0.02334244731238252

3
[1 + 23 , 2 + 13 ] 1
3 f (1 + 89 ) + f (2 + 19 ) ≈ −0.068382627545234
[2 + 31 , 3] 1
3 f (2 + 59 ) + f (2 + 79 ) ≈ −0.5418501791892335

ˆ 3
ln(sin(x))dx ≈ −0.63357525404685
1

´π
7: The trapezoidal rule applied to 0
sin4 x dx gives

π
sin4 (0) + sin4 (π) = 0,

2

which has absolute error 83 π. Since the trapezoidal rule has error term O n12 , dividing the interval of


integration into n subintervals should decrease the error by a factor of about n12 . Therefore, we need to solve
3
π
the equation 8 2 = 10−4 :
n
3

= 10−4
n2
3

= n2
s 10−4
3

= n
10−4
n ≈ 108.5.

Increasing the number of intervals by a factor of 109 should do the trick. Since our initial estimate used but
one interval, we need to use 109 intervals to achieve 10−4 accuracy.
280 Solutions to Selected Exercises

15: Let Sk (a, b) mean applying composite Simpson’s rule to the interval [a, b] with k subintervals and ek mean the
error in Sk (a, b). We now repeat the analysis we did in deriving the adaptive trapezoidal rule but applied to
Simpson’s rule:
 4  4
1 1
en ≈ M and e2n ≈ M
n 2n
so
4
en M n1
≈ 4 = 16, which implies en ≈ 16e2n .
e2n M 1 2n
´b
Because a
f (x)dx = S2 (a, b) + e2 = S1 (a, b) + e1 ,

S2 (a, b) − S1 (a, b) = e1 − e2
≈ 16e2 − e2
= 15e2
1
so e2 ≈ 15 (S2 (a, b) − S1 (a, b)). Explicitly,
ˆ b
1
f (x)dx − S2 (a, b) ≈ (S2 (a, b) − S1 (a, b)).
a 15
Now we know what quantity to use in order to estimate the error. We tabulate the necessary computations:

1
a b S1 (a, b) S2 (a, b) 15 |S2 (a, b) − S1 (a, b)| tol
1 3 −0.837026 −0.730741 0.00708 .002
1 2 −0.046286 −0.045560 4.8(10)−5 .001
2 3 −0.684454 −0.661383 0.00153 .001
2 2.5 −0.134349 −0.134243 7.0(10)−6 .0005
2.5 3 −0.527034 −0.523129 0.00026 .0005
´3
1
ln(sin(x))dx ≈ −0.045560 − 0.134243 − 0.523129 = −0.702932

23: First,
ˆ 1
1
ln(x + 1)dx = [(x + 1) ln(x + 1) − x − 1]0
0
= 2 ln 2 − 2 − (−1)
= 2 ln 2 − 1
≈ 0.3862943611198906.

Now we need to get an estimate using the composite trapezoidal rule with a small number of intervals, say 10
or 20. This part of the computation is mere speculation. Really, any number of intervals that will not give
the desired accuracy will suffice:
T10 (0, 1) = 0.385877936745754.
The error with 10 subintervals is

|0.3862943611198906 − 0.385877936745754| ≈ 4.16424374136581(10)−4 .

Since the error term for the composite trapezoidal rule (assuming f 00 (ξh ) is constant, as we do in deriving the
adaptive method) is O n12 , we expect the error to decrease by a factor of n2 as the number of intervals is
increased by a factor of n. The needed factor of decrease is

10−6
≈ 0.00240139641699267.
4.16424374136581(10)−4
q
1
Therefore, the necessary factor of increase is 0.00240139641699267 ≈ 20.406. Our “test” calculation used 10
intervals, so we need to use 10 · 20.406 = 204.06, or rounding up, 205 intervals to achieve 10−6 accuracy.
281

REMARK: Another way to find the necessary factor of increase is to solve the equation
4.16424374136581(10)−4
= 10−6 .
n2
This comes from the fact that increasing the number of intervals by a factor of n decreases the error by
a factor of n2 . Thus we take the known error (of T10 (0, 1)), divide by n2 and set it equal to the desired
accuracy, 10−6 . The solution, of course, is n ≈ 20.406, the factor of increase.
REMARK: We have used the Octave code
####################################################
# Written by Dr. Len Brin 2 April 2012 #
# MAT 322 Numerical Analysis I #
# Purpose: Implementation of composite Trapezoidal #
# rule #
# INPUT: function f, interval endpoints a and b, #
# number of subintervals n #
# OUTPUT: approximate integral of f(x) from a to b #
####################################################
function integral = compositeTrapezoidal(f,a,b,n)
h = (b-a)/n;
s = 0;
for i = 1:n-1
s = s + f(a+i*h);
end#for
integral = h*(f(a)+2*s+f(b))/2;
end#function
to calculate T10 (0, 1):
>> f=inline(’log(x+1)’);
>> compositeTrapezoidal(f,0,1,10)
ans = 0.385877936745754
compositeTrapezoidal.m may be downloaded at the companion website.
REMARK: Using the code above to calculate the approximation with 205 subintervals:
>> compositeTrapezoidal(f,0,1,205)
ans = 0.386293369647938
and it has error
>> 0.3862943611198906-ans
ans = 9.91471952871414e-07
just less than 10−6 .

Section 4.5
7: We need to combine N (h), N ( h2 ), and N ( h3 ) so that terms involving h and h2 vanish, leaving h3 as the lowest
order term.

N (h) = M − K1 h − K2 h2 − K3 h3 − · · ·
1 1 1
 
h
N = M − K1 h − K2 h2 − K3 h3 − · · ·
2 2 4 8
1 1 1
 
h
N = M − K1 h − K2 h2 − K3 h3 − · · ·
3 3 9 27

so N (h) + aN ( h2 ) + bN ( h3 ) is
     
a b a b 2 a b
(1 + a + b)M − 1 + + K1 h − 1 + + K2 h − 1 + + K3 h3 − · · · .
2 3 4 9 8 27
282 Solutions to Selected Exercises

Therefore, we need to find a and b such that


a b
1+ + = 0
2 3
a b
1+ + = 0.
4 9
The solution of the system is a = −8 and b = 9. Calculating,
   
h h
N (h) − 8N + 9N = 2M + O(h3 )
2 3
so our O(h3 ) estimate for M is
N (h) − 8N ( h2 ) + 9N ( h3 )
.
2
REMARK: We can work directly by Richardson’s extrapolation (at least to begin) as well. Using Richard-
son’s extrapolation with α = 12 and m1 = 1, we can combine N (h) and N ( h2 ) to get an O(h2 ) approxi-
mation:  
h
N1 (h) = 2N − N (h).
2
Using Richardson’s extrapolation with α = 23 and m1 = 1, we can combine N ( h2 ) and N ( h3 ) to get
another O(h2 ) approximation:
3
N ( h3 ) − N ( h2 )
 
h
N̂1 = 2 1
2 2
   
h h
= 3N − 2N .
3 2

Both N1 and N̂1 are O(h2 ) approximations, so we can combine them to get the O(h3 ) approximation.
Unfortunately, the Richardson’s extrapolation formula does not apply. It assumes the same constants in
each approximation. But the general idea does. We need to combine these approximations
1 3
N1 (h) = M + K2 h2 + K3 h3 + · · ·
2 4
1 5
 
h
N̂1 = M + K2 h2 + K3 h3 + · · ·
2 6 36

to eliminate the h2 term. By inspection, we need 3N̂1 ( h2 ) − N1 (h):

1
 
h
3N̂1 − N1 (h) = 2M − K3 h3 − · · · .
2 3
Therefore, the O(h3 ) approximation for M we are looking for is

3N̂1 ( h2 ) − N1 (h)
N2 (h) =
2
3 3N ( h3 ) − 2N ( h2 ) − 2N ( h2 ) − N (h)
   
=
2
N (h) − 8N ( h2 ) + 9N ( h3 )
= .
2
1
8: For the first extrapolation, we use formula 4.5.4 with α = 2 and m1 = 2:

4N ( h2 ) − N (h)
N1 (h) = ,
3
which leaves N1 (h) = M + l2 h4 + l3 h6 + · · · . We get a second round of refinements from formula 4.5.4 with
α = 21 and m1 = 4:
16N1 ( h2 ) − N1 (h)
N2 (h) = ,
15
283

1
which leaves N2 (h) = M + c3 h6 + · · · . We get a third round of refinements from formula 4.5.4 with α = 2
and m1 = 6:
64N2 ( h2 ) − N2 (h)
N3 (h) = .
63
Tabulating the computation, it goes something like this:

N N1 N2 N3
2.356194
−0.4879837 −1.436042
−0.8815732 −1.012769 −0.9845514
−0.9709157 −1.000696 −0.9998916 −1.000135

The third Richardson extrapolation is −1.000135. Not bad considering the exact value of the integral is −1.
10: To summarize the method, let N0 (k) = Tk (1, 3), the trapezoidal rule itself applied with k subintervals. Then
since the error of the trapezoidal rule only contains even powers,

4j Nj−1 (2k) − Nj−1 (k)


Nj (k) =
4j − 1
for j = 1, 2, . . .. To six significant figures, the following table summarizes the process.

k N0 (k) N1 (k) N2 (k) N3 (k) N4 (k) N5 (k) N6 (k)


1 −2.13074 −0.837026 −0.723655 −0.705067 −0.702555 −0.702340 −0.702330
2 −1.16045 −0.730741 −0.705358 −0.702564 −0.702340 −0.702330
4 −0.838170 −0.706944 −0.702608 −0.702341 −0.702330
8 −0.739751 −0.702879 −0.702345 −0.702330
16 −0.712097 −0.702378 −0.702330
32 −0.704808 −0.702333
64 −0.702952
To (Octave) machine accuracy,
ˆ 3
ln(sin(x))dx ≈ −0.702330215031025
1

Section 5.2
2: Since there are three points given, the spline consists of two cubic pieces. Each cubic piece has 4 coefficients, so
we will need to construct a system of 8 equations in the 8 unknowns. The spline S takes the form
(
S1 (x) = a1 + b1 (x − 1) + c1 (x − 1)2 + d1 (x − 1)3 , x ∈ [0, 1]
S(x) = .
S2 (x) = a2 + b2 (x − 2) + c2 (x − 2)2 + d2 (x − 2)3 , x ∈ [1, 2]

The 8 equations come from the three sets of requirements on any free cubic spline.
Interpolation:

• S1 (0) = −9 ⇒ a1 − b1 + c1 − d1 = −9
• S1 (1) = −13 ⇒ a1 = −13
• S2 (1) = −13 ⇒ a2 − b2 + c2 − d2 = −13
• S2 (2) = −29 ⇒ a2 = −29

Derivative matching:
• S10 (1) = S20 (1) ⇒ b1 = b2 − 2c2 + 3d2
• S100 (1) = S200 (1) ⇒ 2c1 = 2c2 − 6d2
Endpoint conditions:
284 Solutions to Selected Exercises

• S100 (0) = 0 ⇒ 2c1 − 6d1 = 0


• S200 (2) = 0 ⇒ 2c2 = 0

7: Since there are three points given, the spline consists of two cubic pieces. Each cubic piece has 4 coefficients, so
we will need to construct a system of 8 equations in the 8 unknowns. The spline S takes the form
(
S1 (x) = a1 + b1 (x − 2) + c1 (x − 2)2 + d1 (x − 2)3 , x ∈ [1, 2]
S(x) = .
S2 (x) = a2 + b2 (x − 4) + c2 (x − 4)2 + d2 (x − 4)3 , x ∈ [2, 4]

The 8 equations come from the three sets of requirements on any clamped cubic spline.
Interpolation:

• S1 (1) = 1 ⇒ a1 − b1 + c1 − d1 = 1
• S1 (2) = 3 ⇒ a1 = 3
• S2 (2) = 3 ⇒ a2 − 2b2 + 4c2 − 8d2 = 3
• S2 (4) = 2 ⇒ a2 = 2

Derivative matching:

• S10 (2) = S20 (2) ⇒ b1 = b2 − 4c2 + 12d2


• S100 (2) = S200 (2) ⇒ 2c1 = 2c2 − 12d2

Endpoint conditions:

• S10 (1) = 0 ⇒ b1 − 2c1 + 3d1 = 0


• S20 (4) = 0 ⇒ b2 = 0

9a: Following the solution outlined in


 the text, equation 5.2.8 gives n − 2 = 0 equations in the ci . Equation 5.2.11
4 16
gives −4c1 − 2c2 = 3 −1 − −1 , which simplifies to

−4c1 − 2c2 = 36.

Combined with the equation c2 = 0, we find c1 = −9. Now we have the ai and ci . The rest of the
solution amounts to back-substitution. From the left endpoint condition, d1 = 13 c1 = −3. From second
derivative matching, d2 = c2 −c
3
1
= 0−(−9)
3 = 3. Now we have the di . From the interpolation requirements,
b1 = a1 + c1 − d1 + 9 and b2 = a2 + c2 − d2 + 13, so

b1 = −13 − 9 + 3 + 9 = −10
b2 = −29 + 0 − 3 + 13 = −19.

The spline is, therefore,


(
−13 − 10(x − 1) − 9(x − 1)2 − 3(x − 1)3 , x ∈ [0, 1]
S(x) =
−29 − 19(x − 2) + 3(x − 2)3 , x ∈ [1, 2].

REMARK: The solution outlined in the text is not the only way to get the solution. Any method of solving
the six equations involving bi , ci , and di can be used.

9e: Following the solution outlined in the text, equation 5.2.8 gives n − 2 = 0 equations in the ci . We can not
use equation 5.2.11 since it was derived from free endpoint conditions. Instead, we need to use the clamped
1
endpoint conditions to come up with two equations in the ci . Equation 5.2.10 gives us b1 = −2 + −4
3 c1 + 3 c2 .
−2

Solving the second derivative matching equation for d2 , we have d2 = 6 . Substituting expressions for b1 ,
c2 −c1
1
b2 , and d2 into the first derivative matching equation, −2 = − 32 c1 − 43 c2 , which simplifies to 4c1 +8c2 = 3. This
is our first equation in ci . Now solving the left endpoint condition for d1 , we have d1 = 2c13−b1 . Substituting
1
expressions for a1 , b1 , and d1 into the first interpolation equation, we have 3 − ( −2 + −4
3 c1 + 3 c2 ) + c1 −
−2
285

2c1 −( −2
1
+ −4 −2
3 c1 + 3 c2 )
3 = 1, which simplifies to 11c1 + 4c2 = −21. The two equations in ci can now be solved to
find c1 = − 52 and c2 = 13
8 . As with the free spline, the rest of the solution amounts to back-substitution:

1 5 13 7
   
−4 −2
b1 = + − + =
−2 3 2 3 8 4
2(− 25 ) − 74 9
d1 = =−
3 4
13 5
11

8 − −2
d2 = = .
6 16

The spline is, therefore,


(
3 + 74 (x − 2) − 52 (x − 2)2 − 94 (x − 2)3 , x ∈ [1, 2]
S(x) =
2 + 13 2 11
8 (x − 4) + 16 (x − 4) ,
3
x ∈ [2, 4].

REMARK: The solution outlined in the text is not the only way to get the solution. Any method of solving
the six equations involving bi , ci , and di can be used.

10a: >> [a,b,c,d]=naturalCubicSpline([0,1,2],[-9,-13,-29])


a =
-13 -29

b =
-10 -19

c =
-9 0

d =
-3 3

11: First, the declaration of the function must be changed. Left and right endpoint derivatives, m0 and mn , will
be specified, so there must be additional arguments to the function. Also, the name of the function should be
changed:

function [a,b,c,d] = naturalCubicSpline(x,y)

should become

function [a,b,c,d] = clampedCubicSpline(x,y,m0,mn)

The rest of the modifications involve the endpoint conditions and their effect on the equations within the
function. We begin by solving the left endpoint condition for d1 : b1 + 2c1 h1 + 3d1 h21 = m0 ⇒

m0 − b1 − 2c1 h1
d1 = . (6.5.6)
3h21

Substituting this equation, ai = yi , and equation 5.2.10 into 5.2.1 with i = 1 gives
 

y1 − y2 2 1
 m0 − y1 −y2
h2 + 32 h2 c1 + 13 h2 c2 − 2c1 h1
y1 + + h2 c1 + h2 c2 h1 + c1 h21 + h31 = y0 ,
h2 3 3 3h21
286 Solutions to Selected Exercises

which simplifies as follows.


 

y1 − y2 2 1
 m0 − y1 −y2
h2 + 23 h2 c1 + 13 h2 c2 − 2c1 h1
+ h2 c1 + h2 c2 h1 + c1 h21 + h1 = y0 − y1
h2 3 3 3
 
y1 − y2 2 1 m0 − y1h−y 2
+ 23 h2 c1 + 13 h2 c2 − 2c1 h1 y0 − y1
+ h2 c1 + h2 c2 + c1 h1 + =
2

h2 3 3 3 h1
2 1
 
y1 − y2 y1 − y2 y0 − y1
3 + 2h2 c1 + h2 c2 + 3c1 h1 + m0 − + h2 c1 + h2 c2 + −2c1 h1 = 3
h2 h2 3 3 h1
2 1
 
y1 − y2 y0 − y1
2 + 2h2 c1 + h2 c2 + c1 h1 + m0 − h2 c1 + h2 c2 = 3
h2 3 3 h1
y1 − y2 y0 − y1
6 + 6h2 c1 + 3h2 c2 + 3c1 h1 + 3m0 − (2h2 c1 + h2 c2 ) = 9
h2 h1
y1 − y2 y0 − y1
6 + 4h2 c1 + 2h2 c2 + 3c1 h1 + 3m0 = 9 ,
h2 h1
and finally
y0 − y1 y1 − y2
(4h2 + 3h1 ) c1 + 2h2 c2 = 9 −6 − 3m0 . (6.5.7)
h1 h2
The right endpoint condition, Sn0 (xn ) = mn gives bn = mn . Substituting this information into 5.2.7 with
i = n gives mn = yn−1
hn
−yn
− (cn−1 +2c
3
n )hn
, which simplifies to
 
yn−1 − yn
hn cn−1 + 2hn cn = 3 − mn . (6.5.8)
hn

Equation 6.5.7 should be reflected in the modified code on lines 21 and 22:

m(1,1)=2*(h(1)+h(2)); m(1,2)=h(2);
m(1,n+1)=3*((y(1)-y(2))/h(1)-(y(2)-y(3))/h(2));

becomes

m(1,1)=3*h(1)+4*h(2); m(1,2)=2*h(2);
m(1,n+1)=9*(y(1)-y(2))/h(1)-6*(y(2)-y(3))/h(2)-3*m0;

Equation 6.5.8 should be reflected in the modified code on line 25:

m(n,n-1)=0; m(n,n)=1; m(n,n+1)=0;

becomes

m(n,n-1)=h(n); m(n,n)=2*h(n); m(n,n+1)=3*((y(n)-y(n+1))/h(n)-mn);

The solution for the ci remains unchanged. We have only left to modify the computation of b1 and d1 on lines
47 and 48. b1 now comes from 5.2.10, so

b(1)=(y(1)-y(2))/h(1)-2*c(1)*h(1)/3;

becomes

b(1)=(y(2)-y(3))/h(2)+2*c(1)*h(2)/3+h(2)*c(2)/3;

d1 now comes from 6.5.6, so

d(1)=-c(1)/(3*h(1));

becomes
287

d(1)=(m0-b(1)-2*c(1)*h(1))/(3*h(1)^2);

Of course, the comments at the beginning of the function should be updated as well. The modified code,
then, should look something like this:

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Written by Dr. Len Brin 3 June 2014 %
% Purpose: Calculation of a natural cubic %
% spline. %
% INPUT: points (x(1),y(1)), (x(2),y(2)), ... %
% spline must interpolate; first %
% derivative at left endpoint, m0; first %
% derivative at right endpoint, mn. %
% OUTPUT: coefficients of each piece of the %
% piecewise cubic spline: %
% S(i,x) = a(i) %
% + b(i)*(x-x(i+1)) %
% + c(i)*(x-x(i+1))^2 %
% + d(i)*(x-x(i+1))^3 %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function [a,b,c,d] = clampedCubicSpline(x,y,m0,mn)
n=length(x)-1;
for i=1:n
h(i)=x(i)-x(i+1);
end%for
% Left endpoint condition:
% m(1,1)*c(1) + m(1,2)*c(2) = m(1,n+1)
m(1,1)=3*h(1)+4*h(2); m(1,2)=2*h(2);
m(1,n+1)=9*(y(1)-y(2))/h(1)-6*(y(2)-y(3))/h(2)-3*m0;
% Right endpoint condition:
% m(n,n-1)*c(n-1) + m(n,n)*c(n) = m(n,n+1)
m(n,n-1)=h(n); m(n,n)=2*h(n); m(n,n+1)=3*((y(n)-y(n+1))/h(n)-mn);
% Conditions for all splines:
for i=2:n-1
m(i,i-1)=h(i);
m(i,i)=2*(h(i)+h(i+1));
m(i,i+1)=h(i+1);
m(i,n+1)=3*((y(i)-y(i+1))/h(i)-(y(i+1)-y(i+2))/h(i+1));
end%for
% Solve for c(i)
l(1)=m(1,1); u(1)=m(1,2)/l(1); z(1)=m(1,n+1)/l(1);
for i=2:n-1
l(i)=m(i,i)-m(i,i-1)*u(i-1);
u(i)=m(i,i+1)/l(i);
z(i)=(m(i,n+1)-m(i,i-1)*z(i-1))/l(i);
end%for
l(n)=m(n,n)-m(n,n-1)*u(n-1);
c(n)=(m(n,n+1)-m(n,n-1)*z(n-1))/l(n);
for i=n-1:-1:1
c(i)=z(i)-u(i)*c(i+1);
end%for
% Compute a(i), b(i), d(i)
% Endpoint conditions:
b(1)=(y(2)-y(3))/h(2)+2*c(1)*h(2)/3+h(2)*c(2)/3;
d(1)=(m0-b(1)-2*c(1)*h(1))/(3*h(1)^2);
% Conditions for all splines:
a(1)=y(2);
288 Solutions to Selected Exercises

for i=2:n
d(i)=(c(i-1)-c(i))/(3*h(i));
b(i)=(y(i)-y(i+1))/h(i)-(c(i-1)+2*c(i))*h(i)/3;
a(i)=y(i+1);
end%for
b(n)=mn;
end%function

Notice the addition of the final computation, b(n)=mn. The value of b(n) from the loop is subject to
floating point error. Setting bn equal to mn at the end of the program eliminates this potential variation.
clampedCubicSpline.m may be downloaded at the companion website.
12b: >> [a,b,c,d]=clampedCubicSpline([1,2,4],[1,3,2],0,0)
a =
3 2

b =
1.75000 0.00000

c =
-2.5000 1.6250

d =
-2.25000 0.68750

Section 2.7
1: (c) g(a) = g(0) = 2 and g(b) = g(.9) = −.1897 so the bracket is good. Moreover, we now know that if the
value of the function is positive at any given iteration, that iteration becomes the left endpoint. Otherwise it
becomes the right endpoint. Recall, the secant method when applied to a proper bracket will always produce
an iteration inside the bracket, so bisection is never needed.

a b candidate x x g(x) x becomes


0 0.9 0.82203 0.82203 −0.207 b
0 0.82203 0.74486 0.74486 −0.137 b
0 0.74486 0.69690 0.69690 −0.060 b
0 0.69690 0.67660 0.67660 −0.020 b
0 0.67660 0.66971

|.66971 − .67660| = .00689 < .01 so we stop with x5 = .66971.


(h) f (a) = f (−20) ≈ 20 and f (b) = f (20) ≈ −17 so the bracket is good. Moreover, we now know that if the
value of the function is positive at any given iteration, that iteration becomes the left endpoint. Otherwise it
becomes the right endpoint. Recall, the secant method when applied to a proper bracket will always produce
an iteration inside the bracket, so bisection is never needed.

a b candidate x x g(x) x becomes


20 −20 1.5262 1.5262 1.18 a
1.5262 −20 2.7013 2.7013 −1.16 b
1.5262 2.7013 2.1186 2.1186 0.229 a
2.1186 2.7013 2.2142 2.2142 0.011 a
2.2142 2.7013 2.2189

|2.2189 − 2.2142| = .0047 < .01 so we stop with x5 = 2.2189.


2: (c) g(a) = g(0) = 2 and g(b) = g(.9) = −.1897 so the bracket is good. Moreover, we now know that if the
value of the function is positive at any given iteration, that iteration becomes the left endpoint. Otherwise it
becomes the right endpoint. An ∗ indicates that the bisection method was used due to the candidate landing
outside the bracket.
289

a b candidate x x g(x) x becomes


0 0.9 1.1136 0.45∗ 0.59 a
0.45 0.9 0.63925 0.63925 0.060 a
0.63925 0.9 0.66547 0.66547 0.0025 a
0.66547 0.9 0.66666

|.66666 − .66547| = .00119 < .01 so we stop with x4 = .66666.


(h) f (a) = f (−20) ≈ 20 and f (b) = f (20) ≈ −17 so the bracket is good. Moreover, we now know that if the
value of the function is positive at any given iteration, that iteration becomes the left endpoint. Otherwise it
becomes the right endpoint. An ∗ indicates that the bisection method was used due to the candidate landing
outside the bracket.

a b candidate x x g(x) x becomes


20 −20 1062.3 0∗ 1 a
0 −20 undef ined

The method is undefined beyond this point due to division by zero. The method fails.

REMARK: We will see later (question 6h) that Octave is able to handle the division by zero well enough
that the method does continue, and eventually arrives at a solution!

3: (c) The secant method produces the sequence of approximations

0, .9, .82203, 1.7456, .83551, .84905, −1.6288, .83478,


.82068, , .14336, .74168, .69475, .66071, .66700

at which point it stops since |.66700 − .66071| = .00629 < .01. The (pure) secant method takes significantly
longer to converge than does its bracketed cousin. This is largely due to the fact that in the secant method,
the third iteration comes from the secant method applied to .9 and .82203, the last two iterations (which do
not comprise a proper bracket), whereas the third iteration in false position comes from the secant method
applied to 0 and .82203 (a proper bracket).
(h) The secant method produces the sequence of approximations

−20, 20, 1.5262, 2.7013, 2.1186, 2.2142, 2.2192

at which point it stops since |2.2192 − 2.2142| = .005 < .01. The (pure) secant method and its bracketed
cousin produce the exact same sequence of iterations. It just happens that, at each step, the secant method
produces an approximation, which when paired with the previous iteration forms a proper bracket!

4: (c) Newton’s method produces the sequence of approximations

.9, 1.1136, 1.0302, 1.0030, 1.0000

at which point it stops since |1 − 1.003| = .003 < .01. The (pure) Newton’s method converges to a different
root, one outside the bracket! It is quick, but it fails to produce a root between 0 and .9, something that
should not be surprising from an un-safeguarded method.
(h) Newton’s method produces the sequence of approximations

20, 1062.3, 3803.0, 971.14, 377.14, 2880.5, 1606.3, 330.83, 66.635, 20.301,
−5.5823, −21.983, −10.454, −4.6688, 1.9357, 2.2550, 2.2193, 2.2191

at which point it stops since |2.2191 − 2.2193| = .0002 < .01. The (pure) Newton’s method takes significantly
longer to converge than does its bracketed cousin! Newton’s method is allowed to wander in a seemingly
random pattern before it comes close enough to the root to converge. Bracketing forces the iterations to
approach much more quickly the interval in which Newton’s method will converge.

1. Use the bracketed secant method (false position) to find a root in the indicated interval, accurate to within
10−2 .
290 Solutions to Selected Exercises

[A]
(a) f (x) = 3 − x − sin x; [2, 3]
(b) g(x) = 3x4 − 2x3 − 3x + 2; [0, 1]
(c) g(x) = 3x4 − 2x3 − 3x + 2; [0, 0.9] [S]

(d) h(x) = 10 − cosh(x); [−3, −2]


√ [A]
(e) f (t) = 4 + 5 sin t − 2.5; [−600, −500]
3t2 tan t
(f) g(t) = 1−t2 ; [3490, 3491]
3t
(g) h(t) = ln(3 sin t) − 5; [1, 2]
(h) f (r) = esin r − r; [−20, 20] [S]

(i) g(r) = sin(er ) + r; [−3, 3]


(j) h(r) = 2sin r − 3cos r ; [1, 3] [A]

5: (c)

>> f2=inline(’3*x^4-2*x^3-3*x+2’);
>> [res,i]=falsePosition(f2,0,.9,10^-6,100)
b = 0.900000000000000
b = 0.822030415125360
b = 0.744866113620209
b = 0.696903242045358
b = 0.676602659540989
b = 0.669712929388636
b = 0.667578776723430
b = 0.666937771712738
b = 0.666747069128180
b = 0.666690496216585
b = 0.666673727853602
b = 0.666668758921090
b = 0.666667286598371

res = 0.666666850350527
i = 13

so x13 = 0.666666850350527 is expected to be within 10−6 of the actual root.


(h)

>> f7=inline(’exp(sin(r))-r’);
>> [res,i]=falsePosition(f7,-20,20,10^-6,100)
b = 20
b = 1.52625394347853
b = 2.70134274226916
b = 2.11862078217644
b = 2.21421804475756
b = 2.21893051185485
b = 2.21910087293432
b = 2.21910692606145

res = 2.21910714100071
i = 8

so x8 = 2.21910714100071 is expected to be within 10−6 of the actual root.

6: (c)
291

>> f2=inline(’3*x^4-2*x^3-3*x+2’);
>> f2p=inline(’12*x^3-6*x^2-3’);
>> [res,i]=bracketedNewton(f2,f2p,0,.9,10^-6,100)
b = 0.900000000000000
b = 0.450000000000000
b = 0.639257968925196
b = 0.665474256136936
b = 0.666663994320019
b = 0.666666666653136

res = 0.666666666666667
i = 6

so x6 = 0.666666666666667 is expected to be within 10−6 of the actual root.


(h)

>> f7=inline(’exp(sin(r))-r’);
>> f7p=inline(’exp(sin(r))*cos(r)-1’);
>> [res,i]=bracketedNewton(f7,f7p,-20,20,10^-6,100)
b = 20
b = 0
warning: division by zero
b = 10
b = 3.66539525575696
b = 1.65966535497164
b = 2.50454805267468
b = 2.22298743934113
b = 2.21911019802387
b = 2.21910714891565

res = 2.21910714891375
i = 9

so x9 = 2.21910714891375 is expected to be within 10−6 of the actual root.

REMARK: When we tried to compute this solution by hand (question 2h), we quit after the first iteration
due to the division by zero. However, Octave continues, treating the undefined estimate as one that
lands outside the bracket. Thus the second iteration is 10 (the bisection method applied to [0, 20]).

7: (c)

>> f2=inline(’3*x^4-2*x^3-3*x+2’);
>> [res,i]=bracketedInverseQuadratic(f2,0,.9,10^-6,100)
b = 0.900000000000000
b = 0.822030415125360
b = 0.411015207562680
b = 0.729556813485380
b = 0.629464108906733
b = 0.671561434924253
b = 0.666977335665865
b = 0.666666168461076

res = 0.666666666960237
i = 8

so x8 = 0.666666666960237 is expected to be within 10−6 of the actual root.


(h)
292 Solutions to Selected Exercises

>> f7=inline(’exp(sin(r))-r’);
>> [res,i]=bracketedInverseQuadratic(f7,-20,20,10^-6,100)
b = 20
b = 1.52625394347854
b = 2.70134274226916
b = 2.11862078217644
b = 2.21421804475756
b = 2.21917736990638
b = 2.21910707796098

res = 2.21910714891272
i = 7

so x7 = 2.21910714891272 is expected to be within 10−6 of the actual root.


10: (c)

>> f2=inline(’3*x^4-2*x^3-3*x+2’);
>> g2=inline(’f2(x)+x’);
>> [res,i]=bracketedSteffensens(g2,0,.9,10^-6,100)
b = 0.900000000000000
b = 0.559577120523157
b = 0.707986331365555
b = 0.669737865924576
b = 0.666686284030401
b = 0.666666667476795

res = 0.666666667666825
i = 6

so x6 = 0.666666667666825 is expected to be within 10−6 of the actual root.


(h)

>> f7=inline(’exp(sin(r))-r’);
>> g7=inline(’f7(x)+x’);
>> [res,i]=bracketedSteffensens(g7,-20,20,10^-6,100)
b = 20
b = 1.80564417969925
b = 2.18151287547235
b = 2.21873144340028
b = 2.21910711013891

res = 2.21910707929096
i = 5

so x5 = 2.21910707929096 is expected to be within 10−6 of the actual root.


13: (c)

>> f2=inline(’3*x^4-2*x^3-3*x+2’);
>> [res,i]=bracketedInverseQuadraticRE(f2,0,.9,10^-6,100)
b = 0.900000000000000
b = 0.822030415125360
b = 0.411015207562680
b = 0.729556813485380
b = 0.629464108906733
b = 0.671561434924253
b = 0.666977335665865
293

b = 0.666666168461076

res = 0.666666666960237
i = 8

so x8 = 0.666666666960237 is expected to be within 10−6 of the actual root.


(h)

>> f7=inline(’exp(sin(r))-r’);
>> [res,i]=bracketedInverseQuadraticRE(f7,-20,20,10^-6,100)
b = 20
b = 1.52625394347854
b = 2.70134274226916
b = 2.11862078217644
b = 2.21421804475756
b = 2.21917736990638
b = 2.21910707796098

res = 2.21910714891272
i = 7

so x7 = 2.21910714891272 is expected to be within 10−6 of the actual root.

Section 6.1
1d: The degree of the differential equation equals the degree of the highest degree derivative in the equation. The
only appearance of a derivative in the equation is the f 0 term. That makes the highest degree derivative 1,
so the degree of the differential equation is 1.
2d: In the differential equation f 0 + fx = x2 , both f and f 0 appear. To verify that a given function f is a solution,
we need to substitute both f and f 0 into the equation. f 0 is not given, so we calculate it:
3x2 4
f 0 (x) = − 2.
4 x
Now that we have everything needed, we substitute f and f 0 into the differential equation and verify that the
equation is true. Substituting:
 3 
4
4 + x
 2 x
3x 4

− 2 + = x2 .
4 x x
It is not obvious that this equation is true, so we need to do a little work. To finish the verification, we must
show that the two sides are equal using algebra. Adding or subtracting or doing anything else to both sides
simultaneously supposes that the two sides are equal, so these things are not allowed! Instead, we need to
manipulate the two sides separately. Working with the left side only:
 4   3
3x 16 4

x
− 2 + + = x2
4x2 4x 4x x2
3x4 16 x4 16
2
− 2
+ 2
+ 2 = x2
4x 4x 4x 4x
4x4
= x2 .
4x2
Almost done, but technically, this equation is not true! It is false when x = 0 because the left side is undefined
for x = 0. Luckily we do not have to worry about that case. It was given that x > 0, so we know x 6= 0 and
we can reduce 4x
4
2
4x2 to x , which finishes the verification.

3d: In order to verify that a function is a solution of an initial value problem, we need to verify that it solves the
differential equation and satisfies the initial value requirement.
294 Solutions to Selected Exercises

• Showing that f (x) = x4 + 16


3
2 2
x , x > 0, is a solution of f = − x +x : In the differential equation f + x = x ,
0 f 0 f

both f and f 0 appear. To verify that a given function f is a solution, we need to substitute both f and
f 0 into the equation. f 0 is not given, so we calculate it:
3x2 16
f 0 (x) =
− 2.
4 x
Now that we have everything needed, we substitute f and f 0 into the differential equation and verify
that the equation is true. Substituting:
 3 
16
4 + x
 2 x
3x 16

− 2 + = x2 .
4 x x
It is not obvious that this equation is true, so we need to do a little work. To finish the verification, we
must show that the two sides are equal using algebra. Adding or subtracting or doing anything else to
both sides simultaneously supposes that the two sides are equal, so these things are not allowed! Instead,
we need to manipulate the two sides separately. Working with the left side only:
 4   3
3x 64 16

x
− 2 + + = x2
4x2 4x 4x x2
3x4 64 x4 64
2
− 2
+ 2
+ 2 = x2
4x 4x 4x 4x
4x4
= x2 .
4x2
Almost done, but technically, this equation is not true! It is false when x = 0 because the left side is
undefined for x = 0. Luckily we do not have to worry about that case. It was given that x > 0, so we
4x4 2
know x 6= 0 and we can reduce 4x 2 to x , which finishes the verification.

• Showing that f (4) = 20: To show that f satisfies the initial value requirement, we simply compute f (4)
and show that it is 20 as required. f (4) = 44 + 16 64 16 80
3

4 = 4 + 4 = 4 = 20.

4c: The given ẏ = t − sin t can be restated as y 0 (t) = t − sin t. In other words, we are given the derivative of y as
a function of t. The fundamental theorem of calculus tells us that y must be the integral (antiderivative) of
the given function. That is,
ˆ
y(t) = (t − sin t)dt
1 2
= t + cos t + C.
2
So the (infinitely many) solutions of the o.d.e. are y(t) = 21 t2 + cos t + C.
5d: Though we could give them, this question is not asking for exact measurements of the error. It is simply
requesting a comment on the accuracy of the approximate solution. It will suffice to compare the graphs of
the exact solution and approximate solution over the interval covered by the approximate solution, [4, 5], and
do a calculation or two. The graph of the exact solution is a graph of the function f (x) = x4 + 16
3

x and the
graph of the approximate solution is a graph of the set {(4, 20), (4.25, 23), (4.5, 26), (4.75, 30), (5, 34)}:
295

From the graphs, the only point in the approximation that is visually separate from the graph of the exact
solution is the point (5, 34). And it only misses by a small relative amount. To be more precise, the relative
error there is |f (5)−34|
|f (5)|
9
= 689 ≈ 0.013. Any general comment on the accuracy of an approximation should
take into account the requirements of the situation. In this case, there is no context to say whether we should
hope for 10%, 1%, .1%, or smaller relative error or whether we should be more concerned about absolute
error. Without any such context, we will simply use the visual representation, which shows the points of
the approximation very close to the graph of the exact solution, and conclude the approximation is a good
representation of the exact solution.
6c: The forces acting on a stationary block on an inclined plane are gravity, friction, and the normal force of the
surface on which it is lying. Gravity acts vertically downward. Friction acts parallel to the surface and up the
slope since it is resisting gravity which pulls the block down the slope. The normal force acts perpendicular to
the surface. Representing the block as a rectangle and each force by a vector, the free body diagram should
look something like this:

Note that the line representing the surface is NOT part of the free body diagram, so it is dashed. It is only
there to show the (potential) direction of motion.
6f: The forces acting on a sofa being pushed across a level floor are gravity, friction, the normal force of the floor,
and the applied force. Gravity acts veritcally downward. Friction acts parallel to the floor opposing the
applied force. The normal force acts perpendicular to the floor. And the applied force acts in an unspecified
direction not parallel to the floor. Representing the sofa as a rectangle and each force by a vector, the free
body diagram should look something like this:

Note that the line representing the floor is NOT part of the free body diagram, so it is dashed. It is only
there to show the direction of motion.
6m: The forces acting on a sky diver—whether his parachute is open, closed, or in the process of opening does
not matter—are gravity and drag (air resistance). Gravity acts vertically downward and drag acts vertically
upward. Representing the sky diver as a rectangle and each force by a vector, the free body diagram should
look something like this:
296 Solutions to Selected Exercises

7c: (See solution of 6c for free body diagram) Since the block is not moving, the net force in any direction must
be zero! That makes the equation of motion s(t) = 0. The end. This answers the question asked.
In a situation where the block is moving, however, it is necessary to consider the magnitudes of the forces
acting in the direction of motion, friction and gravity. For sake of discussion, here is how they may be resolved.
The normal force acts normal to the motion so has zero tangential component. Friction is proportional to
the normal force, and by convention we use µ for the constant of proportionality, so the magnitude of friction
is µN . Adding an auxiliary line perpendicular to the surface, we see that the component of gravity in the
tangential direction is mg sin α.

Taking the positive direction to be down the slope, the forces acting tangential (parallel) to the surface are
mg sin α − µN . To complete the equation of motion, we need to compute N . Since the block does not move
in the normal direction, the net force in that direction must be zero. The only forces acting in the normal
direction are the normal force itself and a component of gravity. Therefore, N must equal the magnitude
of gravity in the normal direction. Again using the auxiliary line, the component of gravity in the normal
direction is mg cos α. Hence N = mg cos α. Substituting this expression into the tangential forces, we have
mg sin α − µmg cos α acting tangential to the surface. By Newton’s Second Law, this force must equal ma, so
the equation of motion is ms̈ = mg sin α − µmg cos α, which simplifies to

s̈ = g(sin α − µ cos α).

This equation can be used for a block in motion down an inclined plane.
7f: (See solution of 6f for free body diagram) Both gravity and the normal force act normal to the motion, so have
zero tangential components. The only forces that act (with nonzero component) in the direction of motion
are friction and the applied force. Friction is proportional to the normal force, and by convention we use µ
for the constant of proportionality, so the magnitude of friction is µN . Adding an auxiliary line parallel to
the surface, we mark the angle of the applied force and see that the component of the applied force in the
tangential direction is Fapplied cos β.

Taking the positive direction to be left, the forces acting tangential (parallel) to the surface are Fapplied cos β −
µN . To complete the equation of motion, we need to compute N . Since the block does not move in the
normal direction, the net force in the normal direction must be zero. The forces acting in that direction are
N itself, gravity, and a component of the applied force. Therefore, in the normal direction, we must have
N + Fapplied sin β = mg or N = mg − Fapplied sin β. Substituting this expression into the tangential forces,
we have Fapplied cos β − µ(mg − Fapplied sin β) acting tangential to the surface. By Newton’s Second Law, this
force must equal ma, so the equation of motion is ms̈ = Fapplied cos β − µ(mg − Fapplied sin β), which simplifies
to
Fapplied
s̈ = (cos β + µ sin β) − µg.
m
297

7m: (See solution of 6m for free body diagram) Both forces in the free body diagram act in the vertical direction, so
the equation of motion is particularly simple in this case. No trigonometry is needed. F = ma simply becomes
Fdrag − mg = ms̈, taking upward to be the positive direction. The drag force is taken to be proportional to
speed but in the opposite direction, so Fdrag may be replaced by −cṡ (for some positive constant c) and the
equation of motion becomes, more precisely, −cṡ − mg = ms̈. With a little bit of algebra, this equation can
be rewritten as
c
s̈ + ṡ + g = 0.
m

Section 6.2
1a: Replacing the t in Euler’s Method (6.2.3) by x, Euler’s Method applied to this problem has the form yi+1 =
yi + h · y 0 (xi , yi ). Because the initial condition is y(1) = 1, we begin with x0 = 1 and y0 = 1. Then

y1 = y0 + 0.5(3x0 − 2y0 )
= 1 + 0.5(3(1) − 2(1))
= 1.5
x1 = x0 + h = 1 + 0.5 = 1.5

Now x0 and y0 can be forgotten as we compute x2 and y2 :

y2 = y1 + 0.5(3x1 − 2y1 )
= 1.5 + 0.5(3(1.5) − 2(1.5))
= 2.25
x2 = x1 + h = 1.5 + 0.5 = 2.0

Therefore, we have y(2) ≈ 2.25.

1d: Because the o.d.e. is not written in the form y 0 = f (t, y), it is our job to rewrite it in that form, taking what
is given and solving for y 0 :

cos(x)y 0 + sin(x)y = 2 cos3 (x) sin(x) − 1


cos(x)y 0 = 2 cos3 (x) sin(x) − 1 − sin(x)y
2 cos3 (x) sin(x) − 1 − sin(x)y
y0 =
cos(x)
= 2 cos2 (x) sin(x) − sec(x) − y tan(x)

So we have f (x, y) = 2 cos2 (x) sin(x) − sec(x) − y tan(x). Now replacing the t in Euler’s Method (6.2.3) by x,
Euler’s Method applied to this problem has the form yi+1 = yi + h · y 0 (xi , yi ). Because the initial condition
is y(1) = 0, we begin with x0 = 1 and y0 = 0. Then

y1 = y0 + 0.5f (x0 , y0 )
= 0 + 0.5f (1, 0)
= 0.5(2 cos2 (1) sin(1) − sec(1))
≈ −0.67976011062352
x1 = x0 + h = 1 + 0.5 = 1.5

Now x0 and y0 can be forgotten as we compute x2 and y2 :

y2 = y1 + 0.5f (x1 , y1 )
≈ −0.67976 + 0.5f (1.5, −0.67976)
≈ −2.9503939532546
x2 = x1 + h = 1.5 + 0.5 = 2.0

Therefore, we have y(2) ≈ −2.9503939532546.


298 Solutions to Selected Exercises

2a: For Taylor’s Method of degree 2, we will need the second derivative of y. The only thing we have to work with
is the o.d.e. itself, dx
dy
= 3x − 2y. By implicit differentiation,

d2 y dy
2
=3−2 .
dx dx

However, this does not give us y 00 in terms of x and y. We must substitute dy


dx in terms of x and y. But that’s
d2 y
exactly what the o.d.e. tells us! Substituting dy
dx = 3x − 2y into the expression for dx2 yields

d2 y
= 3 − 2(3x − 2y)
dx2
= 3 − 6x + 4y.

Now we are ready. Symbolically, Taylor’s Method of degree 2 is

1
yi+1 = yi + h · y 0 (xi , yi ) + h2 · y 00 (xi , yi )
2
xi+1 = xi + h

Beginning with the initial conditions, x0 = y0 = 1,

1
y1 = y0 + h · y 0 (x0 , y0 ) + h2 · y 00 (x0 , y0 )
2
1
= 1 + 0.5(3 · 1 − 2 · 1) + (0.5)2 · (3 − 6 · 1 + 4 · 1)
2
= 1.625
x1 = x0 + h = 1 + 0.5 = 1.5

Now x0 and y0 can be forgotten as we compute x2 and y2 :

1
y2 = y1 + h · y 0 (x1 , y1 ) + h2 · y 00 (x1 , y1 )
2
1
= 1.625 + 0.5(3 · 1.5 − 2 · 1.625) + (0.5)2 · (3 − 6 · 1.5 + 4 · 1.625)
2
= 2.3125
x1 = x0 + h = 1.5 + 0.5 = 2.0

Therefore, we have y(2) = 2.3125.

2d: For Taylor’s Method of degree 2, we will need the second derivative of y. The only thing we have to work
with is the o.d.e. itself (after it’s been solved for dx
dy
: dx
dy
= 2 cos2 (x) sin(x) − sec(x) − y tan(x). By implicit
differentiation,

d2 y dy
= − tan(x) · − sec(x) tan(x) − 4 cos(x) sin2 (x) − y sec2 (x) + 2 cos3 (x).
dx2 dx

However, this does not give us y 00 in terms of x and y. We must substitute dx


dy
in terms of x and y. But that’s
2
exactly what the o.d.e. tells us! Substituting dx = 2 cos (x) sin(x) − sec(x) − y tan(x) into the expression for
dy

d2 y
dx2 (and simplifying a lot!) yields

d2 y
= −y + 8 cos3 (x) − 6 cos(x)
dx2
Now we are ready. Symbolically, Taylor’s Method of degree 2 is

1
yi+1 = yi + h · y 0 (xi , yi ) + h2 · y 00 (xi , yi )
2
xi+1 = xi + h
299

Beginning with the initial conditions, x0 = 1, y0 = 0,


1
y1 = y0 + h · y 0 (x0 , y0 ) + h2 · y 00 (x0 , y0 )
2
= 0 + 0.5(2 cos2 (1) sin(1) − sec(1))
1
+ (0.5)2 · (8 cos3 (1) − 6 cos(1))
2
≈ −0.92725823477363
x1 = x0 + h = 1 + 0.5 = 1.5

Now x0 and y0 can be forgotten as we compute x2 and y2 :


1
y2 = y1 + h · y 0 (x1 , y1 ) + h2 · y 00 (x1 , y1 )
2
1
≈ −0.9272 + 0.5f (1.5, −0.9272) + (0.5)2 · y 00 (1.5, −0.9272)
2
≈ −1.3896462555267
x1 = x0 + h = 1.5 + 0.5 = 2.0

Therefore, we have y(2) = −1.3896462555267. If this exercise does not convince you that Taylor’s Methods
of degree higher than 2 are not particularly user-friendly, just wait until you try Taylor’s Method of degree 3
on this problem.
3a: For Taylor’s Method of degree 3, we will need the second and third derivatives of y. The only thing we have
to work with is the o.d.e. itself, dx
dy
= 3x − 2y. By implicit differentiation,

d2 y dy
=3−2 .
dx2 dx

However, this does not give us y 00 in terms of x and y. We must substitute dy


dx in terms of x and y. But that’s
d2 y
exactly what the o.d.e. tells us! Substituting dy
dx = 3x − 2y into the expression for dx2 yields

d2 y
= 3 − 2(3x − 2y)
dx2
= 3 − 6x + 4y.
d2 y
Implicitly differentiating the equation for dx2 gives

d3 y dy
= −6 + 4 ·
dx3 dx
= −6 + 4(3x − 2y)
= 12x − 8y − 6.

Now we are ready. Symbolically, Taylor’s Method of degree 3 is


1 1
yi+1 = yi + h · y 0 (xi , yi ) + h2 · y 00 (xi , yi ) + h3 · y 000 (xi , yi )
2 6
xi+1 = xi + h

Beginning with the initial conditions, x0 = y0 = 1,


1 1
y1 = y0 + h · y 0 (x0 , y0 ) + h2 · y 00 (x0 , y0 ) + h3 · y 000 (x0 , y0 )
2 6
1 2
= 1 + 0.5(3 · 1 − 2 · 1) + (0.5) · (3 − 6 · 1 + 4 · 1)
2
1 3
+ (0.5) (12 · 1 − 8 · 1 − 6)
6
≈ 1.5833333333333
x1 = x0 + h = 1 + 0.5 = 1.5
300 Solutions to Selected Exercises

Now x0 and y0 can be forgotten as we compute x2 and y2 :


1 1
y2 = y1 + h · y 0 (x1 , y1 ) + h2 · y 00 (x1 , y1 ) + h3 · y 000 (x1 , y1 )
2 6
1
≈ 1.583 + 0.5(3 · 1.5 − 2 · 1.583) + (0.5)2 · (3 − 6 · 1.5 + 4 · 1.583)
2
1 3
+ (0.5) (12 · 1.5 − 8 · 1.583 − 6)
6
≈ 2.2777777777777
x1 = x0 + h = 1.5 + 0.5 = 2.0
Therefore, we have y(2) = 2.2777777777777.
3d: For Taylor’s Method of degree 3, we will need the second and third derivatives of y. The only thing we have
to work with is the o.d.e. itself (after it’s been solved for dx
dy
: dx
dy
= 2 cos2 (x) sin(x) − sec(x) − y tan(x). By
implicit differentiation,
d2 y dy
2
= − tan(x) · − sec(x) tan(x) − 4 cos(x) sin2 (x) − y sec2 (x) + 2 cos3 (x).
dx dx
However, this does not give us y 00 in terms of x and y. We must substitute dx
dy
in terms of x and y. But that’s
2
exactly what the o.d.e. tells us! Substituting dx = 2 cos (x) sin(x) − sec(x) − y tan(x) into the expression for
dy

d2 y
dx2 (and simplifying a lot!) yields
d2 y
= −y + 8 cos3 (x) − 6 cos(x)
dx2
d2 y
Implicitly differentiating the equation for dx2 gives
d3 y dy
= − − 24 cos2 (x) sin(x) + 6 sin(x)
dx3 dx
= y tan(x) + (6 − 26 cos2 (x)) sin(x) + sec(x)
Now we are ready. Symbolically, Taylor’s Method of degree 3 is
1 1
yi+1 = yi + h · y 0 (xi , yi ) + h2 · y 00 (xi , yi ) + h3 · y 000 (xi , yi )
2 6
xi+1 = xi + h
Beginning with the initial conditions, x0 = 1, y0 = 0,
1 1
y1 = y0 + h · y 0 (x0 , y0 ) + h2 · y 00 (x0 , y0 ) + h3 · y 000 (x0 , y0 )
2 6
= 0 + 0.5(2 cos2 (1) sin(1) − sec(1))
1
+ (0.5)2 · (8 cos3 (1) − 6 cos(1))
2
1
+ (0.5)3 · (sec(1) + (6 − 26 cos2 (1)) sin(1))
6
≈ −0.91657489783846
x1 = x0 + h = 1 + 0.5 = 1.5
Now x0 and y0 can be forgotten as we compute x2 and y2 :
1 1
y2 = y1 + h · y 0 (x1 , y1 ) + h2 · y 00 (x1 , y1 ) + h3 · y 000 (x1 , y1 )
2 6
1
≈ −0.9166 + 0.5f (1.5, −0.9166) + (0.5)2 · y 00 (1.5, −0.9166)
2
1
+ (0.5)3 · y 000 (1.5, −0.9166)
6
≈ −1.3083937870918
x1 = x0 + h = 1.5 + 0.5 = 2.0
Therefore, we have y(2) = −1.3083937870918. If this exercise does not convince you that Taylor’s Methods
of degree higher than 2 are not particularly user-friendly, nothing will!
301

7: Remember to document your code! In fact, the documentation for a function should almost always be written
before the function itself. Putting down in print exactly what the intended inputs and outputs of the function
will be should help guide how it is written. From the pseudo-code for Euler’s Method, the inputs are the
differential equation ẏ = f (t, y); initial condition y(t0 ) = y0 ; numbers t0 and t1 ; and the number of steps N .
A reasonable comment for the beginning of the function would list all of these inputs and the output, plus
document who wrote it when and for what reason:
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Written by Leon Brin 29 January 2012 %
% Purpose: This function implements Euler’s method where the %
% step size is calculated and held constant. %
% INPUT: function f(x,y); interval [a,b]; y(a); steps n %
% OUTPUT: approximation (x(i),y(i)) of the solution of y’=f(x,y) %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
The declaration of the function has to have the five inputs as arguments and the output as a return value.
Something like function [y,x] = eulerode(f,a,ya,b,n) should do, where ya of course is the input y(a).
The rest of the function should follow almost verbatim the pseudo-code. I’ve used x instead of t for the
independent variable. eulerode.m may be downloaded at the companion website.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Written by Leon Brin 29 January 2012 %
% Purpose: This function implements Euler’s method where the %
% step size is calculated and held constant. %
% INPUT: function f(x,y); interval [a,b]; y(a); steps n %
% OUTPUT: approximation (x(i),y(i)) of the solution of y’=f(x,y) %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function [y,x] = eulerode(f,a,ya,b,n)
i = 1;
x(i) = a;
y(i) = ya;
h = (b-a)/n;
while (i<=n)
y(i+1) = y(i) + h*f(x(i),y(i));
x(i+1) = a + (b-a)*i/n;
i = i+1;
end%while
end%function
14c: The equation of motion is s̈ = g(sin α − µ cos α). It is a second order differential equation with dependent
variable s and independent variable t. The g, α, and µ appearing in the equation are constants. We let u = ṡ
so u̇ = s̈ = g(sin α − µ cos α), and the first order system becomes
u̇ = g(sin α − µ cos α)
ṡ = u
Fapplied
14f: The equation of motion is s̈ = m (cos β
+ µ sin β) − µg. It is a second order differential equation with
dependent variable s and independent variable t. The β, m, Fapplied , and µ appearing in the equation are
F
constants. We let u = ṡ so u̇ = s̈ = applied
m (cos β + µ sin β) − µg, and the first order system becomes
Fapplied
u̇ = (cos β + µ sin β) − µg
m
ṡ = u

14m: The equation of motion is s̈ + m c


ṡ + g = 0. It is a second order differential equation with dependent variable
s and independent variable t. The c, m, and g appearing in the equation are constants. We let u = ṡ so
u̇ = s̈ = − m
c
ṡ − g, and the first order system becomes
c
u̇ = − u − g
m
ṡ = u
302 Solutions to Selected Exercises

15c: The system we are solving is

u̇ = g(sin α − µ cos α)
ṡ = u

with initial conditions s(0) = 0, ṡ(0) = 0 and parameter values g = 32.2 ft/s2 , µ = .21, α = .25 rad. No
conversion of units is needed. We plug the parameter values into the system to get the initial value problem

u̇ = 1.41462169238826
ṡ = u
u0 = ṡ(0) = 0
s0 = s(0) = 0

Applying Euler’s method to this system means iterating

un+1 = un + hu̇(un , sn ) = un + 0.25(1.41462169238826)


sn+1 = sn + hṡ(un , sn ) = sn + 0.25un
tn+1 = tn + h

In particular,

u1 = u0 + 0.25u̇(u0 , s0 )
= 0 + 0.25(1.41462169238826) ≈ 0.353655423097065
s1 = s0 + 0.25u0
= 0 + 0.25(0) = 0
t1 = t0 + 0.25 = .25

and

u2 = u1 + 0.25u̇(u1 , s1 )
≈ 0.3536 + 0.25(1.414) ≈ 0.7073108461941298
s2 = s1 + 0.25u1
≈ 0 + 0.25(0.3536) ≈ 0.08841385577426622
t1 = t0 + 0.25 = .5

Therefore, s(0.5) ≈ 0.08841385577426622.


15f: The system we are solving is
Fapplied
u̇ = (cos β + µ sin β) − µg
m
ṡ = u

with initial conditions s(0) = 0, ṡ(0) = .03 and parameter values g = 9.81 m/s2 , µ = .15, β = 10
π
rad, m = 35
kg, and Fapplied = 75 N. No conversion of units is needed. We plug the parameter values into the system to
get the initial value problem
75  π  π 
u̇ = cos + .15 sin − .15(9.81) ≈ 0.6658051402529905
35 10 10
ṡ = u
u0 = ṡ(0) = .03
s0 = s(0) = 0

Applying Euler’s method to this system means iterating

un+1 = un + hu̇(un , sn ) = un + 0.25(0.6658051402529905)


sn+1 = sn + hṡ(un , sn ) = sn + 0.25un
tn+1 = tn + h
303

In particular,
u1 = u0 + 0.25u̇(u0 , s0 )
= .03 + 0.25(0.6658051402529905) ≈ 0.1964512850632476
s1 = s0 + 0.25u0
= 0 + 0.25(.03) = 0.0075
t1 = t0 + 0.25 = .25
and
u2 = u1 + 0.25u̇(u1 , s1 )
≈ 0.1964 + 0.25(0.6658) ≈ 0.3629025701264953
s2 = s1 + 0.25u1
≈ 0.0075 + 0.25(0.1964) ≈ 0.05661282126581191
t1 = t0 + 0.25 = .5
Therefore, s(0.5) ≈ 0.05661282126581191.
15m: The system we are solving is
c
u̇ = − u−g
m
ṡ = u
with initial conditions s(0) = 2000, ṡ(0) = −55 and parameter values g = 9.81 m/s2 , c = 26, and m = 70
kg. No conversion of units is needed. We plug the parameter values into the system to get the initial value
problem
26 13
u̇ = u − 9.81 = − u − 9.81
70 35
ṡ = u
u0 = ṡ(0) = −55
s0 = s(0) = 2000
Applying Euler’s method to this system means iterating
13
 
un+1 = un + hu̇(un , sn ) = un + 0.25 − un − 9.81
35
sn+1 = sn + hṡ(un , sn ) = sn + 0.25un
tn+1 = tn + h
In particular,
u1 = u0 + 0.25u̇(u0 , s0 )
13
 
= −55 + 0.25 − (−55) − 9.81 ≈ −52.34535714285715
35
s1 = s0 + 0.25u0
= 2000 + 0.25(−55) = 1986.25
t1 = t0 + 0.25 = .25
and
u2 = u1 + 0.25u̇(u1 , s1 )
13
 
≈ −52.34 + 0.25 − (−52.34) − 9.81 ≈ −49.9372168367347
35
s2 = s1 + 0.25u1
≈ 1986.25 + 0.25(−52.34) ≈ 1973.163660714286
t1 = t0 + 0.25 = .5
Therefore, s(0.5) ≈ 1973.163660714286.
304 Solutions to Selected Exercises

18a: A number of differential equations solution techniques require you to have some idea what the solution will
be before you know exactly what it is. You then take this “rough guess” and refine it by forcing it to solve the
given differential equation. The method of undetermined coefficients is an example of such a technique. We
know the solution will be a linear combination of certain functions, but we don’t know the right coefficients
to use. To find the coefficients, we plug the solution with unknown (undetermined) coefficients into the
differential equation and match the coefficients of like terms. This process leaves us with a linear system of
equations to solve for the unknowns. In this particular example, we are given that y(x) = Ax2 + Bx + C is a
solution of y 00 + 5y 0 − 8y = 3x2 , and it is our job to figure out the values of A, B, and C. We will find y 0 and
y 00 and substitute them into the o.d.e.:
y 0 (x) = 2Ax + B
y (x)
00
= 2A
Therefore
y 00 + 5y 0 − 8y = 2A + 5(2Ax + B) − 8(Ax2 + Bx + C).
Thus, if we are to have a solution of the o.d.e., we will need
2A + 5(2Ax + B) − 8(Ax2 + Bx + C) = 3x2
Simplifying, that is
−8Ax2 + (10A − 8B)x + (2A + 5B − 8C) = 3x2 .
Matching the coefficients of like terms on the left and the right, we have
−8A = 3
10A − 8B = 0
2A + 5B − 8C = 0.
15
The solution of this system is A = − 83 , B = − 32 99
, C = − 256 . Hence the solution of the o.d.e. is y(x) =
3 2 15 99
− 8 x − 32 x − 256 .
18h: A number of differential equations solution techniques require you to have some idea what the solution will
be before you know exactly what it is. You then take this “rough guess” and refine it by forcing it to solve the
given differential equation. The method of undetermined coefficients is an example of such a technique. We
know the solution will be a linear combination of certain functions, but we don’t know the right coefficients to
use. To find the coefficients, we plug the solution with unknown (undetermined) coefficients into the differential
equation and match the coefficients of like terms. This process leaves us with a linear system of equations to
solve for the unknowns. In this particular example, we are given that θ(t) = At cos t+Bt sin t+C cos t+D sin t
1
is a solution of θ̈ + 10 θ̇ + θ = t cos t, and it is our job to figure out the values of A, B, C, and D. We will find
θ̇ and θ̈ and substitute them into the o.d.e.:
˙
θ(t) = (D + A) cos(t) + (B − C) sin(t) + Bt cos(t) − At sin(t)
θ̈(t) = (2B − C) cos(t) + (−D − 2A) sin(t) − At cos(t) − Bt sin(t)
Therefore
1
θ̈ + θ̇ + θ = (2B − C) cos(t) + (−D − 2A) sin(t) − At cos(t) − Bt sin(t)
10
1
+ ((D + A) cos(t) + (B − C) sin(t) + Bt cos(t) − At sin(t))
10
+At cos t + Bt sin t + C cos t + D sin t
Simplifying, that is
1 1 1
 
θ̈ + θ̇ + θ = D + 2B + A cos(t)
10 10 10
+ (B − C − 2A) sin(t)
1
+ Bt cos(t)
10
1
− At sin(t)
10
305

Thus, if we are to have a solution of the o.d.e., we will need


1 1
 
D + 2B + A cos(t)
10 10
+ (B − C − 2A) sin(t)
1
+ Bt cos(t)
10
1
− At sin(t) = t cos t
10
Matching the coefficients of like terms on the left and the right, we have
1 1
D + 2B + A = 0
10 10
B − C − 2A = 0
1
B = 1
10
1
− A = 0
10
The solution of this system is A = 0, B = 10, C = 10, D = −200. Hence the solution of the o.d.e. is
θ(t) = 10t sin t − 200 sin t + 10 cos t.

Section 6.3
1a: Each o.d.e. solver has the form

yi+1 = yi + h(weighted average of evaluations of f ).

It is the integration formula that gives us the weighted average. In this case, the formula
2
  
h
f (x0 ) + 3f x0 + h
4 3

tells us to average f (x0 ), the value of f at the first node, with f (x0 + 32 h) in a 1 : 3 ratio. That is, we sum one
f (x0 ) with three f (x0 + 23 h) and divide by 4. Unfortunately, we are using f here in two different settings. The
f in an o.d.e. solver is not the same f used in deriving the integration formulas. The f from the integration
formulas is a function of one variable, x. The f we need in an o.d.e. solver is a function of two variables, t
and y. Nevertheless, they play the same role. They each hold the values of the function we are integrating. If
we need to sum one f (x0 ) with three f (x0 + 32 h) in the integration formula, then we need to sum one f (ti , yi )
with three f (ti+2/3 , yi+2/3 ) in the o.d.e. solver. Generally, f (x0 + αh) in an integration formula translates to
f (ti+α , yi+α ) in the o.d.e. solver as long as the integration formula is written for an interval of length h.
Each o.d.e. solver begins with k1 = f (ti , yi ) where (ti , yi ) is the last point approximated. Each successive
value in the o.d.e. solver is obtained by using Euler’s method with initial condition (starting point) (ti , yi ).
For this particular integration formula, there is only one node other than x0 , so we will need only one more
stage. We approximate yi+2/3 by yi + 2h 3 k1 (Euler’s method using starting point (ti , yi ) and approximate
slope k1 ). This makes k2 = f (ti + 3 , yi + 2h
2h
3 k1 ). The final step is to compute the weighted average. As
discussed, we need to sum one k1 with three k2 and divide by 4. In summary, the o.d.e. solver suggested by
this integration formula is

k1 = f (ti , yi )
2h 2h
 
k2 = f ti + , yi + k1
3 3
h
yi+1 = yi + [k1 + 3k2 ] .
4

1e: Each o.d.e. solver has the form

yi+1 = yi + h(weighted average of evaluations of f ).


306 Solutions to Selected Exercises

It is the integration formula that gives us the weighted average. In this case, the formula

1
   
h
3f x0 + h + f (x0 + h)
4 3

tells us to average f (x0 + 31 h), the value of f at the first node, with f (x0 + h) in a 3 : 1 ratio. That is, we
sum three f (x0 + 31 h) with one f (x0 + h) and divide by 4. Unfortunately, we are using f here in two different
settings. The f in an o.d.e. solver is not the same f used in deriving the integration formulas. The f from
the integration formulas is a function of one variable, x. The f we need in an o.d.e. solver is a function of
two variables, t and y. Nevertheless, they play the same role. They each hold the values of the function we
are integrating. If we need to sum three f (x0 + 31 h) with one f (x0 + h) in the integration formula, then we
need to sum three f (ti+1/3 , yi+1/3 ) with one f (ti+1 , yi+1 ) in the o.d.e. solver. Generally, f (x0 + αh) in an
integration formula translates to f (ti+α , yi+α ) in the o.d.e. solver as long as the integration formula is written
for an interval of length h.
Each o.d.e. solver begins with k1 = f (ti , yi ) where (ti , yi ) is the last point approximated. Each successive
value in the o.d.e. solver is obtained by using Euler’s method with initial condition (starting point) (ti , yi ).
For this particular integration formula, there are two nodes other than x0 , so we will need two more stages.
We approximate yi+1/3 by yi + h3 k1 (Euler’s method using starting point (ti , yi ) and approximate slope k1 ).
This makes k2 = f (ti + h3 , yi + h3 k1 ). We then approximate yi+1 by yi + hk2 (Euler’s method using starting
point (ti , yi ) and approximate slope k2 ). The final step is to compute the weighted average. As discussed, we
need to sum three k2 with one k3 and divide by 4. In summary, the o.d.e. solver suggested by this integration
formula is

k1 = f (ti , yi )
 
h h
k2 = f ti + , yi + k1
3 3
k3 = f (ti + h, yi + hk2 )
h
yi+1 = yi + [3k2 + k3 ] .
4

2a: We will modify the test code from the text in two essential ways.

1. It will be adapted for the o.d.e. solver

k1 = f (ti , yi )
2h 2h
 
k2 = f ti + , yi + k1
3 3
h
yi+1 = yi + [k1 + 3k2 ]
4

2. An extra loop will be added so it approximates y(2) for a number of step sizes.

These modifications will make it a simple matter to determine the rate of convergence.

t0=4;
h=-1/4;
n=8;
f=inline("-y/t+t^2");
exact=inline("t^3/4+16/t");
y0=20;
disp(’ h y Error’)
disp(’ ------------------------------------’)
for j=1:6
t=t0;
y=y0;
for i=1:n
k1=f(t,y);
307

k2=f(t+2*h/3,y+2*h/3*k1);
y=y+h/4*(k1+3*k2);
t=t+h;
end%for
x=exact(t);
sprintf(’%12.5g%12.5g%12.5g’,h,y,abs(y-x))
n=n*2;
h=h/2;
end%for

The output from this code is

h y Error
------------------------------------
ans = -0.25 9.9391 0.060922
ans = -0.125 9.9846 0.015433
ans = -0.0625 9.9961 0.0038827
ans = -0.03125 9.999 0.00097364
ans = -0.015625 9.9998 0.00024378
ans = -0.0078125 9.9999 6.099e-05
2
The ratio of the step size on one line to the next is 12 , and the ratio of consecutive errors is about 14 = 12 ,
so it appears the o.d.e. solver has rate of convergence O(h2 ). The integration method has rate of convergence
O(h4 ) so we would expect the o.d.e. solver to be O(h3 ). Our experiment does not show the expected rate of
convergence.

2e: An extra loop will be added so it approximates y(2) for a number of step sizes.
These modifications will make it a simple matter to determine the rate of convergence.

t0=4;
h=-1/4;
n=8;
f=inline("-y/t+t^2");
exact=inline("t^3/4+16/t");
y0=20;
disp(’ h y Error’)
disp(’ ------------------------------------’)
for j=1:6
t=t0;
y=y0;
for i=1:n
k1=f(t,y);
k2=f(t+h/3,y+h/3*k1);
k3=f(t+h,y+h*k2);
y=y+h/4*(3*k2+k3);
t=t+h;
end%for
x=exact(t);
sprintf(’%12.5g%12.5g%12.5g’,h,y,abs(y-x))
n=n*2;
h=h/2;
end%for

The output from this code is

h y Error
------------------------------------
ans = -0.25 9.9697 0.03027
308 Solutions to Selected Exercises

ans = -0.125 9.9923 0.0076889


ans = -0.0625 9.9981 0.0019376
ans = -0.03125 9.9995 0.00048634
ans = -0.015625 9.9999 0.00012183
ans = -0.0078125 10 3.0487e-05
2
The ratio of the step size on one line to the next is 21 , and the ratio of consecutive errors is about 14 = 12 ,
so it appears the o.d.e. solver has rate of convergence O(h2 ). The integration method has rate of convergence
O(h4 ) so we would expect the o.d.e. solver to be O(h3 ). Our experiment does not show the expected rate of
convergence.

8a: The Octave function we wrote to implement Euler’s method takes 5 arguments. As explained in the comment
preceding the function declaration,

% INPUT: function f(x,y); interval [a,b]; y(a); steps n %


% OUTPUT: approximation (x(i),y(i)) of the solution of y’=f(x,y) %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function [y,x] = eulerode(f,a,ya,b,n)

they are, in order, (f) the function f (x, y) appearing on the right side of the o.d.e., (a) the x-coordinate of the
initial condition, (ya) the y-coordinate of the initial condition, (b) the x-coordinate of the desired solution,
and (n) the number of steps that should be taken. From the Octave command line, the solution can be found
this way:

>> format(’long’)
>> f=inline(’3*x-2*y’)
f = f(x, y) = 3*x-2*y
>> eulerode(f,1,1,2,20)
ans =
Columns 1 through 4:
1.00000000000000 1.05000000000000 1.10250000000000 1.15725000000000
Columns 5 through 8:
1.21402500000000 1.27262250000000 1.33286025000000 1.39457422500000
Columns 9 through 12:
1.45761680250000 1.52185512225000 1.58716961002500 1.65345264902250
Columns 13 through 16:
1.72060738412025 1.78854664570823 1.85719198113740 1.92647278302366
Columns 17 through 20:
1.99632550472130 2.06669295424917 2.13752365882425 2.20877129294183
Column 21:
2.28039416364764

The value in Column 21 is the desired result, so y(2) ≈ 2.28039416364764. The rest of the output gives approxima-
tions for the solution at other points. For example, y(1.95) ≈ 2.20877129294183. Use [y,x]=eulerode(f,1,1,2,20)
to see all the corresponding x-coordinates.

8d: The Octave function we wrote to implement Euler’s method takes 5 arguments. As explained in the comment
preceding the function declaration,

% INPUT: function f(x,y); interval [a,b]; y(a); steps n %


% OUTPUT: approximation (x(i),y(i)) of the solution of y’=f(x,y) %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function [y,x] = eulerode(f,a,ya,b,n)

they are, in order, (f) the function f (x, y) appearing on the right side of the o.d.e., (a) the x-coordinate of the
initial condition, (ya) the y-coordinate of the initial condition, (b) the x-coordinate of the desired solution,
and (n) the number of steps that should be taken. From the Octave command line, the solution can be found
this way:
309

>> format(’long’)
>> f=inline(’(2*cos(x)^3-1-y*sin(x))/cos(x)’)
f = f(x, y) = (2*cos(x)^3-1-y*sin(x))/cos(x)
>> eulerode(f,1,0,2,20)
ans =
Columns 1 through 3:
0.000000000000000 -0.063348127711403 -0.133556806761731
Columns 4 through 6:
-0.210091730766547 -0.292335849279218 -0.379594108676440
Columns 7 through 9:
-0.471098428249811 -0.566012332190405 -0.663433947473280
Columns 10 through 12:
-0.762393924730387 -0.861836463006993 -0.960521838453174
Columns 13 through 15:
-1.055901027787366 -1.150767311038156 -1.243138035592362
Columns 16 through 18:
-1.331810188637979 -1.415726818259857 -1.493905125626401
Columns 19 through 21:
-1.565422860316011 -1.629418404020635 -1.685095172485204

The value in Column 21 is the desired result, so y(2) ≈ −1.685095172485. The rest of the output gives approxima-
tions for the solution at other points. For example, y(1.95) ≈ −1.629418404020. Use [y,x]=eulerode(f,1,0,2,20)
to see all the corresponding x-coordinates.

9a: The Octave functions we wrote to implement other methods take 5 arguments. Here, we imagine a similar
function for trapezoidal-ode has been written and looks like

% INPUT: function f(x,y); interval [a,b]; y(a); steps n %


% OUTPUT: approximation (x(i),y(i)) of the solution of y’=f(x,y) %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function [y,x] = trapode(f,a,ya,b,n)

The arguments are, in order, (f) the function f (x, y) appearing on the right side of the o.d.e., (a) the x-
coordinate of the initial condition, (ya) the y-coordinate of the initial condition, (b) the x-coordinate of the
desired solution, and (n) the number of steps that should be taken. From the Octave command line, the
solution can be found this way:

>> format(’long’)
>> f=inline(’3*x-2*y’)
f = f(x, y) = 3*x-2*y
>> trapode(f,1,1,2,20)
ans =
Columns 1 through 4:
1.00000000000000 1.05125000000000 1.10475625000000 1.16030440625000
Columns 5 through 8:
1.21770048765625 1.27676894132891 1.33735089190266 1.39930255717191
Columns 9 through 12:
1.46249381424058 1.52680690188772 1.59213524620839 1.65838239781859
Columns 13 through 16:
1.72546107002583 1.79329226837337 1.86180450287790 1.93093307510450
Columns 17 through 20:
2.00061943296957 2.07081058683746 2.14145858108790 2.21252001588455
Column 21:
2.28395561437552

The value in Column 21 is the desired result, so y(2) ≈ 2.28395561437552. The rest of the output gives approxima-
tions for the solution at other points. For example, y(1.95) ≈ 2.21252001588455. Use [y,x]=trapode(f,1,1,2,20)
to see all the corresponding x-coordinates.
310 Solutions to Selected Exercises

9d: The Octave functions we wrote to implement other methods take 5 arguments. Here, we imagine a similar
function for trapezoidal-ode has been written and looks like

% INPUT: function f(x,y); interval [a,b]; y(a); steps n %


% OUTPUT: approximation (x(i),y(i)) of the solution of y’=f(x,y) %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function [y,x] = trapode(f,a,ya,b,n)

The arguments are, in order, (f) the function f (x, y) appearing on the right side of the o.d.e., (a) the x-
coordinate of the initial condition, (ya) the y-coordinate of the initial condition, (b) the x-coordinate of the
desired solution, and (n) the number of steps that should be taken. From the Octave command line, the
solution can be found this way:

>> format(’long’)
>> f=inline(’(2*cos(x)^3-1-y*sin(x))/cos(x)’)
f = f(x, y) = (2*cos(x)^3-1-y*sin(x))/cos(x)
>> trapode(f,1,0,2,20)
ans =
Columns 1 through 3:
0.000000000000000 -0.066778403380866 -0.139846898631295
Columns 4 through 6:
-0.218610595984683 -0.302399307505556 -0.390473688925680
Columns 7 through 9:
-0.482031924143591 -0.576216643912361 -0.672121275727591
Columns 10 through 12:
-0.768792826665983 -0.865212265743696 -0.959857757799220
Columns 13 through 15:
-1.056576584732967 -1.151350240932434 -1.242238115924874
Columns 16 through 18:
-1.328187356783625 -1.408239476567505 -1.481492346014993
Columns 19 through 21:
-1.547099820528092 -1.604277373646634 -1.652308958787397

The value in Column 21 is the desired result, so y(2) ≈ −1.652308958787. The rest of the output gives approxima-
tions for the solution at other points. For example, y(1.95) ≈ −1.604277373646. Use [y,x]=trapode(f,1,0,2,20)
to see all the corresponding x-coordinates.
10a: The Octave functions we wrote to implement other methods take 5 arguments. Here, we imagine a similar
function for clopen-ode has been written and looks like

% INPUT: function f(x,y); interval [a,b]; y(a); steps n %


% OUTPUT: approximation (x(i),y(i)) of the solution of y’=f(x,y) %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function [y,x] = clopen(f,a,ya,b,n)

The arguments are, in order, (f) the function f (x, y) appearing on the right side of the o.d.e., (a) the x-
coordinate of the initial condition, (ya) the y-coordinate of the initial condition, (b) the x-coordinate of the
desired solution, and (n) the number of steps that should be taken. From the Octave command line, the
solution can be found this way:

>> format(’long’)
>> f=inline(’3*x-2*y’)
f = f(x, y) = 3*x-2*y
>> clopen(f,1,1,2,20)
ans =
Columns 1 through 4:
1.00000000000000 1.05120833333333 1.10468084027778 1.16020204697801
Columns 5 through 8:
1.21757698550727 1.27662924238649 1.33719919281938 1.39914240296940
311

Columns 9 through 12:


1.46232818428681 1.52663828541552 1.59196570858681 1.65821363865296
Columns 13 through 16:
1.72529447404116 1.79312894992824 1.86164534486007 1.93077876287422
Columns 17 through 20:
2.00047048394069 2.07066737621900 2.14132136424883 2.21238894775115
Column 21:
2.28383076622349

The value in Column 21 is the desired result, so y(2) ≈ 2.28383076622349. The rest of the output gives approxima-
tions for the solution at other points. For example, y(1.95) ≈ 2.21238894775115. Use [y,x]=clopen(f,1,1,2,20)
to see all the corresponding x-coordinates.

10d: The Octave functions we wrote to implement other methods take 5 arguments. Here, we imagine a similar
function for clopen-ode has been written and looks like

% INPUT: function f(x,y); interval [a,b]; y(a); steps n %


% OUTPUT: approximation (x(i),y(i)) of the solution of y’=f(x,y) %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function [y,x] = clopen(f,a,ya,b,n)

The arguments are, in order, (f) the function f (x, y) appearing on the right side of the o.d.e., (a) the x-
coordinate of the initial condition, (ya) the y-coordinate of the initial condition, (b) the x-coordinate of the
desired solution, and (n) the number of steps that should be taken. From the Octave command line, the
solution can be found this way:

>> format(’long’)
>> f=inline(’(2*cos(x)^3-1-y*sin(x))/cos(x)’)
f = f(x, y) = (2*cos(x)^3-1-y*sin(x))/cos(x)
>> clopen(f,1,0,2,20)
ans =
Columns 1 through 3:
0.000000000000000 -0.066674788135152 -0.139650010793905
Columns 4 through 6:
-0.218333343735571 -0.302057681694326 -0.390087513340042
Columns 7 through 9:
-0.481626032825074 -0.575822930559361 -0.671782830658695
Columns 10 through 12:
-0.768574489070735 -0.865241984556076 -0.960839121780159
Columns 13 through 15:
-1.051332254162207 -1.136768664871208 -1.218181121459446
Columns 16 through 18:
-1.294632701999881 -1.365219285669536 -1.429077386836689
Columns 19 through 21:
-1.485393339498179 -1.533411938658838 -1.572444496803329

The value in Column 21 is the desired result, so y(2) ≈ −1.572444496803329. The rest of the output gives approxima-
tions for the solution at other points. For example, y(1.95) ≈ −1.533411938658838. Use [y,x]=clopen(f,1,0,2,20)
to see all the corresponding x-coordinates.

11a: The Octave function we wrote to implement the midpoint method takes 5 arguments. As explained in the
comment preceding the function declaration,

% INPUT: function f(x,y); interval [a,b]; y(a); steps n %


% OUTPUT: approximation (x(i),y(i)) of the solution of y’=f(x,y) %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function [y,x] = midpoint(f,a,ya,b,n)
312 Solutions to Selected Exercises

they are, in order, (f) the function f (x, y) appearing on the right side of the o.d.e., (a) the x-coordinate of the
initial condition, (ya) the y-coordinate of the initial condition, (b) the x-coordinate of the desired solution,
and (n) the number of steps that should be taken. From the Octave command line, the solution can be found
this way:

>> format(’long’)
>> f=inline(’3*x-2*y’)
f = f(x, y) = 3*x-2*y
>> midpoint(f,1,1,2,20)
ans =
Columns 1 through 4:
1.00000000000000 1.05125000000000 1.10475625000000 1.16030440625000
Columns 5 through 8:
1.21770048765625 1.27676894132891 1.33735089190266 1.39930255717191
Columns 9 through 12:
1.46249381424058 1.52680690188772 1.59213524620839 1.65838239781859
Columns 13 through 16:
1.72546107002582 1.79329226837337 1.86180450287790 1.93093307510450
Columns 17 through 20:
2.00061943296957 2.07081058683746 2.14145858108790 2.21252001588455
Column 21:
2.28395561437552

The value in Column 21 is the desired result, so y(2) ≈ 2.28395561437552. The rest of the output gives approxima-
tions for the solution at other points. For example, y(1.95) ≈ 2.21252001588455. Use [y,x]=midpoint(f,1,1,2,20)
to see all the corresponding x-coordinates.
11d: The Octave function we wrote to implement the midpoint method takes 5 arguments. As explained in the
comment preceding the function declaration,

% INPUT: function f(x,y); interval [a,b]; y(a); steps n %


% OUTPUT: approximation (x(i),y(i)) of the solution of y’=f(x,y) %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function [y,x] = midpoint(f,a,ya,b,n)

they are, in order, (f) the function f (x, y) appearing on the right side of the o.d.e., (a) the x-coordinate of the
initial condition, (ya) the y-coordinate of the initial condition, (b) the x-coordinate of the desired solution,
and (n) the number of steps that should be taken. From the Octave command line, the solution can be found
this way:

>> format(’long’)
>> f=inline(’(2*cos(x)^3-1-y*sin(x))/cos(x)’)
f = f(x, y) = (2*cos(x)^3-1-y*sin(x))/cos(x)
>> midpoint(f,1,0,2,20)
ans =
Columns 1 through 3:
0.000000000000000 -0.066766774094073 -0.139831999606821
Columns 4 through 6:
-0.218600428030388 -0.302401486830318 -0.390495486389841
Columns 7 through 9:
-0.482080439082276 -0.576299298636036 -0.672247230148908
Columns 10 through 12:
-0.768977840728485 -0.865503930033315 -0.960754716787988
Columns 13 through 15:
-1.057757600117324 -1.154510687305015 -1.247336119828964
Columns 16 through 18:
-1.335197000042218 -1.417135309027307 -1.492245593754752
Columns 19 through 21:
-1.559677244661507 -1.618640905170988 -1.668415622421331
313

The value in Column 21 is the desired result, so y(2) ≈ −1.668415622421. The rest of the output gives approxima-
tions for the solution at other points. For example, y(1.95) ≈ −1.618640905170. Use [y,x]=midpoint(f,1,0,2,20)
to see all the corresponding x-coordinates.

12a: The Octave function we wrote to implement Ralston’s method takes 5 arguments. As explained in the
comment preceding the function declaration,

% INPUT: function f(x,y); interval [a,b]; y(a); steps n %


% OUTPUT: approximation (x(i),y(i)) of the solution of y’=f(x,y) %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function [y,x] = ralston(f,a,ya,b,n)

they are, in order, (f) the function f (x, y) appearing on the right side of the o.d.e., (a) the x-coordinate of the
initial condition, (ya) the y-coordinate of the initial condition, (b) the x-coordinate of the desired solution,
and (n) the number of steps that should be taken. From the Octave command line, the solution can be found
this way:

>> format(’long’)
>> f=inline(’3*x-2*y’)
f = f(x, y) = 3*x-2*y
>> raslston(f,1,1,2,20)
ans =
Columns 1 through 4:
1.00000000000000 1.05125000000000 1.10475625000000 1.16030440625000
Columns 5 through 8:
1.21770048765625 1.27676894132891 1.33735089190266 1.39930255717191
Columns 9 through 12:
1.46249381424058 1.52680690188772 1.59213524620839 1.65838239781859
Columns 13 through 16:
1.72546107002583 1.79329226837337 1.86180450287790 1.93093307510450
Columns 17 through 20:
2.00061943296957 2.07081058683746 2.14145858108790 2.21252001588455
Column 21:
2.28395561437552

The value in Column 21 is the desired result, so y(2) ≈ 2.28395561437552. The rest of the output gives approxima-
tions for the solution at other points. For example, y(1.95) ≈ 2.21252001588455. Use [y,x]=ralston(f,1,1,2,20)
to see all the corresponding x-coordinates.

12d: The Octave function we wrote to implement Ralston’s method takes 5 arguments. As explained in the
comment preceding the function declaration,

% INPUT: function f(x,y); interval [a,b]; y(a); steps n %


% OUTPUT: approximation (x(i),y(i)) of the solution of y’=f(x,y) %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function [y,x] = ralston(f,a,ya,b,n)

they are, in order, (f) the function f (x, y) appearing on the right side of the o.d.e., (a) the x-coordinate of the
initial condition, (ya) the y-coordinate of the initial condition, (b) the x-coordinate of the desired solution,
and (n) the number of steps that should be taken. From the Octave command line, the solution can be found
this way:

>> format(’long’)
>> f=inline(’(2*cos(x)^3-1-y*sin(x))/cos(x)’)
f = f(x, y) = (2*cos(x)^3-1-y*sin(x))/cos(x)
>> ralston(f,1,0,2,20)
ans =
Columns 1 through 3:
0.000000000000000 -0.066770300283373 -0.139836235303672
314 Solutions to Selected Exercises

Columns 4 through 6:
-0.218602682516778 -0.302399209595394 -0.390486264578185
Columns 7 through 9:
-0.482061961403970 -0.576269242338981 -0.672202937226713
Columns 10 through 12:
-0.768915255605024 -0.865412688887274 -0.960565629810385
Columns 13 through 15:
-1.056164950925061 -1.150218643526626 -1.240368616917767
Columns 16 through 18:
-1.325575733886901 -1.404886704290576 -1.477402196316258
Columns 19 through 21:
-1.542278061151791 -1.598731451393269 -1.646047861531770

The value in Column 21 is the desired result, so y(2) ≈ −1.6460478615317. The rest of the output gives approxima-
tions for the solution at other points. For example, y(1.95) ≈ −1.598731451393. Use [y,x]=ralston(f,1,0,2,20)
to see all the corresponding x-coordinates.

Section 6.4
1a: The o.d.e. solver previously derived is

k1 = f (ti , yi )
2h 2h
 
k2 = f ti + , yi + k1
3 3
h
yi+1 = yi + [k1 + 3k2 ] ,
4

making β2 = 32 , α1 = 14 , and α2 = 34 . Plugging these values (plus β3 = α3 = 0) into equations 6.4.4,

1 3
+ +0 = 1
4 4
3 2 1
· +0·0 =
4 3 2
 2
3 2 1
· + 0 · 02 =
4 3 3
2 1
0 · · 0 6= .
3 6

Since the only unsatisfied equation was derived from h3 terms, we conclude that this method has local
truncation error O(h3 ). The integration formula from which it was derived has local truncation error O(h4 ),
so it is not quite as accurate as an o.d.e. solver. However, local truncation error O(h3 ) is consistent with the
experimentally determined O(h2 ) rate of convergence. In fact, it is this local truncation error that leads to
the O(h2 ) rate of convergence.

1e: The o.d.e. solver previously derived is

k1 = f (ti , yi )
 
h h
k2 = f ti + , yi + k1
3 3
k3 = f (ti + h, yi + hk2 )
h
yi+1 = yi + [3k2 + k3 ] .
4
315

making β2 = 13 , β3 = 1, α1 = 0, α2 = 34 , and α3 = 14 . Plugging these values into equations 6.4.4,

3 1
0+ + = 1
4 4
3 1 1 1
· + ·1 =
4 3 4 2
 2
3 1 1 1
· + · 12 =
4 3 4 3
1 1 1
· ·1 6= .
4 3 6

Since the only unsatisfied equation was derived from h3 terms, we conclude that this method has local
truncation error O(h3 ). The integration formula from which it was derived has local truncation error O(h4 ),
so it is not quite as accurate as an o.d.e. solver. However, local truncation error O(h3 ) is consistent with the
experimentally determined O(h2 ) rate of convergence. In fact, it is this local truncation error that leads to
the O(h2 ) rate of convergence.

2: From the initial value problem, f (t, y) = ty and y(1) = 12 . For the o.d.e. solver, this means t0 = 1 and y0 = 12 .
To compute y(2) in one step, h = 1 and

1 1
k1 = f (ti , yi ) = 1 · =
2 2
1 1 1 1 1 1 9
  
k2 = f (ti + h, yi + hk1 ) = 1 + · 1 + ·1· =
2 2 2 2 2 2 8
1 1 1 1 1 9 51
  
k3 = f (ti + h, yi + hk2 ) = 1 + · 1 + ·1· =
2 2 2 2 2 8 32
1 51 67
 
k4 = f (ti + h, yi + hk3 ) = (1 + 1) +1· =
2 32 16
1
y1 = y0 + h(k1 + 2k2 + 2k3 + k4 )
6 
1 1 1 9 51 67

= + ·1 +2· +2· +
2 6 2 8 32 16
35
= = 2.1875
16
t1 = t0 + h = 1 + 1 = 2

Thus y(2) ≈ 2.1875. Euler’s method with two steps yielded y(2) ≈ 1.3125. Since the exact solution is
3/2
y(2) = e 2 ≈ 2.240844535169032, RK4 did a much better job in one step than did Euler’s method in two
steps. Incidentally, even four steps of Euler’s method (which means 4 function evaluations—just as many as
one step of RK4), yields y(2) ≈ 1.621398925781250.

Section 6.5
4: The blanks in the table are to be read as zeros, so β11 = β12 = 0, for example. The only non-zero value for the
βij is β21 = 1. The values in the left column are the δi , so δ2 = 1. The values in the bottom row are the αi ,
so α1 = α2 = 12 . In summary,
1
δ2 = 1, β21 = 1, α1 = α2 = .
2
Because the tableau has two rows above the row of αi , it is a two-stage method. Therefore, the method takes
the form

k1 = f (ti , yi )
k2 = f (ti + δ2 h, yi + β21 hk1 )
yi+1 = yi + h[α1 k1 + α2 k2 ].
316 Solutions to Selected Exercises

See equation 6.5.2. Plugging in the parameter values, this tableau represents the method

k1 = f (ti , yi )
k2 = f (ti + h, yi + hk1 )
1 1
 
yi+1 = yi + h k1 + k2 .
2 2

This last equation simplifies to yi+1 = yi + h2 [k1 + k2 ]. These equations are exactly those in equation 6.3.3,
trapezoidal-ode, or the improved Euler method.

6b: First, decoding the table into the form 6.5.2, we see this is a 4-stage method with formula

k1 = f (ti , yi )
2 2
 
k2 = f ti + h, yi + hk1
7 7
4 8 4
 
k3 = f ti + h, yi − hk1 + hk2
7 35 5
6 29 2 5
 
k4 = f ti + h, yi + hk1 − hk2 + hk3
7 42 3 6
1 1 5 1
 
yi+1 = yi + h k1 + k2 + k3 + k4 .
6 6 12 4

Code similar to the samples in sections 6.3 and 6.4 might look like thirdOrder.m, which may be downloaded
at the companion website.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Written by Leon Brin 9 June 2016 %
% Purpose: This function implements a 3rd order Runge-Kutta %
% method where the step size is calculated and held %
% constant. %
% INPUT: function f(x,y); interval [a,b]; y(a); steps n %
% OUTPUT: approximation (x(i),y(i)) of the solution of y’=f(x,y) %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function [y,x] = thirdOrder(f,a,ya,b,n)
i = 1;
x(i) = a;
y(i) = ya;
h = (b-a)/n;
while (i<=n)
k1 = f(x(i), y(i));
k2 = f(x(i)+2*h/7, y(i)+2*h/7*k1);
k3 = f(x(i)+4*h/7, y(i)+h/35*(-8*k1+28*k2));
k4 = f(x(i)+6*h/7, y(i)+h/42*(29*k1-28*k2+35*k3));
y(i+1) = y(i) + h/12*(2*k1+2*k2+5*k3+3*k4);
x(i+1) = a + (b-a)*i/n;
i = i+1;
end%while
end%function

Applying this code to the test o.d.e. used in section 6.3,


y
ẏ = − + t2
t
y(4) = 20,

to approximate y(2), which we know has exact value 10, with various step sizes yields
317

>> format(’long’)
>> f=inline(’-y/t+t^2’)
f = f(t, y) = -y/t+t^2
>> [y,x]=thirdOrder(f,4,20,2,5);
>> abs(10-y(length(y)))
ans = 4.14600417808941e-04
>> [y,x]=thirdOrder(f,4,20,2,10);
>> abs(10-y(length(y)))
ans = 5.20403883292886e-05
>> [y,x]=thirdOrder(f,4,20,2,20);
>> abs(10-y(length(y)))
ans = 6.48395888624975e-06
>> [y,x]=thirdOrder(f,4,20,2,40);
>> abs(10-y(length(y)))
ans = 8.08029787080500e-07

Since the number of steps is doubling from one call of thirdOrder to the next, the step size is halving. As
the step size is halved, the error is decreasing by a factor of 8, or by ( 21 )3 , lending numerical evidence that
the rate of convergence is O(h3 ).

10: First, decoding the table into the form 6.5.2, we see the embedded methods have 5 and 4 stages with formulas

k1 = f (ti , yi )
1 1
 
k2 = f ti + h, yi + hk1
4 4
3 9
 
k3 = f ti + h, yi − hk1 + 3hk2
4 4
1 1 5 1
 
k4 = f ti + h, yi + hk1 + hk2 + hk3
2 18 12 36
7 5 1
 
k5 = f ti + h, yi + hk1 − hk2 − hk3 + 2hk4
9 3 9
1 2 1
 
{first method} yi+1 = yi + h k1 + k4 + k5
6 3 6
7 5 1
 
{second method} yi+1 = yi + h k1 − k2 − k3 + 2k4 .
9 3 9

The difference of the two methods will be used as an error estimate:


1 2 1 7 5 1
   
error ≈ h k1 + k4 + k5 − h k1 − k2 − k3 + 2k4
6 3 6 9 3 9
h
= [−11k1 + 30k2 + 2k3 − 24k4 + 3k5 ] .
18
Since we are told this is an RK3(4) method, it has rate of convergence (order) 3 and therefore has local
truncation error O(h4 ). This means the error will scale with the fourth power of h. This is important when
adjusting the step size. We will need to use a fourth root, not a third root as in RK2(3). Besides this change
and the formula changes, the code of RK2(3) can be shared. rk34butcher.m may be downloaded at the
companion website.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Written by Leon Brin 9 June 2016 %
% Purpose: This function implements an adaptive rk3(4) method of %
% Butcher where the step size is controlled by the routine. %
% INPUT: function f(x,y); interval [a,b]; y(a); initial step %
% size h; tolerance eps; maximum steps N; %
% OUTPUT: approximation (x(i),y(i)) of the solution of y’=f(x,y) %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
318 Solutions to Selected Exercises

function [y,t] = rk34butcher(f,a,ya,b,h,eps,N)


i = 1;
t(i) = a;
y(i) = ya;
done = 0;
while (!done && i<=N)
if ((b-t(i)-h)*(b-a)<=0)
h=b-t(i);
done = 1;
endif
k1 = f(t(i), y(i));
k2 = f(t(i)+h/4, y(i)+h/4*k1);
k3 = f(t(i)+3*h/4, y(i)+h/4*(-9*k1+12*k2));
k4 = f(t(i)+h/2, y(i)+h/36*(2*k1+15*k2+k3));
k5 = f(t(i)+h, y(i)+h/9*(7*k1-15*k2-k3+18*k4));
err = abs(h/18*(-11*k1+30*k2+2*k3-24*k4+3*k5));
if (done || err<=eps)
y(i+1) = y(i) + h/6*(k1+4*k4+k5);
t(i+1) = t(i) + h;
if (t(i+1) == t(i))
disp("Procedure failed. Step size reached zero.")
return
endif
i = i+1;
endif
q = 0.9*realpow(eps/err,1/4);
q = max(q,0.1);
q = min(5.0,q);
h = q*h;
end%while
if (!done)
disp("Procedure failed. Maximum number of iterations reached.")
endif
end%function

12b: The method of exercise 6c shares the first three stages with this method. All we need to do is append the
line of αi values from that table to this one, noting that we need to add a zero at the end:

0
1 1
2 2
3 3
4 0 4
2 1 4
1 9 3 9
7 1 1 1
24 4 3 8
2 1 4
9 3 9

15a: There are two difficulties with this problem. The more straightforward of the two is knowing what the error
of the approximation really is. This o.d.e. is not solvable exactly, so we can’t compute the exact solution. We
can certainly run the method with a tolerance of 10−4 , but this is only a local truncation error. It does not
necessarily translate into any estimate of the global error (the total accumulated error at the last step). Often
times, they will be similar in magnitude, but there is far from any guarantee of it. In any case, here are the
1
results of running the method with initial step size 10 and tolerance 10−4 :

>> f=inline(’(x+2*exp(y)*cos(exp(x)))/(1+exp(y))’)
f = f(x, y) = (x+2*exp(y)*cos(exp(x)))/(1+exp(y))
>> [y,x]=rk23(f,0,2,4,1/10,1e-4,100000);
>> y(length(y))
ans = 2.37564101044550
319

Figure 6.5.1: log-log plot of tolerance versus global error


RK2(3)
1e-3

1e-4
global error

1e-5

1e-6

1e-7
1e-5 1e-4 1e-3
tolerance

>> length(y)
ans = 152

suggesting that y(4) ≈ 2.37564. Though we should have some confidence that this is a reasonable estimate
(say with error no more than 10−2 ), we should certainly not claim that the error is less than, or really all that
close to 10−4 . The algorithm took 152 steps to arrive at the result, so the error had a chance to accumulate.
If it is extremely important to know that the estimate is accurate to the nearest 10−4 or better, it could be
compared to a second run with a smaller tolerance:

>> [y,x]=rk23(f,0,2,4,1/10,1e-5,100000);
>> y(length(y))
ans = 2.37616344347848

The difference between the estimates is about 5.22(10)−4 . This would suggest that the error in the first
estimate is likely a bit more than 10−4 . But even this evidence is far from iron-clad. The second difficulty
is that small adjustments in the tolerance can lead to large changes in the global error. Global error as a
function of tolerance is very rough and discontinuous (see Figure 6.5.1). The oscillatory nature of the solution
exacerbates this problem with adaptive Runge-Kutta methods. If the global error scaled perfectly with the
truncation error, Figure 6.5.1 would show a perfectly straight line parallel to the line y = x, shown in red.
This figure shows that most tolerances between 10−5 and 10−3 would suffice to give a global error of 10−4 or
less, though there are some exceptions, most notably one right around 10−4 . Figure 6.5.2 shows the solution
over the interval [0, 4], illustrating its oscillations. Generally speaking, comparing multiple approximations
using different tolerances is not how global error is controlled. Global error can be reasonably well controlled
by scaling the tolerance relative to the step size as the solution progresses or using relative errors instead of
absolute. Either way, this concern adds another layer of complexity to the method.

16a: There are two difficulties with this problem. The more straightforward of the two is knowing what the error
of the approximation really is. This o.d.e. is not solvable exactly, so we can’t compute the exact solution. We
can certainly run the method with a tolerance of 10−4 , but this is only a local truncation error. It does not
necessarily translate into any estimate of the global error (the total accumulated error at the last step). Often
320 Solutions to Selected Exercises

Figure 6.5.2: Solution of equation 6.5.4


2.6

2.4

2.2

2
y

1.8

1.6

1.4

1.2
0 1 2 3 4
x

times, they will be similar in magnitude, but there is far from any guarantee of it. In any case, here are the
1
results of running the method with initial step size 10 and tolerance 10−4 :

>> f=inline(’(x^2+y)/(x-y^2)’)
f = f(x, y) = (x^2+y)/(x-y^2)
>> [y,x]=rk23(f,0,5,3,1/10,1e-4,100000);
>> y(length(y))
ans = 3.66765768487404
>> length(y)
ans = 17

suggesting that y(4) ≈ 3.66765. Though we should have some confidence that this is a reasonable estimate
(say with error no more than 10−2 ), we should certainly not claim that the error is less than, or really all
that close to 10−4 . The algorithm took 17 steps to arrive at the result, so the error had a small chance to
accumulate. If it is extremely important to know that the estimate is accurate to the nearest 10−4 or better,
it could be compared to a second run with a smaller tolerance:

>> [y,x]=rk23(f,0,5,3,1/10,1e-5,100000);
>> y(length(y))
ans = 3.66757804370410

The difference between the estimates is about 7.96(10)−5 . This would suggest that the error in the first
estimate is likely right around 10−4 . But even this evidence is far from iron-clad. The second difficulty
is that small adjustments in the tolerance can lead to large changes in the global error. Global error as a
function of tolerance is rough and discontinuous (see Figure 6.5.3). If the global error scaled perfectly with
the truncation error, Figure 6.5.3 would show a perfectly straight line parallel to the line y = x, shown in red.
This figure shows that most tolerances between 10−5 and 10−3 would suffice to give a global error of 10−4 or
less, though there may be some exceptions not plotted. Figure 6.5.4 shows the solution over the interval [0, 3].
Generally speaking, comparing multiple approximations using different tolerances is not how global error is
controlled. Global error can be reasonably well controlled by scaling the tolerance relative to the step size
as the solution progresses or using relative errors instead of absolute. Either way, this concern adds another
layer of complexity to the method.
321

Figure 6.5.3: log-log plot of tolerance versus global error


RK2(3)
1e-3

1e-4
global error

1e-5

1e-6
1e-5 1e-4 1e-3
tolerance

Figure 6.5.4: Solution of equation 6.5.5


5

4.8

4.6

4.4
y

4.2

3.8

3.6
0 0.5 1 1.5 2 2.5 3
x
322 Solutions to Selected Exercises
Answers to Selected Exercises

Section 1.1
10e: 0.83333

14a: .2353263818643 and .2343263818643

15a: .2349438537911 and .2347090273506


1 97 1 103
16: (p, p̃) ∈ , − 13 , − 300
97
, − 13 , − 103
    
3 , 300 , 3 , 300 300

21: p = ±1 and p̃ is anything; or p = p̃ 6= 0.

24a: (i) 8.99999974990351 (ii) 2.5009649(10)−7 (iii) 2.7788499(10)−8 (iv) (10)−14 (v) 2.5009647(10)−7

Section 1.2
ξ sin(ξ)−4 cos(ξ) 4
1f: T3 (x) = x2 . R3 (x) = 24 x .

9d: 10.760
 
12π 2 −48
12a: ξ(π) = cos−1 π4 ≈ 0.7625.

Section 1.3
1d: α = 1
1

6f: O n
 
6h: O √1
n

1

6n: O n

19e: 4 iterations

Section 1.4
7: (a) 1 more than 4 times the number required for the 2n−1 × 2n−1 grid. (b) 0 (c) 0

14: (a) S(n − 1, k − 1) (b) k · S(n − 1, k)

323
324 Answers to Selected Exercises

Section 2.1
4c: In 27 iterations, we get 0.666666664928, which is within 10−8 of an actual root.

4f: In 27 iterations, we get 21.9911485687, which is within 10−8 of an actual root.

5: (a) 0.625 (b) 1.09375

37π
10: 2

16: 33

23: One possible collatz.m file is

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Written by on %
% Purpose: implementation of the collatz function %
% INPUT: integer n %
% OUTPUT: n/2 or 3n+1 depending on whether n is %
% even or odd %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function res=collatz(n)
if (ceil(n/2)==n/2)
res=n/2
else
res=3*n+1
end%if
end%function


25: (a) 20π

Section 2.2
2d: (i) The hypotheses of the MVT are met. (ii) c ≈ −2.540793513382845.

2g: (i) The hypotheses of the MVT are not met.

2h: (i) The hypotheses of the MVT are met. (ii) c ≈ 17.41987374102208.

3c: −2 and 5

3d: −1 and − 31
q q
4−3x2 4−3x5
4c: f1 (x) = 5
2 and f2 (x) = 6 . There are many others.

(x2 −5x+1)(log2 3)−x2 −1


4f: f1 (x) = and f2 (x) = (log2 3)(x2 − 5x + 1) − 5x − 1. There are many others.
p
5

5c: 1.326008542399018, 1.598751095046933, 1.737721941251104, 1.779703798972744, 1.788512049183622; the se-


quence seems to be converging.

6c: 1.79047660196506

7c: The web diagram over [.8, 2] is:


325

18: (a) 15 (b) The equations g(x) = x and f (x) = x are equivalent.
23: − 41

Section 2.3
10: (a) 15. HINT: It is valid to bound the derivative over the interval [1.618033988749895, 2.5] instead of the entire
interval [.5, 3.5]. Why? On the other hand, if you do consider the whole interval [.5, 3.5], you get a bound of
43. (b) It actually takes 15 iterations.
13: a1 ≈ 1.942415717 and a2 ≈ 1.623271404

14: 2.732050809. HINT: use f (x) = 4 2x3 + 4x2 − 4x − 4. Why?
15: a0 = 3, a1 = 23 , and a2 = 4
3

18: No. Aitken’s delta-squared method is designed to speed up linearly convergent sequences, not superlinearly
convergent sequences.
21: a1 ≈ 2.152904629 and a2 ≈ 1.873464044
3
23: 2 or 0
24: x̂ ≈ 5.259185715

Section 2.4
4c: Using x0 = 2 and x1 = 3, we find x8 = 1.47883214766643.
4d: Using x0 = 3 and x1 = 4, we find x10 = 0.948434069243393.
5c: Using x0 = 2.5, we find x6 = 1.47883214733021.
5d: Using x0 = 3.5, we find x7 = 0.948434069919634.
6c: Using x0 = 2.5, we find x18 = 0.948434068437721.
6d: Using x0 = 3.5, we find x15 = 0.948434069313413.
7c: Using x0 = 2 and x1 = 3, we find x10 = 1.47883214733021. The difference between x10 and x8 is about
3.3(10)−10 , so x8 was indeed accurate to within 10−5 .
7d: Using x0 = 3 and x1 = 4, we find x12 = 0.948434069919636. The difference between x12 and x10 is about
6.7(10)−10 , so x10 was indeed accurate to within 10−5 .
9b: x14 = 0.580055888962675.
15b: x14 = 0.580055888962675. This is different from 0. Why?
326 Answers to Selected Exercises

16: x10 = 3.739599358563032.


20: x16 = 3.7201766622615984(10)−4 , x17 = 2.4933434933779863(10)−4 , and x18 = 1.6752024023472534(10)−4 .
a16 = 3.7404947721983783(10)−6 so Aitken’s delta-squared method DOES speed up convergence.
23: (a) π (b) Newton’s method will fail because g 0 (0) = 0. (c) 6 (d) Something near −7.5 will do.
25c: In 18 iterations, we get 0.666666668082383, which is within 10−8 of an actual root. This is the same root
found by the bisection method, but the bisection method took longer, 27 iterations.
25f: In 9 iterations, we get 21.9911485743912, which is within 10−8 of an actual root. This is the same root found
by the bisection method, but the bisection method took longer, 27 iterations.
31: 3.555963292212723

Section 2.5
2: f and (a), g and (d), h and (b), l and (c)
8: f and (b), g and (c), h and (d), l and (a)

Section 2.6
6b: g(2) = 5 and g 0 (2) = −8
21 241003
8b: x1 = 8 and x2 = 100544

14b: −8, −2.33333, 0.33333, 2 + i, 2 − i


15b: −2, 0.76393, 5.23607, 0.66667 + 0.57735i, 0.66667 − 0.57735i
16b: They do change, but not within the first five decimal places.
19b: (i) -109.372462336481 (ii) -109.372462336481 (iii) ans = 0
19c: (i) 948.990683139955 (ii) 948.990683139955 (iii) ans = 1
20:

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Written by Dr. Len Brin 15 January 2014 %
% Purpose: Implementation of Newton’s Method %
% for polynomials of the form %
% p(x) = c1 + c2*x + c3*x^2 + ... + c(n+1)*x^n %
% using Horner’s Method, n > 1. %
% INPUT: coefficients c; tolerance tol; maximum %
% number of iterations N %
% OUTPUT: approximations to all roots, roots %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function roots = newthornall(c,tol,N,x0)
n=length(c)-1;
for i=1:n-2
res=newtonhorner(c,x0,tol,N)
roots(i)=res;
x0=roots(i);
c=deflate(c,x0);
end%for
[roots(n-1),roots(n)]=quadraticRoots(c(3),c(2),c(1));
end%function

Remark: This code is often successful, but can easily come up empty. For example,
327

newthornall([56,-152,140,-17,-48,9],1e-5,100,2)

returns
res = 0.763932022500484
res = 5.23606797749979
res = Method failed---maximum number of iterations reached
error: newthornall: A(I) = X: X must have the same size as I
error: called from:
error: .../newthornall.m at line 16, column 13
It fails to come up with the third real root, −2. After finding the first two roots, the polynomial has
been deflated to
14.00000000000065 − 16.99999999999987x+
6.00000000000002x2 + 9.00000000000000x3 .
With this cubic and initial value 5.23606797749979, Newton’s method does not converge to −2. On the
other hand, newthornall([56,-152,140,-17,-48,9],1e-5,100,-2) returns
res = -2
res = 0.763932022500211
res = 5.23606797749979
ans =

Columns 1 and 2:

-2.000000000000000 + 0.000000000000000i 0.763932022500211 + 0.000000000000000i

Columns 3 and 4:

5.236067977499789 + 0.000000000000000i 0.666666666666667 + 0.577350269189623i

Column 5:

0.666666666666667 - 0.577350269189623i
Having found −2 first, it has no problem finding the other roots.

21: (a)
1.5858
−13
4.4142
−2 + 2.2361i
−2 − 2.2361i
(b)
3 − 1.4142i
−2.6
−2 + 2.2361i
−2 − 2.2361i
3 + 1.4142i

Section 2.7
1: (a) x4 = 2.1806 (e) x10 = −502.19 (j) x3 = 1.0079

2: (a) x5 = 2.1798 (e) x9 = −502.19 (j) x6 = 1.0079

3: (a) x7 = 2.1798 (e) x6 = −499.98 (j) x5 = 1.0080


328 Answers to Selected Exercises

4: (a) x7 = 2.1798 (e) x2 = −499.98 (j) x3 = 4.1495

5: (a) x9 = 2.17975713685875 (e) x18 = −502.188059117229 (j) x4 = 1.00794427892360

6: (a) x6 = 2.17975706647997 (e) x10 = −502.188059235320 (j) x8 = 1.00794427848101

7: (a) x6 = 2.17975706647996 (e) x9 = −502.188059235320 (j) x4 = 1.00794427848094

8: (a), (e), and (j): Bracketed inverse quadratic interpolation is at least as fast or faster than false position or
bracketed Newton’s method.

9: bracketedSteffensens.m may be downloaded at the companion website.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Written by Dr. Len Brin 15 January 2014 %
% Purpose: Implementation of Steffensen’s method %
% INPUT: function f; initial value x0; tolerance %
% TOL; maximum iterations N0 %
% OUTPUT: approximation x and number of %
% iterations i; or message of failure %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function [x,i] = bracketedSteffensens(f,a,b,TOL,N0)
i=1;
A=f(a);
B=f(b);
while (i<=N0)
b
x0=b;
x1=B;
x2=f(x1);
if (abs(x2-x1)<TOL)
x=x2;
disp(" ");
return
end%if
x=x0-(x1-x0)^2/(x2-2*x1+x0);
if (x<min([a,b]) || x>max([a,b]))
x=a+(b-a)/2;
end%if
if (abs(x-x2)<TOL)
disp(" ");
return
end%if
X=f(x);
if ((B<b && X>x) || (B>b && X<x))
a=b; A=B;
end%if
b=x; B=X;
i=i+1;
end%while
x="Method failed---maximum number of iterations reached";
end%function

10: (a) x6 = 2.17975706643814 (e) x11 = −502.188059386686 (j) x9 = 1.00794427672537

11: (a), (e), and (j): Bracketed inverse quadratic interpolation is at least as fast or faster than bracketed Steffensen’s
method, counting only number of iterations. However, bracketed Steffensen’s requires two function evaluations
per iteration, so for all practical purposes requires more than twice the computational power of bracketed
inverse quadratic interpolation.
329

13: (a) x6 = 2.17975706647996 (e) x8 = −502.188059227438 (j) x4 = 1.00794427848094


14: (a) and (j): Since the root is on the order of 1, there is no difference between testing absolute and relative
errors. (e) Since the root is around five hundred, the method stops when the absolute error is only about
10−6 · 500 = 5(10)−4 . Consequently, the method stops one iteration earlier when checking relative error than
it does checking absolute error.

Section 3.2
15: 4
√ √
18+ 142 18− 142
16: 3, 7 , or 7

21: 8

Section 3.3
5: P2 (x) = −0.001642458785316x2 + 1.64927376355948x + 10
8: P3 (x) = 2x − 1. Is degree 1 what you expected?
3
14: (a) 40000 f (4) (ξ8.4 ) (b) 8.7364(10)−5 max f (4) (x) (c) .52501

x∈[8.1,8.7]

15: 0.5x3 + 1.5x2 + 0.335x + 0.951


19: f and (b), g and (c), h and (d), l and (a)

Section 4.1
f (x0 +h)−f (x0 )
2cc: f 0 x0 + h

2 ≈ h

f (x0 +h)−f (x0 )


3cc: f 0 x0 + h

2 ≈ h

6: (a) 20.32712878304436 (e) 0.6321205268681437 (g) 0.2325441461772505


7: (a) (i) e3 − e−4 (ii) 0.2599074987454273 (e) (i) 1 − e−1 (ii) 3.196041398201288(10)−8
(g) (i) ln 2 (ii) 1.2110575916990385(10)−5
8b: 1.19336533331362
9b: .19336533331362
11b: 35
28
11f: 15

12a: −1
12c: −23
7f (x0 +2h)−15f (x0 )+8f (x0 −h)
13b: f 0 (x0 + 3h) ≈ 6h
−f (x0 +2h)+9f (x0 )−8f (x0 −h)
13f: f 0 (x0 − h) ≈ 6h

13h: f 0 (x0 ) ≈ −f (x0 +3h)+9f6h(x0 +h)−8f (x0 )


´ x +θ h
14c: x00+θ01h f (x)dx ≈ − h2 · θθ13 −θ
−θ2 ((2θ2 − θ1 − θ0 )f (x0 + θ3 h) − (2θ3 − θ1 − θ0 )f (x0 + θ2 h))
0

´ x0 +2h
f (x)dx ≈ h
f (x0 ) + 3f x0 + 43 h
 
15a: x0 2
´ x0 +h
15e: x0
f (x)dx ≈ h
2 [f (x0 ) + f (x0 + h)]
330 Answers to Selected Exercises

Section 4.2
f (x0 + h) − f (x0 )
 
h
1: (b) f 0
x0 + ≈
4 h
−3f (x 0 − h) + 4f (x0 ) − f (x0 + h)
(f) f 0 (x0 − h) ≈
2h
−7f (x0 − h) + 16f (x0 + 2h) − 9f (x0 + 3h)
(h) f 0 (x0 − h) ≈
12h
−3f (x0 − h) − 10f (x0 ) + 18f (x0 + h) − 6f (x0 + 2h) + f (x0 + 3h)
(l) f (x0 ) ≈
0
12h
f (x0 − h) − 2f (x0 ) + f (x0 + h)
2: (b) f 00 (x0 − h) ≈
h2
f (x 0 − h) − 4f (x0 + 2h) + 3f (x0 + 3h)
(d) f 00 (x0 − h) ≈
6h2
11f (x0 − h) − 20f (x0 ) + 6f (x0 + h) + 4f (x0 + 2h) − f (x0 + 3h)
(h) f (x0 ) ≈
00
12h2
ˆ x0 +h
4: (d) f (x)dx ≈ hf (x0 )
ˆ x0 +2h
x0
2 4
    
(f) f (x)dx ≈ h f x0 + h + f x0 + h
3 3
ˆx0x0 +h
h
(h) f (x)dx ≈ (f (x0 ) + f (x0 + h))
2
ˆ xx00 +2h
2
   
h
(j) f (x)dx ≈ 3f x0 + h + f (x0 + 2h)
x0 2 3

Section 4.3
2: f 0 (−2.7) ≈ −0.9151775; f 0 (−2.5) ≈ 1.5014075; f 0 (−2.3) ≈ 2.17825; f 0 (−2.1) ≈ 1.11535
3c: 0.4897985468241977
3e: 149/24 = 6.2083
4: (c) 0.4693956404725931 (e) 17/2 = 8.5
5: (c) 0.5 (e) 81/16 = 5.0625
6: (c) 8.57775220962087(10)−5 (e) 0.0083
7: (c) 0.02031712882950837 (e) 2.3
8: (c) 0.0102872306978985 (e) 1.1375
10: 0
11b: 288666.8155482048
12b: lower: 1565.147456974753 upper: 2334.925631788689 actual: 1915.502415038936
13b: 3.142092629759007
17a: error term: O(h2 f 0 (ξ)) degree of precision: 0
17e: error term: O(h4 f 000 (ξ)) degree of precision: 2
17g: error term: O(h4 f 000 (ξ)) degree of precision: 2
17i: error term: O(h5 f (4) (ξ)) degree of precision: 3
18a: O(hf 00 (ξ))
18e: O(h4 f (5) (ξ))
331

18g: O(hf 000 (ξ))

18i: O(h3 f (5) (ξ))

23a: 0.0134k for some constant k depending on the approximation formula, not the function sin x.

3 π3
25: (a) O(h3 ) (b) 1 (c) 2 π ≈ 2.720699046351327 (d) 36 ≈ 0.8612854633416616 (e) actual absolute error:
0.7206990463513265

27: − 12

28: O(h2 )

30: 0

31: 10506.03569166666

36: approximation 1: −3(1.22140)+4(1.10517)−1


−.2 = 1.2176; approximation 2: 1.34986−1.10517
.2 = 1.22345; approximation
−3(1.22140)+4(1.34986)−1.49182
3: .2 = 1.2171; rank: 3,1,2; Other answers are acceptable.

38: 2.58629507364657; h = .0000474853515625

Section 4.4
1: (c) 17.52961733248352 (e) 1.560867019857898

2: (c) 19.3773960369059 (e) 1.569045013890161

3: (c) 18.14554356729098 (e) 1.563593017868653

4: (c) 18.14441877898906 (e) 1.563592239944993

5: (c) 17.73342635968343 (e) 1.561774648629937

8: 141

11b:
ˆ x0 +2h
" n−1
X   n  #
h h X h
f (x)dx ≈ f (x0 ) + f (x0 + 2h) + 2 f x0 + 2i +4 f x0 + (2i − 1)
x0 3n i=1
n i=1
n

11c:
ˆ x0 +3h
" n−1
3h X  
h
f (x)dx ≈ f (x0 ) + f (x0 + 3h) + 2 f x0 + 3i
x0 8n i=1
n
n     #
X h h
+3 f x0 + (3i − 2) + f x0 + (3i − 1)
i=1
n n

16: 0.386259562814567

19: (a) 1.386294361119891 (b) 132

21: 3.109198655184147; yes

26: 0.3862939349171364; 5

27: A straightforward implementation, adaptSimp():


332 Answers to Selected Exercises

####################################################
# Written by Leon Brin 15 May 2014 #
# Purpose: Implementation of adaptive Simpson’s #
# rule #
# INPUT: function f, interval endpoints a and b, #
# desired accuracy TOL. #
# OUTPUT: approximate integral of f(x) from a to b #
# within TOL of actual. #
####################################################

function res = adaptSimp(f,a,b,TOL)


h = (b-a)/4;
f0 = f(a);
f1 = f(a+h);
f2 = f(a+2*h);
f3 = f(a+3*h);
f4 = f(b);
error = abs(h*(f0-4*(f1+f3)+6*f2+f4))/45;
if (error <= TOL)
res = h/3*(f0+4*(f1+f3)+2*f2+f4);
else
res = adaptSimp(f,a,a+2*h,TOL/2) + adaptSimp(f,a+2*h,b,TOL/2);
endif
endfunction

A pair of functions that minimizes the number of evaluations of f , aSimp() and adaptiveSimpsons():

####################################################
# Written by Leon Brin 15 May 2014 #
# Purpose: Wrapper for aSimp() #
# INPUT: function f, interval endpoints a and b, #
# desired accuracy TOL. #
# OUTPUT: approximate integral of f(x) from a to b #
# within TOL of actual. #
####################################################
function res = adaptiveSimpsons(f,a,b,TOL)
res = aSimp(f,a,b,f(a),f((a+b)/2),f(b),TOL);
end#function

####################################################
# Written by Leon Brin 15 May 2014 #
# Purpose: Implementation of adaptive Simpson’s #
# rule #
# INPUT: function f, interval endpoints a and b, #
# f0=f(a), f2=f((a+b)/2), f4=f(b), desired #
# accuracy TOL. #
# OUTPUT: approximate integral of f(x) from a to b #
# within TOL of actual. #
####################################################
function res = aSimp(f,a,b,f0,f2,f4,TOL)
h = (b-a)/4;
f1 = f(a+h);
f3 = f(a+3*h);
error = abs(h*(f0-4*(f1+f3)+6*f2+f4))/45;
if (error <= TOL)
res = h/3*(f0+4*(f1+f3)+2*f2+f4);
else
333

res = aSimp(f,a,a+2*h,f0,f1,f2,TOL/2) + aSimp(f,a+2*h,b,f2,f3,f4,TOL/2);


end#if
end#function

REMARK: aSimp() , adaptSimp(), and adaptiveSimpsons() must be contained in separate .m files.


adaptiveSimpsons() is the only one that should be used directly. The others are called by it. Code may be
downloaded at the companion website.
28: >> f=inline(’log(sin(x))’);
>> adaptiveSimpsons(f,1,3,.002)
ans = -0.70293
30a: (a) (i)

>> format(’long’);
>> f=inline(’x*sin(x^2)’);
>> adaptiveSimpsons(f,0,2*pi,10^-5)
ans = 0.603500307287469

(ii) 1−cos(4 − 0.603500307287469 ≈ 6.175(10)−7 (iii) yes

2

Section 4.5
8 sin πh
− sin(πh)

2
1:
3h
3: O(h9 )
16
4: 9

5: (a) N (1.0) ≈ 0.4596976941318602 and N (0.5) ≈ 0.489669752438509


(b) (i) N1 (1.0) ≈ 0.5196418107451577 (ii) N1 (1.0) ≈ 0.4996604385407252
(c) assumption (i) because it yields an approximation with error about half that of N (1.0), just what would
be expected if assumption (i) were true.
1−cos h
REMARK: limh→0 h2 = 12 .

N (h) − 12N (h/3) + 27N (h/9)


9:
16
10: (c) 18.1436194387278 (e) 1.56359161739838 (g) 3.10928992861842
11: The following code works, but is not very efficient and depends on a working compositeTrapezoidal() function.
In fact, the inefficiency is due to calling the compositeTrapezoidal() function. Each time compositeTrape-
zoidal() is called, it recalculates values of f that it already calculated last time it was called. Avoiding this
repetition of work would make the routine much more efficient. Can you think of a way to accomplish this?
romberg.m may be downloaded at the companion website.

####################################################
# Written by Dr. Len Brin 16 May 2014 #
# Purpose: Implementation of Romberg integration #
# INPUT: function f, interval endpoints a and b, #
# tolerance tol #
# OUTPUT: approximate integral of f(x) from a to b #
####################################################
function integral = romberg(f,a,b,tol)
N(1,1)=compositeTrapezoidal(f,a,b,1);
N(2,1)=compositeTrapezoidal(f,a,b,2);
N(2,2)=(4*N(2,1)-N(1,1))/3;
i=2;
334 Answers to Selected Exercises

while (abs(N(i,i)-N(i,i-1))>tol || abs(N(i,i)-N(i-1,i-1))>tol)


i=i+1;
N(i,1)=compositeTrapezoidal(f,a,b,2^(i-1));
for j=2:i
m=4^(j-1);
N(i,j)=(m*N(i,j-1)-N(i-1,j-1))/(m-1);
end#for
end#while
integral=N(i,i);
end#function

12a: (i)

>> romberg(inline(’x*sin(x^2)’),0,2*pi,10^-5)
ans = 0.603500924593406

1 − cos(4π 2 )

(ii) − 0.603500924593406 ≈ 2.34(10)−10 (iii) yes, and not just barely

2

Section 5.2
9c:
2 3

−.28 + 3.1861(x − .2) − 3.208(x − .2) − 10.693333(x − .2) , x ∈ [.1, .2]

2 3
S(x) = .0066 + 2.5465(x − .3) − 3.188(x − .3) + .066667(x − .3) , x ∈ [.2, .3]
3
.24 + 2.2277(x − .4) + 10.626667(x − .4) x ∈ [.3, .4]

9f:
2 3

−.28 + 3.84613(x − .2) − 20.0773(x − .2) − 245.387(x − .2) ,
 x ∈ [.1, .2]
2 3
S(x) = .0066 + 2.91347(x − .3) + 10.7507(x − .3) + 102.76(x − .3) , x ∈ [.2, .3]
.24 + 0.1(x − .4) − 38.8853(x − .4)2 − 165.453(x − .4)3 , x ∈ [.3, .4]

10c: >> [a,b,c,d]=naturalCubicSpline([.1,.2,.3,.4],[-.62,-.28,.0066,.24])


a =
-0.2800000 0.0066000 0.2400000

b =
3.1861 2.5465 2.2277

c =
-3.20800 -3.18800 0.00000

d =
-10.693333 0.066667 10.626667

12c: >> [a,b,c,d]=clampedCubicSpline([.1,.2,.3,.4],[-.62,-.28,.0066,.24],0.5,0.1)


a =
-0.2800000 0.0066000 0.2400000

b =
3.84613 2.91347 0.10000

c =
-20.077 10.751 -38.885

d =
-245.39 102.76 -165.45
335

Section 6.1
1a: one

1c: two

1f: two

2a: ẏ(t) = et . Substituting into ẏ = y yields et = et , a true statement.

2c:
√   √ 
1 −t/2 √ 3 3

ṡ(t) = e 3 cos t − sin t
2 2 2
√ √
1 √ 3 3
    
s̈(t) = − e−t/2 3 cos t + sin t
2 2 2

Substituting into s̈ + ṡ + s = 0 yields


√   √  √   √  √ 
1 √ 3 3 1 √ 3 3 3
 
− e−t/2 3 cos t + sin t + e−t/2 3 cos t − sin t + e−t/2 sin t = 0,
2 2 2 2 2 2 2

a true statement.
  
1 1√ 1 1√ 1
2f: ṙ(t) = 2√ t
and r̈(t) = − 4t t
. Substituting into r̈ ṙt2
= − 8 yields − 4t t

2 t
t2 = − 81 , a true statement for
t > 0.

3a: ẏ(t) = 4et . Substituting into ẏ = y yields 4et = 4et , a true statement. Furthermore, y(0) = 4e0 = 4 as required.
  
3c: ṡ(t) = −te−t . Substituting into ṡ = (1 − 2s)t yields −te−t = 1 − 2 × 21 1 + e−t
2 2 2
t, a true statement.
Furthermore, s(0) = 21 (1 + e0 ) = 1 as required.
  
1 1√ 1 1√ 1
3f: ṙ(t) =2 t
√and r̈(t) = − 4t t
. Substituting into r̈ ṙt2
= − 8 yields − 4t t

2 t
t2 = − 81 , a true statement for
√ 1
t > 0. Furthermore, r(9) = 9 − 3 = 0 and ṙ(9) = 2√ 9
= 61 as required.

4a: y(x) = x5 + C

4d: y(t) = ln |t| + C, t < 0

4f: s(t) = 3(t − 1)et + C

5a: From the graphs of the exact and approximate solutions, it appears the approximation is reasonable, but gets
progressively worse as t increases. The greatest error occurs at 1, and to be more precise, the relative error
there is about 0.099, less than 10%.
336 Answers to Selected Exercises

5c: From the graphs of the exact and approximate solutions, it appears the approximation is very good at t = 0
and t = 2, but is not particularly accurate between. To be more precise, the relative errors at t = 0.5, 1, 1.5
are about .124, .097, and .095. At three of the five points, the relative error is 9.5% or more.

5f: From the graphs of the exact and approximate solutions, the approximation looks very good for all values of t.
The greatest errors seem to occur at t = 11 and t = 13. To get an idea of just how good the approximation
is, the absolute errors at t = 11 and t = 13 are about .0066 and .0044, respectively. The relative errors are
about .021 and .0073, respectively. All small errors.

6a:

6b:
337

6e: Fapplied and Ff riction may be swapped.

6g:

6h:

6i:

6j:
338 Answers to Selected Exercises

6l:

6n:

6o:

1
7: (6a) θ̈ + g` sin θ = 0; (6b) with dowhnill as the positive direction: s̈ = g(sin α − µ cos α); (6e) s̈ = m Fapplied − µg;
1
(6g) with uphill as the positive direction: s̈ = m Fapplied cos(β − α) − g(sin α + µ cos α); (6h) with the direction
of the sled’s motion as the positive direction: s̈ = −µg; (6i) with downhill as the positive direction: s̈ =
g(sin α − µ cos α); (6j) with the direction of the puck’s motion as the positive direction: s̈ = −µg; (6l) with
up as the positive direction: s̈ = m c
ṡ − g; (6n) with up as the positive direction: s̈ = m c
ṡ − g; (6o) with up as
the positive direction: s̈ = −g
8: Kinetic friction: µmg versus µ(mg + Fapplied sin 20◦ ). Necessary applied force to overcome friction: µmg versus
cos 20◦ −µ sin 20◦ . The applied force pushing parallel to the floor will need to be only (cos 20 − µ sin 20 ) times
µmg ◦ ◦

as great as when pushing at 20◦ from parallel. For example, when µ = .3, cos 20◦ − µ sin 20◦ ≈ .837 so the
necessary force pushing parallel to the floor is only 83.7% of that needed pushing at 20◦ from parallel.

Section 6.2
1c: y(2) ≈ 1.3125
2c: y(2) ≈ 1.88671875
339

3c: y(2) ≈ 2.126953125


4: y(1.5) ≈ 0.8203125
5:

Assumptions: The solution of the o.d.e. exists and is unique on the interval from t0 to t1 .
Input: Differential equation ẏ = f (t, y); formula ÿ(t, y); initial condition y(t0 ) = y0 ; numbers t0 and t1 ;
number of steps N .
Step 1: Set t = t0 ; y = y0 ; h = (t1 − t0 )/N
Step 2: For j = 1 . . . N do Steps 3-4:
Step 3: Set y = y + hf (t, y) + 12 h2 ÿ(t, y)
Step 4: Set t = t0 + Ni (t1 − t0 )
Output: Approximation y of the solution at t = t1 .

8: taylor2ode.m may be downloaded at the companion website.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Written by Leon Brin 13 November 2015 %
% Purpose: This function implements Taylor’s method of order 2 %
% where the step size is calculated and held constant. %
% INPUT: function f(x,y); function (df/dx)(x,y); interval [a,b]; %
% y(a); steps n %
% OUTPUT: approximation (x(i),y(i)) of the solution of y’=f(x,y) %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function [y,x] = taylor2ode(f,ft,a,ya,b,n)
i = 1;
x(i) = a;
y(i) = ya;
h = (b-a)/n;
while (i<=n)
y(i+1) = y(i) + h*(f(x(i),y(i)) + 0.5*h*ft(x(i),y(i)));
x(i+1) = a + (b-a)*i/n;
i = i+1;
end%while
end%function

11:
y(2) ≈ 2.3125, 2.28814697265625, 2.28469446951954, 2.28402793464698
absolute errors are approximately
0.02866617919084, 0.004313151847096, 8.606487103870(10)−4 , 1.941138378267(10)−4
error ratios are approximately 6.6, 5.0, 4.4.
14a:
g
u̇ = − sin θ
`
θ̇ = u

14b:
u̇ = g(sin α − µ cos α)
ṡ = u

14e:
1
u̇ = Fapplied − µg
m
ṡ = u
340 Answers to Selected Exercises

14g:
1
u̇ = Fapplied cos(β − α) − g(sin α + µ cos α)
m
ṡ = u

14h:

u̇ = −µg
ṡ = u

14i:

u̇ = g(sin α − µ cos α)
ṡ = u

14j:

u̇ = −µg
ṡ = u

14l:
c
u̇ = u−g
m
ṡ = u

15: (a) −0.6656470478206087 (b) 0.2384138557742662 (e) 0.05695982142857142 (g) 0.2313498206324268 (h) 14.979875
(i) 5.988821238748838 (j) 43.9939625 (l) 4.387767857142857

18c: y(x) = 23 x − 5
4

18d: y(x) = 72 x2 + 11
7 x + 143
49

18e: y(t) = t4 − 8t3 + 48t2 − 192t + 385

18g: θ(t) = − 25 e−t sin t − 15 e−t cos t


1 7t 1
18i: x(t) = 12 te − 35

Section 6.3
1b:

k1 = f (ti , yi )
 
h h
k2 = f ti + , yi + k1
2 2
yi+1 = yi + hk2

1c:

k1 = f (ti , yi )
 
h h
k2 = f ti + , yi + k1
3 3
h
yi+1 = yi + [3k2 − k1 ]
2
341

1g:

k1 = f (ti , yi )
 
h h
k2 = f ti + , yi + k1
3 3
 
h h
k3 = f ti + , yi + k2
2 2
2h 2h
 
k4 = f ti + , yi + k1
3 3
h
yi+1 = yi + [3k2 − 4k3 + 3k4 ]
2

1j:

k1 = f (ti , yi )
√ √ √ √
5− 3 5− 3
 
k2 = f ti + √ h, yi + √ hk1
2 5 2 5
 
h h
k3 = f ti + , yi + k2
2 2
√ √ √ √
5+ 3 5+ 3
 
k4 = f ti + √ h, yi + √ hk1
2 5 2 5
h
yi+1 = yi + [5k2 + 8k3 + 5k4 ]
18

2b: O(h2 ); Yes


2c: O(h2 ); Yes
2g: O(h3 ); No
2j: O(h2 ); No
6: on page 301
3:

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Written by Leon Brin 28 May 2016 %
% Purpose: This function implements the Midpoint method where %
% the step size is calculated and held constant. %
% INPUT: function f(x,y); interval [a,b]; y(a); steps n %
% OUTPUT: approximation (x(i),y(i)) of the solution of y’=f(x,y) %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function [y,x] = midpoint(f,a,ya,b,n)
i = 1;
x(i) = a;
y(i) = ya;
h = (b-a)/n;
while (i<=n)
k1 = f(x(i),y(i));
k2 = f(x(i)+h/2,y(i)+h/2*k1);
y(i+1) = y(i) + h*k2;
x(i+1) = a + (b-a)*i/n;
i = i+1;
end%while
end%function
This code may be downloaded at the companion website.
342 Answers to Selected Exercises

7:

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Written by Leon Brin 28 May 2016 %
% Purpose: This function implements Ralston’s method where %
% the step size is calculated and held constant. %
% INPUT: function f(x,y); interval [a,b]; y(a); steps n %
% OUTPUT: approximation (x(i),y(i)) of the solution of y’=f(x,y) %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function [y,x] = ralston(f,a,ya,b,n)
i = 1;
x(i) = a;
y(i) = ya;
h = (b-a)/n;
while (i<=n)
k1 = f(x(i),y(i));
k2 = f(x(i)+2*h/3,y(i)+2*h/3*k1);
y(i+1) = y(i) + h/4*(k1+3*k2);
x(i+1) = a + (b-a)*i/n;
i = i+1;
end%while
end%function
This code may be downloaded at the companion website.

8c: 2.071336302192492
9c: 2.237523715781341
10c: 2.240722979472185
11c: 2.235615854209425
12c: 2.236251636584492

Section 6.4
1b: O(h3 ); equal to that of underlying integration formula; yes, one degree higher than rate of convergence.
1c: O(h3 ); equal to that of underlying integration formula; yes, one degree higher than rate of convergence.
1g: NOTE: Since this is a four-stage method, equations 6.4.5-6.4.14 must be used to determine the rate of conver-
gence. O(h4 ); less than that of underlying integration formula; yes, one degree higher than rate of convergence.
1j: NOTE: Since this is a four-stage method, equations 6.4.5-6.4.14 must be used to determine the rate of conver-
gence. O(h3 ); less than that of underlying integration formula; yes, one degree higher than rate of convergence.
4: eulerimp.m may be downloaded at the companion website.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Written by Leon Brin 31 May 2016 %
% Purpose: This function implements improved Euler’s method %
% where the step size is calculated and held constant. %
% INPUT: function f(x,y); interval [a,b]; y(a); steps n %
% OUTPUT: approximation (x(i),y(i)) of the solution of y’=f(x,y) %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function [y,x] = eulerimp(f,a,ya,b,n)
i = 1;
x(i) = a;
y(i) = ya;
h = (b-a)/n;
343

while (i<=n)
k1 = f(x(i),y(i));
k2 = f(x(i)+h,y(i) + h*k1);
y(i+1) = y(i) + h/2*(k1+k2);
x(i+1) = a + (b-a)*i/n;
i = i+1;
end%while
end%function

5: heun.m may be downloaded at the companion website.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Written by Leon Brin 31 May 2016 %
% Purpose: This function implements Heun’s third order method %
% where the step size is calculated and held constant. %
% INPUT: function f(x,y); interval [a,b]; y(a); steps n %
% OUTPUT: approximation (x(i),y(i)) of the solution of y’=f(x,y) %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function [y,x] = heun(f,a,ya,b,n)
i = 1;
x(i) = a;
y(i) = ya;
h = (b-a)/n;
while (i<=n)
k1 = f(x(i), y(i));
k2 = f(x(i)+h/3, y(i)+h/3*k1);
k3 = f(x(i)+2*h/3, y(i)+2*h/3*k2);
y(i+1) = y(i) + h/4*(k1+3*k3);
x(i+1) = a + (b-a)*i/n;
i = i+1;
end%while
end%function

6: rk4.m may be downloaded at the companion website.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Written by Leon Brin 1 June 2016 %
% Purpose: This function implements Runge-Kutta 4th order (RK4) %
% where the step size is calculated and held constant. %
% INPUT: function f(x,y); interval [a,b]; y(a); steps n %
% OUTPUT: approximation (x(i),y(i)) of the solution of y’=f(x,y) %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function [y,x] = rk4(f,a,ya,b,n)
i = 1;
x(i) = a;
y(i) = ya;
h = (b-a)/n;
while (i<=n)
k1 = f(x(i), y(i));
k2 = f(x(i)+h/2, y(i)+h/2*k1);
k3 = f(x(i)+h/2, y(i)+h/2*k2);
k4 = f(x(i)+h, y(i)+h*k3);
y(i+1) = y(i) + h/6*(k1+2*k2+2*k3+k4);
x(i+1) = a + (b-a)*i/n;
i = i+1;
end%while
end%function
344 Answers to Selected Exercises

Section 6.5
1: One way to code it would be the following. rk23.m may be downloaded at the companion website.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Written by Leon Brin 31 May 2016 %
% Purpose: This function implements an adaptive rk2(3) method %
% where the step size is controlled by the routine. %
% Heun’s third order method is combined with open-ode. %
% INPUT: function f(x,y); interval [a,b]; y(a); initial step %
% size h; tolerance eps; maximum steps N; %
% OUTPUT: approximation (x(i),y(i)) of the solution of y’=f(x,y) %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function [y,t] = rk23(f,a,ya,b,h,eps,N)
i = 1;
t(i) = a;
y(i) = ya;
done = 0;
while (!done && i<=N)
if ((b-t(i)-h)*(b-a)<=0)
h=b-t(i);
done = 1;
endif
k1 = f(t(i), y(i));
k2 = f(t(i)+h/3, y(i)+h/3*k1);
k3 = f(t(i)+2*h/3, y(i)+2*h/3*k2);
err = abs(h/4*(k1-2*k2+k3));
if (done || err<=eps)
y(i+1) = y(i) + h/4*(k1+3*k3);
t(i+1) = t(i) + h;
if (t(i+1) == t(i))
disp("Procedure failed. Step size reached zero.")
return
endif
i = i+1;
endif
q = 0.9*realpow(eps/err,1/3);
q = max(q,0.1);
q = min(5.0,q);
h = q*h;
end%while
if (!done)
disp("Procedure failed. Maximum number of iterations reached.")
endif
end%function

2: (a) and (d).

12b: The Butcher tableau is

0
1 1
3 3
2
3 − 13 1
1 1 −1 1
1 3 3 1
8 8 8 8
1 1
0 2 2 0

14: One way to code it would be the following. merson.m may be downloaded at the companion website.
345

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Written by Leon Brin 9 June 2016 %
% Purpose: This function implements the method of Merson (1957) %
% where the step size is controlled by the routine. %
% INPUT: function f(x,y); interval [a,b]; y(a); initial step %
% size h; tolerance eps; maximum steps N; %
% OUTPUT: approximation (x(i),y(i)) of the solution of y’=f(x,y) %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function [y,t] = merson(f,a,ya,b,h,eps,N)
i = 1;
t(i) = a;
y(i) = ya;
done = 0;
while (!done && i<=N)
if ((b-t(i)-h)*(b-a)<=0)
h=b-t(i);
done = 1;
endif
k1 = f(t(i), y(i));
k2 = f(t(i)+h/3, y(i)+h/3*k1);
k3 = f(t(i)+h/3, y(i)+h/6*(k1+k2));
k4 = f(t(i)+h/2, y(i)+h/8*(k1+3*k3));
k5 = f(t(i)+h, y(i)+h/2*(k1-3*k3+4*k4));
err = abs(h/30*(2*k1-9*k3+8*k4-k5));
if (done || err<=eps)
y(i+1) = y(i) + h/6*(k1+4*k4+k5);
t(i+1) = t(i) + h;
if (t(i+1) == t(i))
disp("Procedure failed. Step size reached zero.")
return
endif
i = i+1;
endif
q = 0.9*realpow(eps/err,1/4);
q = max(q,0.1);
q = min(5.0,q);
h = q*h;
end%while
if (!done)
disp("Procedure failed. Maximum number of iterations reached.")
endif
end%function

15d: As can be seen from the diagram, most tolerances greater than 10−4 do not produce a global error of 10−4
or less, though there are exceptions. If just guessing and checking, likely you will end up with a tolerance of
5(10)−5 or less.
346 Answers to Selected Exercises

Cash-Karp
1e+0

1e-1

1e-2

1e-3
global error

1e-4

1e-5

1e-6

1e-7

1e-8
1e-5 1e-4 1e-3
tolerance

15f: As can be seen from the diagram, most tolerances less than 10−3 produce a global error of 10−4 or less, as do
some greater tolerances.

RK2(4)
1e-1

1e-2

1e-3

1e-4
global error

1e-5

1e-6

1e-7

1e-8

1e-9
1e-4 1e-3 1e-2
tolerance

16d: As can be seen from the diagram, tolerances less than 10−4 produce a global error of 10−4 or less, as do some
slightly higher tolerances.
347

Cash-Karp
1e-3

1e-4
global error

1e-5

1e-6
1e-5 1e-4 1e-3
tolerance

16f: As can be seen from the diagram, most tolerances less than about 5(10)−3 produce a global error of 10−4 or
less, as do some slightly greater tolerances.

RK2(4)
1e-2

1e-3

1e-4
global error

1e-5

1e-6

1e-7
1e-4 1e-3 1e-2
tolerance

19: (a) y(5) ≈ 6.40926980783945; error ≈ 1.75(10)−4 , 75% greater than the tolerance. (b) y(5) ≈ 6.40708478227220;
error ≈ 2.36(10)−3 , nearly 24 times the tolerance. (c) y(5) ≈ 6.40937679658180; error ≈ 6.82(10)−5 , about
348 Answers to Selected Exercises

68% of the tolerance. (d) y(5) ≈ 6.40885618182156; error ≈ 5.88(10)−4 , nearly 6 times the tolerance.
20: In order from most to least efficient: Cash-Karp, Merson, RK2(3), Bogacki-Shampine, with evaluations 42, 50,
69, and 138, respectively.
Bibliography

[1] Robert E. Barnhill and Richard F. Riesenfeld, editors. Computer Aided Geometric Design : Proceedings of
a conference held at the University of Utah, Salt Lake City, Utah, March 18-21, 1974. Academic Press, New
York, 1974.

[2] Michael F. Barnsley. Fractals Everywhere. Academic Press, Boston, 1988.

[3] R. P. Brent. An algorithm with guaranteed convergence for finding a zero of a function. The Computer Journal,
14(4):422–425, 1971.

[4] John Briggs and F. David Peat. Turbulent Mirror, page 69. Harper & Row Publishers, New York, 1989.

[5] Richard L. Burden and J. Douglas Faires. Numerical Analysis. Thomson Brooks/Cole, 8th edition, 2005.

[6] J.C. Butcher. The Numerical Analysis of Ordinary Differential Equations : Runge-Kutta and General Linear
Methods. John Wiley & Sons, 1987.

[7] J.C. Butcher. A history of runge-kutta methods. Applied Numerical Mathematics, 20:247–260, 1996.

[8] J.R. Cash and Alan H. Karp. A variable order runge-kutta method for initial value problems with rapidly
varying right-hand sides. ACM Transactions on Mathematical Software, 16(3):201–222, September 1990.

[9] Bill Casselman. From Bèzier to Bernstein. http://www.ams.org/samplings/feature-column/fcarc-bezier,


June 2014.

[10] Paul de Faget de Casteljau. De Casteljau’s autobiography : My time at Citroën. Computer Aided Geometric
Design, 16(7):583–586, August 1999.

[11] David Goldberg. What every computer scientist should know about floating-point arithmetic. http://docs.
oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html, Accessed June 2014.

[12] S. W. Golomb. Checker boards and polyominoes. Amer. Math. Monthly, 61:675–682, 1954.

[13] Richard Guichard. Calculus : Early transcendentals. http://www.whitman.edu/mathematics/


multivariable/, January 2014.

[14] Denny Gulick. Encounters with Chaos, page 2. McGraw-Hill, New York, 1992.

[15] Bryce Harrington and Johan Engelen. Inkscape. Software available at http://www.inkscape.org/.

[16] K. Heun. Neue methode zur approximativen integration der differentialgleichungen einer unabhängigen verän-
derlichen. Zeitschrift für Mathematik und Physik, 45:23–38, 1900.

[17] Jeffery J. Leader. Numerical Analysis and Scientific Computing. Pearson, 2004.

[18] Eugene Loh and G. William Walster. Rump’s example revisited. Reliable Computing, 8(3):245–248, 2002.

[19] Edward N. Lorenz. Deterministic nonperiodic flow. Journal of the Atmospheric Sciences, 20(2):130–141, March
1963.

349
350 Index

[20] Michael R. Matthews. Time for science education : how teaching the history and philosophy of pendulum
motion can contribute to science literacy. Kluwer Academic/Plenum Publishers, New York, 2000.
[21] Michael R. Matthews, Michael P. Clough, and Craig Ogilvie. Pendulum motion: The value of idealization in
science. http://www.storybehindthescience.org/pdf/pendulum.pdf.

[22] Cleve Moler. Numerical Computing with MATLAB, chapter 4. The MathWorks, Natick, MA, 2004. https:
//www.mathworks.com/moler/index_ncm.html.
[23] David E. Müller. A method for solving algebraic equations using an automatic computer. Mathematical Tables
and Other Aids to Computation, 10(56):208–215, October 1956.

[24] L. Mumford. Technics and Civilization. Harcourt Brace Jovanovich, New York, 1934.
[25] Ron Naylor. Galileo, copernicanism and the origins of the new science of motion. The British Journal for the
History of Science, 36(2):151–181, June 2003.
[26] William H. Press, Saul A. Teukolsky, William T. Vetterling, and Brian P. Flannery. Numerical Recipes in C :
The Art of Scientific Computing. Cambridge University Press, New York, 2nd edition, 1999.

[27] The GNOME Project. Dia. Software available at http://live.gnome.org/Dia.


[28] Siegfried M. Rump. Algorithms for verified inclusions: Theory and practice. In R. E. Moore, editor, Reliability
in Computing: The Role of Interval Methods in Scientific Computing, pages 109–126, Boston, 1988. Academic
Press.

[29] J. R. Sharma. A family of methods for solving nonlinear equations using quadratic interpolation. Computers
and Mathematics with Applications, 48(5-6):709–714, September 2004.
[30] Avram Sidi. Generalization of the secant method for nonlinear equations. Applied Mathematics E-Notes,
8:115–123, 1999. Available free at mirror sites of http://www.math.nthu.edu.tw/~amen/.

[31] Gilbert Strang. Calculus. http://ocw.mit.edu/ans7870/resources/Strang/Edited/Calculus/Calculus.


pdf. Accessed June 2014.
[32] Ruedeger Timm et al. Libreoffice. Software available at http://www.libreoffice.org/.
[33] Unknown. Huygens’ clocks. http://www.sciencemuseum.org.uk/onlinestuff/stories/huygens_clocks.
aspx.

[34] Charles F. Van Loan. Introduction to Scientific Computing : A Matrix Vector Approach Using MATLAB.
Prentice-Hall, Upper Saddle River, NJ, 2nd edition, 2000.
[35] Christopher Vickery. IEEE-754 analysis. http://babbage.cs.qc.cuny.edu/IEEE-754/. Accessed June 2013.
Index

3/8-rule Runge-Kutta method, 234 second order, 204


solution, 196
accuracy, 1, 6 stiff, 232
significant digits of, 24 divided difference, 118, 119, 121, 123
adaptive quadrature, see numerical integration, adaptive division
adaptive Runge-Kutta method, 227, 234 synthetic, 71, 82
pseudo-code, 230
adaptive Simpson’s rule embedded Runge-Kutta method, 234
code, 331 error, 1
Aitken’s delta-squared method, 58, 59 absolute, 1, 5
algorithmic, 1, 3, 6
Bernstein polynomial, 110 floating-point, 1, 3, 6
bisection method, 37 relative, 1, 5
analysis, 40 round-off, 6
pseudo-code, 39 truncation, 6
Bode’s rule, 156 error checking, 63
Bogacki-Shampine method, 235 Euler’s method, 202, 205, 234
bracketed inverse quadratic interpolation, 98 code, 301
Octave code, 96 pseudo-code, 203
bracketed Newton’s method, 91, 98 explicit trapezoidal method, 222
Octave code, 92
bracketed secant method, 91, 98 false position, see bracketed secant method
Octave code, 92 fixed point, 46
bracketing, 91, 98 attractive, 53
pseudo-code, 93 repulsive, 53
Brent’s method, 94 fixed point iteration method, 46, 53
Butcher tableau, 231, 234 analysis, 56
pseudo-code, 53
Cardano floating-point arithmetic, 2, 6
cubic formula of, 69 force
Cash-Karp method, 235, 236 applied, 196
composite trapezoidal rule compression, 195
code, 281 drag, 194, 195
convergence frictional, 196
order of, 19–21, 24 gravitational, 194, 195
rate of, 22–24 normal, 195
superlinear, 56, 61 spring, 195
superquadratic, 61 tension, 194, 195
convergence diagram, 60 free body diagram, 194
cubic formula, 68
Galileo, 193, 194
deflation, 83, 88 Golomb
differential equation, 195 Solomon, 31
approximate solution, 196
degree, 195 Heun
ordinary, 195 Karl, 222, 227

351
352 Index

Heun’s third order method, 222, 227 modified Euler, see modified Euler method
code, 343 Neville’s, see Neville’s method
Horner’s method, 82, 88 Newton’s, see Newton’s method
code, 326 Ralston’s, see Ralston’s method
pseudo-code, 84 regula falsi, see bracketed secant method
Huygens, Christiaan, 193, 194 RK4, see RK4 method
Runge-Kutta, see Runge-Kutta method
implicit Runge-Kutta method, 232 secant, see secant method
improved Euler method, 222, 234 seeded secant, see seeded secant method
code, 342 Sidi’s, see Sidi’s method
initial value problem, 196, 197 Steffensen’s, see Steffensen’s method
interpolating function, 106, 114 Taylor’s, see Taylor’s method
interpolating polynomial, 114 midpoint method, 213
inverse quadratic interpolation method, 94, 98 code, 341
order of convergence, 95 midpoint rule, 156
iteration, 46 modified Euler method, 213

Kutta Neville’s method, 111, 115


Martin, 222 Octave code, 114
pseudo-code, 113
Lagrange form, 107, 114 Newton
Lorenz, Edward, 4 second law of motion, 194
Newton form, 117, 118, 123
Müller’s method, 86, 88 Newton’s method, 65, 66, 71, 86
order of convergence, 87 pseudo-code, 66
Maclaurin polynomial, 13 node, 128, 134
Maxima, 141 numerical differentiation, 132, 137
Merson method, 235 numerical integration, see also quadrature, 133, 139
code, 344 adaptive, 162, 164
method composite, 161, 164
3/8-rule Runge-Kutta, see 3/8-rule Runge-Kutta method Romberg, 172
adaptive Runge-Kutta, see adaptive Runge-Kutta
method o.d.e., see differential equation, ordinary
Aitken’s delta-squared, see Aitken’s delta-squared Octave
method %, 27
bisection, see bisection method .m file, 15, 32
Bogacki-Shampine, see Bogacki-Shampine method arithmetic operations, 6
bracketed inverse quadratic interpolation, see brack- array, 25
eted inverse quadratic interpolation boolean operators, 41
bracketed Newton’s, see bracketed Newton’s method comments, 27
bracketed secant method, see bracketed secant method comparison, 41
Brent’s, see Brent’s method constants, 8
Cash-Karp, see Cash-Karp method custom functions, 31
embedded Runge-Kutta, see embedded Runge-Kutta disp, 26
method end, 25
Euler’s, see Euler’s method for loop, 24
explicit trapezoidal, see explicit trapezoidal method format, 6
false position, see bracketed secant method if then [else], 40
fixed point iteration, see fixed point iteration method inline function, 15
Heun’s third order, see Heun’s third order method length of an array, 25
Horner’s, see Horner’s method recursive function, 33
implicit Runge-Kutta, see implicit Runge-Kutta method sprintf, 209
improved Euler, see improved Euler method standard functions, 6
inverse quadratic interpolation, see inverse quadratic while loop, 61
interpolation method
Müller’s, see Müller’s method pendulum, 193–195
Merson, see Merson method π
midpoint, see midpoint method approximation, 23
353

polynomial Taylor’s method of degree 3


finding all roots, 83 pseudo-code, 203
Maclaurin, 13 Taylor’s method of order 2
Taylor, 10, 13 code, 339
polynomial approximation, 134 pseudo-code, 339
potential leading coefficient, 117, 123 Theorem
precision Fixed Point Convergence, 48, 53
degree of, 149, 152 Fixed Point Error Bound, 56, 60
Fundamental Factorization, 83
quadratic formula Generalized Rolle’s, 114
alternate, 85 Intermediate Value, 40
quadrature, see also numerical integration, 152 Mean Value, 53
Gaussian, 149, 152 of Algebra, Fundamental, 83
Rational Roots, 71
Ralston’s method, 213 Rolle’s, 13
code, 342 Taylor’s, 10, 13, 14
Ramanujan Taylor’s two variable, 217, 225
Srinivasa, 23 Weighted Mean Value, 152, 275
recursion, 29 trapezoidal rule, 156
regula falsi, see bracketed secant method adaptive, 162
Richardson’s extrapolation, 168 adaptive, pseudo-code, 164
RK2(3) method, 230 composite, 161
code, 344 composite, pseudo-code, 162
RK3(4) method trominos, 30
code, 317
RK4 method, 222, 223 undetermined coefficients, 137, 144, 206
code, 343
Romberg integration, 172 validation, 63
code, 333
Runge web diagram, 47
Carl, 222 wxMaxima, 141
Runge-Kutta method, 207, 217, 227

secant method, 67, 71


analysis, 67
pseudo-code, 70
seeded secant method, 70, 71
pseudo-code, 70
separation of variables, 198
Sidi’s method, 111, 115, 119
Octave code, 121
pseudo-code, 119
Simpson’s rule, 156
Simpson’s 83 rule, 156
Steffensen’s method, 59, 61
code, 328
pseudo-code, 60
stencil, 131, 134
stopping criterion
for root finding, 97
synthetic division, 82, 88

Taylor
Brook, 14
error term, 11
polynomial, 10, 13
remainder term, 10
Taylor’s method, 201, 205

You might also like