Lecture Notes On Numerical Methods For Engineering (?) : Pedro Fortuny Ayuso
Lecture Notes On Numerical Methods For Engineering (?) : Pedro Fortuny Ayuso
CC
BY: Copyright
c 20112015 Pedro Fortuny Ayuso
Contents
Introduction
1. Some minor comments
5
5
7
7
10
16
19
19
20
22
24
26
32
35
35
40
44
46
Chapter 4. Interpolation
1. Linear (piecewise) interpolation
2. Can parabolas be used for this?
3. Cubic Splines: Continuous Curvature
4. The Lagrange Interpolating Polynomial
5. Approximate Interpolation
6. Code for some of the Algorithms
49
49
50
51
57
60
64
69
69
71
79
79
CONTENTS
2.
3.
4.
5.
6.
7.
8.
The Basics
Discretization
Sources of error: truncation and rounding
Quadratures and Integration
Eulers Method: Integrate Using the Left Endpoint
Modified Euler: the Midpoint Rule
Heuns Method: the Trapezoidal Rule
81
82
85
86
86
87
89
93
93
96
98
Introduction
These notes cover what is taught in the classes of Numerical Methods for Engineering in the School at Mieres. One should not expect
more than that: a revision of what has been or will be explained during the course. For a more correct, deep and thorough explanation,
one should go to the material referenced in the bibliography?
Simply, stated
There is none.
INTRODUCTION
CHAPTER 1
A.B 10C
10C A.B
A.BeC
where A, B and C are natural numbers (possibly 0), is a sign (which
may be elided if +). Any of those expressions refers to the real number ( A + 0.B) 10C (where 0.B is nought dot B. . . ).
For example:
The number 3.123 is the same 3.123.
The number 0.01e 7 is 0.000000001 (eight zeroes after the
dot and before the 1).
7
1. EXPONENTIAL NOTATION
the 11 bits of e but that it is shifted to the right. The exponent e = 01010101011, which in decimal is 683 represents,
in double precision format, the number 26831023 = 2340 .
Those e with a starting 0 bit correspond to negative powers
of 2 and those having a starting 1 bit to positive powers (recall that 210 = 1024).
If e = 0 then, if m 6= 0 (if there is a mantissa):
N = (1)s 21023 0.m,
where 0.m means zero-dot-m in binary.
If e = 0 and m = 0, the number is either +0 or 0, depending on the sign s. This means that 0 has a sign in double
precision arithmetic (which makes sense for technical reasons).
The case e = 2047 (when all the eleven digits of e are 1) is
reserved for encoding exceptions like and NaN (nota-number-s). We shall not enter into details.
As a matter of fact, the standard is much longer and thorough and
includes a long list of requisites for electronic devices working in
floating point (for example, it specifies how truncations of the main
mathematical operations have to be performed to ensure exactness
of the results whenever possible).
The main advantage of floating point (and of double precision
specifically) is, apart from standardization, that it enables computing
with very small and very large numbers with a single format (the
smallest storable number is 21023 ' 10300 and the largest one is
21023 ' 10300 ). The trade off is that if one works simultaneously with
both types of quantities, the first lose precision and tend to disappear
(a truncation error takes place). However, used carefully, it is a hugely
useful and versatile format.
1.2. Binary to decimal (and back) conversion. ? In this course,
it is essential to be able to convert a number from binary to decimal
representation and back.
In these notes, we shall use the following notation: the expression x a means that x is a variable, a is a value (so that it can be
a number or another variable) and that x gets assigned the value of
a. The expression u = c is the conditional statement the value designed by u is the same as that designed by c, which can be either
true or false. We also denote m//n as the quotient of dividing the
natural number m > 0 by the natural number n > 0, and m%n is the
Examples?
10
11
denoted x.
12
| x x |
|x|
(which is always positive).
We are not going to use a specific notation for them (some people
use and , respectively).
E XAMPLE 2. The constant , which is the ratio between the length
of the circumference and its diameter, is, approximately 3.1415926534+
(the trailing + indicates that the real number is larger than the representation). Assume one uses the approximation = 3.14. Then
13
This last statement means that one is incurring an error of 5 tenthousandths per unit (approx. 1/2000) each time 3.14 is used for .
Thus, if one adds 3.14 two thousand times, the error incurred when
using this quantity instead of 2000 will be approximately . This
is the meaning of the relative error: its reciprocal is the number of
times one has to add x in order to get an accrued error equal to the
number being approximated.
Before proceeding with the analysis of errors, let us define the
two most common ways of approximating numbers using a finite
quantity of digits: truncation and rounding. The precise definitions
are too detailed for us to give. Start with a real number (with possibly an infinite quantity of digits):
x = a 1 a 2 . . . a r . a r +1 . . . a n . . .
notice that there are r digits to the left of the decimal point. Define:
D EFINITION 6. The truncation of x to k (significant) digits is:
14
15
http://www.cs.utexas.edu/downing/papers/PatriotA1992.pdf
and the official report is also available:
http://www.fas.org/spp/starwars/gao/im92026.htm.
16
26493
17
CHAPTER 2
1. Introduction
Computing roots of functions, and especially of polynomials, is
one of the classical problems of Mathematics. It used to be believed
that any polynomial could be explicitly solved like the quadratic
equation, via a formula involving radicals (roots of numbers). Galois
Theory, developed in the beginning of the XIX Century, proved that
this is not the case at all and that, as a matter of fact, most polynomials of degree greater than 4 are not solvable using radicals.
However, the search for a closed formula for solving equations is
just a way of putting off the real problem. In the end, and this is
what matters:
The only computations that can always be performed exactly are addition,
subtraction and multiplication.
19
20
21
| f (c)| < e then c is accepted as an approximate root) and a maximum number of iterations N > 0 of the process. With all these data,
one proceeds taking advantage of Bolzanos Theorem, as follows:
If f ( a) f (b) < 0, then there is a root in ( a, b). Take c as the mida+b
point2 of the interval, that is c =
. There are three possibilities
2
now:
Either | f (c)| < e and one takes c as an approximate root and
finishes the algorithm.
Or f ( a) f (c) < 0 and one substitutes c for b and repeats.
Or f (c) f (b) < 0 and one substitutes c for a and repeats.
The iteration is done at most N times (so that one obviously has to
keep track of their number). If after N iterations no approximate root
has been found, the process ends with an error message.
The formal statement of the process just described is Algorithm
??. Notice that the expression a b is used, meaning that a takes (or
is given) the value b.
E XAMPLE 7. Take the function f ( x ) = cos(e x ), which is continuous in the interval [0, 1]. Obviously, f (0) f (1) < 0 (why is this obvious?), so that Bolzanos Theorem applies. Setting a = 0, b = 1,
Algorithm ?? gives the following sequence: each assignment to a or
b means that the corresponding endpoint of the interval is substituted for that value.
octave > Bisec(@(x) cos(exp(x)), 0, 1, .001, 10)
c = 0.50000
b = 0.50000
% new interval: [0, 0.50000]
c = 0.25000
a = 0.25000
% new interval: [0.25000, 0.50000]
c = 0.37500
a = 0.37500
% new interval: [0.37500, 0.50000]
c = 0.43750
a = 0.43750
% new interval: [0.43750, 0.50000]
c = 0.46875
b = 0.46875
% new interval: [0.43750, 0.46875]
c = 0.45312
b = 0.45312
% new interval: [0.43750, 0.45312]
c = 0.44531
a = 0.44531
% new interval: [0.44531, 0.45312]
c = 0.44922
a = 0.44922
% new interval: [0.44922, 0.45312]
c = 0.45117
octave> f(c) ans = 6.4520e-04
2This is what gives the algorithm the alternative name midpoint rule.
22
Examples
approximation.
3. NEWTON-RAPHSONS ALGORITHM
23
point of this line with the OX axis. Obviously, this will most likely
not be a root of f but in sufficiently general conditions, it is expected to
approach one. If the process is repeated enough times, it get nearer
a root of f . This is the idea of Newton-Raphsons method.
Recall that the equation of the line passing through ( x0 , y0 ) with
slope b is:
Y = b ( X x0 ) + y0
so that the equation of the tangent line to the graph of f at ( x0 , f ( x0 ))
is (assuming f has a derivative at x0 ):
Y = f 0 ( x0 )( X x0 ) + f ( x0 ).
The meeting point between this line and OX is
f ( x0 )
( x1 , y1 ) = x0 0
,0
f ( x0 )
assuming it exists (i.e. f 0 ( x0 ) 6= 0).
If x1 is not an approximate root of f with the desired precision,
one proceeds in the same way at the point ( x1 , f ( x1 )). After having
performed n steps, the next point xn+1 takes the form:
x n +1 = x n
f ( xn )
.
f 0 ( xn )
24
Algorithm 4 Newton-Raphson.
Input: A differentiable function f ( x ), a seed x0 R, a tolerance
e > 0 and a limit for the number of iterations N > 0
Output: Either an error message or a real number c such that
| f (c)| < e (i.e. an approximate root)
? S TART
i0
while | f ( xi )| e and i N do
f (x )
x i +1 x i 0 i
f ( xi )
i i+1
end while
if i > N then
return ERROR
end if
c xi
return c
The convergence speed (which will be studied in detail in ??) is clear
in the sequence of xn, as is the very good approximation to 0 of f (c)
for the approximate root after just 5 iterations.
E XAMPLE 9. However, the students are suggested to try NewtonRaphsons algorithm for the function cos(e x ) in Example ??. A strange
phenomenon takes place: the algorithm is almost never convergent
and no approximation to any root happens. Is there a simple explanation for this fact?
E XAMPLE 10. Newton-Raphsons method can behave quite strangely
when f 0 (c) = 0, if c is a root of f . We shall develop some examples
in the exercises and in the practicals.
4. The Secant Algorithm
f (x )
25
h 0
f (c + h) f (c)
.
h
f 0 ( xn ) '
f ( x n ) f ( x n 1 )
,
x n x n 1
so that the formula for computing xn+1 becomes, using this approximation,
x n x n 1
x n +1 = x n f ( x n )
,
f ( x n ) f ( x n 1 )
and the Secant Algorithm is obtained. Notice that, in order to start
it, two seeds are required, instead of one. The coefficient of f ( xn ) in
the iteration is just the reciprocal of approximation (??).
Algorithm 5 The Secant Algorithm.
Input: A function f ( x ), a tolerance e > 0, a bound for the number
of iterations N > 0 and two seeds x1 , x0 R
Output: Either a real number c R with | f (c)| < e or an error
message
? S TART
i0
while | f ( xi )| e and i N do
x i x i 1
x i +1 x i f ( x i )
f ( x i ) f ( x i 1 )
i i+1
end while
if i > N then
return ERROR
end if
c xi
return c
It is worthwhile, when implementing this method, keeping in
memory not just xn and xn1 but also the values f ( xn ) and f ( xn1 )
so as not to recompute them.
26
5. FIXED POINTS
27
| g( x1 ) g( x2 )| | x1 x2 |,
which means: the distance between the images of two arbitrary points is
less than times the original distance between the points, and, as < 1,
what really happens is that the distance between the images is always less
than the original distance. That is, the map g is shrinking the interval
[ a, b] everywhere. It is as if [ a, b] were a piece of cotton cloth and it
was being washed with warm water: it shrinks everywhere.
The last paragraph contains many emphasis but they are necessary for understanding the final result: the existence of a fixed point,
a point which gets mapped to itself.
From all the explanation above, one infers that the width of g([ a, b])
is less than or equal to (b a). If moreover g([ a, b]) [ a, b] (that is,
if g mapped [ a, b] into itself) then one could also compute g( g([ a, b])),
which would be at most of width (b a) = 2 (b a). Now
one could repeat the iteration indefinitely and g g g g =
gn ([ a, b]) would have width less than n (b a). As < 1, this
width tends to 0 and using the Nested Intervals Principle, there must
be a point [ a, b] for which g() = . This is, by obvious reasons,
a fixed point. Even more, the fact that < 1 implies that is unique.
There is one and only one fixed point for g in [ a, b]. We have just
given a plausible argument for the next result:
T HEOREM 1. Let g : [ a, b] [ a, b] be a map of [ a, b] into itself, which
is continuous and differentiable on [ a, b]. If there is a positive < 1 such
that for any x [ a, b], | g0 ( x )| , then there exists one and only one
[ a, b] for which g() = . Even more, for any x0 [ a, b], if one
defines
xn = g( xn1 ) for n > 0
28
then
lim xn = .
to test.
5. FIXED POINTS
29
Assumed both checks have been performed, the method for finding a fixed point of g : [ a, b] [ a, b] is stated in Algorithm ??:
Algorithm 6 Fixed Point.
Input: A function g (contractive etc. . . ), a seed x0 [ a, b], a tolerance e > 0 and a maximum number of iterations N > 0
Output: either c [ a, b] such that |c g(c)| < e or an error message
? S TART
i 0, c x0
while |c g(c)| e and i N do
c g(c)
i i+1
end while
if i > N then
return ERROR
end if
return c
E XAMPLE 13. For the fixed point algorithm we shall use the same
function as for the bisection method, that is f ( x ) = cos(e x ). In order
to find a root of this function, we need to turn the equation
cos(e x ) = 0
into a fixed-point problem. This is always done in the same (or similar) way: the above equation is obviously equivalent to
cos(e x ) + x = x,
which is a fixed-point problem. Let us call g( x ) = cos(e x ) + x, whose
fixed point we shall try to find.
In order to apply the fixed-point theorem, one needs an interval
[ a, b] which is mapped into itself. It is easily shown that g( x ) decreases near x = 0.5 and, as a matter of fact, that it does so in the
interval I = [0.4, 0.5]. Moreover, g(0.4) ' 0.478+ while g(0.5) '
0.4221+, which implies that the interval I is mapped into itself by
g (this is probably the most difficult part of a fixed-point problem:
finding an appropriate interval which gets mapped into itself). The
derivative of g is g0 ( x ) = e x sin(e x ) + 1, whose absolute value is,
in that interval, less than 0.8 (it is less than 1 in the whole interval
[0, 1], as a matter of fact). This is the second condition to be verified
in order to apply the theorem.
30
The convergence speed does not look too good, which is usual in
fixed-point problems.
R EMARK (Skip in first reading). The fixed point algorithm can be
used, as explained in ??, for finding roots of general equations using
a suitable factor; to this end, if the equation is f ( x ) = 0, one can use
any function g( x ) of the form
g( x ) = x k f ( x )
where k R is an adequate number. This transformation is performed so that the derivative of g is around 0 near the root and so
that if c is the (unknown) root, then g defines a contractive map on
an interval of the form [c , c + ] (which will be the [ a, b] used in
the algorithm).
The hard part is to check that g is a contractive map of [ a, b] onto
itself.
5.4. Convergence Speed of the Fixed Point Method. If the absolute value of the derivative is bounded by < 1, the convergence
speed of the fixed point algorithm is easily bounded because [ a, b] is
mapped onto a sub-interval of width at most (b a) and this contraction is repeated at each step. So, after i iterations, the width of
the image set is at most i (b a). If < 101 and (b a) < 1, for
example, then one can guarantee that each iteration gives an extra
digit of precision in the root. However, tends to be quite large (like
0.9) and the process is usually slow.
E XAMPLE 14. If [ a, b] = [0, 1] and | g0 ( x )| < 0.1, then after each iteration there is an exact decimal digit more in x as an approximation
to the fixed point c, regardless of the initial value of the seed x.
5. FIXED POINTS
31
f ( xn )
f 0 ( xn )
which, at a root c of f (that is, f (c) = 0) gives g0 (c) = 0. That is, the
derivative of g is 0 at the fixed point of g. This makes convergence
very fast when x is near c (and other conditions hold).
In fact, the following strong result (which is usually stated as
Newton-Raphson has quadratic convergence) can be proved:
T HEOREM 2. Assume f is a twice-differentiable function on an interval [r e, r + e], that r is a root of f and that
The second derivative of f is bounded from above: | f 00 ( x )| < K
for x [r e, r + e],
The first derivative of f is bounded from below: | f 0 ( x )| > L > 0
for x [r e, r + e]
then, if xk [r e, r + e], the following term in Newton-Raphsons iteration is also in that interval and
K
|r xk+1 | < | ||r xk |2 .
2L
And, as a corollary:
C OROLLARY 1 (Duplication of exact digits in Newton-Raphsons
metod). Under the conditions of Theorem ??, if e < 0.1 and K < 2L then,
for n k, each iteration xn+1 of Newton-Raphson approximates c with
twice the number of exact digits as xn .
P ROOF. This happens because if k = 0 then x0 has at least one
exact digit. By the Theorem, | x1 r | is less than 0.12 = .01. And
from this point on, the number of zeroes in the decimal expansion of
| xn r | gets doubled each time.
32
33
fa = f(a);
fb = f(b);
if(fa == 0)
c = a;
return
end
if(fb == 0)
c = b;
return
end
c = (a+b)/2
fc = f(c);
while(abs(fc) >= epsilon & N < n)
N = N + 1;
% multiply SIGNS, not values
if(sign(fc)*sign(fa) < 0)
b = c;
fb = fc;
else
a = c;
fa = fc;
end
% An error might happen here
c = (a+b)/2;
fc = f(c);
end
if(N >= n)
warning("Tolerance not reached.")
end
end
34
% Newton-Raphson implementation
function [z n] = NewtonF(f, fp, x0, epsilon = eps, N = 50)
n = 0;
xn = x0;
% Both f and fp are anonymous functions
fn = f(xn);
while(abs(fn) >= epsilon & n <= N)
n = n + 1;
fn = f(xn); % memorize to prevent recomputing
% next iteration
xn = xn - fn/fp(xn); % an exception might take place here
end
z = xn;
if(n == N)
warning(Tolerance not reached.);
end
end
CHAPTER 3
Ax = b
36
the new system be the same as that of the original1. To this end, only
the following operation is permitted:
3 2 1 4 1
0 1
4
2
3
A=
6 1 2
5
0
1 4
3 2 4
and combining row 3 with row 1 times 2 (in order to make a zero
at the 6), then one has to multiply by L31 (2)
1 0 0 0
3 2 1 4 1
3 2 1 4 1
0 1 0 0 0 1
4
2
3
4
2
3
= 0 1
.
2 0 1 0 6 1 2
5
0 0 5 4 3 2
0 0 0 1
1 4
3 2 4
1 4
3 2 4
Algorithm ?? is a simplified statement of Gauss reduction method.
The line with a comment [*] ?? is precisely the multiplication of A
which is upper triangular, is
on the left by L ji (m ji ). In the end, A,
a product of these matrices:
(3)
find the exact solutions to any equation or system of equations. So this sameness
of the solutions relates only to the theoretical algorithm
37
1
0
0 ... 0
m21 1
0 . . . 0
m31 m32 1 . . . 0
.
..
..
. . ..
..
. .
.
.
mn1 mn2 mn3 . . . 1
Algorithm 7 Gauss Algorithm for linear systems
Input: A square matrix A and a vector b, of order n
Output: Either an error message or a matrix A and a vector b such
= b has the same
that A is upper triangular and the system Ax
solutions as Ax = b
? S TART
A A, b b, i 1
while i < n do
if A ii = 0 then
return ERROR [division by zero]
end if
[combine rows underneath i with row i]
j i+1
while j n do
m ji A ji / A ii
[Next line is an operation on a row]
A j A j m ji A i [*]
b j b j m ji b i
j j+1
end while
i i+1
end while
b
return A,
We have thus proved the following result:
38
T HEOREM 3. If in the Gauss reduction process no element of the diagonal is 0 at any step, then there is a lower triangular matrix L whose
elements are the corresponding coefficients at their specific place and an
upper triangular matrix U such that
A = LU
and such that the system Ux = b = L1 b is equivalent to the initial one
Ax = b.
With this result, a factorization of A is obtained which simplifies
the resolution of the original system, as Ax = b can be rewritten
LUx = b and one can proceed step by step:
First the system Ly = b is solved by direct substitution that
is, from top to bottom, without even dividing.
Then the system Ux = y is solved by regressive substitution
from bottom to top; divisions will be needed at this point.
This solution method only requires storing the matrices L and U in
memory and is very fast.
1.1. Pivoting Strategies and the LUP Factorization. If during
Gaussian reduction a pivot appears (the element which determines
the multiplication) of value approximately 0, either the process cannot be continued or one should expect large errors to take place,
due to rounding and truncation2. This problem can be tackled by
means of pivoting strategies, either swapping rows or both rows and
columns. If only rows are swapped, the operation is called partial
pivoting. If swapping both rows and columns is allowed, total pivoting takes place.
D EFINITION 9. A permutation matrix is a square matrix whose entries are all 0 but for each row and column, in each of which there is
exactly a 1.
Permutation matrices can be built from the identity matrix by
permuting rows (or columns).
A permutation matrix P has, on each row, a single 1 and the rest
entries are 0. If on row i the only 1 is on column j, the multiplication
PA means swap rows j and i of A.
Obviously, the determinant of a permutation matrix is not zero
(it is either 1 or 1). It is not so easy to show that the inverse of a
2Recall Example ?? of Chapter ??, where a tiny truncation error in a denomi-
39
1
2
3 4
1 2 5 6
A=
1 2 3 7
0 12 7 8
then
1
0
L=
1
1
0
1
0
0
0
0
1
0
0
1 2 3 4
1 0
0 12 7 8
0 0
0
U=
0 0 8 10 , P = 0 1
0
1
0 0 0 11
0 0
0
0
0
1
0
1
0
0
40
swap is needed on A at step i then the rows i and j are swapped but
only the columns 1 to i 1 are involved). Also, the P computed up to
step i has to be multiplied on the left by Pij (the permutation matrix
for i and j).
This can be stated as Algorithm ??.
Algorithm 8 LUP Factorization for a matrix A
Input: A matrix A of order n
Output: Either an error or three matrices: L lower triangular with
1 on the diagonal, U upper triangular and a permutation matrix P
such that LU = PA
Comment: Pip is the permutation matrix permuting rows i and p.
? S TART
L Idn , U Idn , P Idn , i 1
while i < n do
p row index such that |U pi | is maximum, with p i
if U pi = 0 then
return ERROR [division by zero]
end if
[swap rows i and p]
P Pip P
U Pip U
[on L swap only rows i and p on the submatrix
n (i 1) at the left, see the text]
L L
[combine rows on U and keep track on L]
j i+1
while j <= n do
m ji Uji /Uii
Uj Uj mij Ui
L ji m ji
j j+1
end while
i i+1
end while
return L, U, P
41
of starting with a vector b (which might be called the initial condi how much does the
tion), one starts with a slight modification b,
solution to the new system differ from the original one? The proper
way to ask this question uses the relative error, not the absolute one.
Instead of system (??), consider the modified one
Ay = b + b
where b is a small vector. The new solution will have the form x + x ,
for some x (which one expects to be small as well).
E XAMPLE 16. Consider the system of linear equations
2 3
x1
2
=
4 1
1
x2
whose solution is ( x1 , x2 ) = (0.1, 0.6). If we take b = (2.1, 1.05) then
b = (0.1, 0.05) and the solution of the new system
2 3
x1
2.1
=
4 1
x2
1.05
is ( x1 , x2 ) = (0.105, 0.63) so that x = (0.005, 0.03). In this case, a
small increment b gives rise to a small increment x but this needs
not be so.
However, see Example ?? for a system with a very different behaviour.
The size of vectors is measured using norms (the most common
one being length, in Euclidean space, but we shall not make use of
this one). As x is a solution, one has
A( x + x ) = b + b ,
so that
Ax = b ,
but, as we want to compare x with b , we can write
x = A1 b
so that, taking sizes (i.e. norms, which are denoted with kk), we get
kx k = k A1 b k.
Recall that we are trying to asses the relative displacement, not the
absolute one. To this end we need to include k x k in the left hand
42
kx k
k A1 b k
=
,
k Ax k
kbk
kx k
kx k
k A1 b k
=
kbk
k Ax k
k Akk x k
and, applying the same reasoning to the right hand side of (??), one
gets
k A1 kkb k
k A1 b k
,
kbk
kbk
and combining everything,
k A1 kkb k
kx k
kbk
k Akk x k
so that, finally, one gets an upper bound for the relative displacement
of x:
kx k
k k
k Akk A1 k b
kxk
kbk
The matrix norm which was mentioned above does exist. In fact,
there are many of them. We shall only make use of the following one
in these notes:
D EFINITION 11. The infinity norm of a square matrix A = ( aij ) is
the number
n
k Ak = max
|aij |,
1 i n j =1
that is, the maximum of the sums of absolute values of each row.
The following result relates the infinity norms of matrices and
vectors:
L EMMA 4. The infinity norm is such that, for any vector x, k Ax k
k Ak k x k , where k x k is the norm given by the maximum of the absolute values of the coordinates of x.
43
This means that, if one measures the size of a vector by its largest
coordinate (in absolute value), and one calls it k x k , then
kx k
k k
k A k k A 1 k b .
k x k
kbk
The product k Ak k A1 k is called the condition number of A for the
infinity norm, is denoted ( A) and is a bound for the maximum possible displacement of the solution when the initial vector gets displaced. The greater the condition number, the greater (to be expected) the displacement of the solution when the initial condition
(independent term) changes a little.
The condition number also bounds from below the relative displacement:
L EMMA 5. Let A be a nonsingular matrix of order n and x a solution
of Ax = b. Let b be a displacement of the initial conditions and x the
corresponding displacement in the solution. Then:
kx k
k k
1 kb k
( A) b .
( A) kbk
kxk
kbk
So that the relative displacement (or error) can be bounded using the relative residue (the number kb k/kbk).
The following example shows how large condition numbers are
usually an indicator that solutions may be strongly dependent on the
initial values.
E XAMPLE 17. Consider the system
(5)
whose condition number for the infinity norm is 376.59, so that a relative change of a thousandth of unit in the initial conditions (vector
b) is expected to give rise to a relative change of more than 37% in
the solution. The exact solution of (??) is x = 0.055+, y = 0.182+.
However, the size of the condition number is a tell-tale sign that
a small perturbation on the system will modify the solutions greatly.
If, instead of b = (0.169, 0.067), one uses b = (0.167, 0.067) (which is
a relative displacement of just 1.1% in the first coordinate), the new
solution is x = 0.0557+, y = 0.321+, for which x has not even the
same sign as in the original problem and y is displaced 76% from its
original value. This is clearly unacceptable. If the equations describe
a static system, for example, and the coefficients have been measured
44
( N P) x = b Nx = b + Px x = N 1 b + N 1 Px.
If one calls c = N 1 b and M = N 1 P, then one obtains the following
fixed point problem:
x = Mx + c
which can be solved (if at all) in the very same way as in Chapter ??:
start with a seed x0 and iterate
xn = Mxn1 + c,
until a sufficient precision is reached.
In what follows, the infinity norm k k is assumed whenever the
concept of convergence appears3.
One needs the following results:
T HEOREM 4. Assume M is a matrix of order n and that k Mk < 1.
Then the equation x = Mx + c has a unique solution for any c and the
iteration xn = Mxn1 + c converges to it for any initial value x0 .
3However, all the results below apply to any matrix norm.
45
k Mkn
k x1 x0 k .
1 k M k
Recall that for vectors, the infinity norm k x k is the maximum of the absolute values of the coordinates of x.
k xn sk
46
for any i from 1 to n. That is, if the elements on the diagonal are
greater in absolute value than the sum of the rest of the elements of
their rows in absolute value.
For these matrices, the convergence of both Jacobis and GaussSeidels methods is guaranteed:
L EMMA 6. If A is a strictly diagonally dominant matrix by rows
then both Jacobis and Gauss-Seidels methods converge for any system of
the form Ax = b and any seed.
For Gauss-Seidel, the following also holds:
L EMMA 7. If A is a positive definite symmetric matrix, then GaussSeidels method converges for any system of the form Ax = b.
4. Annex: Matlab/Octave Code
Code for some of the algorithms explained is provided; it should
work both in Matlab and Octave.
4.1. Gauss Algorithm without Pivoting. Listing ?? implements
Gauss reduction algorithm for a system Ax = b, returning L, U and
the new b vector, assuming the multipliers are never 0; if some is 0, then
the process terminates with a message.
Input:
A: a square matrix (if it is not square, the output gives the triangulation under the principal diagonal),
b: a vector with as many rows as A.
The output is a trio [L, At, bt], as follows:
L: the lower triangular matrix (with the multipliers),
At: the transformed matrix (which is U in the LU-factorization)
and which is upper triangular.
47
48
return;
end
At=A;
bt=b;
L=eye(n);
P=eye(n);
i=1;
while (i<n)
j=i+1;
% beware nomenclature:
% L(j,i) is ROW j, COLUMN i
% the pivot with greatest absolute value is sought
p = abs(At(i,i));
pos = i;
for c=j:n
u = abs(At(c,i));
if(u>p)
pos = c;
p = u;
end
end
if(u == 0)
warning(Singular system);
return;
end
% Swap rows i and p if i != p
% in A and swap left part of L
% This is quite easy in Matlab, there is no need
% for temporal storage
P([i pos],:) = P([pos i], :);
if(i = pos)
At([i pos], :) = At([pos i], :);
L([i pos], 1:i-1) = L([pos i], 1:i-1);
b([i pos], :) = b([pos i], :);
end
while(j<=n)
L(j,i)=At(j,i)/At(i,i);
% Combining these rows is easy
% They are 0 up to column i
% And combining rows is easy as above
At(j,i:n) = [0 At(j,i+1:n) - L(j,i)*At(i,i+1:n)];
bt(j)=bt(j)-bt(i)*L(j,i);
j=j+1;
end
i=i+1;
end
end
CHAPTER 4
Interpolation
Given a set of data generally a cloud of points on a plane, the
human temptation is to use them as source of knowledge and forecasting. Specifically, given a list of coordinates associated to some
kind of event (say, an experiment or a collection of measurements)
( xi , yi ), the natural thing to do is to use it for deducing or predicting
the value y would take if x took some other value not in the list. This
is the interpolating and extrapolating tendency of humans. There is no
helping it. The most that can be done is studying the most reasonable ways to perform those forecasts.
We shall use, along this whole chapter, a list of n + 1 pairs ( xi , yi )
(6)
x x 0 x 1 . . . x n 1 x n
y y 0 y 1 . . . y n 1 y n
which is assumed ordered on the x coordinates, which are all different, as well: xi < xi+1 . The aim is to find functions which somehow
have a relation (a kind of proximity) to that cloud of points.
1. Linear (piecewise) interpolation
The first, simplest and useful idea is to use a piecewise defined
function from x0 to xn consisting in joining each point to the next
one by a straight line. This is called piecewise linear interpolation or
linear spline we shall define spline in general later on.
D EFINITION 13. The piecewise linear interpolating function for
list (??) is the function f : [ x0 , xn ] R defined as follows:
y y i 1
f (x) = i
( x xi1 ) + yi1 if x [ xi1 , xi ]
x i x i 1
that is, the piecewise defined function whose graph is the union of
the linear segments joining ( xi1 , yi1 ) to ( xi , yi ), for i = 1, . . . , n.
Piecewise linear interpolation has a set of properties which make
it quite interesting:
It is very easy to compute.
It passes through all the data points.
49
50
4. INTERPOLATION
f (x)
linear interpolation
2
4
10
It is continuous.
That is why it is frequently used for drawing functions (it is what
Matlab does by default): if the data cloud is dense, the segments are
short and corners will not be noticeable on a plot.
The main drawback of this technique is, precisely, the corners
which appear anywhere the cloud of points does not correspond to
a straight line. Notice also that this (and the following splines which
we shall explain) are interpolation methods only, not suitable for extrapolation: they are used to approximate values between the endpoints
x0 , xn , never outside that interval.
2. Can parabolas be used for this?
Linear interpolation is deemed to give rise to corners whenever
the data cloud is not on a straight line. In general, interpolation requirements do not only include the graph to pass through the points
in the cloud but also to be reasonably smooth (this is not just for aesthetic reasons but also because reality usually behaves this way).
One could try and improve linear interpolation with higher degree
functions, imposing that the tangents of these functions match at the
intersection points. For example, one could try with parabolic segments (polynomials of degree two). As they have three degrees of
freedom, one could make them not only pass through ( xi1 , yi1 )
and ( xi , yi ), but also have the same derivative at xi . This may seem
reasonable but has an inherent undesirable property: it is intrinsically asymmetric (we shall not explain why, the reader should be
51
f (x)
4
quadratic spline
2
4
10
52
4. INTERPOLATION
f (x)
cubic spline
2
4
10
53
We shall denote by Pi the polynomial corresponding to the interval [ xi1 , xi ] and we shall always write it relative to xi1 , as follows:
(7)
bi h i + c i h i 2 + d i h i 3 = y i y i 1 .
for i = 1, . . . , n, which gives n linear equations.
The condition on the continuity of the derivative is Pi0 (xi ) =
Pi0+1 (xi ), so that
(9)
(10)
54
4. INTERPOLATION
(11)
c i +1 c i
3hi
bi =
yi yi1
c + 2ci
h i i +1
;
hi
3
bi = bi 1 + h i 1 ( c i + c i 1 ) .
for i = 2, . . . , n. Now one only needs to use Equations (??) for i and
i 1 and introduce them into (??), so that there are only cs. After
some other elementary calculations, one gets
(14)
yi yi1 yi1 yi2
hi1 ci1 + (2hi1 + 2hi )ci + hi ci+1 = 3
hi
hi1
for i = 2, . . . , n 1.
This is a system of the form Ac =
matrix
(15)
h1 2( h1 + h2 )
h2
0
0
h2
2( h2 + h3 ) h3
A=
..
..
..
...
.
.
.
0
0
0
0
, where A is the n 2 by n
...
...
..
.
0
0
..
.
0
0
..
.
0
0
..
.
. . . h n 1 2 ( h n 1 + h n ) h n
2
3
.
..
n 1
with
y i y i 1 y i 1 y i 2
i = 3
.
hi
h i 1
It is easy to see that this is a consistent system (with infinite solutions, because there are two missing equations). The equations
which are usually added were explained above. If one sets, for example, c1 = 0 and cn = 0, so that A is completed above with a row
(1 0 . . . 0)and below with (0 0 0 . . . 1), whereas gets a top and
55
bottom 0. From these n equations, one computes all the ci and, using
(??) and (??), one gets all the bs and ds.
The above system (??), which has only nonzero elements on the
diagonal and the two adjacent lines is called tridiagonal. These systems are easily solved using LU factorization and one can even compute the solution directly, solving the cs in terms of the s and hs.
One might as well use iterative methods but for these very simple
systems, LU factorization is fast enough.
3.2. The Algorithm. We can now state the algorithm for computing the interpolating cubic spline for a data list x, y of length
n + 1, x = ( x0 , . . . , xn ), y = (y0 , . . . , yn ), in which xi < xi+1 (so
that all the values in x are different). This is Algorithm ??.
3.3. Bounding the Error. The fact that the cubic spline is graphically satisfactory does not mean that it is technically useful. As a matter
of fact, it is much more useful than what it might seem. If a function
is well behaved on the fourth derivative, then the cubic spline is a
very good approximation to it (and as the intervals get smaller, the
better the approximation is). Specifically, for clamped cubic splines, we
have:
T HEOREM 6. Let f : [ a, b] R be a 4 times differentiable function
with | f 4) ( x )| < M for x [ a, b]. Let h be the maximum of xi xi1 for
i = 1, . . . , n. If s( x ) is a the clamped cubic spline for ( xi , f ( xi )), then
5M 4
h .
384
This result can be most useful for computing integrals and bounding the error or for bounding the error when interpolating values
of solutions of differential equations. Notice that the clamped cubic spline for a function f is such that s0 ( x0 ) = f 0 ( x0 ) and s0 ( xn ) =
f 0 ( xn ), that is, the first derivative at the endpoints is given by the
first derivative of the interpolated function.
Notice that this implies, for example, that if h = 0.1 and M <
60 (which is rather common: a derivative greater than 60 is huge),
then the distance at any point between the original function f and the
interpolating cubic spline is less than 104 .
|s( x ) f ( x )|
3.4. General Definition of Spline. We promised to give the general definition of spline:
D EFINITION 15. Given a data list as (??), an interpolating spline of
degree m for it, (for m > 0), is a function f : [ x0 , xn ] R such that
56
4. INTERPOLATION
4. LAGRANGE INTERPOLATION
57
( x i x j ),
j 6 =i
pi ( x ) =
j 6 =i ( x x j )
.
j 6 =i ( xi x j )
These polynomials p0 ( x ), . . . , pn ( x ) are called the Lagrange basis polynomials (there are n + 1 of them, one for each i = 0, . . . , n). The collection { p0 ( x ), . . . , pn ( x )} can be viewed as the basis of the vector space
Rn+1 . From this point of view, now the vector P( x ) = (y0 , y1 , . . . , yn )
58
4. INTERPOLATION
f (x)
cubic spline
Lagrange
4
3
2
4
10
P ( x ) = y0 p0 ( x ) + y1 p1 ( x ) + + y n p n ( x ) =
(17)
yi pi ( x )
i =0
j 6 =i ( x x j )
.
= yi
(x xj )
i =0 j 6 = i i
n
One verifies easily that this P( x ) passes through all the points ( xi , yi )
for i = 0, . . . , n.
The fact that there is only one polynomial of the same degree
as P( x ) satisfying that condition can proved as follows: if there existed Q( x ) of degree less than or equal to n passing through all those
points, the difference P( x ) Q( x ) between them would be a polynomial of degree at most n with n + 1 zeros and hence, would be 0.
So, P( x ) Q( x ) would be 0, which implies that Q( x ) equals P( x ).
Compare the cubic spline interpolation with the Lagrange interpolating polynomial for the same function as before in figure ??.
The main drawbacks of Lagranges interpolating polynomial are:
4. LAGRANGE INTERPOLATION
59
1
0.8
0.6
0.4
0.2
1
f ( x ) = 1+12x
2
Lagrange
0.5
0.5
F IGURE 5. Runges Phenomenon: the Lagrange interpolating polynomial takes values very far away
from the original function if the nodes are evenly distributed. There is a thin blue line which corresponds
to the cubic spline, which is indistinguishable from the
original function.
The first problem is intrinsic to the way we have computed it1.
The second one depends on the distribution of the xi in the segment [ x0 , xn ]. A remarkable example is given by Runges phenomenon: whenever a function with large derivatives is approximated by
a Lagrange interpolating polynomial with the x coordinates evenly
distributed, then this polynomial gets values which differ greatly
from those of the original function (even though it passes through
the interpolation points). This phenomenon is much less frequent in
splines.
One may try to avoid this issue (which is essentially one of curvature) using techniques which try to minimize the maximum value
taken by the polynomial
R( x ) = ( x x0 )( x x1 ) . . . ( x xn ),
that is, looking for a distribution of the xi which solves a minimax
problem. We shall not get into details. Essentially, one has:
60
4. INTERPOLATION
1
0.8
0.6
0.4
0.2
1
f ( x ) = 1+12x
2
Lagrange
0.5
0.5
( b a ) xi + ( a + b )
.
2
The points xi in the lemma are called Chebychevs nodes for the interval [ a, b]. They are the ones to be used if one wishes to interpolate
a function using Lagranges polynomial and get a good approximation.
xi =
5. Approximate Interpolation
In experimental and statistical problems, data are always considered inexact, so that trying to find a curve fitting them all exactly
makes no sense at all. This gives rise to a different interpolation concept: one is no longer interested in making a curve pass through each
point in a cloud but in studying what curve in a family resembles that
cloud in the best way. In our case, when one has a data table like
(??), this resemblance is measured using the minimal quadratic
distance from f ( xi ) to yi , where f is the interpolating function, but
this is not the only way to measure the nearness (however, it is the
one we shall use).
5. APPROXIMATE INTERPOLATION
61
5
4
3
2
1
0.5
0.5
( f ( x i ) y i )2 .
i =1
The least squares linear interpolation problem consists in, given the
cloud of points ( xi , yi ) and a family of functions, to find the function among those in the family which minimizes the total quadratic
error. This family is assumed to be a finite-dimensional vector space
(whence the term linear interpolation).
We shall give an example of nonlinear interpolation problem, just
to show the difficulties inherent to its lack of vector structure.
5.2. Linear Least Squares. Assume one has to approximate a
cloud of points of size N with a function f belonging to a vector
space V of dimension n and that a basis of V is known: f 1 , . . . , f n .
We always assume N > n (it is actually much greater, in general).
62
4. INTERPOLATION
( f ( x i ) y i )2
i =1
is minimal. As V is a vector space, this f must be a linear combination of the elements of the basis:
f = a1 f 1 + a2 f 2 + + a n f n .
And the problem consists in finding the coefficients a1 , . . . , an minimizing the value of the nvariable function
N
F ( a1 , . . . , a n ) =
( a1 f 1 ( x i ) + + a n f n ( x i ) y i )2 ,
i =1
Writing
N
yj =
f j ( xi ) yi ,
i =1
and stating the equations in matrix form, one gets the system
f 1 ( x1 )
f 2 ( x1 )
.
..
f n ( x1 )
(18)
f 1 ( x2 )
f 2 ( x2 )
..
.
f n ( x2 )
...
...
..
.
...
f1 (x N )
f 1 ( x1 )
f2 (x N )
f 1 ( x2 )
.. ..
. .
fn (xN )
f1 (x N )
f 1 ( x1 ) f 1 ( x2 )
f 2 ( x1 ) f 2 ( x2 )
= .
..
..
.
f n ( x1 ) f n ( x2 )
f 2 ( x1 )
f 2 ( x2 )
..
.
f2 (x N )
...
...
..
.
...
f n ( x1 )
a1
f n ( x2 )
a2
.. .. =
. .
fn (xN )
an
f1 (x N )
y1
y1
f2 (x N )
y2 y 2
.. .. = .. ,
. . .
fn (xN )
yN
yn
...
...
..
.
...
5. APPROXIMATE INTERPOLATION
63
f 1 ( x1 ) f 1 ( x2 ) . . . f 1 ( x N )
y1
f 2 ( x1 ) f 2 ( x2 ) . . . f 2 ( x N )
y1
X=
, y=
..
..
..
...
...
.
.
.
f n ( x1 ) f n ( x2 ) . . . f n ( x N )
yN
f ( x ) = aebx .
These are Gaussian bells for a, b > 0. They obviously do not form a
vector space and the techniques of last section do not apply.
One could try to transform the problem into a linear one, perform the least squares approximation on the transformed problem
and revert the result. This may be catastrophic.
Taking logarithms on both sides of (??), one gets:
log( f ( x )) = log( a) + bx2 = a0 + bx2 ,
2
64
4. INTERPOLATION
data
3e2x
linear interp.
65
66
4. INTERPOLATION
67
%
%
%
%
%
#
%
%
%
%
#
%
%
%
%
interpol.m
Linear interpolation.
Given a cloud (x,y), and a Cell Array of functions F,
return the coefficients of the least squares linear
interpolation of (x,y) with the base F.
Input:
x: vector of scalars
y: vector of scalars
F: Cell array of anonymous functions
Outuput:
c: coefficients such that
c(1)*F{1,1} + c(2)*F{1,2} + ... + c(n)*F{1,n}
is the LSI function in the linear space <F>.
CHAPTER 5
f 0 (x) '
f ( x + .01) f ( x )
=
.01
1
2.01
12
= 0.248756+
.01
69
instability?
70
f ( x0 + h )
f ( x0 )
f ( x0 h )
x0 h
x0
x0 + h
f ( x .01) f ( x )
= 1.99
.01
.01
1
2
= 0.251256+
71
f 00 ( ) 2
h ,
2
h,
h
2
which is exactly the meaning of having order 1. Notice that the
rightmost term cannot be eliminated.
However, for the symmetric formula one can use the Taylor polynomial of degree 2, twice:
f 0 (x) =
f 00 ( x ) 2
f 3) ( ) 3
h +
h ,
2
6
f 3) ( ) 3
f 00 ( x ) 2
h
h
f ( x h) = f ( x ) f 0 ( x )h +
2
6
f ( x + h) = f ( x ) + f 0 ( x )h +
72
f (x)
2
0
2
3.5
P( x ) dx = a1 P( x1 ) + a2 P( x2 ) + + an+1 P( xn+1 ).
73
2.1. The Midpoint Rule. A coarse but quite natural way to approximate an integral is to multiply the value of the function at the
midpoint by the width of the interval. This is the midpoint formula:
D EFINITION 18. The midpoint quadrature formula corresponds to
x1 = ( a + b)/2 and a1 = (b a). That is, the approximation
Z b
a+b
f ( x ) dx ' (b a) f
.
2
a
given by the area of the rectangle having a horizontal side at f ((b
a)/2).
One checks easily that the midpoint rule is of order 1: it is exact
for linear polynomials but not for quadratic ones.
2.2. The Trapezoidal Rule. The next natural approximation (which
is not necessarily better) is to use two points. As there are two already given (a and b), the naive idea is to use them.
Given a and b, one has the values of f at them. One could interpolate f linearly (using a line) and approximate the value of the
integral with the area under the line. Or one could use the mean
value between two rectangles (one with height f ( a) and the other
with height f (b)). The fact is that both methods give the same value.
D EFINITION 19. The trapezoidal rule for [ a, b] corresponds to x1 =
a, x2 = b and weights a1 = a2 = (b a)/2. That is, the approximation
Z b
ba
f ( x ) dx '
( f ( a) + f (b))
2
a
for the integral of f using the trapeze with a side on [ a, b], joining
( a, f ( a)) with (b, f (b)) and parallel to OY.
Even though it uses one point more than the midpoint rule, the
trapezoidal rule is also of order 1.
2.3. Simpsons Rule. The next natural step involves 3 points instead of 2 and using a parabola instead of a straight line. This method
is remarkably precise (it has order 3) and is widely used. It is called
Simpsons Rule.
D EFINITION 20. Simpsons rule is the quadrature formula corresponding to the nodes x1 = a, x2 = ( a + b)/2 and x3 = b, and
the weights, corresponding to the correct interpolation of a degree
74
f (x)
2
0
2
F IGURE 3. trapezoidal rule: approximate the area under f by that of the trapeze.
4( b a )
75
f (x)
2
0
2
3.5
F IGURE 4. Simpsons rule: approximating the area under f by the parabola passing through the endpoints
and the midpoint (black).
f (x)
2
0
2
2.5
3.5
4.5
76
trapezoidal rule is applied on both in order to approximate the integral of f on [ a, c], one gets:
Z c
ba
cb
( f (b) + f ( a)) +
( f (c) + f (b)) =
2
2
a
h
( f ( a) + 2 f (b) + f (c)) ,
2
where h = b a is the width of each subinterval, either [ a, b] or [b, c]:
the formula just a weighted mean of the values of f at the endpoints
and the midpoint. In the general case, when there are more than two
subintervals, one gets an analogous formula:
f ( x ) dx =
h
( f ( a) + 2 f ( x2 ) + 2 f ( x3 ) + + 2 f ( xn ) + f (b)) .
2
a
That is, half the width of the subintervals times the sum of the values
at the endpoints and the double of the values at the interior points.
f ( x ) dx '
80
f (x)
60
40
20
0
2
2.75
3.5
4.25
77
CHAPTER 6
Differential Equations
There is little doubt that the main application of numerical methods is for integrating differential equations (which is the technical term
for solving them numerically).
1. Introduction
A differential equation is a special kind of equation: one in which
one of the unknowns is a function. We have already studied some:
any integral is a differential equation (it is the simplest kind). For
example,
y0 = x
is an equation in which one seeks a function y( x ) whose derivative
is x. It is well-known that the solution is not unique: there is an
integration constant and the general solution is written as
x2
+ C.
2
This integration constant can be better understood graphically.
When computing the primitive of a function, one is trying to find
a function whose derivative is known. A function can be thought
of as a graph one the X, Y plane. The integration constant specifies
at which height the graph is. This does not change the derivative,
obviously. On the other hand, if one is given the specific problem
y=
y0 = x, y(3) = 28,
then one is trying to find a function whose derivative is x, with a condition at a point: that its value at 3 be 28. Once this value is fixed, there
is only one graph having that shape and passing through (3, 28). The condition y(3) = 28 is called an initial condition. One imposes that the
graph of f passes through a point and then there is only one f solving the integration problem, which means there is only one suitable
constant C. As a matter of fact, C can be computed by substitution:
28 = 32 + C C = 19.
79
80
6. DIFFERENTIAL EQUATIONS
The same idea gives rise to the term initial condition for a differential
equation.
Consider the equation
y0 = y
(whose general solution should be known). This equation means:
find a function y( x ) whose derivative is equal to the same function
y( x ) at every point. One tends to think of the solution as y( x ) = e x
but. . . is this the only possible solution? A geometrical approach may
be more useful; the equation means the function y( x ) whose derivative is equal to the height y( x ) at each point. From this point of
view it seems obvious that there must be more than one solution to
the problem: at each point one should be able to draw the corresponding tangent, move a little to the right and do the same. There
is nothing special on the points ( x, e x ) for them to be the only solution to the problem. Certainly, the general solution to the equation
is
y( x ) = Ce x ,
where C is an integration constant. If one also specifies an initial condition, say y( x0 ) = y0 , then necessarily
y0 = Ce x0
so that C = y0 /e x0 is the solution to the initial value problem (notice
that the denominator is not 0).
This way, in order to find a unique solution to a differential equation, one needs to specify at least one initial condition. As a matter
of fact, the number of these must be the same as the order of the
equation.
Consider
y00 = y,
whose solutions should also be known: the functions of the form
y( x ) = a sin( x ) + b cos( x ), for two constants a, b R. In order for
the solution to be unique, two initial conditions must be specified.
They are usually stated as the value of y at some point, and the value
of the derivative at that same place. For example, y(0) = 1 and
y0 (0) = 1. In this case, the solution is y( x ) = sin( x ) + cos( x ).
This chapter deals with approximate solutions to differential equations. Specifically, ordinary differential equations (i.e. with functions
y( x ) of a single variable x).
2. THE BASICS
81
2. The Basics
The first definition is that of differential equation
D EFINITION 23. An ordinary differential equation is an equality A =
B in which the only unknown is a function of one variable whose derivative of some order appears.
The adjective ordinary is the condition of being a function of a
single variable (there are no partial derivatives).
E XAMPLE 18. We have shown some examples above. Differential
equations can get many forms:
y0 = sin( x )
xy = y0 1
(y0 )2 2y00 + x2 y = 0
y0
xy = cos(y)
y
etc.
In this chapter, the unknown in the equation will always be denoted with the letter y. The variable on which it depends will usually
be either x or t.
D EFINITION 24. A differential equation is of order n if n is the
highest order derivative of y appearing in it.
The specific kind of equations we shall study in this chapter are
the solved ones (which does not mean that they are already solved,
but that they are written in a specific way):
y0 = f ( x, y).
D EFINITION 25. An initial value problem is a differential equation
together with an initial condition of the form y( x0 ) = y0 , where
x0 , y0 R.
D EFINITION 26. The general solution to a differential equation E
is a family of functions f ( x, c), where c is one (or several) constants
such that:
Any solution of E has the form f ( x, c) for some c.
Any expression f ( x, c) is a solution of E,
except for possibly a finite number of values of c.
82
6. DIFFERENTIAL EQUATIONS
| f ( x1 ) f ( x2 )| K | x1 x2 |
for any x1 , x2 , X, where | | denotes the absolute value of a number.
This is a kind of strong continuity condition (i.e. it is easier for
a function to be continuous than to be Lipschitz). What matters is
that this condition has a very important consequence for differential
equations: it guarantees the uniqueness of the solution. Let X be
a set [ x0 , xn ] [ a, b] (a strip, or a rectangle) and f ( x, y) : X R a
function on X which satisfies Lipschitzs condition. Then
T HEOREM 9 (Cauchy-Kovalevsky). Under the conditions above, any
differential equation y0 = f ( x, y) with an initial condition y( x0 ) = y0 for
y0 ( a, b) has a unique solution y = y( x ) defined on [ x0 , x0 + t] for some
t R greater than 0.
Lipschitzs condition is not that strange. As a matter of fact, polynomials and all the analytic functions (exponential, logarithms,
trigonometric functions, etc. . . ) and their inverses (where they are
defined
and continuous) satisfy it. An example which does not is
f ( x ) = x on an interval containing 0, because f at that point has
a vertical tangent line. The reader should not worry about this
condition (only if he sees a derivative becoming infinity or a point of
discontinuity, but we shall not discuss them in these notes). We give
just an example:
3. DISCRETIZATION
83
84
6. DIFFERENTIAL EQUATIONS
85
86
6. DIFFERENTIAL EQUATIONS
Z x
x0
f (t, y(t)) dt + y0 ,
which is not a primitive but looks like it. In this case, there is no way to
approximate the integral using the values of f at intermediate points
because one does not know the value of y(t). But one can take a similar
approach.
6. Eulers Method: Integrate Using the Left Endpoint
One can state Eulers method as in Algorithm ??. If one reads
carefully, the gist of each step is to approximate each value yi+1 as the
previous one yi plus f evaluated at ( xi , yi ) times the interval width.
87
That is:
Z x i +1
xi
f (t) dt = (b a) f ( a),
which, for lack of a better name, could be called the left endpoint rule:
the integral is approximated by the area of the rectangle of height
f ( a) and width (b a).
Algorithm 10 Eulers algorithm assuming exact arithmetic.
Input: A function f ( x, y), an initial condition ( x0 , y0 ), an interval
[ a, b] = [ x0 , xn ] and a step h = ( xn x0 )/n
Output: A family of values y0 , y1 , . . . , yn (which approximate the
solution to y0 = f ( x, y) on the net x0 , . . . , xn )
? S TART
i0
while i n do
y i +1 y i + h f ( x i , y i )
i i+1
end while
return (y0 , . . . , yn )
One might try (as an exercise) to solve the problem using the
right endpoint: this gives rise to what are called the implicit methods which we shall not study (but which usually perform better than
the explicit ones we are going to explain).
7. Modified Euler: the Midpoint Rule
Instead of using the left endpoint of [ xi , xi+1 ] for integrating and
computing yi+1 , one might use (and this would be better, as the
reader should verify) the midpoint rule somehow. As there is no
way to know the intermediate values of y(t) further than x0 , some
kind of guesswork has to be done. The method goes as follows:
Use a point near Pi = ( xi , yi ) whose x coordinate is the
midpoint of [ xi , xi+1 ].
For lack of a better point, the first approximation is done using Eulers algorithm and one takes a point Qi = [ xi+1 , yi +
h f ( xi , yi )].
88
6. DIFFERENTIAL EQUATIONS
f ( x ) dx ' (b a) f (
a+b
),
2
yi + k1
z2 + k 2
z2
y i +1
yi
xi
xi +
h
2
x i +1
89
Compute k1 = f ( xi , yi ).
Compute z2 = y j + hk1 . This is the coordinate yi+1 in Eulers
method.
Compute k2 = f ( xi+1 , z2 ). This would be the slope at ( xi+1 , yi+1 )
with Eulers method.
Compute the mean of k1 and k2 : k1 +2 k2 and use this value as
slope. That is, set yi+1 = yi + 2h (k1 + k2 ).
Figure ?? shows a graphical representation of Heuns method.
90
6. DIFFERENTIAL EQUATIONS
yi + hk1
y i +1
yi + hk2
yi
xi
x i +1
x i +1 + h
91
Euler
Modified
Solution
4
3
2
1
1
0.5
0
0.5
1
1.5
0
F IGURE 5. Comparison between Eulers, Modified Eulers and Heuns methods and the true solution to the
ODE y0 = y + cos( x ) for y(0) = 1.
CHAPTER 7
94
which reflects the idea that what we are doing is just the same as in
Chapter ??, only with more coordinates. This is something that has to be
clear from the beginning: the only added difficulty is the number of
computations to be carried out.
Notice that in the example above, F depends only on x and y
not on t. However, it might as well depend on t (because the behaviour of the system may depend on time), so that in general, we
should write
v(t) = F (t, B(t)).
Just for completeness, if F does not depend on t, the system is called
autonomous, whereas if it does depend on t, it is called autonomous.
1.1. A specific example. Consider, to be more precise, the same
autonomous problem as above but with F ( x, y) = (y, x ). This
gives the following equations
x (t) = y(t)
(22)
y (t) = x (t).
It is clear that, in order to have an initial value problem, we need an
initial condition. Because the equation is of order one, we need just
the starting point of the motion, that is: two cordinates. Let us take
x (0) = 1 and y(0) = 0 as the initial position for time t = 0. The
initial value problem is, then
x (t) = y(t)
with ( x (0), y(0)) = (1, 0)
(23)
y (t) = x (t).
which, after some scaling (so that the plot looks more or less nice) describes the vector field depicted in Figure ??. It may be easy to guess
that a body whose velocity is described by the arrows in the diagram follows a circular motion. This is what we are going to prove,
actually.
The initial value problem (??) means, from a symbolic point of
view, that the functions x (t) and y(t) satisfy the following conditions:
1. A TWO-VARIABLE EXAMPLE
95
96
x 1 = f 1 (t, x1 , . . . , xn ), x1 (0) = a1
x = f (t, x , . . . , x ), x (0) = a
n
2
2
2
2
1
(25)
.
.
x n = f n (t, x1 , . . . , xn ), xn (0) = an
97
Algorithm 13 Eulers algorithm for n coordinates. Notice that boldface elements denote vectors.
Input:
n
functions
of
n + 1
variables:
f 1 (t, x1 , . . . , xn ), . . . , f n (t, x1 , . . . , xn ),
an
initial
condition
( a1 , a2 , . . . , an ), an interval [ A, B] and a step h = ( B A)/N
Output: A family of vector values x0 , . . . , xN (which approximate the solution to the corresponding initial value problem for
t [ A, B])
? S TART
i 0, t0 A, x0 ( a1 , . . . , an )
while i < N do
v ( f 1 (ti , xi ), . . . , f n (ti , xi ))
x i +1 x i + h v i
i i+1
end while
return x0 , . . . , x N
As we did in Chapter ??, we could state a modified version of Eulers method but we prefer to go straightway for Heuns. It is precisely stated in Algorithm ??.
98
Algorithm 14 Heuns algorithm for n coordinates. Notice that boldface elements denote vectors.
Input:
n
functions
of
n
variables:
f 1 (t, x1 , . . . , xn ), . . . , f n (t, x1 , . . . , xn ),
an
initial
condition
( a1 , a2 , . . . , an ), an interval [ A, B] and a step h = ( B A)/N
Output: A family of vector values x0 , . . . , xN (which approximate the solution to the corresponding initial value problem for
t [ A, B])
? S TART
i 0, t0 A, x0 ( a1 , . . . , an )
while i < N do
v ( f 1 (ti , xi ), . . . , f n (ti , xi ))
z xi + h vi
w ( f 1 (ti , zi ), . . . , f n (ti , zi ))
xi+1 xi + 2h (v + z)
i i+1
end while
return x0 , . . . , x N
In Figure ?? a plot of the approximate solution to the initial value
problem of Equation ?? is shown. Notice that the approximate trajectory is on the printed page indistinguishable from a true circle (we have plotted more than a single loop to show how the approximate solution overlaps to the naked eye). This shows the improved accuracy of Heuns algorithm although it does not behave
so nicely in the general case.
99
100
And now Newtons law can be expressed as a true ordinary differential equation of order one
x 1 (t) = v1 (t)
x 2 (t) = v2 (t)
v 2 (t) = F2 ( x1 (t), . . . , xn (t))
(26)
..
x n (t) = vn (t)
d k x1
= F1 (t, x, x0 , . . . , xk1) )
dt
d k x2
= F2 (t, x, x0 , . . . , xk1) )
k
(27)
dt
..
dk xn
= Fn (t, x, x0 , . . . , xk1) )
dtk
where the functions F1 , . . . , Fn depend on t and the variables x1 , . . . , xn
and their derivatives with respect to t up to order k 1. It is easier
to understand than to write down.
In order to turn it into a system of equations of order one, one
just defines k 1 new systems of variables, which we enumerate
with superindices
101
1
1 = u2
x
=
u
u
1
1
1
..
.
.
,
,
.
.
.
,
.
.
xn = u1n
un = u2n
u 1k2 = u1k1
..
.
u kn2 = ukn1
( x1 , y1 ) = G
(28)
( x2 , y2 ) = G
m2
(( x2 x1 )2 + (y2 y1 )2 )
m1
3/2
( x2 x1 , y2 y1 )
(( x2 x1 )2 + (y2 y1 )2 )
3/2
( x1 x2 , y1 y2 )
102
m2
( x x1 )
3/2 2
(( x2 x1 )2 + (y2 y1 )2 )
y 1 = uy
m2
u y =
( y y1 )
3/2 2
x2 x1 )2 + ( y2 y1 )2 )
((
(29)
x2 = v x
m1
v x =
( x x2 )
3/2 1
(( x2 x1 )2 + (y2 y1 )2 )
y2 = vy
m1
v y =
( y y2 )
3/2 1
(( x2 x1 )2 + (y2 y1 )2 )
which is a standard ordinary differential equation of the first order.
An initial value problem would require both a pair of initial positions (for ( x1 , y1 ) and ( x2 , y2 )) and a pair of initial velocities (for
(u x , uy ) and (v x , vy )): this gives eight values, apart from the masses,
certainly.
In Listing ?? we show an implementation of Heuns method for
the two-body problem (with G = 1). In order to use it, the function
twobody must receive the values of the masses, a vector of initial
conditions (in the order ( x1 (0), y1 (0), x2 (0), y2 (0), u x (0), uy (0), v x (0), vy (0))),
a step size h and the number of steps to compute.
u x =
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
103
F IGURE 4. Approximation to the solution of the twobody problem with initial values as in the text. One of
the bodies (in red) is more massive than the other.
23
24
25
26
27
28
29
30
31
32
33
34
%plot(x1,y1,r, x2,y2,b);
%axis([-2 2 -2 2]);
%filename=sprintf(pngs/%05d.png,k);
%print(filename);
%clf;
end
endfunction
% interesting values:
% r=twobody(4,400,[-1,0,1,0,0,14,0,-0.1],.01,40000); (pretty singular)
% r=twobody(1,1,[-1,0,1,0,0,0.3,0,-0.5],.01,10000); (strange->loss of
precision!)
% r=twobody(1,900,[-30,0,1,0,0,2.25,0,0],.01,10000); (like sun-earth)
% r=twobody(1, 333000, [149600,0,0,0,0,1,0,0], 100, 3650); (earth-sun)...
104
F IGURE 5. Absolute and relative motion of two particles. On the left, the absolute trajectories, on the right,
the relative ones. The elliptical shape and the numerical error are noticeable on the right (the blue trajectory
should be a perfect ellipse).