0% found this document useful (0 votes)
7 views

Numerical Methods

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Numerical Methods

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 106

Numerical and

Computational Methods
with Python
This course covers numerical methods for solving various
mathematical problems and introduces the basics of pro-
gramming. The methods are demonstrated using Python.
A short introduction to Python is part of the course. No
knowledge of Python is expected.

updated October 9, 2024 by N.N and S. P.

Notes by
Profs. N. Gromov, B. Doyon
Drs. N. Nüsken and S. Pougkakiotis

Mathematics Department, King’s College London


Contents

1 Introduction 5
1.1 What are Numerical Methods? . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Why Python? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.1 How to start writing Python . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.2 iPython/Jupyter Interface . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.3 First Program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Introduction to Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.1 Basic Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.2 The First Cell in an iPython Notebook . . . . . . . . . . . . . . . . . 16
1.4 Floating Point Arithmetic and Numerical Errors . . . . . . . . . . . . . . . . 16
1.4.1 Real Numbers in a Computer . . . . . . . . . . . . . . . . . . . . . . 17
1.4.2 Round-off Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.4.3 Arithmetic Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2 Solution of a Single Non-linear Equation 28


2.1 Newton Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.1.1 Python Code for the Newton Method . . . . . . . . . . . . . . . . . . 29
2.1.2 Newton Fractals†1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.2 Rate and Order of Convergence . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.2.1 Convergence of the Newton method . . . . . . . . . . . . . . . . . . . 33
2.3 Secant Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.3.1 Python Code for the Secant Method . . . . . . . . . . . . . . . . . . 36
2.4 Bisection Method (a.k.a Lion Hunting) . . . . . . . . . . . . . . . . . . . . . 36
2.4.1 Bisection Method Implementation . . . . . . . . . . . . . . . . . . . . 37
2.5 Numerically Estimating Convergence Orders . . . . . . . . . . . . . . . . . . 38

3 Approximation of Functions by Polynomials 41


3.1 Lagrange Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.1.1 Python Implementation of the Lagrange Interpolation . . . . . . . . . 42
3.1.2 Limitations of Polynomial Interpolation . . . . . . . . . . . . . . . . . 44
3.2 Fitting Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.2.1 Least Squares Fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
1
This section is not examinable, but it is recommended to study it. It’s also very beautiful!

2
3.2.2 Python Code for Polynomial LS Fit . . . . . . . . . . . . . . . . . . . 46
3.3 Interpolation Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.3.1 Error of the Polynomial Interpolation . . . . . . . . . . . . . . . . . . 48
3.3.2 Estimating the Error on a Mesh: The Order of the Error . . . . . . . 49
3.3.3 Optimising the Error Bound: Chebyshev nodes . . . . . . . . . . . . 50

4 Numerical Differentiation 52
4.1 Finite Difference Approximations . . . . . . . . . . . . . . . . . . . . . . . . 52
4.2 Error Estimation: Round-off (Machine Precision) and Approximation Errors 53
4.3 Python experimentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.4 Richardson Extrapolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.4.1 Richardson Extrapolation: An Example . . . . . . . . . . . . . . . . 57

5 Numerical Integration 58
5.1 Trapezoidal Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.1.1 Approximation error . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.1.2 Roundoff error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.1.3 Python implementation for the Trapezoidal Rule . . . . . . . . . . . 61
5.2 Midpoint rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.3 Simpson’s rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.3.1 Error estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.4 Gaussian Integration Method . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.4.1 Python implementation of Gaussian Integration . . . . . . . . . . . . 67
5.5 Monte-Carlo Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.5.1 Estimating Statistical Error . . . . . . . . . . . . . . . . . . . . . . . 70

6 Numerical solution of ordinary differential equations 75


6.1 Initial value problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.1.1 Euler’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.1.2 Euler’s Method: Python implementation . . . . . . . . . . . . . . . . 77
6.1.3 Euler Method: error estimation . . . . . . . . . . . . . . . . . . . . . 79
6.1.4 Runge-Kutta Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.1.5 Runge-Kutta Method: Python implementation . . . . . . . . . . . . . 81
6.1.6 4th order Runge-Kutta Method . . . . . . . . . . . . . . . . . . . . . 81
6.2 Boundary problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.2.1 Shooting method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.2.2 Python implementation of the Shooting method . . . . . . . . . . . . 83

7 Stochastic differential equations 85


7.1 Euler-Maruyama method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

8 Partial Differential Equations 88


8.1 Relaxation Method for the Laplace equation . . . . . . . . . . . . . . . . . . 88
8.1.1 Python Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 88
9 Eigenvalue problem 91
9.1 Direct method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
9.2 Jacobi Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
9.2.1 Python Implementation of the Jacobi Method . . . . . . . . . . . . . 94

10 Optimisation problem 96
10.1 Golden Section Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
10.2 Powell’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
10.3 Down-Hill method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
10.3.1 Python implementation of the Down-Hill method . . . . . . . . . . . 100

11 Solving Linear Systems of Algebraic Equations 103


11.1 Gaussian Elimination Method . . . . . . . . . . . . . . . . . . . . . . . . . . 104
11.1.1 Python implementation of Gaussian Elimination Method . . . . . . . 105
11.2 Pivoting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
Chapter 1
Introduction

1.1 What are Numerical Methods?


As much as you may think differently, the analytical1 ability of humans is in fact very limited.
We can only find exact solutions to very simple problems (even with Wolfram Alpha). For
example, consider the relatively simple-looking integral:
Z
sin(x)
dx .
x
For limits (−∞, ∞), one can show that this integral is equal to π (even this requires complex
analysis). Despite numerous attempts by various people, no analytic solution was found for
finite limits. Later it was shown that this integral does not have a solution in terms of
elementary functions! This is where numerical methods become essential. In this course we
will introduce several methods for evaluating this integral numerically for any limits you like.
You will be able to check that
Z 1
sin(x)
dx = 0.946083070367183014941353313823... ,
0 x
or even verify, without any knowledge of complex analysis, that
Z ∞
sin(x)
dx = 3.14159265358979323423816268637... , (1.1)
−∞ x

which is, of course, π!2 Amazingly, the methods we will discuss will work equally well for
much more complicated functions without any modifications. Due to this power, numerical
1
‘Numerical’ methods usually produce approximate solutions. ‘Analytical methods’ use algebraic manip-
ulation to arrive at an exact answer (which could be arbitrarily complicated and thus may not be useful in
practice).
2
Formally, one may have doubts as to whether this is really π, as we can only see the first 30 digits. The
31st digit could indeed deviate from π. However, the probability of that happening (roughly 10−30 ) is much
less than the probability of a huge asteroid hitting Earth in the next half an hour and destroying the planet,
so in many situations it could be acceptable to assume that it is indeed π. There are many other theories
which are more likely than the theory that the number in the r.h.s. of (1.1) is not π. For example, it is
also much more likely that we all live in a computer simulation (in which case we should not make a big
distinction between exact numbers and approximate numerical values as all the computer calculations have
some small numerical error as we discuss in this course too).

5
methods are crucial to physics, biology, chemistry, and also mathematics and theoretical
physics!
Numerical methods are not limited to solving integration problems. In this course we
will cover numerical methods for: linear equations; eigenvalue problems; nonlinear equations;
differential equations; optimization problems; and stochastic differential equations.

1.2 Why Python?


There are many reasons to study and use Python for numerical methods. Firstly, it is cur-
rently the most popular programming language and is used in areas such as web development,
weather forecasting, and computer games. Even though C or Java are more powerful, they
are a lot more complicated to study and less suitable for our work. It is much faster to write
a working Python program than a C program. Also Python is a cross platform language and
the same program will work immediately on almost any computer system - even on android
phones!
Finally, it is completely free!

1.2.1 How to start writing Python


For this module, you are not required to install Python on your computer, since we will work
with JupyterLite (which is a web-based code editor, on which you can write Python code).
To access JupyterLite, simply use the following link:

https://jupyter.org/try-jupyter/lab/

Nonetheless, if you are interested in delving (independently of this module) more deeply into
Python, you could install it on your laptop. The installation is quite straightforward. To
avoid compatibility problems it is recommended to use the anaconda3 distribution:

https://repo.continuum.io/archive/
or directly at
https://www.anaconda.com/products/individual

Make sure you download anaconda3 (as opposed to anaconda2). The difference is that
anaconda3 comes with Python 3, which we will use for this course. In the repository (first
link above), the name of the file suggests the operational system, i.e. for 64 bit Windows
you need to download

Anaconda3-2021.05-Windows-x86_64.exe

and
Anaconda3-2021.05-MacOSX-x86_64.pkg
for MAC. After installing anaconda3 make sure you also include the main packages as de-
scribed below.

6
Packages we need
If you decide to work with a version of Python installed locally in your computer, please
note that you need to independently install packages that we will be using in this module
(all these are automatically installed in JupyterLite). Specifically, we will make use of the
following packages:
numpy, scipy, matplotlib, ipython, notebook
The step-by-step installation process (which is very straightforward) is described below.

for Windows. open “command prompt”3 (or even better “anaconda prompt” which is
installed if you completed the anaconda installation) and run one by one the following
commands (answer yes to all questions):
conda install matplotlib
conda install ipython
conda install notebook
conda install numpy
conda install scipy
conda install mpmath

If the conda command does not work, it probably means that you have forgotten to “Add
anaconda to PATH” during the installation. Install it again with the correct settings.
To check that the ipython installed correctly run from the “command prompt”
jupyter notebook

This will open your browser with Jupyter interface inside.

1.2.2 iPython/Jupyter Interface


iPython/Jupyter (included in JupyterLite) allows your code to be typed using a nice web-
based code editor, and be run interactively. If you install Python locally, you can also run
web-based Jupyter notebooks. Specifically, to run Jupyter on your laptop, you have to follow
the installation steps from the previous section and then run the command
jupyter notebook

Nonetheless, please note that during the class test, you will be asked to use JupyterLite.
Make sure that you are comfortable using it (it is really easy!).

1.2.3 First Program


Open a jupyter notebook. Create a cell and type into it
2+2

and then evaluate the cell by pressing “Ctrl+shift”. As a result we get 4 as expected!
3
You may need to set your current directory to be inside the anaconda installation directory.

7
1.3 Introduction to Python
1.3.1 Basic Operations
Assignment and Comparison. If you are familiar with basics of programming you can
skip this section. As in any programming language our main friends are variables which can
be represented by one or several latin symbols and can also contain numbers (but not at the
beginning of the name):
a
xXx
variable26

The variables could be assigned various values which can change afterwards. To assign a
value to a variable we can use =. For example, we can assign the value 13 to the variable a
using
a=13

Note that the statement above means “set variable a to the value 13”. It should not be
confused with the mathematical equation a = 13. In particular, the statement
13=a

will result in an error and does not make sense as we cannot set the value “a” to “13” as
“13” is not a variable. What is more similar to the pure-math a = 13 is
a==13
True

(with two equal symbols in a row). The == is a comparison operation which returns True
when the value of the l.h.s. and that of the rhs are the same. One can equally well swap
r.h.s. with the lhs for this operation
13==a
True

More advanced examples of the asignment operation are statements containing nontrivial
expression in the r.h.s.:
a=13 # s e t 13 as a v a l u e o f a
a=a+1 # f i r s t a+1 w i l l be computed g i v i n g 14
# and t h e n t h e a s s i g n m e n t o p e r a t i o n w i l l be
# a p p l i e d s e t t i n g a t o 14

To check the current value of a variable one can use the print function:
a=13 # a becomes 13
a=a+10 # a becomes 23
print (a) # p r i n t t h e c u r r e n t v a l u e o f a
23

Note that when using iPython you don’t need to write print, you can simply type

8
a
23

Control Flow. Comparison operations are especially useful when it comes to control flow
(i.e., controlling which command of a Python program is going to be executed next). A very
useful tool for control flow is the if statement, which determines whether a command will
be executed, depending on some appropriate condition.
x = 10

if x > 5:
print ("x is greater than 5")
x is greater than 5

Note that the indentation after the if statement is important (i.e., the first command after
the if-statement must start with a tabulation symbol without any spaces in the beginning),
since it signifies which commands have to be executed if the associated condition is true. In
general, we can create an if−elif−else statement, as shown in the following example:
x = 5

if x > 5:
print ("x is greater than 5")
elif x == 5:
print ("x is equal to 5")
else:
print ("x is smaller than 5")
x is equal to 5

On an unrelated note, observe that we have used a string within the print command. A
string is a series of characters enclosed within double quotation marks.

Basic Operations. Basic algebraic operations


2∗4
8

7/2
3.5

One can use variables involved into these operations:


a=13 # s e t a t o 13
b =3∗( a +1)/2 # e v a l u a t e t h e r h s and a s s i g n t h e r e s u l t t o b
print (a,b)
13 21.0

9
Lists. Lists can be used to store any kind of data, but they should not be confused with
vectors and matrices (for this, use np.array supporting the corresponding mathematical
operations; see below!). The built-in list type works as follows:
a=[1 ,2 ,3 ,4];
a[0] # g i v e s t h e f i r s t e l e m e n t ( Python i n d e x i n g s t a r t s a t 0)
a[1] # g i v e s t h e s e c o n d e l e m e n t
a[−1] # g i v e s t h e l a s t e l e m e n t
a[2:] # p a r t o f t h e l i s t from 3 rd e l e m e n t t o t h e end
a[:] # r e t u r n s t h e e x a c t copy o f t h e l i s t
1
2
4
[3, 4]
[1, 2, 3, 4]

Note that a list is not really a vector. In particular, multiplication by an integer results in
a longer list:
a=[1 ,2 ,3 ,4];
a∗3
[1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4]

Lists can contain lists as their elements:


a=[1 ,[2 ,3] ,3 ,4];
a[1]
[2, 3]

To add an element to an already existing list:


a=[1 ,2 ,3];
a. append (4)
print (a)
[1, 2, 3, 4]

Range. To create a list of several consecutive numbers we use


list(range (10))
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Note that that gives a list of all non-negative numbers up to but not including 10. For the
list of numbers from 5 to 10 we have to type
list(range (5 ,11))
[5, 6, 7, 8, 9, 10]

10
Note that again the last argument is not included into the range (but the first is). That is
a typical source of confusion. Always remember that Python indexing starts at zero.
One can also specify the step when using range, so that all even numbers in the range
0 . . . 10 can be obtained by
list(range (0 ,11 ,2))
[0, 2, 4, 6, 8, 10]

The function range is commonly used in the context of loops:

Loops. We will use for loops intensively:


for i in range (4):
print (i)
0
1
2
3

As with the if statements, the second line in the above code must start from a tabulation
symbol. This is generally the way in which blocks of code are indicated in Python. Note
that using range(4) we make the variable i attain consecutive integer values from 0 to 3.
You can also use range with two arguments
for i in range (4 ,6): # t o run from 4 t o 5
print (i)
4
5

We will also use while type of loops:


i=0
while i<10:
i = i+1
print (i)
1
2
3
4

Remark
Loops are generally very important in the context of programming, and allow us to
automate repeated operations. Nonetheless, in the context of linear algebra (which
typically benefits greatly from the utilization of loops) with Python, it is highly inef-
ficient to use loops. Instead, for all linear algebra purposes we will be using a highly
optimized (linear algebra dedicated) package, called numpy, as discussed bellow.

11
Arrays, Vectors, Matrices. arrays are much closer to the vectors we are used to; we
can add and/or multiply them by constants. To define an array we will have to use the
numpy package:
import numpy as np # l o a d i n g numpy p a c k a g e ( needed o n l y once )
a = np. array ([1 ,2 ,3])
b = np. array ([1 ,0 ,0])

For example:
a ∗0.5
a+b
[0.5 1. 1.5]
[2 2 3]

To create a 0-vector or a 0-matrix use the zeros command:


np. zeros (5)
np. zeros ((3 ,3)) # n o t e d o u b l e b r a c k e t s h e r e !
[ 0. 0. 0. 0. 0.]
[[ 0. 0. 0.]
[ 0. 0. 0.]
[ 0. 0. 0.]]

All the standard matrix operations can be performed very easily. For the matrix-matrix
product use np.dot:
a = np. array ([[1 , 0], [0, 1]])
b = np. array ([[4 , 1], [2, 2]])
np.dot(a,b)
[[4 1]
[2 2]]

Compare this to the element-wise product:


a = np. array ([[1 , 0], [0, 1]])
b = np. array ([[4 , 1], [2, 2]])
a∗b
[[4 0]
[0 2]]

Typing a∗b when in fact you want matrix multiplication is a typical hard-to-find mistake!
To compute the trace of a matrix:
a = np. array ([[1 , 0], [4, 7]])
np. trace (a)
8

Similarly, for the matrix-vector product A · ⃗v , we perform:

12
A = np. array ([[1 , 2], [3, 1]])
v = np. array ([1 ,4])
np.dot(A,v)
[9 7]

Again, compare this to


A = np. array ([[1 , 2], [3, 1]])
v = np. array ([1 ,4])
A∗v
([[1 , 8],
[3, 4]])

Be careful so as to not confuse np.dot and ∗!


For computing the determinant of a matrix we perform:
A = np. array ([[1 , 2], [3, 1]])
np. linalg .det(A)
−5.0

Similarly, to compute the inverse of a matrix:


A = np. array ([[1 , 2], [3, 1]])
np. linalg .inv(A)
[[−0.2 0.4]
[ 0.6 −0.2]]

In general, you can get an exhaustive description of the numpy package by typing
import numpy as np
help(np)

Creating Grids. The functions np.arange and np.linspace are to np.arrays what
range is to lists. The difference between np.arange and np.linspace is that the former
allows you to specify the step size, while the latter allows you to specify the number of
steps:
np. arange (0 ,1 ,0.25)
([0. , 0.25 , 0.5 , 0.75])

np. linspace (0 ,1 ,5)


([0. , 0.25 , 0.5 , 0.75 , 1. ])

13
Elementary Functions and Constants. The numpy package also contains all the ele-
mentary functions such as sin, cos, exp, tan, etc.,
[np.sin (2),np.cos (1.5) , np.abs(−1.2)]
[0.90929742682568171 , 0.070737201667702906 , 1.2]

as well as important constants such as π and e:


[np.pi , np.e]
[3.141592653589793 , 2.718281828459045]

The following can be used if functions or constants are used repeatedly:


from numpy import sin , pi
[sin (2) , pi]
[0.90929742682568171 , 2.718281828459045]

User-defined Functions. It is typically useful to define our own functions. For example,
let us define a function f (x) = x3 + 2x2 + 1. In Python this is done in the following way:
def f(x):
return x ∗∗3+2∗( x ∗∗2)+1

After that we can use this function just like any other function:
f(1)
4

There are situations when we want to define a function without giving it a specific name.
This happens for example when we want to use the function only once. In this cases it
is convenient to use a so-called lambda statement. This allows us to define an anonymous
function, i.e. a function that is defined without a name:
( lambda x: x ∗∗2)(3)
9

This is essentially equivalent to


def squareofx (x):
return x ∗∗2
squareofx (3)
9

except that we do not have to “waste” a name for the function x2 .

14
Making Plots. We will use the matplotlib.pyplot package for the plots. It has numer-
ous methods and we will only describe some of them. For example we can easily make a plot
of the sin function as shown below:
import numpy as np
import matplotlib . pyplot as plt
xs = np. linspace (0 ,4∗ np.pi ,100) # sampling point along x
ys = np.sin(xs) # function values at these points
plt.plot(xs ,ys) # creates plot
plt.show () # shows t h e p l o t

1.0

0.5

0.0

−0.5

−1.0
0 2 4 6 8 10 12 14

Remark: Short excursion - loops and vectorised programming.


Let us unpack the line ys = np.sin(xs) in the code above. Here, the function np.sin
is (automatically) applied elementwise to the vector (array) xs. Therefore, we could
also write:
ys = np.zeros(len(xs))
for i in range(len(xs )):
ys[i] = np.sin(xs[i])

However, the line ys = np.sin(xs) runs in around 1.77µs, while the code involving
the loop takes 173µs (on my computer, this depends a lot on the hardware, of course).
That is, it takes around 100 times longer! This is because numpy is an extremely
optimised package that can perform elementwise computations in parallel. As already
mentioned earlier in the notes, it is a good practice to avoid using loops whenever
possible. It is always preferable to call numpy routines when applicable. In this course,
such considerations will not play a significant role, as most examples can be executed
in very little time anyway. However, for more involved projects it is crucial to pay
attention to such details, as otherwise the code will be far too slow (in fact, the recent
developments in deep learning would hardly be possible without packages like numpy
tailored for ‘under the hood’ parallel computation).

Back to Plots. Plots can be modified for readability in many different ways:
ys1 = np.sin(xs)
ys2 = np.cos(xs)
plt.plot(xs ,ys1 ,’r’) # ‘ r ’ − means red , s o l i d l i n e
plt.plot(xs ,ys2 ,’gx’) # ‘ gx ’ − means green , no l i n e , ‘ x ’ marker

15
plt.show ()

1.0

0.5

0.0

−0.5

−1.0
0 2 4 6 8 10 12 14

See http://matplotlib.org/gallery.html for numerous types of plots and examples.

Exercise 1 Define the function f (x) = x3 sin(x2 ) and plot its values in the range (0, 10). At
first you will get a plot which is a bit ugly. To make it look nicer you will have to increase the
number of points by changing xs = np.linspace(0,10,1000). Use both types of functions:
created with def and with a lambda-statement.

Getting Help. A very important feature, which one can use even during the class test, is
the “?” symbol before the name of the function:
?range
Docstring :
range (stop) −> range object
range (start , stop[, step ]) −> range object

Return a sequence of numbers from start to stop by step.


Type: type

1.3.2 The First Cell in an iPython Notebook


It is recommended to use the first cell in a notebook to load all the necessary packages. It
could, for instance, look like this:
import numpy as np
from numpy import sin , cos , pi , e
import matplotlib . pyplot as plt

1.4 Floating Point Arithmetic and Numerical Errors


As mentioned above, one important aspect of numerical computation is the fact that the
results that are obtained are often (almost always) approximations of the true mathematical
answer to the problem we are trying to solve. There are many sources of approximations. As
we will see already with the first algorithm we will learn, the Newton method, the numerical

16
method itself, by construction, can only give an approximation, although sometimes the
approximation it can give is very precise. We’ll discuss more at length the approximations
due to the numerical method and how to estimate the numerical errors they cause for the
various algorithms we’ll study. But another source of numerical error is the finite memory
of the machine itself, combined with the way it keeps in memory the numbers it calculates.
In general, there are many sources of error in numerical computations, such as:
• Round-off errors, occurring due to finite memory in computers;
• Arithmetic errors, occurring due to manipulating inexact numbers;
• Approximation errors, occurring due to the utilization of approximations (such as those
obtained via the Taylor expansion);
• Algorithmic errors, occurring due to the utilization of iterative methods for the solution
of mathematical problems;
• Stochastic errors, occurring due to random noise or measurement errors.
In the rest of this section, we will get a deeper understanding of round-off and arithmetic
errors, by discussing how numbers are stored and manipulated within a computer, by giving
some examples in Python, as well as providing some warnings on things to avoid when writing
code for numerical computations. In general, this module will mostly focus on approximation
and algorithmic errors, although we will briefly discuss (and see some examples of) the effects
of round-off, arithmetic or stochastic errors.

1.4.1 Real Numbers in a Computer


Normalised Scientific Notation. In the decimal system, any real number with finite
number of digits can be expressed in normalised scientific notation. This means that the
decimal point is shifted and an appropriate power of 10 is supplied so that only one decimal
digit, which must be non-zero, is before the decimal point.
Example For instance:
1. 132.204 = 1.32204 × 102 .
2. −0.0041 = −4.1 × 10−3 .
Note that this notation can be directly utilized in Python as shown below:
x = 1.23e−4
0.000123
Generalisation to the number system with base N gives the following normalised real
number:
±(a1 .a2 a3 . . . am ) × N e , with a1 ̸= 0,
where the number a1 .a2 a3 . . . am is written with respect to the number system in base N .
This is the normalised scientific notation in base N . The digits a1 , a2 , a3 , . . . , am are the
significant digits, e is called the exponent and N is the base. The factor a2 a3 . . . am is called
the mantissa (or the fractional part).

17
Computer Number Representation. Computers use the binary system to store num-
bers. A bit is one digit in the binary system. Most modern computers use number repre-
sentations from IEEE 754-2008 standard arithmetic. Here we will discuss double precision
numbers (64−bits; also known as float64 in Python). A double precision 64−bit IEEE
number x is stored in the computer in the following form:
x = (−1)s × 1.f × 2e−1023 , (1.2)
where 1.f is in the binary form, as is 1023, i.e., 1023 = (1111111111)2 . In (1.2),
• 1 bit is used for the sign; s = 0 or 1 (0 for +, 1 for −);
• 11 bits are used for the non-negative exponent e (all zeros and all ones is reserved; to
be discussed later);
• 52 bits are used for the mantissa f .
Definition: Floating point numbers

All numbers of the form of (1.2) are called floating point (or machine) numbers.

Theorem 1.4.1 (Range of machine numbers) There is a finite (closed) range for the
machine numbers x, namely,
realmin ≤ |x| ≤ realmax.
Proof To prove this result, we need to find the largest and the smallest floating point
numbers enabled by the representation in (1.2).
• What is the largest floating point number, say realmax?
To answer this question, we first note that the largest value of the exponent e is
( |11111111110
{z } )2 = (2046),
11 bits

while the largest value of the mantissa f is ( 11 · · · 1} )2 .


| {z
52 bits
Therefore,
 
2046−1023 1 1 1 1
realmax = (1. 11 · · · 1} )2 ×2 = 1 + + 2 + 3 + · · · + 52 × 21023
| {z 2 2 2 2
52 bits
= (2 − 2−52 )21023 ≈ 10308 .

• What is the smallest positive floating point number, say realmin?


Proceeding similarly as before we get
realmin = (1. 00 · · · 0} )2 ×21−1023 ≈ 10−308 ,
| {z
52 bits

since the smallest positive value for the mantissa f in this case is ( 00 · · · 0} )2 and
| {z
52 bits
the smallest positive value for the exponent e is ( |00000000001
{z } )2 (note that e =
11 bits
(00000000000)2 = 0 and f = 0 is a special case used to represent zero).

18
Thus any machine number x is such that,

realmin ≤ |x| ≤ realmax,

which completes the proof.

Definition: Overflow and Underflow


We say that if the absolute value of a number is:

• Larger than realmax, overflow occurs and the computer usually returns inf.

• Smaller than realmin, underflow occurs and the computer usually returns zero.

inf
⋆ NaN (i.e., Not-a-Number) is used to describe things like , inf − inf, etc. in
inf
the machine.

Remark: Denormalized and special numbers

The previous (normalized) IEEE number system fails to represent the following num-
bers:

• negative numbers less than −realmax (negative overflow );

• negative numbers greater than −realmin (negative underflow );

• the value 0;

• positive numbers less than realmin (positive underflow );

• positive numbers greater than realmax (positive overflow ).

We mentioned earlier that the IEEE 754-2008 standard arithmetic reserves the expo-
nents of all zeros (i.e., (00000000000)2 ) and of all ones. If an exponent of all zeros
appears, then the computer assumes a denormalized representation

x = (−1)s × 0.f × 2−1022 .

This allows us to represent numbers smaller (in absolute value) than realmin. If, in
this case, f contains only zeros, the computer considers this number to be 0 (in fact,
there are two different zeros; if the sign is positive, we call this a positive zero, and
negative zero if the sign is negative).
If an exponent of all ones appears and f contains only zeros, then we obtain (−1)s ×∞.
If, in this case, the mantissa f contains some nonzero bits, we consider this to be NaN.

All the above can be easily witnessed in Python as shown below:


import math
import numpy as np

positive infinity = float (’inf ’)


print ( positive infinity ) # Output : i n f

19
negative infinity = float (’−inf ’)
print ( negative infinity ) # Output : − i n f

value = float (’inf ’)

if math. isinf(value ):
print (" Infinite value!")

x = positive infinity + negative infinity


print (x)

realmax = np.finfo(np. float64 ). max


print ( realmax )
print ( realmax ∗2) # Create o v e r f l o w

inf
−inf
Infinite value!
nan
1.7976931348623157 e+308
inf
<RuntimeWarning : overflow encountered : print ( realmax ∗2)

1.4.2 Round-off Errors


Machine epsilon, eps. Let’s go back to the floating point representation (1.2) and consider
only positive floating point numbers (i.e., s = 0) with exponent e = 1023 (so that e − 1023 =
0). Then the floating point number will be of the form:

x = 1.f.

The smallest and largest floating point numbers in this case are x = 1 and

x = (1. 11 · · · 1} )2 = 2 − 2−52 ,
| {z
52 bits

respectively, i.e., the largest floating point number in this case is very close to 2. Consider
the interval [1, 2).

• How many different floating point numbers exist in the interval [1, 2)?
Since each of the 52 bits in the mantissa f can be 0 or 1, the answer is 252 .

• What is the distance between two successive floating point numbers in [1, 2)?
The 252 numbers are equidistributed on the interval [1, 2) (which has a length equal to
1). Thus the distance between two successive floating point numbers in [1, 2) is 2−52 .

20
Definition: Machine epsilon

The number 2−52 , which is the distance between two successive floating point numbers
in [1, 2) is called the machine epsilon and it is denoted by eps.

Note: eps is the smallest floating point number for which 1 + eps is a floating point num-
ber not equal to 1. To see this in practice, in the following Python code, we change the
appearance of the output using format(), while working with the machine epsilon.
import numpy as np

machine epsilon = np.finfo(np. float64 ). eps

print ( format (1,"0.52f"))


print ( format (1 + machine epsilon /2,"0.52f"))
print ( format (1 + machine epsilon ,"0.52f"))

# A l t e r n a t i v e way o f p r i n t i n g u s i n g f o r m a t s p e c i f i e r s
print (f"1: {1:.52 f}")
print (f"1 + eps /2: {1 + machine epsilon / 2:.52f}")
print (f"1 + eps: {1 + machine epsilon :.52f}")

1.0000000000000000000000000000000000000000000000000000
1.0000000000000000000000000000000000000000000000000000
1.0000000000000002220446049250313080847263336181640625
1: 1.0000000000000000000000000000000000000000000000000000
1 + eps /2: 1.0000000000000000000000000000000000000000000000000000
1 + eps: 1.0000000000000002220446049250313080847263336181640625
You can see that 1 + eps/2 is identical to 1, as far as Python is concerned. Any number that
is not exactly representable as in (1.2), is rounded to the closest number that is. This is
called a round-off error. The previously presented Python code shows that 1 + eps/2 cannot
be represented exactly, and thus we incur the maximum possible (relative) round-off error,
that is eps/2. Let us briefly explain why that is.
If we take positive floating point numbers with exponent e = 1024, i.e., of the form
x = (1.f ) × 2,
then the smallest and largest floating point numbers in this case are 2 and (2 − 2−52 ) × 2,
respectively. There are again 252 different floating point numbers equidistributed in the
interval [2, 4), and now the distance between two successive numbers is eps × 2 = 2−52 × 2 =
2−51 > eps. We can repeat the procedure for any positive floating point number, and in
particular for any exponent e, 0 ≤ e ≤ 2047; the distance between two successive numbers
in the interval [2e−1023 , 2e−1022 ) is eps × 2e−1023 = 2−52 × 2e−1023 < eps if e − 1023 < 0 and
eps × 2e−1023 = 2−52 × 2e−1023 > eps if e − 1023 > 0.  A sketch
 showing the form of distribution
1
of the machine numbers with 3−bit mantissa in , 8 is displayed in Figure 1.1.
8
As expected, we can see that (positive) floating point numbers are dense close to zero and
significantly sparser as we consider intervals of large positive numbers.

21
 
1
Figure 1.1: Distribution of machine numbers with 3−bit mantissa in ,8 .
8

Given a number to be stored in the computer, it is rounded to its closest machine num-
ber. If this number happens to be in the interval [1, 2), then the maximum absolute error
eps
that occurs is . Similarly we can find the maximum absolute error that occurs when a
2
number is stored in the computer for every binary interval of the form [2e−1023 , 2e−1022 ) , to
e−1023
be eps·2 2 . Does this mean that the round-off errors become worse when storing large
numbers and better when storing positive numbers close to zero? Not quite! We do not care
about absolute errors. Instead, we care about relative errors. Let us distinguish the two.

Definition: Absolute and relative errors


Let x∗ be an approximation of a number x. Then

• |x − x∗ | is called the absolute error for x;


|x − x∗ |
• , with x ̸= 0, is called the relative error for x.
|x|

Let f l(x) be the floating point representation of x. Then the absolute and relative
(round-off) errors for x are given by
|x − f l(x)|
|x − f l(x)| and ,
|x|
respectively.
Theorem 1.4.2 Let x ∈ [realmin, realmax] and let f l(x) be the floating point representation
of x obtained by rounding. Then
|x − f l(x)| eps
≤ . (1.3)
|x| 2
|x − f l(x)| eps h eps eps i
In fact, ≤ is equivalent to f l(x) = x(1 + ε) for some ε ∈ − , .
|x| 2 2 2
Proof We will not prove this result, but it is not hard to show!
Note: From the previous theorem we observe that no matter how large a number is (as-
suming it is within the allowed bounds), the worst-case relative round-off error is always the
same!
To illustrate the importance of round-off errors in practice, let us consider the following
two comparisons:

22
1 + 2 == 3
True

0.1 + 0.2 == 0.3


False

The above example illustrates that many floating point numbers cannot be exactly repre-
sented in the form given in (1.2), and thus we are force to incur round-off errors when working
with such numbers. An extremely important take-away from this example is that you should
never compare real numbers in Python using logical operators like == (for instance in an if-
then statement). Instead, we must resort to asserting whether two numbers are sufficiently
“close enough” to each other, as shown in the following example:
import math
math. isclose (0.1+0.2 ,0.3)
True

1.4.3 Arithmetic Errors


A direct consequence of round-off errors is the presence of arithmetic errors. For example,
we will see that basic algebraic properties like the distributive property do not hold when
performing calculations in the computer. To illustrate this, we will focus on positive real
numbers.

Example Let x, y ∈ [realmin, realmax]. What is the worst-case relative error of the addition
on the computer?
Solution: Let f l(x) and f l(y) the floating point representation of x and y, respectively. Then
eps eps
f l(x) = x(1 + ε1 ) for some ε1 ∈ [− , ]
2 2 (1.4)
eps eps
f l(y) = y(1 + ε2 ) for some ε2 ∈ [− , ]
2 2
Thus, f l (f l(x) + f l(y)) is the floating point representation of x + y. Therefore
eps eps
f l (f l(x) + f l(y)) = (f l(x) + f l(y)) (1 + ε3 ) for some ε3 ∈ [− , ].
2 2
Using (1.4) and by doing some basic computations on the above relation we get:

f l (f l(x) + f l(y)) = (x(1 + ε1 ) + y(1 + ε2 )) (1 + ε3 )


= (x + ε1 x + y + ε2 y)(1 + ε3 )
= x + ε1 x + y + ε2 y + ε3 x + ε3 ε1 x + ε3 y + ε3 ε2 y
= (x + y) + (ε1 + ε3 + ε3 ε1 )x + (ε2 + ε3 + ε3 ε2 )y.

23
Thus we can estimate the relative error as follows:
|f l (f l(x) + f l(y)) − (x + y)| |(ε1 + ε3 + ε3 ε1 )x + (ε2 + ε3 + ε3 ε2 )y|
=
|x + y| |x + y|
x|ε1 + ε3 + ε3 ε1 | + y|ε2 + ε3 + ε3 ε2 |

x+y
x(|ε1 | + |ε3 | + |ε3 | |ε1 |) + y(|ε2 | + |ε3 | + |ε3 | |ε2 |)
≤ ,
x+y
where in the last two inequalities we have used the triangle inequality and the fact that both x
and y are positive. Therefore
eps eps eps2 eps eps eps2
   
x + + +y + +
|f l (f l(x) + f l(y)) − (x + y)| 2 2 4 2 2 4

|x + y| x+y
eps eps eps2
 
(x + y) + +
2 2 4
= ,
x+y
from where we conclude that
|f l (f l(x) + f l(y)) − (x + y)| eps2
≤ eps + .
|x + y| 4

From the above example, we can immediately see that basic algebraic properties are not
expected to hold.
1
Example Let x = 1001, y = −1 and z = . The distributive property gives
1000
z(x + y) = zx + zy. (1.5)

Set w = z(x + y), w1 = zx, and w2 = zy. Assume that the arithmetic calculations for the
computation of w, w1 , and w2 are performed exactly. Then, assume that we store w, w1 , w2 , and
we perform the addition w1 + w2 in the computer. Will (1.5) hold exactly?
Solution: Since w = z(x + y) = 1 and 1 is a machine number we have

f l(w) = f l(1) = 1.

On the other hand, since w1 = zx = 1.001, and w2 = zy = −0.001, we obtain

f l (f l(w1 ) + f l(w2 )) = f l (1.001(1 + ε1 ) − 0.001(1 + ε2 ))


h eps eps i
= (1.001 + ε1 × 1.001 − 0.001 − ε2 × 0.001) × (1 + ε3 ) for some ε1 , ε2 , ε3 ∈ − , .
2 2
Therefore,

f l (f l(w1 ) + f l(w2 )) = 1 + ε3 + ε1 × 1.001 − ε2 × 0.001 + ε3 (ε1 × 1.001 − ε2 × 0.001),

which, in general, is not equal to 1 because εi ̸= 0 for i = 1, 2, 3.

24
We saw that addition of positive reals in the computer results in arithmetic errors. What
about subtraction or multiplication? We can show the following results (focusing on posi-
tive numbers, although the results can be extended to negative numbers as well): Assume
that x, y ∈ [realmin, realmax] and for any number w, f l(w) represents the floating point
representation of it. Then

• As we already showed, the worst-case relative error for addition is eps + eps2 /4 (note
that since eps is small, the dominating term is eps).

• We can show that the worst-case relative error for subtraction is


eps2 x + y
 
|(x − y) − f l (f l(x) − f l(y))|
≤ eps + .
|x − y| 4 x−y

• The worst-case relative error for multiplication is

|xy − f l (f l(x)f l(y))| 3 3 1


≤ eps + eps2 + eps3 .
|xy| 2 4 8

From the above results, we observe that addition and multiplication arithmetic errors are
both of the order of machine precision, i.e. O(eps). However, the worst-case relative error
for subtraction depends on the particular numbers we are subtracting! This is extremely
important that should be kept in mind when subtracting numbers that are close to each
other. Indeed, if x ≈ y, then 1/|x − y| can have a very large magnitude, and then the
arithmetic error of subtraction can be detrimental. This effect has a special name: it is
called catastrophic cancellation. Let us illustrate the potential problems it leads to in
Python:

Example Be careful when subtracting numbers of nearly equal size:


import math

x = 1e16
res = math.sqrt(x + 1) − math.sqrt(x)
print (f" Result : { res :.52f}")
Result : 0.0000000000000000000000000000000000000000000000000000

Can we do anything about this? In general, it is difficult to avoid catastrophic cancellation,


but there are two general tools we can use:

• If we have two numbers x, y with x > 1, y > 1 and x ≈ y we could use

x2 − y 2 x3 − y 3 xn − y n
x−y = = 2 = · · · = .
x+y x + xy + y 2 xn−1 + xn−2 y + · · · + xy n−2 + y n−1
We expect that the above different ways of writing x − y will produce stable formulas.
E.g., if x = 1.1 and y = 1.05 then x − y = 0.05, while x100 ≈ 1.3781 × 104 and
y 100 ≈ 1.315 × 102 , so x100 and y 100 are no longer close to each other.

25
• On the other hand, it is often possible to use mathematical approximations to avoid
using subtraction altogether. For example, say we want to compute exp(x) − 1, for
x ≈ 0. This would lead to catastrophic cancellation, and the previous trick would also
not work. Instead, we can use the Taylor expansion of exp and write:
x2 x3 x4
exp(x) − 1 = x + + + + ...
2 3! 4!
noting that the more Taylor terms we use, the better the approximation we obtain.
That way, we have avoided using subtraction!
Let us see how the first of these tools would work in practice, to improve the loss of accuracy
we observed in the last Python code: We have observe that
√ √  √ √ 
√ √ x+1− x x+1+ x 1
f (x) = x + 1 − x = √ √ ≡√ √
x+1+ x x+1+ x

import math

x = 1e16
res = math.sqrt(x + 1) − math.sqrt(x)
alt res = 1 / (math.sqrt(x + 1) + math.sqrt(x))

print (f"res: { direct result :.52f}")


print (f" alt res : { alternative result :.52f}")
res: 0.0000000000000000000000000000000000000000000000000000
alt res : 0.0000000050000000000000001046128041506423633766331704

Propagation of Errors in Function Evaluations Finally, let us observe that round-


off errors propagate also when evaluating functions, and this can result is very significant
errors. To see this, let x∗ be an approximation of x and assume that we only know x∗ (e.g.
x∗ = f l(x)). Let δ(x∗ ) be an upper bound of the absolute error |x − x∗ | (e.g. if x ∈ [1, 2)
and x∗ = f l(x), then δ(x∗ ) = eps/2), i.e.,

|x − x∗ | ≤ δ(x∗ ).

Assume that f is a continuously differentiable function and that we are interested in finding
an upper bound for the absolute error |f (x) − f (x∗ )|. From the Mean Value Theorem we
get:
f (x) − f (x∗ ) = f ′ (ξ)(x − x∗ ) with ξ ∈ [min{x, x∗ }, max{x, x∗ }].
Let D = max

|f ′ (t)|, then
t∈[min{x,x },max{x,x∗ }]

|f (x) − f (x∗ )| ≤ D|x − x∗ | ≤ Dδ(x∗ ).

Thus Dδ(x∗ ) is an upper bound for |f (x) − f (x∗ )|. In many cases this bound gives pretty
good estimates for the error for f (x). However, the above estimates can be crude, especially
in cases where |f ′ | varies a lot between x and x∗ .

26
1.4.4 Summary
Now that we have given a short overview of sources of errors in numerical computations, it
is important to mention some notation that will be used throughout this module to specify
the “order of magnitude” of given errors.

Definition: Big-O notation

Given two positive functions T : (0, ∞) → (0, ∞) and f : (0, ∞) → (0, ∞) we say that
T (x) = O(f (x)) if and only if there exist constants c > 0 and some x0 > 0 such that

T (x) ≤ cf (x), ∀ x ≥ x0 .

The Big-O notation is very useful, because it compares two functions by only considering
their dominating terms. For example, let eps be a small number (e.g., representing the
machine precision). We saw that the worst-case relative error from addition in a computer
is T (eps) = eps + eps2 /4. Let f (eps) = eps. Then, based on the Big-O definition, and given
that eps is assumed to be small (≪ 1), we can easily show that

T (x) = O(f (x)) ≡ eps + eps2 /4 = O(eps).

It is important to keep this notation in mind, since it will be very useful when quantifying
errors of various numerical computations, and will be useful in comparing the accuracy of
different numerical methods (in the grand scheme of things).
In general, not all algorithms suffer significantly from proliferation of machine errors;
sometimes errors stay small throughout the execution of an algorithm. However, sometimes
they do become important, and constitute the main limitation to how precise our numerical
result can be. In every calculation that we do, it is important to keep in mind machine errors
that can be introduced, even though they often turn out to be much less important than
errors due to approximations made by the numerical method itself.

Practical take-aways:

1. Use math.isclose to compare floating point (i.e., real) numbers.

2. There are several forms of numerical errors (e.g., round-off, arithmetic, algorithmic, or
stochastic), each of which might require different treatment.

3. Be careful when subtracting numbers close to each other (i.e., beware of catastrophic
cancellation).

4. In cases where round-off or arithmetic errors can be very problematic, we can adjust
the 64bit floating point accuracy of Python. We will later see the packages mpmath
and decimal that provide better accuracy.

27
Chapter 2
Solution of a Single Non-linear Equation

In this chapter we study the Newton method for the solution of a single non-linear equation.
A general non-linear equation reads as
f (x) = 0, (2.1)
for some function f . For the Newton method we assume that this function is differentiable
and the derivative f ′ is known (later we discuss how to build various approximations for the
derivative numerically, but in this chapter we assume that the derivative is known analyti-
cally).

2.1 Newton Method


The idea is very simple - given some starting point x0 together with the function value and
its derivative at this point, f (x0 ) and f ′ (x0 ), we can approximate the initial function f (x)
by the first two terms of the Taylor expansion:
f (x) ≃ f (x0 ) + f ′ (x0 )(x − x0 ) ≜ fˆ(x) . (2.2)
Strictly speaking this approximation is only good for x’s such that |x − x0 | is small; for in-
stance if the function is twice-differentiable, the error in this approximation is O ((x − x0 )2 ).
The great advantage of this approximation is that the equation fˆ(x) = 0 can be immediately
solved analytically to give
f (x0 )
0 = fˆ(x) = f (x0 ) + f ′ (x0 )(x − x0 ) ≃ f (x) ⇒ x = x0 − ′ . (2.3)
f (x0 )

If the approximation (2.2) was exact (i.e., if fˆ(x) ≡ f (x)), then x0 − ff′(x 0)
(x0 )
would give us the
exact solution of the equation (2.1). However, fˆ(x) is only an approximation of the nonlinear
function f (x), and so x1 = x0 − ff′(x 0)
(x0 )
is only an approximate zero of f (x). Under favourable
conditions, x1 will be closer to the true solution of our equation (2.1), in comparison to the
starting point x0 . Then we can use x1 as a starting point for the next iteration to get an
even better approximation x2 , and so on, until we reach a satisfactory approximate solution.
In other words, we will generate the series of numbers xn defined by
f (xn−1 )
xn = xn−1 − , (2.4)
f ′ (xn−1 )

28
assuming that f ′ (xn−1 ) ̸= 0, each time (hopefully) approaching closer and closer the true
solution of the equation (2.1) (see the figure below).

When the sequence of numbers xn is convergent, it converges very fast; the number of
exact digits doubles on each iteration (see the table below).

When the number of exact digits doubles on each iteration, then the algorithm is said to be
quadratically convergent (below we discuss more precisely the notion of rate of convergence
of an algorithm).
One should note that we are only guaranteed to get a convergent series if the starting
point x0 is already relatively close to the exact solution. One could be particularly unlucky
by picking the starting point such that f ′ (x0 ) is close to zero, in which case x1 will be huge
and the series may not converge. There is always an element of art to choosing a good
starting point x0 . Depending on that choice, you can either get the result in a few steps
with enormous precision or find a useless divergent series.

2.1.1 Python Code for the Newton Method


We begin the Python code by defining our function f (y) = y 3 /3 − 2y 2 + y − 4

29
def f(y):
return y∗∗3/3−2∗y ∗∗2+y−4

and its derivative df :


def df(y):
return y∗∗2−4∗y+1

Next, we build the sequence xn (stored in the list type of Python). We begin from a trivial
sequence, containing only the starting point x=[x0]. Then we append new points to this
list, and we repeat up to a certain number of steps (in this case, 10).
def newton (f,df ,x0):
xs = [x0]
for i in range (10):
xs. append (xs[−1]−f(xs[−1])/df(xs[−1]))
return xs[−1]

To test how this works we run


newton (f,df ,8.)

where we’ve chosen the starting point to be 8. Try it out with other starting points, and
other numbers of steps.

Exercise 2 It is also a good exercise to try it with a different condition for the stopping point:
instead of a fixed number of steps, stop the loop at the step n for which |f (xn )| < 10−10 (thus
when we are indeed very near to a zero).

2.1.2 Newton Fractals†1


We note that the derivation above for the main equation of the Newton method holds for
reals as well as complex numbers. So the recursion (2.4) is valid in the complex setting, and
we can expect that the method will be able to find complex zeros as well.
For instance, for a generic starting point x0 on the complex plane, the iteration procedure
of the Netwon method should converge, for the case of f (y) = y 3 /3 − 2y 2 + y − 4, to one of
the three roots

5.83821 , 0.080896 + 1.43139i , 0.080896 − 1.43139i .

We can associate 3 colours to each of these 3 possibilities. Then, the picture of the fractal is
obtained in the following way: for each point of the complex plane x0 we build the series xn
and if that is convergent to one of the roots we paint this point in the corresponding colour.
1
This section is not examinable, but it is recommended to study it. It’s also very beautiful!

30
To make the picture more beautiful one can also change the intensity of the color depending
on how many iterations are needed in order to reach some fixed precision. The picture
obtained in this way is called the Newton fractal.
The code below created this picture:

Python Code for the Newton Fractal

# c r e a t e an empty image
from PIL import Image
sizex = 700; sizey = 700
image = Image.new("RGB", (sizex , sizey ))

# drawing area
xmin = 0.0; xmax = 6.0
ymin = −3.0; ymax = 3.0

# t h e r e are 3 p o s s i b l e r o o t s
roots = [ complex (5.83821 ,0) ,
complex (0.080896 , 1.43139 ),
complex (0.080896 ,− 1.43139 )]

# draw t h e f r a c t a l
for i in range(sizey ):
zy = i ∗ (ymax − ymin) / (sizey − 1) + ymin
for j in range(sizex ):
zx = j ∗ (xmax − xmin) / (sizex − 1) + xmin
z = complex (zx , zy)
n = 0;
while abs(f(z))> 1/10∗∗10:
n = n + 1;
z = z − f(z) / df(z)

31
# c h o o s e t h e c o l o r d e p e n d i n g on t h e number o f i t e r a t i o n s n
# and on t h e r o o t
if abs(z−roots [0]) < 1/10∗∗5:
image. putpixel ((j, i), (255 − n % 32 ∗ 8 , 0, 0))
elif abs(z−roots [1]) < 1/10∗∗5:
image. putpixel ((j, i), (0 , 255 − n % 32 ∗ 8, 0))
elif abs(z−roots [2]) < 1/10∗∗5:
image. putpixel ((j, i), (0 , 0, 255 − n % 32 ∗ 8))

# saving r e s u l t to a f i l e
image .save(" fractal .png", "PNG")

2.2 Rate and Order of Convergence


Usually numerical methods require several iterations to give the result with a good precision.
Sometimes it could take really long time to get to the result due to a slow convergence rate.
To quantify the notion of convergence of a sequence xk of numerical approximation of some
exact quantity x∗ we define the q-convergence rate as follows.

Definition: Rate of convergence of a sequence

Suppose that a sequence {xk }∞ ∗


k=0 converges to some point x . The sequence is said to

converge to x with order q (where q > 1), if

|xk+1 − x∗ |
lim =µ
k→∞ |xk − x∗ |q

for some finite nonzero rate of convergence 0 < µ < ∞. Moreover, if the order of
convergence is q = 1, then the rate must be such that µ ∈ (0, 1).

The above definition quantifies the speed of convergence of a sequence (i.e., the larger q
is, the fastest the convergence). A sequence with convergence order q = 1 is called linearly
convergent, for q = 2 is called quadratically convergent, while for q = 3 cubically convergent,
and so on.
To give some intuition, we see that if

xk = x∗ + A exp(bq k )

for some A ∈ R, q > 1 and b < 0, then clearly limk→∞ xk = x∗ , and


|xk+1 − x∗ |
= |A|1−q exp(bq k+1 − bqq k ) = |A|1−q ≜ µ. (2.5)
|xk − x∗ |q
So this sequence is q-convergent. In other words, q-convergence means that log |xk − x∗ |
tends to negative infinity exponentially fast in k (like q k plus subleading terms). Since
− log10 |xk − x∗ | is essentially the number of leading digits, in base 10, of xk that agree with
those of x∗ , at every step k → k + 1, the number of exact leading digits is multiplied by q.

32
Remark: Order of convergence q = 1
As mentioned in the definition, the case where q = 1 does not guarantee that the
sequence xk converges to x∗ linearly, unless 0 < µ < 1. The condition on µ in this case
is simple to see: assume, for instance, that x∗ = 0 and xk > 0 for all k’s (for example,
take xk = 1/2k ). Then, if the order of convergence is q = 1, we have
xk+1
lim = µ ⇒ xk ∼ Aµk as k → ∞, for some A
k→∞ xk

(for the implication: just solve the recursion xk+1 /xk = µ by taking the product K
Q
k=a
on both sides, for some fixed a). Obviously, this converges if 0 < µ < 1, and diverges
if µ > 1. Note, from this analysis, that “linear convergence” means that the speed at
which the sequence approaches to the point of convergence is exponentially fast. For
example, the sequence xk = 2−k is linearly convergent towards 0.

The case µ = 1 is called “1-sublinear convergence”. An example is a convergence speed


that is controlled by a power law, for instance xk = k −2 , in which case you can see
that limk→∞ xk+1 /xk = 1. The case µ = 0 is called “1-superlinear convergence”, and
lies in-between linear (q = 1) and any q-convergence with q > 1. One example of such
a sequence is xk = k −k .

2.2.1 Convergence of the Newton method


To derive this simplified analysis, we assume that f is twice-differentiable. Furthermore,
assuming we are already close to the actual zero of the function, which we denote as x∗ , we
can make a better approximation of the function f by including the quadratic term to the
Taylor series, i.e.
1
f (x) ≃ 0 + f ′ (x∗ )(x − x∗) + f ′′ (x∗ )(x − x∗)2 + O (x − x∗ )3 .

(2.6)
2
Using this approximation we can estimate the errors xn −x∗ and xn−1 −x∗ for two consecutive
iterations, by taking the second-order Taylor expansion of f (xn−1 ) and f ′ (xn−1 ) at x∗ while
truncating any third-order terms, and substituting these in the Newton update as shown
below
f (xn−1 ) f ′ (x∗ )ϵn−1 + 12 f ′′ (x∗ )ϵ2n−1
xn = xn−1 − ′ ≃ xn−1 −
f (xn−1 ) f ′ (x∗ ) + f ′′ (x∗ )ϵn−1
where ϵm ≡ xm − x∗ . Using the general formula 1/(1 + x) ≃ 1 − x for |x| small, we can also
write it as
1 f ′′ (x∗ )
1+ ϵ 1 f ′′ (x∗ ) f ′′ (x∗ )
  
2 f ′ (x∗ ) n−1
ϵn = ϵn−1 − ϵn−1 ′′ ∗ ≃ ϵn−1 − ϵn−1 1+ 1 − ′ ∗ ϵn−1
ϵn−1
1 + ff ′ (x
(x )
∗ ) ϵn−1
2 f ′ (x∗ ) f (x )
1 f ′′ (x∗ ) f ′′ (x∗ )
 
= ϵn−1 − ϵn−1 1 + ϵn−1 − ′ ∗ ϵn−1
2 f ′ (x∗ ) f (x )
(2.7)

33
where on the right-hand side the error introduced by truncating the Taylor series expansions
is O(ϵ3n−1 ). We finally get

1 f ′′ (x∗ ) ϵn 1 f ′′ (x∗ )
ϵn = ϵ2n−1 + O(ϵ3
n−1 ) ⇒ lim = ≜ µ. (2.8)
2 f ′ (x∗ ) n→∞ ϵ2
n−1 2 f ′ (x∗ )

If f ′ (x∗ ) ̸= 0 and f ′′ (x∗ ) ̸= 0, this means that this method has a quadratic convergence rate.
As mentioned earlier, in practice this means the following: if after n − 1 iterations we got
the result with, e.g., ϵn−1 ∼ 10−5 error, then just by doing one extra iteration we obtain
the numerical error ϵn ∼ 10−10 which means that we double number of exact digits at each
iteration.

The Degenerate Case. In the above convergence analysis we had to assume that f ′ (x∗ ) ̸=
0. As the derivative appears in denominator of (2.8) our conclusion should be wrong in the
case when f ′ (x∗ ) = 0; the method might still work and it might indeed produce a zero of the
nonlinear equation, but its convergence analysis is different. For f ′ (x∗ ) = 0 and f ′′ (x∗ ) ̸= 0,
instead of (2.7) we get
1 ′′ ∗
f (x )
ϵn ≃ ϵn−1 − ϵn−1 2 ′′ ∗
f (x )
or
1 ϵn 1
ϵn ≃ ϵn−1 ⇒ lim = ≜ µ ∈ (0, 1) .
2 n→∞ ϵn−1 2
This means it indeed still converges, but that the convergence is very slow in this case. On
each iteration we decrease the error ϵn by a factor of 1/2, i.e. we will need at least 4 iterations
just to get one more exact digit for the estimation of x∗ .

Exercise 3 Work out what happens with the convergence order and rate if f ′ (x∗ ) = f ′′ (x∗ ) =
0 and f ′′′ (x∗ ) ̸= 0, etc.

2.3 Secant Method


The Newton method described above converges extremely fast, however, it requires the
knowledge of the derivative of the function f (x). In some cases this may not be easily
available. In the secant method we build a linear approximation for the function f (x) again,
this time using two points xn−1 and xn−2
x − xn−1 x − xn−2
f (x) ≃ f (xn−2 ) + f (xn−1 ) .
xn−2 − xn−1 xn−1 − xn−2
It is easy to verify that the r.h.s. of the above expression:

1. is a linear function in x

2. takes the value f (xn−1 ) at x = xn−1

3. takes the value f (xn−2 ) at x = xn−2 .

34
Applying the same logic as in the Newton method we find xn as an exact zero of this linear
approximation
xn − xn−1 xn − xn−2
f (xn−2 ) + f (xn−1 ) = 0
xn−2 − xn−1 xn−1 − xn−2
which gives
xn−1 f (xn−2 ) − xn−2 f (xn−1 )
xn = (2.9)
f (xn−2 ) − f (xn−1 )

One can notice that the convergence rate of this method is slightly slower than that of
the Newton method. It required 11 iterations (vs.8 for the Newton method) to get the result
with 90 digits of accuracy as we can see from the table below. One can even estimate that
the precision increases roughly by a factor 3/2 (from 12 digits to 20 digits) per iteration.
For the Newton method the number of precise digits doubled.

35
2.3.1 Python Code for the Secant Method
Let us implement the secand method. We use the same function as before, i.e. f (y) =
y 3 /3 − 2y 2 + y − 4.
def f(y):
return y∗∗3/3−2∗y ∗∗2+y−4

However, this time we do not need to define its derivative function.


We use basically the same code as for the Newton method with a few modifications.
First, we add x−1 and remove df from the list of the arguments. Second, we have to modify
the recursion relation, imposing (2.9)
def secant (f,x0 ,xm1 ):
x=[xm1 ,x0];
for i in range (10):
x. append ((x[−1]∗f(x[−2])−x[−2]∗f(x[−1]))/
(f(x[−2])−f(x[−1])));
return x[−1]

To test how this works we run


secant (f ,8. ,7.5)

Try this on your Python notebook, and see what happens when you change the values!

Exercise 4 Again, try it with a different condition for the stopping point: instead of a fixed
number of steps, stop the loop at the step n for which |f (xn )| < 10−10 .

Exercise 5 Run the secant method code given above, but allow it to run for 100 iterations.
What do you observe? Is there an issue, and if so, how can it be fixed?
Hint: To understand the problem, it might be useful to revise the floating point arithmetic
used within a computer (and thus Python as well).

2.4 Bisection Method (a.k.a Lion Hunting)


This method is applicable for solving the equation f (x) = 0 for a real variable x for any
continuous real-valued function f (x) on an interval [a, b] when f (a)f (b) is negative (i.e. f (a)
and f (b) have opposite sign). In this case, according to the intermediate value theorem,
the continuous function takes all values (f (a), f (b)) inside the interval so in particular the
interval must contain a point x∗ where f (x∗ ) = 0. Our task is to find this point.
For Newton and secant methods it was crucial that the function was sufficiently smooth
and could be approximated well by its Taylor series in the vicinity of each of the points
considered. An example where this assumption fails and the Newton method may never
converge is the function 
 x + 1 (x > 1/2)
f (x) = x − 1 (x < −1/2) (2.10)
3x (otherwise).

36
Figure 2.1: Plot of the function f defined in (2.10)

Indeed with both the Newton and Secant methods, if we start at x0 = 1 we will get
x1 = −1 and then for x2 = 1 again and it will continue to flipflop forever.

Exercise 6 Check the above statement: calculate analytically the sequences obtained by using
the Newton and Secant methods starting at x0 = 1 for the above function.

The Bisection Method presented here will work even in this situation. Let us describe
the method: at each step we divide the interval in two by computing its midpoint, i.e.
c = (a+b)/2, and the value of the function at the midpoint f (c). Next we have 3 possibilities

1. f (c) = 0, then we have found the zero x∗ = c

2. f (c)f (a) < 0, then we can be sure that zero is between a and c, in which case we
repeat the procedure for the interval [a, c]

3. f (c)f (a) > 0 (which implies that f (c)f (b) < 0), and then zero is hidden between c and
b, in which case we repeat the procedure again for [c, b]

Note that at each iteration we reduce the search area for our zero by a factor of 2.

2.4.1 Bisection Method Implementation


def bisection (f,a,b):
for i in range (10):
c=(a+b)/2 # f i n d mid p o i n t
if abs(f(c))<1e−10:
print ("found zero",c)
return c
elif f(a)∗f(c)<0:
b=c
else:
a=c
print ("a=",a,"b=",b,"f(a)=",f(a))

37
To use it we define some function:
def f(x):
return x∗∗2−2

and then feed it as an argument to our bisection method


bisection (f ,0 ,2)

a= 1.0 b= 2 f(a)= −1.0


a= 1.0 b= 1.5 f(a)= −1.0
a= 1.25 b= 1.5 f(a)= −0.4375
a= 1.375 b= 1.5 f(a)= −0.109375
a= 1.375 b= 1.4375 f(a)= −0.109375
a= 1.40625 b= 1.4375 f(a)= −0.0224609375
a= 1.40625 b= 1.421875 f(a)= −0.0224609375
a= 1.4140625 b= 1.421875 f(a)= −0.00042724609375
a= 1.4140625 b= 1.41796875 f(a)= −0.00042724609375
a= 1.4140625 b= 1.416015625 f(a)= −0.00042724609375

We see that the convergence is very slow and requires lots of iterations to give decent preci-
sion.

Exercise 7 Derive analytically the convergence order and rate of the bisection method. More
precisely, establish an upper bound on how fast we can approach the true zero, by using your
knowledge of how the interval changes.

2.5 Numerically Estimating Convergence Orders


Sometimes it is hard to establish analytically the convergence order and rate of a given
algorithm. However, this can be estimated numerically, by testing the algorithm on various
examples, and analysing the numbers obtained. For this, it is convenient to reduce, as much
as possible, the error introduced by the finite machine precision (see (1.2)). Note, however,
that the estimated convergence rate might only be applicable to certain examples and not
others!
In order to minimize round-off errors for this purpose, in Python one can easily use
multiple precision mathematics by loading the package mpmath. Obviously, the higher the
precision we use, the more memory intensive and slow our code will be.
# load the multiple precision l i b r a r y
import mpmath as mp
# Setting precision
mp.mp.dps = 100
print (mp.pi)
3.141592653589793238462643383279502884197169399375105820974944592

We can use this enormous precision to study experimentally the convergence rates of var-
ious algorithms. First we adjust the Newton method code to work with arbitrary precision.
Note that here and below, in order to use the high precision, we need to make sure that the

38
variable used in the algorithm are high-precision numbers, which we do by using the mpf
method from mpmath.
def newtonHP (f,df ,x0 , precision ):
n = 0;
x = [mp.mpf(x0)]
while (abs(f(x[−1]))>10∗∗(−precision )) and (n<100):
# An a l t e r n a t i v e way o f w r i t i n g n = n+1
n += 1
x. append (x[−1]−f(x[−1])/df(x[−1]))
return x
To verify the quadratic convergence rate of the Newton method we compute the number of
exact digits as log10 |xn − x∗ |, divide it by 2n , and plot the result (see Section 2.2). Here
we use
√ the function f (x) = x2 − 3 as a simple example, so that we know the exact solution
x∗ = 3. As a starting point, we choose x0 = 7.5.
x0 = 7.5
result = NewtonHP (f,df ,x0 ,50)

numberOfDigitsOverqSquare = []
ns = []
n = 0
for x in result :
n += 1
ns. append (n)
numberOfDigitsOverqSquare . append (
float(−mp.log(abs(x − 3∗∗ mp.mpf (0.5)))/ mp.log (10))/2∗∗ n
)
plt.plot(ns , numberOfDigitsOverqSquare ,"rx",
label=" Precision Convergence ")
plt. xlabel (’Iteration Number ’)
plt. ylabel (’Number of Digits of Precision / 2^n’)
plt. title(’Convergence rate of Newton \’s Method ’)
plt. legend ()
plt.show ()
From the associated figure, we see that limn→∞ log |xn − x∗ |/q n = const (= 0.1 here) with
q = 2. This implies that the Newton method exhibits a local quadratic convergence rate in
this case, confirming our analytical proof given in Section 2.2.1.
Repeating
√ the
 same exercise for the secant method, we could confirm that in this case
1
q = 2 1 + 5 ≈ 1.6180. As expected, the secant method has a slower convergence rate
compared to Newton method. In this way, we are generally able to work out how powerful
our algorithms are (albeit for specific examples), and subsequently estimate the number of
iterations needed to reach some desired precision!

Exercise 8 Numerically estimate the order of convergence for the secant and bisection
method by adapting the above code. Check that you get q ≈ 1.6180 for the secant method,
and that your answer for the bisection method agrees with what you obtained in Exercise 7.

39
40
Chapter 3
Approximation of Functions by Polynomials

There are a number of reasons why we would want to approximate complicated functions
using simpler functions. Solving various equations numerically can give very good results
which are frequently not accessible using other methods (such as analytical exact solutions).
But in order to make it tractable for the computer, an exact equation is usually replaced by
a “discretised” equation: the functions involved are replaced by piecewise-simpler functions
(e.g., piecewise-constant, piecewise-linear, piecewise-polynomial, etc.), or, more generally, by
a simpler function within a family of functions, with a finite number of parameters to be
determined. This discretised or simplified equation is designed to give a good approximation
but is not equivalent to the initial problem (e.g., system of nonlinear equations). It is thus
of utmost importance to have a way of deriving such approximations, while also knowing
how precise these are.
Another general way of seeing this problem is the following: given a set of data points,
perhaps coming from experimental observations, we would like to find the closest “natural”,
or smooth enough, function that these data points might represent. We will see later how
discretisation works. But in this chapter, we will instead consider approximations by a given
simple family of functions: namely, polynomials.

Figure 3.1: Two types of the polynomial approximation. Left: interpolation when the
polynomial goes exactly through the data points (red dots). Right: fitting (noisy) data with
a polynomial.

The problem of polynomial approximation can be formulated in the following way: Given

41
a set of data points, say
x0 x1 x2 . . . xn
,
y0 y1 y2 . . . yn
find the best polynomial which describes this data. Note that there is always a unique
polynomial of degree n which goes through all these points (assuming that the xi ’s are
pairwise distinct). However, sometimes the data may contain significant measurement noise
and instead one needs to find a smooth function which goes between the points, as shown
on the right-hand side of Figure 3.1.

3.1 Lagrange Interpolation


As we already mentioned, it is always possible to construct a unique polynomial of degree
n which passes through n + 1 data points (can you see why?). The Lagrange interpolation
formula gives an explicit expression for this polynomial:
n
X
Pn (x) = yi pi (x)
i=0

where pi (x) is the basic building block for the interpolation. The pi (x)’s are fully fixed by
the following conditions: they should each be polynomials of degree at most n, and they
should satisfy pi (xi ) = 1 and pi (xj ) = 0, for all j ̸= i. With these conditions, Pn (x) is a
polynomial of degree at most n, and satisfies Pn (xi ) = yi , for all i.
It is not hard to show that
n
Y x − xj
pi (x) = . (3.1)
x i − xj
j=0, j̸=i

It is indeed easy to see that pi (xi ) = 1 and pi (xj ) = 0, for j ̸= i, and that these are
polynomials of degree n. For example, for xi = i those pi (x) are depicted below

3.1.1 Python Implementation of the Lagrange Interpolation


First we implement the basis functions pi (x). This depends on all xi ’s, and so we must pass
these as function variables as well, which we do using the list xs= {xi }.

42
def pbasis (x,i,xs):
xi = xs[i]
res = 1
for xj in xs:
if xj != xi: # x j not eq ua l to x i
res ∗= (x−xj )/(xi−xj) # t h i s i s t h e same as
# r e s = r e s ∗ ( x−x j ) / ( x i −x j )
return res

This implements the multiplication of all factors in the product over j in (3.1) with a loop,
where the “result” res starts with value 1, and is multiplied by the correct factor for all
j. Then the Lagrange Interpolation is simply a sum over all points yi times the basis
polynomials. The function must have as input not only the independent variable x, but also
both data sets: the xi ’s and the yi ’s, stored in the lists xs and ys (note: len is a method
that gives the length of a list or array):
def lagrange (x,ys ,xs):
res = 0
for i in range(len(ys )):
res += ys[i]∗ pbasis (x,i,xs [:])
return res

Then the usage is the following


xs =[1 ,2 ,3 ,4 ,5 ,6]
ys =[0. ,0.841471 ,0.909297 ,0.14112 , −0.756802 , −0.958924]
lagrange (1.5 ,ys ,xs)

Remark: Alternative implementation of lagrange


In general, when writing Python code it is a good practice to use the built-in functions
of Python. This can lead to more efficient but also more succinct and readable code.
To showcase this, we observe that the lagrange function above can be equivalently
written by utilizing the sum function of Python:
def lagrange (x, ys , xs):
return sum(ys[i] ∗ pbasis (x, i, xs)
for i in range(len(ys )))

In order to check how it works we can reproduce the left graph in Figure 3.1 above by using
the pyplot library:
import numpy as np
import matplotlib . pyplot as plt
xdense = np. arange (0 ,7 ,0.1)
plt.plot(xs , ys , ’ro’)
plt.plot(xdense , lagrange (xdense ,ys ,xs))
plt.show ()

43
3.1.2 Limitations of Polynomial Interpolation
The polynomial interpolations usually work very well in the middle of the interpolation
interval, but could produce unwanted oscillations at the ends of the interval, as shown in the
graph below:

One of the ways to reduce the oscillations is to increase the number of points:

However, this also makes the interpolation slower and requires more resources to store the
information. A more efficient way is to choose points xi so that they are more dense at the
ends of the interval. One of the possibilities
  is to take points xi , called Chebyshev points or
a+b a−b i+1/2
nodes, such that xi = 2 − 2 cos π n+1 , i = 0, . . . , n, for the interval [a, b]. These are
the roots of the Chebyshev polynomial of the first kind (rescaled to the interval [a, b] instead
of the conventional [−1, 1]). The important property of these points is that they are more
concentrated near the edge of the interval. One can think of this as projecting values of the
cos function taken at regular angles, onto the x-axis. The mathematical origin of this choice
is explained in Section 3.3. This formula already gives a much better result:

44
3.2 Fitting Data
Fitting is another way of approximating a function from some data points. The curve of
best fit does not need to pass exactly through the data points and as such can be used to
smooth the data and clean it from potential errors/noise.

3.2.1 Least Squares Fit


One of the ways of making a good fit is to take a linear ansatz

f (x) = a0 f0 (x) + a1 f1 (x) + · · · + am fm (x),

where fi (x) is some predetermined basis of m+1 linearly independent functions. An example
of basis functions are the monomials fi (x) = xi , giving a polynomial fit. But other bases are
possible; in fact any basis of linearly independent functions works.
Our next task is to find the “optimal” coefficients ai so that the fit function f (x) passes
as close as possible to the data points yi . Obviously, in the extreme case when m = n (where
n + 1 is the number of the points, labelled from 0 to n inclusively) we should reproduce an
exact interpolation; for instance, with the choice of monomials fi (x) = xi , this will reproduce
the polynomial interpolation from the previous section. In general, we assume m ≤ n and so
we want to get the closest possible approximation of the data by means of a smaller number
of parameters.
A useful measure for the quality of the approximation is given by the sum of squares of
deviations !2
X n n
X Xm
S= (yi − f (xi ))2 = yi − aj fj (xi ) .
i=0 i=0 j=0

We have to find the values of the coefficients aj for which S is minimal, i.e. we have to
require
∂S
= 0 , k = 0, . . . , m.
∂ak
More explicitly, the previous equation reads as
n m
!
∂S X X
= −2 yi − aj fj (xi ) fk (xi ) = 0 , k = 0, . . . , m . (3.2)
∂ak i=0 j=0

By inspection, we see that this is a linear system of equations in the variables ai ; remember
that these are the variables we are trying to solve for. That is, we can write this system of
equations as
Aa = b .
The matrix A is obtained by inspection of (3.2): the matrix element Akj is obtained by
looking at the equation ∂S/∂ak = 0, and in it, the coefficient of aj . Similarly, the vector
b comes from the equation ∂S/∂ak = 0, by collecting the terms that do not depend on the
aj ’s. This gives
Xn Xn
Ajk = fj (xi )fk (xi ), bk = fk (xi )yi .
i=0 i=0

45
In the particular case of fj (x) = xj , we obtain
n
X n
X
Akj = xj+k
i , bk = xki yi .
i=0 i=0

These are the main objects of the polynomial approximation.

3.2.2 Python Code for Polynomial LS Fit


Note that in the Python code written below (specifically, the function findas), we utilize
various ways of dealing with arrays and lists. For instance, if x is a list, then x∗2 is a list
twice as long where the elements repeat. We use this below in [0.]∗(m+1), for instance, to
create a list of m + 1 0’s. We also use lists of lists; for instance [[1,2,3]]∗(m+1) makes a
list made of m + 1 elements, each of which is the list [1,2,3].
But if y is a numpy array, then y∗2 is an array of the same length, where all elements
are multiplied by 2. With arrays, there are also many mathematical operations we can do,
for instance y∗y is the array where all elements are multiplied by themselves, y∗∗3 is that
where all elements are taken to the third power, and y∗∗y is that where all elements are
exponentiated to the power of themselves. Try it on your notebook!
Note that in Python, when dealing with matrices and vectors it use useful to store them
as numpy arrays, as this allows us to perform mathematical operations on them very easily.
In the example below, for instance, we use the method np.array to transform the list of
lists of zeros into a matrix the entries of which are all zeros. This command differs from
the np.asarray command used before it, which also converts a quantity into a numpy array.
The difference between the two is that the latter does not create a new numpy array if its
input is already a numpy array (i.e., it is useful for ensuring that the input provided from the
user is indeed what we expected; a numpy array. If not, we mitigate any issues that would
possibly arise in the case where the user provides an incorrect input type, such as a list).
One subtle point is worth mentioning here. In, for instance, np.array([0.]∗(m+1)),
we write 0. instead of 0. What does this mean? This simply tells Python that the array
will be filled with 0’s that are to be seen as floating point numbers, and not integers. Using
only 0, mathematical operations on the elements of the array would be confined within the
integers, in which case, for example “1/2 = 0” (that is, the computer would simply take
the integer part of the result). This could result in significant errors. In general, Python
tends to prefer storing a number as an integer, if that is sufficient for the purposes of the
variable. This is because integer operations are faster and require less memory. Nonetheless,
mixed operations between integers are floating point numbers are automatically transferred
into floating point operations, unless specified otherwise. However, when creating a numpy
array, we have to explicitly specify the data type (i.e., integer, floating point, etc.). This is
an important point to remember!
We also introduce the method np.sum, which sums the elements of a numpy array in
its argument (note that this is different from the built-in sum command of Python which
works in iterables or lists; indeed, there are several subtle differences between np.sum and
sum, which have inherently different performance and capabilities). Additionally, in the
code below, we make use of the method np.linalg.solve (which is also part of the numpy
package), which solves the linear system Ax = b (that is, it evaluates the vector A−1 b). The
syntax np.linalg.solve means that we utilize the solve method from the sub-package
linalg of numpy (but this fact is not particularly important).

46
import numpy as np

def findas (m,xs ,ys):


# Ensure x s and y s a r e numpy a r r a y s
xs = np. asarray (xs)
ys = np. asarray (ys)

# c r e a t i n g a matrix of zeros
A = np.array ([[0.]∗( m +1)]∗( m+1))
# c r e a t i n g an empty v e c t o r
b = np.array ([0.]∗( m+1))

# f i l l i n g i n A and b w i t h v a l u e s
for k in range(m+1):
b[k] = np.sum(ys ∗ xs ∗∗ k)
for i in range(m+1):
A[k,i] = np.sum(xs ∗∗( k+i))
coefs = np. linalg .solve(A,b)
print (coefs)
def fit(x):
return sum(coefs[k ]∗( x ∗∗ k) for k in range(len(coefs )))
return fit

To check how it works we make a plot. An important remark is that the output of the function
findas is not a number, not a list, but itself a function! Indeed, within the function findas,
there is a def fit(x): which defines a new function fit (the name “fit” is a dummy name,
it is just a dummy variable used inside the function findas that takes for value a function),
and this new function returns the result of the polynomial fit. It is this function that is
the return value of findas, in the line return fit. Thus, in the code given below, the
variable ft (which is the result of calling findas), is itself a function — it is the function
that represents the fit of our data. This is a convenient way of returning a fit: just return
the function that fits.
import matplotlib . pyplot as plt

xs = np. array ([1 ,2 ,3 ,4 ,5 ,6]);


ys = np. array ([ −5.21659 ,2.53152 ,2.05687 ,14.1135 ,20.9673 ,33.5652]);
ft = findas (2,xs ,ys)

xdense = np. arange (0 ,7 ,0.1)


fitlist = list(map( lambda x: ft(x), xdense ))

plt.plot(xs , ys , ’ro’)
plt.plot(xdense , fitlist )
plt.show ()

The above code should reproduce the right plot given in the beginning of the chapter.

47
Remark
Observe that the above code uses some interesting Python syntax: lambda x: ...
is a way of writing a function without having to attribute it to a symbol;
map(a,b) is a way of creating an ordered set of elements by mapping the
ordered set b (it can be a list for instance) via the function a; and list
transforms a set of elements into an actual list (which can be used for plot-
ting). In the above, list(map(lambda x: ft(x), xdense)) does the same as
[ft(x) for x in xdense].

Finally, let us note that the commands used to create the zero matrix A and the zero
vector b in the function findas can be written in several different ways. The most
common (and readable) way of doing this is by using the np.zeros command. Specif-
ically, we could substitute the command: A = np.array([[0.]∗(m+1)]∗(m+1))
by the command A = np.zeros((m + 1, m + 1)). Similarly, the com-
mand b = np.array([0.]∗(m+1)) could “equivalently” be written as
b = np.zeros(m + 1).

Exercise 9 This is a simple exercise, but it is an important one, since it logically completes
a previous exercise. In Exercise 8, evaluate numerically the convergence orders, instead of
checking if a given q works. For this purpose, extract the data for y(k) = log(| log(|xk − x∗ |)|)
as a function of k, and make sure that you have enough values of k. Then do a linear, 2nd
order polynomial, and 3rd order polynomial fit of this data. Then try to fit only the data for
k large enough. Using such an analysis, deduce that the result is linear in k at large enough
k, and extract the value of q (the convergence order).

3.3 Interpolation Errors


3.3.1 Error of the Polynomial Interpolation
Consider the difference of the exact function f (x) and the interpolation polynomial PN (x):
f (x) − PN (x) = r(x)NN +1 (x)
where r(x) is the error function, which we want to estimate, and NN +1 (x) = N
Q
i=0 (x − xi ).
This representation is natural since the difference vanishes for x = xi , i = 0, . . . , N .
Theorem 3.3.1 Let f ∈ C N +1 [a, b] (i.e., N + 1 times differentiable) with a = x0 , b = xN .
Then for any x ∈ [a, b], there exists a point ζ(x) ∈ [a, b], such that
f (N +1) (ζ(x))
f (x) − PN (x) = NN +1 (x).
(N + 1)!
Proof Let us introduce an auxiliary function q(y):
q(y) = f (y) − PN (y) − r(x)NN +1 (y)
which vanishes for N + 2 points y = x0 , . . . , xN and x. If f ∈ C N +1 [a, b], then q ∈ C N +1 [a, b].
Next, we use the intermediate value theorem which tells that for any C 1 function which

48
vanishes at A and B, there is at least one point inside [A, B] where its derivative vanishes.
We apply this theorem N times: first we conclude that q ′ (y) vanishes at N +1 points between
y = x0 , . . . , xN and x, then q (2) (y) vanishes at N points, and so on. Thus, we finally observe
that q (N +1) (y) should vanish at some point ζ ∈ [a, b].
Noticing that (N + 1)-th derivative of q reads:

q (N +1) (y) = f (N +1) (y) − (N + 1)!r(x)

we get for y = ζ
q (N +1) (ζ) = 0 = f (N +1) (ζ) − (N + 1)!r(x)
from where we see that
f (N +1) (ζ)
r(x) = .
(N + 1)!

Thus, we get the following estimation:


" #
f (N +1) (y)
|f (x) − PN (x)| ≤ maxy∈[a,b] |NN +1 (x)| . (3.3)
(N + 1)!

In practice, we will replace in this formula |NN +1 (x)| by its maximum on the interval,
maxy∈[a,b] [|NN +1 (x)|].

Example Estimate the interpolation error for the function f (x) = x for the nodes [100, 120, 140].
For that we use our error estimate (3.3):
 
3 1 1
|f (x) − PN (x)| ≤ max max|(x − 100)(x − 120)(x − 140)|.
8 y 5/2 3!

By noting that1 max|(x − 100)(x − 120)(x − 140)| = 3079.2, we obtain

|f (x) − PN (x)| ≤ max < 2 · 10−3 .

3.3.2 Estimating the Error on a Mesh: The Order of the Error


One can easily observe that the formula (3.3) is not very convenient to use, as it is complicated
(or even completely impractical) to obtain a good estimate of the maximal value of NN +1 (x)
over the interval x ∈ [a, b].
In estimating errors in general, we will often discretise functions or equations on an
equally spaced mesh xi = x0 + ih (with h > 0). Then, we are interested in estimating the
error, or at least obtaining a good upper bound on the error, as h → 0. What this means is
that the approximation – here the polynomial fit – is expected to become better and better
1
This step is non-trivial. You have to find where the maximum is located by taking the derivative of
(x − 100)(x − 120)(x − 140) and solving for it to vanish. Next, you have to evaluate the initial polynomial at
the points where its derivative vanishes and pick the maximal value among them. In the following subsection,
we give a more efficient method for the error estimation. Thus, we do not go into the details of this calculation
here. Nonetheless, it is still useful for you to work it out yourself.

49
as the mesh size h is made smaller (assuming that we have enough data to fit), and we would
like to know in what way (in essence “how fast”) the error decreases to zero. This is called
the order of the error.
Suppose x ∈ [xk−1 , xk ]. Then,

|x0 − x| ≤ kh, |x1 − x| ≤ (k − 1)h, . . . |xk−1 − x| ≤ h, (3.4)


|xk − x| ≤ h, |xk+1 − x| ≤ 2h, . . . |xN − x| ≤ (N − k + 1)h, (3.5)

i.e., |NN +1 (x)| ≤ (N − k + 1)!k!hN +1 , so that

maxy∈[a,b] |f (N +1) (y)| N +1


|f (x) − PN (x)| ≤ N +1
 h . (3.6)
k

That is, as h is made smaller, we see that |f (x) − PN (x)| ≤ chN +1 for some c > 0 (that
depends on the function f and a quantity depending on N that is less than or equal to 1).
In that case, we say that PN has an error of order hN +1 , or an error O(hN +1 ) (assuming,
of course, that the magnitude of the (N + 1)-st derivative of f is bounded by a constant
independent of N (or h), which is not always the case).

3.3.3 Optimising the Error Bound: Chebyshev nodes


Following our discussion in the beginning of this Chapter, we note that one can (provably) get
a better result using the roots of Chebyshev polynomials of the first kind (these polynomials
are denoted by TN +1 , and are of degree N + 1) as nodal points, instead of the equally spaced
mesh. For simplicity, let us focus on the case where [a, b] = [−1, 1].
We would like to find the best possible nodes, so that the worst-case bound on |NN +1 (x)|
is minimized. In other words, we want to solve the following optimisation problem:

min max ΠN
i=0 (x − xi ) . (3.7)
x0 ,...,xN x∈[a,b]

However, it can be shown that

NN +1 (x) ≡ ΠN
i=0 (x − xi ) = x
N +1
− r(x),

for some polynomial r ∈ PN (where PN is the space of all polynomials of degree N ), de-
pending on {xj }N
i=0 (this can be shown by using standard identities of polynomials). Thus,
problem (3.7) can equivalently be written as

min max xN +1 − r(x) ≡ min xN +1 − r(x) ∞


. (3.8)
r∈PN x∈[a,b] r∈PN

In other words, we are looking for a polynomial of degree N yielding the best possible
approximation of xN +1 on [a, b] (in the sense described above).
It turns out (by using the so-called equioscillation theorem; this in non-examinable) that
a solution to the optimisation problem (3.8) arises from the Chebyshev polynomial of the first
kind. In our convention, these polynomials are defined by TN +1 (cos θ) = 2−N cos((N + 1)θ)
(where on the right-hand side, trigonometric identities have to be used to write the result
as a polynomial in cos θ). Observe that for any x ∈ [−1, 1], we can equivalently write
the previous as TN +1 (x) = 2−N cos ((N + 1) cos−1 x). Let us note that this is a normalized

50
version of the standard Chebyshev polynomial of the first kind, yielding a monic polynomial
(i.e., a polynomial with leading coefficient equal to 1). As a monic polynomial, TN +1 (x) can
be written as
TN +1 (x) = xN +1 − q(x), for some q ∈ PN .
It turns out that q(x) is a solution to problem (3.8). Now that we have a solution to problem
(3.8), we can find a solution to problem (3.7) by finding all the roots (or zeros) of the
(normalized) Chebyshev polynomial and use them as  our nodals
 points. On the interval
π(i+1/2)
[−1, 1], the nodal points (zeros of TN +1 ) are xi = cos N +1
, for i = 0, 1, . . . , N .
This result holds for the interval [−1, 1], and translation and dilation can be used to
accommodate other intervals [a, b] (see Section 3.1.2). Thus, the roots of TN +1 are the
nodal points to use in order to minimise the error. Specifically, one can show that for these
polynomials (i.e., by using the aforementioned nodal points) we obtain
1
|NN +1 (x)| = |TN +1 (x)| ≤ , for x ∈ [−1, 1], (3.9)
2N
and therefore,
maxy∈[−1,1] |f (N +1) (y)|
|f (x) − PN (x)| ≤ . (3.10)
(N + 1)!2N
In general, we have
N +1
maxy∈[a,b] |f (N +1) (y)|

b−a
|f (x) − PN (x)| ≤ .
2 (N + 1)!2N

Note that this is a smaller error than that in (3.6). To prove this, you might consider using
Stirling’s approximation for (n + 1)! (assuming n is large), however this is non-examinable.

51
Chapter 4
Numerical Differentiation

The problem posed in this CHapter is extremely simple: given the function f (x), compute
dn f /dxn at given x. In particular, assuming that we can compute the values of the function
f at some points xk , we would like to convert this information into an approximate expres-
sion for the derivatives. Nonetheless, we will show that such an approximation, although
extremely useful, always incurs a substantial error that needs to be tracked and quantified.

4.1 Finite Difference Approximations


Once again, the main tool we utilize for such derivative approximations is the Taylor series
expansion. For it to be applicable we have to assume the function is sufficiently smooth. In
this case we can write
h2 h3
f (x + h) = f (x) + hf ′ (x) + f ′′ (x) + f ′′′ (x) + O(h4 ),
2 6
2 3
h h
f (x − h) = f (x) − hf ′ (x) + f ′′ (x) − f ′′′ (x) + O(h4 ),
2 6
2
h h3
f (x + 2h) = f (x) + 2hf ′ (x) + 4 f ′′ (x) + 8 f ′′′ (x) + O(h4 ),
2 6
2 3
h h
f (x − 2h) = f (x) − 2hf ′ (x) + 4 f ′′ (x) − 8 f ′′′ (x) + O(h4 ),
2 6
for sufficiently small h. We will use these expressions in order to approximate the first
derivative. Similar ideas can be used to approximate higher derivatives as well.
From the above expansion for the Taylor series we get
f (x + h) − f (x − h) h2 ′′′
f ′ (x) = − f (x) + O(h4 ) (4.1)
2h 6
that is,
f (x + h) − f (x − h)
f ′ (x) = + O(h2 ) . (4.2)
2h
We see that the combination f (x+h)−f
2h
(x−h)
, for small h, gives f ′ (x) up to an error of order
2 2
O(h ), that is, with order-h precision. Yet another approximation
f (x − 2h) − f (x + 2h) + 8f (x + h) − 8f (x − h) h4 (5)
f ′ (x) ≃ − f (x) (4.3)
12h 30
52
uses 4 different points and has a better O(h4 ) precision. One may also use the simpler
combination
f (x + h) − f (x)
f ′ (x) = + O(h) (4.4)
h
or
f (x) − f (x − h)
f ′ (x) = + O(h) (4.5)
h
which has a lower precision, O(h) instead of O(h2 ) (i.e., the error O(h) is greater than the
error O(h2 ), since we assume that h is very small).

4.2 Error Estimation: Round-off (Machine Precision)


and Approximation Errors
One could argue that the error O(h) in the simpler approximation (4.4) should not be such
a big problem, as we simply should take h as small as we like. However, the problem
here is that the error introduced by the finite machine precision (see Section 1.4.2) has a
significant effect. In fact, there are two sources of errors here, which need to be considered
and controlled: the approximation error (associated to the numerical method), and the
round-off errors (associated to finite machine precision or machine epsilon).
Consider a definition of an approximate derivative that is well implementable on the
computer:
′ f (x + h) − f (x)
fapprox (x) ≜
h
for some small h. Then

fapprox (x) = f ′ (x) + O(h) (4.6)
and the approximation error is O(h).

But fapprox (x) needs to be evaluated on the computer. As we saw in Sections 1.4.2–1.4.3,
most operations on the computer introduce an error O(ϵ) with, say, ϵ = 10−16 . Such errors are
essentially out of our control: they come from the necessary rounding-off performed within
the computer’s architecture due to the finite memory allocated for storing real numbers, and
the way the round-off is done depends on the number itself and on the associated computer
processes. For instance, in the expression f (x + h) − f (x) both f (x + h) and f (x) have an
error O(ϵ) (since f is assumed to be continuously differentiable; see Section 1.4.3 where we
evaluate the propagation of round-off errors when evaluating functions), hence so does their
difference. In fact, as we saw in Section 1.4.3, if |f (x + h) − f (x)| is very small, then the
relative error of the subtraction can be substantially larger (due to catastrophic cancellation)!
Nonetheless, in this Section we will only consider absolute errors for simplicity1 . In that case,
the numerical evaluation of the previous approximation is
′ f (x + h) − f (x) + O(ϵ)
fnum (x) ≜ .
h + O(ϵ)
However, we have to keep in mind that because f (x + h) − f (x) is small, the round-off error
O(ϵ) has an effect on digits that can be quit significant, hence a big effect. The question we
pose now is: what absolute error does this numerical computation introduce to the result?
1
We note that if we manage to find values of h that minimize the absolute errors of the corresponding
approximations, we can expect to also obtain reduced relative errors. Thus, focusing on absolute errors is
done without loss of generality.

53
In order to address this, consider the ratio a/b, where a ∼ b and both are rather small,
say O(h); consider a = Ah and b = Bh for some small finite h, and O(1) numbers A and B
(the exact value of which is not important here). a and b are assumed to be obtained from
some operation on the computer, so have absolute errors O(ϵ). Thus, when evaluating the
ratio a/b on a computer, we have

a + xϵ a (bx − ay)
= +ϵ + O(ϵ2 ) (4.7)
b + yϵ b b2

for some (random) O(1) coefficients x and y (whose exact values do not matter), representing
the round-off errors of a and b, respectively. Note that in (4.7) we utilized a first-order Taylor
series expansion in ϵ evaluated at ϵ = 0. This means that the absolute error of this calculation
is
(bx − ay) Bhx − Ahy Bx − Ay ϵ ϵ
ϵ = ϵ = = O . (4.8)
b2 B 2 h2 B2 h h
Recall that A, B, x, y are all O(1) numbers, and x and y are essentially random. Here we
took into account both small quantities ϵ and h. You can also verify that the higher order
terms in the ϵ Taylor series expansion will lead to terms of the form O((ϵ/h)n ) for higher
powers n (i.e., n ≥ 2).
From Eqs. (4.8) and (4.6), we conclude that
ϵ ϵ
′ ′
fnum (x) = fapprox (x) + O = f ′ (x) + O(h) + O .
h h
This means that we cannot
 really take h to be very small, as this will increase the induced
ϵ
arithmetic error O h of this numerical method, because h is in the denominator. How-
ever at the same time, we cannot take h to be too large, because this will increase the
approximation error O(h) of the numerical method: instead, we should strike an optimal
balance between the two sources of error in order to minimize the overall absolute error of
our approximation of the derivative.
For this “forward difference approximation” method, the combined absolute error is
ϵ
E ≜ O(h) + O .
h
The important question is: what is the optimal value for h, which minimises the error? In
order to answer this, we write the combined error as

E = uh +
h
for some O(1) numbers u and v (whose values are not important), and find the minimum
with respect to h. By taking the derivative of the previous expression (w.r.t. h) and equating
it to zero, this minimum is obtained at


r
vϵ vϵ
u− 2 =0 ⇒ h= = O( ϵ). (4.9)
h u

That is, the optimal choice of h is O( ϵ) = O(10−8 ). Thus, because of finite machine-
precision, we do not expect to achieve more than about 8 digits of accuracy (or precision) in
this derivative approximation.

54
Let us also estimate the absolute error for the “central difference approximation” given
in eq. (4.2). In this case, we can (similarly to the forward differences case) show that the
combined error is O(h2 ) + O( hϵ ) whose minimal value is achieved at h = O(ϵ1/3 ) = O(10−6 ),
and thus the best possible absolute approximation error is O(10−12 ). We see that the central
difference approximation is a slightly better, but still not to the extent that is suggested in
(4.2)2 .
Let us at this point note that in Python, we could use a multiple precision library, such as
mpmath, which allows to decrease ϵ to, say, 10−500 (see, for example, Section 2.5). Nonetheless,
we should always keep in mind that increasing the precision in Python introduces significant
memory and CPU overheads, making our code slower and less scalable. Thus, we should
always strive to optimize our code and our numerical approximations to make the most of
the standard machine precision.

Exercise 10 Show that all Taylor series terms in (4.7) have the form, under a calculation
like (4.8), O((ϵ/h)n ).

Exercise 11 Work out various expressions for the second derivative, and evaluate the opti-
mal h, taking into account round-off and approximation errors. You should get that for the
k th derivative, the roundoff error is O(ϵ/hk ), and with an approximation error O(hn ), the
1 n
optimal value is h = O(ϵ n+k ) with optimal error E = O(ϵ n+k ).

4.3 Python experimentation


The numerical method itself is trivial to implement, but it is interesting to experiment with
Python to check the two types of errors at play.
Here we make a method that evaluates the number of good digits in the symmetric
approximation (4.2). In the following code, we set ϵ ≈ 10−40 . The expression
(f(1+h)−f(1−h))/(2∗h)
is affected by the errors O(h2 ) + O(ϵ/h), while df(1), where we analytically computed
the derivative, is only affected by the smaller O(ϵ) round-off error, which can therefore be
neglected.
import mpmath as mp

mp.mp.dps = 40
def f(x):
return x ∗∗3+2∗ x
def df(x):
return 3∗( x ∗∗2)+2
def numberOfGoodDigits (h):
differencempf = (f(1+h)−f(1−h ))/(2∗ h) − df (1)
return float((−mp.log(mp.fabs( differencempf )))/ mp.log (10))
2
Note that if we did not consider round-off errors, then the forward difference approximation would be
O(h), while the central difference would be O(h2 ) (i.e., in theory, the latter is significantly better). In practice
however, we see that the approximation power of the two is not as big, when also considering round-off errors.

55
Next, we utilize the following piece of code to plot the number of accurate digits versus the
value of h. The resulting plot is given in Figure 4.3.
import matplotlib . pyplot as plt

xdense = mp. arange (0 ,40 ,0.1)


plt.plot(xdense ,[ numberOfGoodDigits (10∗∗(−x)) for x in xdense ])
plt. xlabel ("d: for h = 10∗∗(−d)")
plt. ylabel (" Number of Good Digits ")
plt. title(" Number of Good Digits vs. h(=10^(−d))")
plt.show ()

We observe that the plot in Fig. 4.3 verifies our estimates for the optimal value of h!
Indeed, for the central difference approximation, we observed that an optimal value of h is
O(ϵ1/3 ). In the above Python example, we set ϵ = 10−40 , and thus the optimal h should be
about O(10−14 ). This is indeed verified in Fig. 4.3.

Remark
Observe that in the previous code we use arange, which is a method we have seen
from the numpy library. But we have not imported that library! In fact, mpmath, which
we have imported, has a lot of the mathematical methods of numpy. In particular
in contains arange, sin, cos, .... But crucially, these return multiple precision
floats (i.e., mpf) numbers. So, above, in writing xdense = arange(0,40,0.1), we
made sure that the numbers we pass to the function numberOfGoodDigits are of mpf
type, hence the calculations done in numberOfGoodDigits are all on mpf numbers.
Thus, we are able to work with 40 digits of accuracy; this is essential for the procedure
to work.

56
In general, working with high-precision libraries (like mpmath) can be tricky and leads
to non-scalable code. Thus, such libraries should be used for particular calculations
that require high precision, and then mpf numbers should be (hopefully) safely cast
back to regular floats (using which we can utilize any available Python package, like
numpy).

4.4 Richardson Extrapolation


The Richardson extrapolation is a simple to implement method, that enables us to boost the
accuracy of certain numerical computations. It works for a number of numerical procedures,
including the finite difference approximations.
Suppose we have approximated some quantity G by some expression g(h), with an in-
duced error of the form of chp . Then, for two different values of h we get
G = g(h1 ) + chp1 , G = g(h2 ) + chp2
We can now eliminate the error by combining the previous those two formulae: in the
following expression, the terms chp1 and chp2 cancel-out
(h1 /h2 )p g(h2 ) − g(h1 )
G= .
(h1 /h2 )p − 1
This is the Richardson extrapolation formula. It is common to take h2 = h1 /2:
2p g(h/2) − g(h)
G= .
2p − 1

4.4.1 Richardson Extrapolation: An Example


To exemplify the Richardson extrapolation method, we apply it to the fordward difference
approximation of the derivative, i.e.
f (x + h) − f (x)
f ′ (x) = + ch + O(h2 ).
h
We will neglect the O(h2 ) terms and identify
f (x + h) − f (x)
G = f ′ (x), g(h) =
h
Under this identification we can apply the Richardson extrapolation for h2 = h1 /2 (keeping
in mind that p = 1 for the given approximation), to obtain

2 f (x+h/2)−f
h/2
(x)
− f (x+h)−f (x)
h 4f (x + h/2) − f (x + h) − 3f (x)
G= = ,
2−1 h
which indeed gives f ′ (x) with better precision:
4f (x + h/2) − f (x + h) − 3f (x) 1
= f ′ (x) − h2 f (3) (x) + O(h4 ),
h 12
i.e., precision is boosted to O(h2 )!

57
Chapter 5
Numerical Integration

We will discuss two main classes of conceptually different methods for numerical integration.
The first class includes midpoint, trapezoidal and Simpson methods. In these methods you
only get the values of the integrand for some fixed points xi and build an approximation
valid between these points like on the picture below

Figure 5.1: Illustration for the midpoint, trapezoid and Simpson methods. In all cases a
(piecewise) polynomial interpolation of the integrand is used (constant, linear and parabolic).
For these methods the sampling points are equally separated.

Figure 5.2: For the Gaussian integration method the sampling points are chosen in a special
way to increase precision

In this first class, we will also see the Gaussian integration method. This is a more
advanced way of integrating, where we also adjust, in a smart way, the positions of the

58
points xi at which the “measurement” of the integrand is made, in order to reduce as much
as possible the error made in the resulting approximation of the integral.
The second class is very different: it is based on a random process, is less accurate,
but much more flexible and easier to implement in the computer. This is the Monte Carlo
method. It is extremely popular in science, thanks to its great flexibility.

5.1 Trapezoidal Rule


As illustrated above for the trapezoidal method we do a linear interpolation between the
sampling points to approximate the integrand. We assume the sampling points to be sep-
arated by a small distance h, then the area of the trapezoid S (see the picture) is given
by
h
Si = (f (xi ) + f (xi+1 )) (5.1)
2

That is, with the trapezoidal rule, the numerical approximation of the integral on the interval
[a, b] is obtained by separating it into n small intervals of size h = b−a
n
each, that is with
xi = a + hi, and evaluating trapezoid area on each of these small intervals:
Z b n−1 n−1 n−1
X X f (xi ) + f (xi+1 ) h X h
f (x)dx ≈ Iab [f ] := Si = h = f (a) + f (a+hi)h+f (b) . (5.2)
a i=0 i=0
2 2 i=1 2

5.1.1 Approximation error


We can easily estimate the error of this approximation in the limit when h is small. If h is
small we can represent f (x) on the interval x ∈ [xi , xi + h] by its Taylor series expansion
1
f (x) ≃ f (xi ) + (x − xi )f ′ (xi ) + (x − xi )2 f ′′ (xi ) + O((x − xi )3 ). (5.3)
2
Its integral on [xi , xi + h] is then given by
xZi +h xZi +h 
1
f (x)dx ≈ f (xi ) + (x − xi )f (xi ) + (x − xi )2 f ′′ (xi )dx

(5.4)
2
xi xi
h2 h3
= f (xi )h + f ′ (xi ) + f ′′ (xi ) .
2 6
59
On the other hand, the trapezoidal rule on this interval gives

h2 ′′
 
f (xi ) + f (xi+1 ) ′ h
Si = h ≈ f (xi ) + f (xi ) + hf (xi ) + f (xi ) (5.5)
2 2 2

and thus the error is


Z xi +h
1 3 ′′
Ei = Si − f (x)dx = h |f (xi )| + O(h4 ) (5.6)
xi 12

which is rather small.

We then evaluate an upper bound for the total error for the full integral using this. With
(5.19), the estimate for the error is the sum
Z b n−1 n−1  3
X X h ′′

E= f (x)dx − Iab [f ] ≤ Ei = 4
|f (xi )| + O(h ) (5.7)
a i=0 i=0
12
3
nh
≤ maxx∈[a,b] |f ′′ (x)| + O(nh4 ) (5.8)
12
h2
= (b − a)maxx∈[a,b] |f ′′ (x)| + O(h3 ) (5.9)
12
where we used the fact that the sum contains n = b−a h
terms, which, as h is made small, is a
−1
large number of order O(h ). Thus, the error in the trapezoidal approximation is bounded
by O(h2 ). This is unless the second derivative is zero everywhere, in which case the function
is linear and the trapezoidal approximation is exact!

5.1.2 Roundoff error


As usual, the approximation made needs to be put on the computer, and then additional
roundoff errors are incurred.
We see that we have to compute a sum of n terms ai , each of order ai = O(h) = O(1/n).
Now suppose that the sum, up to a certain step where a finite fraction of the interval has
been done, is A. The next step we add ai , so the result is A + ai . But the numerical
implementation of this has roundoff errors. Since A is essentially the integral on a finite
fraction of the interval, we may assume A = O(1). This has a roundoff error O(ϵ) as usual,
say with ϵ = 10−16 for a 16-digit computation engine. The one-step ai itself has a smaller

60
error 1 . But the computational operation of going from A to A + ai adds an O(ϵ) error again:
numerically we do A → A + ai + O(ϵ). Thus, it is equivalent to assume each ai to have a
roundoff error O(ϵ).
Now we do the sum of a big numberPof such terms, hence the error accumulates. That
n−1
is, the numerical evaluation of the sum i=0 ai is
n−1
X
(ai + αi ϵ) (5.10)
i=0

where αi are (pseudo-)random O(1) numbers. With maxi |αi | = α = O(1), we can then
bound the difference between the numerical and “mathematical” expressions as
n
X n
X
(ai + αi ϵ) − ai ≤ αnϵ = O(ϵ/h) (5.11)
i=0 i=0

using n = O(h−1 ). Thus, we find the same as for the derivative!


However this is not quite what happens. As the variables αi are essentially random, taking
both positive and
Pn−1negative values, there is some cancellation, and a “law of large numbers”
applies. Thus i=0 αi takes its average value (0) plus fluctuations that are expected to be
of variance proportional to n. This means that typically, we would expect
n−1
X √
αi = O( n) (5.12)
i=0

and therefore n n
X X √
(ai + αi ϵ) − ai = O(ϵ/ h). (5.13)
i=0 i=0

Thus, in fact, we expect this the trapezoidal rule numerical integration to be affected by
smaller roundoff errors than the simplest numerical derivative method. This is typical of
numerical integration: it is less sensitive to roundoff errors than numerical differentiation;
the same analysis holds for the other methods of integration in the first class, so we will not
repeat it.

5.1.3 Python implementation for the Trapezoidal Rule


Implementation is very simple
from numpy import arange
def trapezoidal (f,a,b,n):
fs = list(map( lambda x: f(x), arange (a,b+(b−a)/n,(b−a)/n)));
return (sum(fs)−(fs [0]+ fs [ −1])∗0.5)∗(b−a)/n
Ra
Test it on the −1 x22+1 dx
1
It is the sum f (xi ) + f (xi + h) (which is O(1)) multiplied by h/2 (small). The error O(ϵ) of the sum is
multiplied by h/2 as well so its error is O(hϵ). The computer keeps the number of significant digits under
multiplications, just keeping track where the point “.” is in the floating point representation; in a subtraction
of nearby numbers, significant digits are lost, but not in a multiplication.

61
def f(x):
return 2/(x ∗∗2+1)
trapezoidal (f,−1,1,1000)
What should you get?
Check that the error decreases as 1/n2 .

5.2 Midpoint rule


In this method, we estimate the integral of the small interval [xi , xi + h] by the area of the
rectangle bounded on top by the value of the function at the midpoint of the interval:
Z xi +h  
h
f (x)dx ≈ Si := hf xi + , (5.14)
xi 2
as is clear from the picture.
To estimate the error we assume again that the function f (x), on that interval, can be
approximated well by the Taylor series:
1
f (x) = f (xi ) + f ′ (xi )(x − xi ) + f ′′ (xi )(x − xi )2 + . . . (5.15)
2
so that
Z xi +h Z xi +h  
′ 1 ′′ 2
f (x)dx = f (xi ) + f (xi )(x − xi ) + f (xi )(x − xi ) + . . . dx
xi xi 2
1 1
= f (xi )h + f ′ (xi )h2 + f ′′ (xi )h3 + . . . (5.16)
2 6
We compare this with the Taylor series result obtained from the midpoint rule:
     2 !
h ′ h 1 ′′ h
Si = hf xi + = h f (xi ) + f (xi ) xi + − xi + f (xi ) xi + − xi + ...
2 2 2 2
h2 h3
= f (xi )h + f ′ (xi ) + f ′′ (xi ) + O(h4 ) . (5.17)
2 8
Thus the mismatch between the integral and the midpoint rule is
Z xi +h  
1 1 1
Ei = Si − f (x)dx = − |f ′′ (xi )|h3 + O(h4 ) = |f ′′ (xi )|h3 + O(h4 ) . (5.18)
xi 6 8 24
P
Again, we approximate the full integral by the sum i Si , with a small interval h =
(b − a)/n and xi = a + ih:
Z b n−1 n−1 
X X  1 
f (x)dx ≈ Iab [f ] := Si = h f a+ i+ h . (5.19)
a i=0 i=0
2
Rb
The overall approximation error E = a
f (x)dx − Iab [f ] is bounded as

h2
E≤ (b − a)maxx∈[a,b] |f ′′ (x)| + . . . = O(h2 ). (5.20)
24
This is the same order O(h2 ) as the trapezoidal rule, but the coefficient 1/24 is smaller –
thus this is (marginally) more precise!

62
5.3 Simpson’s rule
Simpson’s rule is a mixture of trapezoidal and the midpoint methods which uses the knowl-
edge of the function at the end points and also in the middle of each of the intervals. For
the small interval [a, b] = [xi , xi+1 ] Simpson’s rule states:
Z b    
b−a a+b
f (x)dx ≈ Si := f (a) + 4f + f (b) . (5.21)
a 6 2

Again for not-small intervals [a, b] we divide it into n parts of lengths h = (a − b)/n with
endpoints xi = a + ih (again x0 = a, xn = b), and then the integral is estimated by the sum
Z b n    
X h xi−1 + xi
f (x)dx ≈ Iab [f ] := f (xi−1 ) + 4f + f (xi ) . (5.22)
a i=1
6 2

giving
n    
X h h
Iab [f ] := f (a + ih − h) + 4f a + ih − + f (a + ih) . (5.23)
i=1
6 2
Where does (5.21) come from? One way to see it is by using Richardson extrapolation,
section 4.4, on the trapezoidal method. Indeed consider the trapezoidal approximation for
integration on the interval [a, b], with two different choices of small parameter h: one is
h1 = b − a (the full interval), and the other is h2 = (b − a)/2 = h1 /2 (half of the interval). In
the notation of the Richardson extrapolation section we have p = 2, as the approximation
error for the trapezoidal method is O(h2 ), and the results of the approximate integration on
[a, b] for the choices h1 , h1 are
     
f (a) f (b) b − a f (a) a+b f (b)
g(h1 ) = (b − a) + , g(h2 ) = +f + . (5.24)
2 2 2 2 2 2

Thus the improved approximation is

g(h2 )22 − g(h1 )


  
b−a a+b
G= = f (a) + f (b) + 4f , (5.25)
22 − 1 6 2

which is nothing but the Simpson’s rule. Hence, according to the Richardson method, the
resulting approximate integral should have an error of O(h3 ) or better. We see below that
it is in fact better!

5.3.1 Error estimation


We can check explicitly the approximation error for the Simpson’s rule approximation given
by (5.22). We use the same technique as for the other approximation methods. First we eval-
uate the error for a single step, by Taylor expanding f (x) around xi in both the expressions
R x +h
for Si and for xii f (x):
xi +h
h2 h3 h4 h5
Z
f (x) = f (xi )h + f ′ (xi ) + f ′′ (xi ) + f ′′′ (xi ) + f ′′′′ + . . . (5.26)
xi 2 3! 4! 5!

63
and
h2 h3 h4 5h5
Si = f (xi )h + f ′ (xi ) + f ′′ (xi ) + f ′′′ (xi ) + f ′′′′ + ... (5.27)
2 3! 4! (4!)2
giving
xi +h
h5
Z
Ei = Si − f (x)dx = f ′′′′ (xi ) + O(h6 ). (5.28)
xi 4!5!
From this, using again that there are n = (b − a)/h terms in the sum for the approximation
Rb
Iab [f ], the overall approximation error E = a f (x)dx − Iab [f ] ≤ n−1
P
i=0 Ei is bounded as

h4
E≤ (b − a)maxx∈[a,b] |f ′′′′ (x)| + . . . = O(h4 ). (5.29)
4!5!
This is much better than the trapezoidal and midpoint methods.

5.4 Gaussian Integration Method


The idea of the Gaussian integration is ot make an optimal choice for the nodal points xi .
A first concept is that the optimal choice depends, on the space of functions the integral
of which we want to approximate. A natural space is that of polynomials, but it is not
so interesting to numerically approximate the integral of polynomials as they are easy to
integrate. So instead, we separate the functions we want to integrate into a “weight” w(x) >
0 or “measure function”, which can be anything, and a function f (x) in a simple space,
which we will take to be polynomials. Thus, we are looking for the best approximation I[f ]
to the integrals of the form Z b
w(x)f (x)dx ≈ I[f ] (5.30)
a

(here for lightness of notation we omit the explicit dependence on a, b of the approximation
Rb
I[f ]). For instance, if you want to approximate a g(x)dx, then depending on what g(x)
looks like, you want to choose the right w(x), so that the other factor f (x) does not vary
too much. How to choose w(x) is a bit of an art; for now we just assume w(x) is given.
The second concept is that we have essentially two groups of parameters that we can
adjust to approximate the integral: the nodes xn,i , and what I would call the “approximation
weights” Ai , for i = 0, 1, 2, . . . , n:
n
X
I[f ] = Ai f (xi ). (5.31)
i=0

In fact, the number of nodes n + 1 is another parameter that we can choose – but here we fix
it, and then try to optimise on xi and Ai . So, we have in total 2n + 2 parameters to adjust.
Once for the measure w(x), and the integration range [a, b], and the number of nodes n,
are given, then the method fixes the xi ’s and Ai ’s. The method attempts to minimise the
error made in the approximation. How? It follows this simple requirement: the approximate
method should give the exact value of the integral for all polynomials of degree less than or
equal to 2n + 1:
2n+1
X
fpoly (x) = am x m . (5.32)
m=0

64
That is Z b
w(x)fpoly (x)dx = I[f ]. (5.33)
a

The space of such polynomial is (2n+2)-dimensional, as there are 2n+2 arbitrary coefficients
am , m = 0, 1, . . . , 2n + 1, hence (5.32) is in fact 2n + 2 equations:
Z b n
X
w(x)xm dx = Ai xm
i , m = 0, . . . , 2n + 1 . (5.34)
a i=0

As we have 2n + 2 free parameters, it should be satisfy all these equations by an appropriate


choice of free parameters.
Note that the equations are highly nonlinear in xn,i . The problem of finding xi can be
reduced to a linear problem using the following observation. Consider a polynomial ϕn+1 (x)
of degree n + 1 such that xi are its zeros:

ϕn+1 (xi ) = 0, i = 0, 1, ldots, n. (5.35)

Then since ϕn+1 (x)xk for k = 0, . . . , n is a polynomial of degree less than or equal to 2n + 1,
we should have
Z b Xn
k
w(x)ϕn+1 (x)x dx = Ai xki ϕn+1 (xi ) = 0 , k = 0, . . . , n . (5.36)
a i=0

Conceptually, this is an orthogonality condition: it says that the polynomials ϕn+1 (x) and
xk , k = 0, 1, . . . , n, are orthogonal with respect to the weight w(x). More practically, this
a set of n + 1 equations (for the n + 1 values of k). But the polynomial ϕn+1 (x), once we
choose its normalisation so that the coefficient of xn+1 is 1 (recall that it must have degree
n + 1, so this coefficient cannot be zero), also has exactly n + 1 free parameters:

ϕn+1 (x) = c0 + xc1 + · · · + cn xn + xn+1 . (5.37)

Thus, the equations (5.36) are now n + 1 linear equations for these n + 1 coefficients, so they
can be solved by linear method (inverting a matrix). Explicitly, the orthogonality condition
is n
X
ci hi+k = −hn+1+k , k = 0, . . . , n (5.38)
i=0

where hi are the moments of the weight function:


Z b
hi = w(x)xi dx . (5.39)
a

Note that these moments hi do not depend on our choice of n (i.e. our choice of the number
of nodes n + 1). Once the weight w(x) and integral boundaries a, b, are known, we can
calculate them.
Eq. (5.38) gives a linear system of equations on the coefficients ci . Once ci ’s are found,
then ϕ(x) is known, and we just have to evaluate its zeros to find xi ’s. Thus we have
transformed the nonlinear problem on the nodes into a linear problem, and a nonlinear
problem of finding the roots of a polynomial.

65
Note that it is important to assume that w(x) is positive – thus w(x)dx is a measure –
otherwise it is not guaranteed that all xi are real (although we will not go into the mathe-
matical underpinning of this statement).
Once xi ’s are fixed, then one can determine Ai ’s from the linear equation (5.34). In fact,
this is an overdetermined set of linear equations, as there are n + 1 parameters Ai , but there
are 2n + 2 equations. Of course, we’ve already fixed the nodes xi in such a way that the full
set of equations can be solved, so it is sufficient to concentrate on the first n + 1 equations,
in order to determine the Ai ’s. In terms of the moments, these first n + 1 equations take the
form n
X
hm = A i xm
i , m = 0, . . . , n. (5.40)
i=0

So, overall, the steps are:

1. Calculate the moments (5.39);

2. Solve the linear system (5.38) and construct the polynomial (5.37);

3. Find the nodes xi by solving (5.35);

4. Find the approximation weights by solving (5.40).

Example: w = 1, a = −1, b = 1
Step 1. Compute hi :
1
(−1)i + 1
Z
hi = xi dx = (5.41)
−1 i+1
so that for instance:
2 2
h0 = 2 , h1 = 0 , h2 = , h3 = 0 , h 4 = (5.42)
3 5

Step 2. Form a linear system for the coefficients ci of the polynomial (5.38). This depends
on our choice of n, so for instance:

n = 0 : h0 c0 = −h1 ⇒ c0 = 0 ⇒ ϕ1 = 0 + x (5.43)
1

h0 c0 + h1 c1 = −h2 c0 = − 3 1
n=1: ⇒ ⇒ ϕ2 = − + x2 (5.44)
h1 c0 + h2 c1 = −h3 c1 = 0 3

 h0 c0 + h1 c1 + h2 c2 = −h3 c0 = 0
3
n=2: h1 c0 + h2 c1 + h3 c2 = −h4 ⇒ c1 = − 53 ⇒ ϕ3 = − x + x3(5.45)
5
h2 c0 + h3 c1 + h4 c2 = −h5 c2 = 0

one can also find


6 3 10 5
ϕ4 = x4 − x2 + , ϕ 5 = x 5 − x3 + x . (5.46)
7 35 9 21
Those polynomials are called Legendre polynomials and the general formula for them is:
n! ∂ n 2
ϕn (x) = (x − 1)n . (5.47)
(2n)! ∂xn

66
From this form it is easy to see that
Z 1
xk ϕn (x)dx = 0 , k = 0, . . . , n − 1 (5.48)
−1

one should simply integrate by parts k + 1 times.

Step 3. Find zeros of ϕn . Again this depends on n, and this gives the nodes we’re looking
for:

n=0: x0 = 0 (5.49)
p p
n=1: x0 = − 1/3, x1 = + 1/3 (5.50)
p p
n=2: x0 = − 3/5, x1 = 0, x2 = + 3/5 (5.51)
n=3: x0 = −0.861136, x1 = −0.339981, x2 = 0.339981, x3 = 0.861136 (5.52)

Step 4. Find weights Ai from


n
X
hm = A i xm
i , m = 0, . . . , n (5.53)
i=0

n = 0 : A0 = 2 (5.54)

A0 + A1 = 2
n=1: A0 A1 ⇒ A0 = 1 A1 = 1 (5.55)
−√ 3
+√ 3
=0

 A0q+ A1 + q
 A2 = 2
5 8 5
n=2: − 35 A0 + 35 A2 = 0 ⇒ A0 = A1 = A2 = . (5.56)

 3 9 9 9
A + 53 A2 = 32
5 0

After that we are ready to approximate the integral by


Z 1 n
X
f (x)dx = Ai f (xi ) . (5.57)
−1 i=0

5.4.1 Python implementation of Gaussian Integration


Here we implement a numerical method for the evaluation of the weights Ai and positions of
the nodal points xi . Note that for a given measure and the integration interval this data has
to be evaluated only once. After that you can apply this information to efficiently compute
numerous integrals. First, we define some measure function w:
import math
from numpy import ∗
a,b = −1,+1
def w(x):
return math.exp(−x ∗∗2)

67
In order to find xi we need to solve the linear system (5.38) which is built out of moments
hi . These must be evaluated by integrating – so we need to know some integrals, in order to
construct our approximate integration method! It is convenient to choose w(x) so that the
weights hi can be evaluated analytically. This is the case in the example done above. It is
2
also the case for the weight w(x) = e−x chosen here; on the interval [−1, 1], the resulting
integrals can be expressed in terms of special functions known as “error functions” (the
theory of special functions is very useful for the Gaussian integration method). But here, to
save us the analytical work, we will instead use an integrator that is already in Python: the
integrate.quad method from the scipy package
from scipy import integrate
def h(k):
return integrate .quad( lambda x: w(x)∗x ∗∗k,a,b)[0]
(check the Python documentation if you’d like to know more about the integrate structure
and its .quad integration method!) We could have used, instead, one of our previous inte-
gration methods, such as Simpson’s method. There is still an advantage here, as once these
few moments are evaluated, to as high a precision as we require, then we can evaluate to a
very high precision many more functions – we build on other, perhaps less strong methods,
to get the very precise Gaussian method.
After that we build the linear system, solve it, and then find zeros of the function ϕn+1 (x),
all this in a single function:
def nodesAndWeights (n):
# finding xi
B = array ([[ None ]∗( n +1)]∗( n+1))
u = array ([ None ]∗( n+1))
for k in range(n+1):
u[k] = − h(n+1+k)
for i in range(n+1):
B[k,i] = h(k+i)
cs = linalg .solve(B,u)
cs = append (cs ,[1])
xs = roots(cs[::−1]).real
xs.sort ()
# f i n d i n g Ai
As = array ([ None ]∗( n+1))
for k in range(n+1):
u[k] = h(k)
for i in range(n+1):
B[k,i] = xs[i ]∗∗ k
As = linalg .solve(B,u)
return xs , As
Some remarks:
• We use the special work None instead of 0. to put zeros – in fact this puts “nothing”,
just a blank space to be filled in later on (in this way, when filled in with floating point
numbers, we won’t have the problem we would have had by putting 0 without the
period afterwards).).

68
• We use the linalg.solve method to solve the linear system that gives the coefficients
ci for the polynomial ϕ(x).
• We use the append method, instead of cs.append. The latter would not have worked
because cs is an array object, not a list object, and arrays don’t have the method
append, like the type array, attached to them. The method append comes with numpy
and is more flexible as it allows one to append a whole list; it takes as input lists or
arrays, but returns an array, not a list object; while a.append, for a list a, does not
need numpy, only appends single elements and returns a list.
• We use the roots(L) method which finds the roots of the polynomial specified by
the coefficients that are in the list L. Importantly, the roots method considers the
coefficients in the opposite order from that we have chosen! That is, the first element
of the list is the coefficient of the highest-degree monomial xn+1 . So we need to invert
the list, with the Python syntax cs[::−1]. The roots method also gives the zeros
as complex numbers (as in general zeros of polynomials are complex numbers). Our
zeros are guaranteed to be real, mathematically; but numerically, as we know, there
are roundoff errors, so Python may well give us numbers with very small, but nonzero,
imaginary part, for instance of the order 10−16 ... In order to avoid any problem down
the line, we take the real part.
The Gaussian integration then can be done as follows
def gauss (f,A,x):
return sum(A∗ array(list(map(f,x))))

The usage is the following


xs , As = nodesAndWeights (6)
gauss ( lambda x: math.exp(−x),As ,xs)

For smooth functions this method works very well. It has excellent precision given only a
few evaluations of the function f . Note that the various singularities of the function f can
be absorbed into the measure function w, so if you have many integrals to compute for the
functions with the same type of singularities it could make sense to use this method.

5.5 Monte-Carlo Method


This method has a number of advantages (except its precision!) It is very simple to implement
and it works equally well in any number of dimensions (i.e. for n-fold multiple integrals).
Recall that for the very basic trapezoidal method the error is h2 ∼ 1/N 2 where N is
the number of the probe points i.e. how many times we had to evaluate our function f
at various points. So it converges to the correct integral, as N becomes large, like 1/N 2 –
this is sometimes referred to as the convergence
√ rate. The convergence rate of the Monte-
Carlo
√ Method, as we will see, is only 1/ N (i.e. to get 4 digits of precision we will need
−4
1/ N ∼ 10 or N ∼ 100 000 000, whereas for the trapezoidal method we will need only
100 points to get the same precision). Still, despite this poor convergence rate, the method
is used widely, because it is so simple to implement and so versatile.
The idea is very simple: assume we can randomly pick a point inside a rectangle R
of a given area S with a homogeneous distribution (meaning that all possible locations of

69
S2
S1

the point are equally probable). We assume that we can also easily test whether the point
is inside or outside some given region R1 ⊂ R of area S1 < S, which we would like to
estimate. Repeating this operation N times, and counting how many points n are inside our
domain S1 , we can estimate the area itself. Indeed, because of the assumed uniformity of
the random process, the number of points n should be proportional to the area S1 , so that,
by normalisation (when S1 = S, we have n = N )

S1 n
≈ (5.58)
S N
In the picture we show the two complementary regions R1 ∪ R2 = R with areas S1 + S2 = S.
In (5.58), the left-hand side is the exact mathematical quantity we are looking for, and
the right-hand side is our numerical-method approximation. An important difference with
respect to previous integral approximations such as (5.22), (5.19) or (5.30), is that in (5.58),
the right-hand side is a random variable. That is, the numerical method does not give the
same value, for a given N , every time we repeat it. Numerical methods based on random
processes are more rare than those based on deterministic processes, but are getting more
and more popular. Because the method is based on a random process, the estimate of the
approximation error cannot be solely based on Taylor series expansions; it has to have a
probabilistic, statistical nature. We will not go into details of how to correctly define an
error, but instead just use basic aspects of statistics.
Intuitively the equation (5.58) is obvious. Below we will derive it and we will see that it
is indeed correct in the limit N → ∞. For finite (but large) N we will be able to estimate
the statistical error in this equation.

5.5.1 Estimating Statistical Error


To see more clearly that the equation (5.58) cannot be correct for finite N it is enough to
notice that the r.h.s. is a rational number whereas the l.h.s can be an arbitrary real number.
In the extreme case when N = 1 we either get n = 1 or n = 0 which will produce S1 = 0 or
S2 = 0.
The right question to ask here is: for given S1 and S2 what is the probability Pn (N ) that
we get exactly n points out of N inside the domain S1 ? To get the answer let’s first take

70
N = 1: then the answer is obvious Pn=1 (1) = p1 and Pn=0 (1) = 1 − p1 = p2 where we denote
S1 S2
p1 = , p2 = . (5.59)
S1 + S2 S1 + S2
Next, for N = 2 the probability of getting no points inside is Pn=0 (2) = (p2 )2 , and similarly
Pn=2 (2) = (p1 )2 . The probability of the remaining option to get just one point inside is simply
Pn=1 (2) = 1−Pn=0 (2)−Pn=2 (2) = 1−(p2 )2 −(p1 )2 = 1−(1−p1 )2 −(p1 )2 = 2p1 −2p21 = 2p1 p2 .
We can summarize that

Pn=0 (2) = (p2 )2 , Pn=1 (2) = 2p1 p2 , Pn=2 (2) = (p1 )2 . (5.60)

The factor 2 in Pn=1 (2) simply indicates that there are two possibilities to get one point
inside - either first point is inside second outside or the first point is outside and the second
inside. In general, for the total number of points N with n points inside, the number of
such combinations is given by the binomial coefficient Nn = n!(NN−n)! !
which indeed gives
2 2!

1
= 1!1! = 2. Thus the general formula is
 
N n N −n
Pn (N ) = p p . (5.61)
n 1 2
Recall that we are looking for the random variable
n
α= . (5.62)
N
The probability distribution (5.61) will tells us about how this is distributed. For our nu-
merical method to work, we want the average value to be equal to S1 /S:
S1
E[α] = (5.63)
S
Then, the error in this numerical approximation will be identified with the standard deviation
of n/N p
σ = E[α2 ] − E[α]2 (5.64)
In fact, as we want the area S1 , we have to multiply by S so the error estimate is

Eapprox = σS (5.65)

We first evaluate the average and standard deviation. The trick is to evaluate explicitly
the “generating function of cumulants”:
N N  
 λα  X λn/N
X N −n
E e = Pn (N )e = (eλ/N p1 )n pN
2 = (eλ/N p1 + p2 )N . (5.66)
n=0 n=0
n

Expanding in powers of λ, we have


λ2 λ2
log E eλα = log(1 + λE[α] + E[α2 ] + . . .) = λE[α] + σ + . . .
 
(5.67)
2 2
Expanding the log of the rhs of (5.66) in powers of λ, we find (recall that p1 + p2 = 1)
λp1 λ2 p1 λ2 p1 (1 − p1 )
N log(1 + + + . . .) = λp 1 + + ... (5.68)
N 2N 2 2N
71
Figure 5.3: Probability Pn=αN (N ) given by (5.61) for p1 = 0.4 computed for N = 10 (left),
N = 100 (center), N = 1000 (right) as a function of α = n/N . For large N the probability
distribution approaches the normal distribution given by (5.74) with the maximum at n/N =
p1 .

We deduce r r
p1 (1 − p1 ) p1 p2
E[α] = p1 , σ= = (5.69)
N N
Therefore we find the correct average (5.63), and we find that the numerical error (the
standard deviation) is
r r r
p1 (1 − p1 ) p1 p2 S1 S2
Eapprox = S= S= . (5.70)
N N N

As claimed, the numerical error decreases like 1/ N .
Let us try to get a bit more intuition about the probability distribution itself, when N
becomes large. From (5.66), we can take the limit N → ∞ on the rhs and we find, using the
standard formula for the exponential limu→0 (1 + au)1/u = ea ,

lim E eλα = lim (1 + (eλ/N − 1)p1 )N = lim (1 + λp1 /N )N = eλp1 .


 
(5.71)
N →∞ N →∞ N →∞

This means that E[αn ] = pn1 for all n, and therefore α √is for sure given by p1 , without any
fluctuations. This is in agreement with the decay as 1/ N of the variance: at large N , the
numerical approximation α gives the correct mathematical value with probability 1.
That is, for given S1 and S2 , for any N there is a nonzero probability to be extremely
unlucky and get no points inside at all, for instance; however for large N this probability is
exponentially suppressed, and we will get α = p1 or very nearby with high probability.
Can we see more explicitly that the probability distribution becomes peaked at α = p1
when N → ∞? To see this we have to use Stirling’s approximation of the factorial m! ∼ mm ,
valid for large m. Using it we find:

Pn (N ) ≃ eN g(α) , g(α) ≡ −α log α − (1 − α) log(1 − α) + α log p1 + (1 − α) log p2 (5.72)

The function g(α) is negative and has maximum at α = p1 (which can be found from
g ′ (α) = 0) see Fig.5.4. As we see from Fig.5.3 the only relevant region is around α = p1
as the probability decreases fast away from this point. Computing the Taylor series around
α = p1 we find
ᾱ2 1
g(α) = 0 + ᾱ × 0 − + O(ᾱ3 ), ᾱ = α − p1 (5.73)
2 p1 p2

72
Figure 5.4: Probability Pn=αN (N ) for large N becomes eN g(α) . The function g(α) defined in
(5.72) is maximal at α = p1 .

thus around the maximum the probability becomes


n 2 1
−N ( N −p1 )
Pn (N ) ≃ e 2p1 p2
(5.74)

which is the normal distribution for α = n/N with mean at p1 and with variance σ 2 = p1Np2 ,
in agreement with what we found above. We see that for large N the variance decreases, the
distribution becomes very narrow, and as a result we can be
√ rather confident that our n is
near its mean value p1 N . The uncertainty in n is ∼ N σ = N p1 p2 i.e. we can write
p
n ≃ p1 N ± N p 1 p2 . (5.75)
S1
Thus knowing n we can estimate p1 = S
by
r
n p 1 p2
p1 ≃ ± (5.76)
N N
from where we get r r
n p1 p2 n S1 S2
S1 = p 1 S ≃ S ± S =S ± (5.77)
N N N N
where the last term gives us the error estimate.

Monte-Carlo method implementation. In Python we generate the (pseudo-)random


numbers with homogeneous distribution in the range (a, b) by random.uniform(a,b). Using
this function it is straightforward to implement the method. Below we compute the area of
the unit circle given by the inequality x2 + y 2 < 1
n=0 # s e t t h e c o u n t e r o f t h e p o i n t s i n s i d e t o z e r o
N =10000 # t o t a l number o f ‘ e x p e r i m e n t s ‘
for i in range(N):
# x and y a r e u n i f o r m i n t h e s q u a r e [ − 1 , 1 ] ˆ 2 ( a r e a = S = 4)
x = random . uniform(−1,1)
y = random . uniform(−1,1)
if (x ∗∗2+ y∗∗2<1): #i n s i d e

73
n = n+1 # i n c r e a s e t h e c o u n t e r o f t h e p o i n t s i n s i d e
p1 = n/N # our e s t i m a t e f o r t h e r a t i o o f a r e a s
S1 = p1 ∗4 # our e s t i m a t e f o r t h e a r e a S1 = p1 S
error =4∗ sqrt(p1∗(1−p1 ))/( sqrt(N)) # e r r o r o f t h e a p p r o x i m a t i o n

which, for example, gives 3.122 and ∼ 0.01655 for the error estimate. This is indeed consis-
tent with the exact result 3.1416.

74
Chapter 6
Numerical solution of ordinary differential
equations

6.1 Initial value problems


An initial value problem is a problem where the differential equation for the unknown function
y(x), and the value(s) of the function, and possibly its derivatives, at a given point say a
(the “initial values” y(a), y ′ (a), etc.), are given. The problem is to find the full function
everywhere, or at least for all values “after” a, that is y(x) for all x > a.
We will see two types of problem such problems: first-order differential equations for a
vector of functions, and nth order differential equations of a single function. In fact, the
latter if just a special case of the former, so for numerical methods, we will concentrate on
the former.
The first-order problem is as follows. A vector of functions ⃗y (x) of a single variable x ∈ R
is determined by a differential equation
⃗y ′ (x) = F⃗ (x, ⃗y (x)) (6.1)
along with an initial condition
⃗y (a) = α
⃗ (6.2)
for some α⃗ . Then we must find ⃗y (x) for all greater values of x that is x > a. In (6.1), there
is a vector function F⃗ (x, ⃗y (x)) which depends in general on x and on the unknown vector
function ⃗y (x) evaluated at x. This fixes the differential equation, and is the most general
form of vector 1st order differential equations. For instance, for n-dimensional vectors,
 
⃗ xy0 − y1
F (x, ⃗y ) = ⇒ y0′ = xy0 − y1 , y1′ = y02 (6.3)
y02
(here with indexing starting from 0).
The n-order problem for a single function y(x) is instead as follows. An ordinary differ-
ential equation of order n can be written as
y (n) = f (x, y, y ′ , . . . , y (n−1) ) (6.4)
and the initial condition must fix all derivatives up to, including, n − 1, so they are
y(a) = α0 , y ′ (a) = α1 , y ′′ (a) = α2 , . . . , y (n−1) = αn−1 . (6.5)

75
Here f is a function that determines the differential equation. For instance, for n = 2,

f (x, y, y ′ ) = xy ′ + y ⇒ y ′′ = xy ′ + y. (6.6)

It turns out that the n-order problem for a single function it can always be transformed
into 1st order problem for a n-dimensional vector. We do this by introducing auxiliary
functions
y0 = y , y1 = y ′ , y2 = y ′′ , . . . , yn−1 = y (n−1) (6.7)
and in these notations the 1st order differential equation becomes

y0′ = y1 (6.8)
y1′ = y2 (6.9)
... (6.10)

yn−1 = f (x, y0 , y1 , . . . , yn−1 ) . (6.11)

That is, we can identify ⃗y := (y0 , y1 , . . . , yn−1 ) = (y, y ′ , y ′′ , . . . , y (n−1) ) as our unknown n-
dimensional vector of functions, and with the choice

F⃗ (x, ⃗y ) = (y1 , y2 , . . . , yn−1 , f (x, y0 , y1 , . . . , yn−1 )) (6.12)

the differential equation (6.4) is equivalent to (6.1) (and the initial conditions also match).
For the numerical implementation, we will assume that this transformation has been
done, and concentrate on the vectorial first order problem.

6.1.1 Euler’s Method


The main idea to numerically solve a 1st order differential equation is to write an approxima-
tion for the derivative, thus leading to an approximation of the equation itself. For the Euler
method, we take the mot direct approximation possible: we introduce a small parameter
h > 0, and discretise the x variable to approximate

⃗y (x + h) − ⃗y (x)
⃗y ′ (x) ≈ . (6.13)
h
Using this in (6.1), we get the approximate formulation of the problem, which can be put
on the computer. In order not to confuse the true mathematical solution to the original
problem, with the numerical solution, I’ll use different notations: ytrue for the solution to the
original problem, and y for the solution to the approximate problem. Thus we have

⃗ytrue (x) = F⃗ (x, ⃗ytrue (x)), ⃗ytrue (a) = α
⃗ (6.14)

for the mathematical problem, and

⃗y (x + h) − ⃗y (x)
= F⃗ (x, ⃗y (x)), ⃗y (a) = α
⃗ (6.15)
h
for the approximate problem, and we expect ⃗ytrue (x) ≈ y(x) for all x; clearly we have an
exact equality for x = a, by definition.

76
The main observation is that eq. (6.15) can be solved iteratively for all ⃗y (a + ih), for
i = 1, 2, 3, 4, . . .. In fact, it does not fix ⃗y (x) for all x, but only on the grid

xi = a + ih. (6.16)

But this is ok, if the grid is tight enough (h is small enough) it will give a good approximation,
and we can use our interpolation methods to fill-in the gaps between the grid.
In oder to see this, we introduce the notation

⃗y (a + ih) ≜ ⃗yi , i = 0, 1, 2, . . . (6.17)

Note that the initial condition is


⃗y0 = α
⃗. (6.18)
Then we can rewrite the approximate equation (6.15) as a recursion on ⃗yi

⃗yi+1 = ⃗yi + hF⃗ (a + ih, ⃗yi ) . (6.19)

This is it: the numerical algorithm simply computes ⃗y1 from the right-hand side, which only
depends on the known ⃗y0 = a; then it computes ⃗y2 , which only requires the now-known ⃗y1 ,
etc.

6.1.2 Euler’s Method: Python implementation


In our implementation, we’ve decided to make is in a way such that the function F⃗ (x, ⃗y )
is globally defined, and called by the euler method which solves the vectorial 1st order
differential equation. So, to fix the problem you’d like to solve, you have to define globally
a function F⃗ (x, ⃗y ).
For instance, for the nth order problem for a single function y(x), as determined by
f (x, ⃗y ), we have to implement (6.12). This can be done as follows:
def F(x,yvec ):
Fres = zeros(yvec.size)
for i in range(yvec.size −1):
Fres[i]= yvec[i+1]
Fres[yvec.size−1] = f(x,yvec)
return Fres

Here, f (x, ⃗y ) must be defined globally.


We use the quantity size which is part of any array object, and which returns the total
number of elements it contains. For a vector, a.size is the same as len(a), which is what is
relevant here as yvec is a vector, so we could have used the len syntax. But as a side note,
for a matrix a.size returns the total number of elements, while len(a) just returns the
number of rows (which is the number of elements of the outer list). Other useful attributes of
an array are the number of dimensions ndim and the shape shape (the various dimensions).
We use zeros(d), which just creates a d-dimensional vector of zeros. If d is actually a
list, then it creates an array with dimensions as per the list (for a 3 by 3 matrix, d=[3,3]),
all entries being zeros. We use this below.
Now that F⃗ (x, ⃗y ) is defined, we implement the Euler method. It takes as entries the vector
α
⃗ of initial values, the starting point a, the end-point b, and the number N of intervals; each

77
interval has a length h = (b − a)/N . It evaluates the result ⃗yi = ⃗y (xi ) for all i = 0 (that
is x0 = a, so this is already given), 1, 2, 3, . . . , N inclusively, with xN = b the last point
given. It returns both the grid xi ’s, and the results ⃗yi ’s. In the memory, ⃗yi ’s are N + 1
vectors organised as a matrix N + 1 by n; recall that n is the size of each vector ⃗yi , and
in the algorithm, this is determined as the size of α ⃗ (so we don’t need another entry in the
function).
def euler (alpha ,a,b,N):
h = (b−a)/N;
ys = zeros ((N+1, alpha.size ));
ys [0] = alpha;
xs = arange (a,b+h,h)
for i in range(N):
ys[i+1] = ys[i] + h∗F(a + i∗h, ys[i])
return xs ,ys
Usage example: We consider y ′ = x with the initial data y(0) = 1. Note how in this code
we use ys[:,0] in order to access the zeroth index of all vectors ⃗yi , that is the elements ⃗yi0 ,
for all i, in order to plot them all. This means we only plot the result for the function itself,
not its derivative (here the “vectors” ⃗yi are one-dimensional, so this is trivial, but in other
examples below this becomes important).
from numpy import ∗
def f(x,yvec ):
return x
alpha = array ([1])
xs ,ys= euler(alpha ,0 ,1 ,1000)
# plot of the r e s u l t

% matplotlib inline
import matplotlib .pylab as plt
plt.plot(xs ,ys [: ,0])
plt.plot(xs ,1+ xs ∗∗2/2)
plt.show ()

1.6

1.5

1.4

1.3

1.2

1.1

1.0
0.0 0.2 0.4 0.6 0.8 1.0

78
Another example y ′′ = −y, and initial data y(0) = 0 and y ′ (0) = 1, whose analytical solution
is sin(x), against which we compare:
def f(x,yvec ):
return −yvec [0]
alpha = array ([0 ,1])

xs ,ys= euler(alpha ,0 ,30 ,2000)


plt.plot(xs ,ys [: ,0])
plt.plot(xs ,sin(xs))
plt.show ()

1.5

1.0

0.5

0.0

−0.5

−1.0

−1.5
0 5 10 15 20 25 30

we clearly see that the error accumulates! In the next section we will see how to improve
the situation.

6.1.3 Euler Method: error estimation


Let us briefly estimate the error of the Euler Method. The error comes from (6.13), which
is accurate O(h) only:
⃗y (x + h) − ⃗y (x)
⃗y ′ (x) = + O(h) (6.20)
h
which then results in the correction to (6.19)

⃗yi+1 = ⃗yi + hF⃗ (a + ih, ⃗yi ) + O(h2 ). (6.21)

Thus the error for a single step of the recursion is of order h2 . Much like for integration, such
errors accumulate as we perform many steps of (6.21) – although here it is more difficult
to make an accurate bound for the error estimate. Nevertheless, we can estimate the total
error similarly. If we make N steps, say in order to reach the point b, the error will becomes
O(N h2 ), and since N = (b − a)/h we get the error of order O(h) at the end. The choice
of b – here the end-point of the interval we used in the numerical algorithm – is arbitrary.
The important point is that for any fixed position x that is a finite distance away from a,
we will need to perform (x − a)/h steps, and as h becomes smaller, this grows like 1/h. So

79
for any such x, the error is expected to be O(h). For larger x’s, the coefficient of h in this
error estimate will be larger, but still going like h.
So, to increase the precision by one digit we will have to make 10 times more iterations
which will make the whole calculation 10 times slower. That is not extremely good, but so
we will deal with the error in a smarter way in the next section.

6.1.4 Runge-Kutta Method


The main problem with the method described above is the approximation for the derivative
– this approximation gives a relatively large error. We could try to use more precise approx-
imations of the derivative, but recall from chapter 4 that this would involve evaluating the
function not just at two points, but at more points (3,4,5,...). Thus the resulting equation
would not be a simple recursion relation where ⃗yi+1 can be evaluated from ⃗yi . We need to
so something more clever.
Instead, we re-approximate the full equation as follows. We express ⃗y (x + h) in terms of
⃗y (x) and its derivatives by Taylor series, and for the derivatives we use the exact equation
(6.1) in order to re-express them in terms of ⃗y (x). That is,
1
⃗y (x + h) = ⃗y (x) + ⃗y ′ (x)h + ⃗y ′′ (x)h2 + O(h3 ) (6.22)
2
which gives
h2 dF⃗ (x, ⃗y )
⃗y (x + h) = ⃗y (x) + F⃗ (x, ⃗y )h + + O(h3 ). (6.23)
2 dx
Note that on the right-hand side it is the total x derivative that appears; this involves the
partial x and partial ⃗y derivatives. Here we assume that we don’t have analytical information
about these partial derivatives; we just have F⃗ (x, ⃗y ) given in an algorithmic fashion. So we
need to further approximate this for numerical evaluation. Now this total x derivative can
be approximated using finite-differences:
dF⃗ (x, ⃗y ) F (x + h, ⃗y (x + h)) − F (x, ⃗y (x))
= + O(h). (6.24)
dx h
The correction O(h) here will lead to O(h3 ) in (6.23), so this is consistent. Here, on the
right-hand side, we still have ⃗y (x + h) that appears, so this does not give a nice recursion
relation – we would have to invert in order to extract ⃗y (x + h) and put it all on the left-hand
side, but as we don’t have analytical information about F⃗ (x, ⃗y ) this inversion is difficult.
But in there, ⃗y (x + h) can be approximated like in the Euler method, as the correction O(h)
will again lead to higher order corrections in the full equation. So we have
h2 dF⃗ (x, ⃗y ) h h⃗ i
= F (x + h, ⃗y (x) + hF⃗ (x, ⃗y (x))) − F⃗ (x, ⃗y (x)) + O(h3 ). (6.25)
2 dx 2
Combining all the pieces together and replacing y(x) → yi and y(x + h) → yi+1 , their
evaluation on the grid x = xi = a + ih, we get the improved recursion relation
h h
⃗yi+1 = ⃗yi + F⃗ (a + ih, ⃗yi ) + F⃗ (a + ih + h, ⃗yi + hF⃗ (a + ih, ⃗yi )) . (6.26)
2 2
The error is O(h3 ), and in total, after accumulation of O(1/h) of such errors to evaluate the
numerical approximation at index i ∝ 1/h corresponding to position x > a, we get an error
O(h2 ).

80
6.1.5 Runge-Kutta Method: Python implementation
We only have to modify one line in the Euler method
def rungekutta (alpha ,a,b,N):
h = (b−a)/N;
ys = zeros ((N+1, alpha.size ));
ys [0] = alpha;
xs = arange (a,b+h,h)
for i in range(N):
ys[i+1] = ys[i] + h /2∗( F(a + i∗h, ys[i])
+F(a + i∗h+h, ys[i]+h∗F(a + i∗h, ys[i])))
return xs ,ys

repeating the same calculation with the new method


from numpy import ∗
def f(x,yvec ):
return −yvec [0]
alpha = array ([0 ,1])

xs ,ys= rungekutta (alpha ,0 ,30 ,2000)


plt.plot(xs ,ys [: ,0])
plt.plot(xs ,sin(xs))
plt.show ()

1.5

1.0

0.5

0.0

−0.5

−1.0

−1.5
0 5 10 15 20 25 30

we get a result which agrees excellently with the analytical solution this time!

6.1.6 4th order Runge-Kutta Method


The main idea of the Runge-Kutta method, developed around 1900, was a simple modifi-
cation of the recursion relation. It can be easily extended to the high order. Here we give,
without derivation, the 4th order formula.

81
It is convenient to introduce the following notation:
⃗ 0 = hF⃗ (x, ⃗y )
K (6.27)
⃗ 1 = hF⃗ (x + h/2, ⃗y + K
K ⃗ 0 /2) (6.28)
⃗ 2 = hF⃗ (x + h/2, ⃗y + K
K ⃗ 1 /2) (6.29)
⃗ 3 = hF⃗ (x + h, ⃗y + K
K ⃗ 2) (6.30)

using this notation one can show that


1 ⃗ ⃗ ⃗ ⃗ 5
⃗y (x) − ⃗y (x + h) + (K 0 + 2K1 + 2K2 + K3 ) = O(h ). (6.31)
6
Thus the accumulated error will be O(h4 ).
We can easily modify the previous code with a new recursion relation:
def rungekutta4 (alpha ,a,b,N):
h = (b−a)/N
n = alpha.size
xs = arange (a,b+h/2,h)
ys = zeros ((N+1,n))
ys [0] = alpha # s t a r t i n g p o i n t f o r t h e r e c u r s i o n
for i in range (0,N):
xi = a+i∗h
K0 = h∗F(xi ,ys[i])
K1 = h∗F(xi+h/2,ys[i]+K0 /2)
K2 = h∗F(xi+h/2,ys[i]+K1 /2)
K3 = h∗F(xi+h,ys[i]+K2)
ys[i+1] = ys[i ]+1/6∗( K0 +2∗ K1 +2∗ K2+K3)
return xs ,ys

6.2 Boundary problem


In the methods above we have complete information needed to fix uniquely the solution of
an o.d.e. at one initial point. The problem becomes a bit more complicated in the case
when the data is distributed between two different points. A particular example which we
will consider here is the second order differential equation with two boundary conditions:

y ′′ = f (x, y, y ′ ) , y(a) = α , y(b) = β . (6.32)

We know that the linear second order equation should have two independent solutions. The
boundary conditions y(a) = α and y(b) = β should fix a unique solution.

6.2.1 Shooting method


The idea of the Shooting method is to use one of the methods for the initial value problem,
where one should specify y(a) = α and also fix some value for the derivative y ′ (a) = u. (The
Shooting method gets is name from the idea of ‘shooting’ out in different directions until we
find a trajectory that matches our boundary conditions.) Then we solve the equation and

82
Figure 6.1: Shooting method: Different probe trajectories (dashed) trying to meet the bound-
ary conditions and the solid line successfully meets the boundary conditions.

find in particular y(b), however, for some randomly picked value for u most likely one gets
y(b) ̸= β. The strategy is to tune the value of u so that we get as close as possible to the
dedicated value of y(b). In other words, y(b) defines us a function of u: g(u) = y(b) and we
have to find u such that G(u) ≡ g(u) − β = 0. This maps our problem to the problem of
finding a zero of a function, which can be solved by some iterative method such as Secant
method1
To summarize, the main steps in the method are:

• Specify the starting values u0 and u1 ; and

• Apply Secant method to solve G(u) ≡ g(u) − β = 0, where f (u) is determined as a


value of y(b) with the initial conditions y(a) = α, y ′ (a) = u

6.2.2 Python implementation of the Shooting method


We will use the secant method implementation from these lecture notes and the rungekutta
from the previous section. Assuming they are copied we add the following code
alpha = 1
beta = 0
def G(u): # d e f i n i t i n g G( u)=y ( a)− b e t a
xs ,ys= rungekutta (array ([ alpha ,u]) ,0 ,30 ,2000)
return ys[−1,0]−beta

next we find the value of u which gives G(u) = 0:


u0 = secant (g ,0 ,1)
1
The Newton method requires knowledge of the derivative f ′ (u) which one can in principle find using
finite differences, at the price of losing precision.

83
then we can plot the result to make sure we indeed found the solution interpolating between
α = 1 and β = 0:
xs ,ys= rungekutta (array ([ alpha ,u0 ]) ,0 ,30 ,2000)
plt.plot(xs ,ys [: ,0])
plt.show ()

84
Chapter 7
Stochastic differential equations

Stochastic differential equations are like usual differential equations with some noise taken
into account. The typical example is a free particle motion in empty space given by the
Newton’s 2nd law
mẍ = 0 (7.1)
which is an ordinary differential equation. Now imagine a particle going through some cloud
of dust (tiny small particles) - it will exhibit some irregular collisions affecting the trajectory
of the particle in some unpredictable way. This can also be described by Newton’s 2nd law,
which includes some random force f (t)

mẍ = f (t) . (7.2)

By random force we understand some random variable with certain distribution (if the
particles in the cloud are small the expectation value of f (t) is small etc.).
Interestingly, very similar equations can describe the financial market where the role of
the dust cloud of the particles is played by buyers and sellers.

dx(t) = a(x)dt + b(x)dW (t) (7.3)

where a and b are some fixed functions and dW (t) is a random “force” represented by the
Wiener process. What this equation means could become clearer in the discrete version:

xn+1 = xn + a(xn )∆t + b(xn )∆Wn (7.4)

where ∆Wn = N (0, ∆t) is a random variable with normal distribution and with expectation
value 0 and variance ∆t.

7.1 Euler-Maruyama method


The Euler-Maruyama method consists of the literal implementations of the recursion relation
(7.4) where the random variable ∆Wn is replaced by a pseudo-randomly generated number
as we describe below. We will consider first the simple example of Brownian motion with
a(x) = 0 and constant b(x) = σ
def brownian (T,sigma ):
dt =0.001;

85
ts = arange (0,T,dt)
N = ts.size
xs= zeros(N)
xs [0]=0
sdt = sqrt (1∗ dt)
for i in arange (1,N):
xs[i] = xs[i−1] + sigma ∗ random . normal (0, sdt)
return ts , xs

Note that random.normal generates a random variable with normal distribution, where the
first argument is the average, and the second the standard deviation; here the variance is
dt, so we need to take the square root. This will generate a rather convincing picture of
Brownian motion
% matplotlib inline
import matplotlib . pyplot as plt
ts , xs = brownian (1 ,1)
plt.plot(ts , xs)
plt.show ()

Next we can check that the random variable x(T ) corresponds to the normal distribution
x2
e− 2T
p(x) = √
2πT
. For that we will run the simulation many times
ys = zeros(M)
T = 1
for i in range (100):
ts , xs = brownian (T ,1)
ys[i]=xs[−1]
plt.plot(ts , xs ,"b")
plt.show ()

86
The question is how to convert the set of final points ys into a nice smooth distribution.
Fortunately in the scipy package there is a suitable function gaussian kde which converts
a set of points into a smooth distribution as shown below
from scipy.stats.kde import gaussian kde
kde = gaussian kde (ys)
x = arange (−5,5,0.1)
plt.plot(x,kde(x))
plt.plot(x,exp(−x∗x/2/T)/ sqrt (2∗ pi ∗T))
plt.show ()

0.30

0.25

0.20

0.15

0.10

0.05

0.00
−6 −4 −2 0 2 4 6

87
Chapter 8
Partial Differential Equations

In this chapter we briefly discuss what can be done to numerically solve partial differential
equations.

8.1 Relaxation Method for the Laplace equation


Unfortunately the Relaxation Method does not involve getting a pastry and going for a quick
snooze. Instead it involves using iterative methods for solving systems of equations (so, not
sure where the ‘relaxation’ bit came from...) We mainly concentrate on the Laplace equation

∂x2 f (x, y) + ∂y2 f (x, y) = 0 . (8.1)

The type of problems we may like to solve are those with boundary conditions specified for
the function f (x, y). For simplicity we assume that we are studying the equation (8.1) in
the rectangular domain x ∈ [a, b] and y ∈ [c, d]. Using the approximation of the second
derivative:
f (x + h, y) + f (x − h, y) − 2f (x, y) f (x, y + h) + f (x, y − h) − 2f (x, y)
∂x2 f (x, y) = 2
, ∂y2 f (x, y) =
h h2
we get for

f (x + h, y) + f (x − h, y) + f (x, y + h) + f (x, y − h) − 4f (x, y) = 0

or in other words
f (x + h, y) + f (x − h, y) + f (x, y + h) + f (x, y − h)
f (x, y) = .
4
So we can see that the equation is satisfied if the value f (x, y) at some point of the lattice
is equal to the average of the values at the neighboring nodes. This gives the idea for the
algorithm - we scan through the lattice replacing the values at the nodes with the mean
value of the f s at the neighboring nodes at the previous iteration.

8.1.1 Python Implementation


The method itself is very easy to implement:

88
def relaxation (ts):
res = ts [:]
for x in range (1,sizex −1):
for y in range (1,sizey −1):
res[x,y]=( ts[x+1,y]+ts[x−1,y]+ts[x,y+1]+ ts[x,y−1])/4
return res

The line res = ts[:] copies ts into res, instead of just attributing the variable res to
the same table as ts (which is what res = ts would have done). Here this is absolutely
essential, otherwise the iteration, in the double for loop, would not be the right one, it would
act on elements of the same matrix and there would be interference between the different
iteration steps, leading to the wrong result!
We apply it to the simulation of a temperature field inside a room with one door and one
window
sizex = 14 # s i z e o f t h e room
sizey = 16
ts = array ([[20.]∗ sizey ]∗ sizex) # s e t t i n g some i n i t i a l v a l u e s
for i in range (1 ,7):
ts[0,i]=10 # s e t t i n g boundary v a l u e s a t t h e door
for i in range (8 ,13):
ts[sizex −1,i]=0 # and t h e boundary v a l u e s a t t h e window

and we are ready to run the relaxation method:


for i in range (12):
ts = relaxation (ts)
matshow (ts)
colorbar ()

89
Figure 8.1: Iterations of diagonalization of the relaxation method. It is important to remem-
ber this that the boundary is fixed (by definition of it being a boundary value problem), so
the outside perimeter of squares will never change colour. The above simulation computes
the temperature field inside a room (the rectangle), given the temperature of the window
(the blue bar below) and the door (the green bar on the top) and the fixed temperature of
the walls (the maroon perimeter). Iteratively we find the temperature field inside the room,
and thus the ideal placement for a bed to snooze in.

90
Chapter 9
Eigenvalue problem

We will mainly focus on symmetric matrices in this section.


The problem we are going to solve is formulated for some symmetric matrix A as follows

A⃗x = λ⃗x (9.1)

where ⃗x and λ are both unknown. The goal is to determine λ. ⃗x could also be of some
interest, but we will not consider it here. This is a very important problem with numerous
applications. For instance, some ODEs can be reduced to this form. Many problems in
Classical Mechanics and Quantum Mechanics can be reduced to the eigenvalue problem.

9.1 Direct method


Let us first approach this problem in the way we would solve it by hand: We rewrite it as

(A − λI)⃗x = 0 (9.2)

where I is a unit matrix. We see that in this form this is a linear homogeneous equation.
For this equation to have a nontrivial solution we should require |A − λI| = 0. Expansion of
the determinant leads to the polynomial equation, also known as the characteristic equation

a0 + a1 λ + a2 λ2 + · · · + an λn = 0 (9.3)

which has the roots λi , i = 1, 2, . . . , n called the eigenvalues of the matrix A. For n not too
large we can solve this equation numerically. Let’s consider an example
 
1 −1 0
A =  −1 4 −2  (9.4)
0 −2 2

We can evaluate the function (9.3) numerically:


A= array([[1,−1,0],[−1,4,−2],[0,−2,2]])
def detl(la): # d e f i n e a f u n c t i o n which s h o u l d be z e r o
return linalg .det(A−la ∗ identity (3))

and then find its roots. To do that we first estimate roughly their positions with a plot:

91
xs = arange (−1,6,1/10)
ys = [detl(x) for x in xs]
plot(xs ,ys)
grid () # t o add a g r i d

we see that the roots are around 0, 1 and 5. Then we can use the Newton method to find
zeros with good precision:
import scipy. optimize as sc
[sc. newton (detl ,0),
sc. newton (detl ,1),
sc. newton (detl ,5)]
[0.28129030960808521 , 1.3160308608707258 , 5.4026788295211823]

This method is easy to implement, however, it requires the knowledge of the starting points
for finding the root and it will not work for the degenerate eigenvalues as well as for the
current case. Finally, this method will be hard to use for large matrices (with many eigen-
values).

9.2 Jacobi Method


A method which will work with minimal input from outside is the Jacobi Method. We start
from the equation
A⃗x = λ⃗x . (9.5)
As far as we are interested in finding λ’s only we can take any non-singular matrix P and
replace ⃗x = P ⃗x∗ :
P −1 AP ⃗x∗ = λ⃗x∗ (9.6)

92
which is again an equivalent eigenvalue problem for another matrix P −1 AP = A∗ . Ideally
we would like A∗ to be diagonal - then the problem is essentially solved as we just read
off λ′ s from its diagonal elements. However finding such transformation P immediately is
complicated. We will do that step by step annihilating the off-diagonal elements one by one.
At each step we will transform A with a particular matrix R of the form:
 
1 0 0 0 0 0 0 0
 0 1 0 0 0 0 0 0 
 
 0 0 c 0 0 s 0 0 
 
 0 0 0 1 0 0 0 0 
R=  0 0 0 0 1 0 0 0 
 (9.7)
 
 0 0 −s 0 0 c 0 0 
 
 0 0 0 0 0 0 1 0 
0 0 0 0 0 0 0 1

with c = cos θ and s = sin θ for some angle θ. Here Rkk = Rll = c and Rkl = −Rlk = s. This
is called the Jacobi rotation matrix. Note that its inverse is given by RT and as a result A∗
is of the form (here an example for k = 3, l = 6):
A1,1 A1,2 cA1,3−sA1,6 A1,4 A1,5 sA1,3+cA1,6 A1,7 A1,8
 
 A1,2 A2,2 cA2,3−sA2,6 A2,4 A2,5 sA 2,3+cA
 2,6 A 2,7 A 2,8 
A3,3 c2+s sA6,6−2cA3,6 cA3,4−sA4,6 cA3,5−sA5,6 csA3,3+ c2−s2 A3,6−csA6,6cA3,7−sA6,7 cA3,8−sA6,8 
  
cA1,3−sA1,6 cA2,3−sA2,6
 
A A cA −sA A A sA +cA A A
 
 1,4 2,4 3,4 4,6 4,4 4,5 3,4 4,6 4,7 4,8 
A1,5 A2,5 cA −sA A A sA +cA A A
 
  3,5  5,6 4,5 5,5 3,5 5,6 5,7 5,8 
sA1,3+cA1,6 sA2,3+cA2,6 csA3,3+ c2−s2 A3,6−csA6,6 sA3,4+cA4,6 sA3,5+cA5,6 2
  
 A 3,3 s +c 2sA 3,6+cA 6,6 sA3,7+cA 6,7 sA3,8+cA 6,8 

 A1,7 A2,7 cA3,7−sA6,7 A4,7 A5,7 sA3,7+cA6,7 A7,7 A7,8 
A1,8 A2,8 cA3,8−sA6,8 A4,8 A5,8 sA3,8+cA6,8 A7,8 A8,8
(9.8)

(It looks rather complicated but you are advised to derive it yourself by multiplying smaller
matrices. Consider separately elements of A∗ij for i = k, i ̸= k and/or j = l, j ̸= k). The
idea is to take such value of θ that the element A∗kl = 0. This gives:

(c2 − s2 )Akl + cs(Akk − All ) = 0 . (9.9)

Note that
√ we do not need to know the angle itself. Instead we need c and s, which are related
by s = 1 − c2 . So we write

(2c2 − 1)Akl + c 1 − c2 (Akk − All ) = 0 (9.10)

from where we get


   
v v
1 1 1 1
u u
2 2
s = 1 − t  , c = 1 + t (9.11)
 u   u 
2 A2lk 2 A2lk

1+4 (Akk −All ) 2 1+4 (Akk −All )2

where the sign is chosen so that s = 0 and c = 1 for Alk = 0. I.e. so that R becomes almost
an identity matrix for small off-diagonal elements. In fact, these formula for s, c still need to
be made more precise, as we need to take the square root. For c we take the positive square
root to get c = 1 for Alk = 0, but for s we need to take just the right branch. This branch
is determined by solving for s in (9.9) after using s2 = 1 − c2 , so we get

(1 − 2c2 )Akl
s= (9.12)
c(Akk − All )

93
With all this, we now have

A∗pq = Apq , p, q ̸= l or k (9.13)


A∗pk = cApk − sApl , p ̸= l or k (9.14)
A∗pl = cApl + sApk , p ̸= l or k (9.15)
A∗ll = c2 All + 2scAlk + c2 Akk (9.16)
A∗kk = c2 Akk − 2scAlk + c2 All . (9.17)

After that we apply this transformation iteratively selecting k and l such that Akl is the
biggest off-diagonal element. See implementation for the details.

9.2.1 Python Implementation of the Jacobi Method


First we use (9.11) to find c and s:
# computes s , c f o r g i v e n m a t r i x A
def getcs (l,k,A):
phi = −(2∗A[k,l])/(A[k,k]−A[l,l])
c = sqrt ((1+ sqrt (1/(1+ phi ∗∗2)))/2)
s = (1−2∗c ∗∗2)∗ A[k,l]/(A[k,k]−A[l,l])/c
return c,s

The next function computes A∗ (denoted by B) for given A and k with l:


# computes A a f t e r t h e t r a n s f o r m a t i o n
def remkl (l,k,A):
c,s = getcs(l,k,A)
n = A[0]. size
B = A.copy ()
for i in range(n):
B[k,i] = c∗A[k,i]−s∗A[l,i]
B[i,k] = B[k,i]
B[l,i] = c∗A[l,i]+s∗A[k,i]
B[i,l] = B[l,i]
B[k,l] = 0
B[l,k] = 0
B[k,k] = c ∗∗2∗ A[k,k]+s ∗∗2∗ A[l,l]−2∗c∗s∗A[k,l]
B[l,l] = c ∗∗2∗ A[l,l]+s ∗∗2∗ A[k,k ]+2∗ c∗s∗A[k,l]
return B

Finally, we have to specify which k and l to use at each step. For that we scan the upper
part of the matrix, finding the biggest element:
# f i n d s l o c a t i o n o f t h e b i g g e s t e l e m e n t a b o v e t h e main d i a g o n a l
def findmax (A):
imax = 0
jmax = 1
maxval = abs(A[0 ,1])
n = (A[0]). size

94
for i in range(n):
for j in range(i+1,n):
if abs(A[i,j])>maxval :
imax = i
jmax = j
maxval = abs(A[i,j])
return imax , jmax , maxval

Combining these functions together we get the following diagonalization procedure


def diagonalize (A):
B = A.copy ()
maxc = 1
while maxc >10∗∗(−6):
i,j,maxc = findmax (B)
B = remkl(i,j,B)
imshow (B, interpolation =’nearest ’)
colorbar ()
show ()
return B

To exemplify the usage we can build a random symmetric matrix:


A= random . random ((4 ,4))
A=A+ transpose (A)
diagonalize (A)

Figure 9.1: Iterations of diagonalization of some random matrix with Jacobi Method

95
Chapter 10
Optimisation problem

The problem of optimisation is a problem of finding the minimum (or maximum) of a function
F (⃗x) of several parameters ⃗x. Depending on the number of parameters and smoothness of
the function F the problem can be complicated and sometimes very slow and one should
adopt a suitable method to achieve the best and fastest result. Normally, one finds only a
local minimum which is another common problem of the existing methods.
We will describe a few methods applicable in different situations: those applicable for one
single parameter; those applicable when the function is smooth and as an example of what
one can do in the situation when the function F is not smooth we describe the Down-Hill
Simplex method.

10.1 Golden Section Search


This method is applicable in the case when we have only one parameter so that F : R → R.
In a sense the golden section search is similar to the bisection method used to find zeros of
a function.
Given an interval [a, b] we evaluate our function F at two points inside x1 and x2 which
are chosen in a rather special way, namely:

x1 = a + α(b − a) , x2 = b − α(b − a)

where 0 < α < 0.5 is a constant to be determined below.


Next we compare F (x1 ) to F (x2 ). If F (x1 ) < F (x2 ) then we denote y0 = x1 and define
b′ = x2 , a′ = a (otherwise we define a′ = x1 , b′ = b and pick y0 = x2 ). Then we again define
new points inside the interval, denoted x′1 and x′2 :

x′1 = a′ + α(b′ − a′ ) , x′2 = b′ − α(b′ − a′ )

Now we choose the value of α in such a way that one of x′i coincides with xj in the previous
step: x′1 = x2 or x′2 = x1 (the only one that makes sense). This will guarantee that as we go
through the iterations, the new yi ’s we get will always be such that the function, at these
points, gets smaller - so we should approach a minimum.
For instance let us assume that F (x1 ) ≥ F (x2 ) so that a′ = a + α(b − a). In this case we
need x′1 = x2 , so that

x2 = b − α(b − a) = x′1 = a′ + α(b′ − a′ ) = [a + α(b − a)] + α(b − [a + α(b − a)])

96
Figure 10.1: First several iterations of the Golden Section Search

which simplifies to the following equation

(α2 − 3α + 1)(a − b) = 0

As this should work for any a, b, the solution is given by



3− 5
α= .
2
We repeat this procedure several times, at each step decreasing the interval and generating
new values yn . It is clear that at each step we have f (yn ) ≤ f (yn−1 ) i.e. the sequence f (yn ) is
decreasing and also the sequence yn is convergent (since the size of the interval is decreasing)
which guarantees that the limit limn→∞ f (yn ) gives at least a local minimum of the function
f (x).
Note √
that the convergence is linear. Indeed at each step the interval shrinks by the factor
1 − α = 5−12
≃ 0.618 i.e. after n steps the length of the interval becomes
√ !n
5−1
ϵn = (b − a)
2

which is a linearly convergent sequence.


This method guarantees finding a local minimum. Even though it is a one dimensional
method one can still use it in the case when the number of parameters is bigger than 1.
Namely, one can first optimize w.r.t. to the first variable, then w.r.t. to the second and so
on. This procedure has to be repeated many times and may eventually converge to a local
minimum in the multi-dimensional space. But the convergence could be very slow.

10.2 Powell’s Method


Now we discuss a method for optimization in n-dimensions which works well for smooth
functions. The method is essentially a generalization of the Newton method.

97
First consider the following simplified problem. Let’s assume that our function F (⃗x) has
the following form:
X 1X
F (⃗x) = c + bi x i + Aij xi xj (10.1)
i
2 i,j
where At = A can be assumed to be symmetric.
If the function is differentiable we can use that at the minimum of F we should have
∂F
=0
∂xi
which becomes X
0 = bi + Aij xj
j

or in the matrix notation


A.⃗x = −⃗b
which we solve as
⃗x = −A−1 .⃗b
which is an explicit expression for the values of the parameters giving the minimum of the
function F (⃗x).
Now, in general (10.1) does not hold, but we can use Taylor expansion to approximate
X 1X
F (⃗x) ≃ F (⃗x0 ) + ∂i F (⃗x0 )(xi − x0,i ) + ∂i ∂j F (⃗x0 )(xi − x0,i )(xj − x0,j ) . (10.2)
i
2 i,j

Dropping higher terms in the Taylor expansion we again have the same as in (10.1) which
we are ready to optimise:
⃗x − ⃗x0 = −A−1 .⃗b (10.3)
where
Aij = ∂i ∂j F (⃗x0 ) , bi = ∂i F (⃗x0 ) (10.4)
this will give us some values of ⃗x, which is not an exact minimum but rather an approximation
which we can use as a starting point for the next iteration, exactly like in the Newton method.
In this way we get the following equations:
⃗xn = ⃗xn−1 − A−1⃗b , Aij = ∂i ∂j F (⃗xn−1 ) , bi = ∂i F (⃗xn−1 ) . (10.5)
The above is of course just a special case of the problem of finding the simultaneous zero
of, say, m + 1 functions:
fi (⃗x) = 0, ⃗x = (x0 , x1 , . . . , xm ), i = 0, 1, . . . , m (10.6)
Indeed by Taylor series expansion
X
fi (⃗x) = fi (⃗x0 ) + ∂j fi (⃗x0 )(xj − x0,j ) + . . . (10.7)
j

and solving for xj the equation that asks the right-hand side to be zero for every i, calling
the solution ⃗x1 , this gives
⃗x1 = ⃗x0 − A−1⃗b, Aij = ∂j fi (⃗x0 ), bi = fi (⃗x0 ). (10.8)
The recursion relation is obtained by replacing ⃗x0 → ⃗xn and ⃗x1 → ⃗xn+1 . This is the multi-
dimensional version of the Newton method.

98
10.3 Down-Hill method
This multidimensional optimisation method is applicable in the case when the function is
not even differentiable or cannot be efficiently approximated by the polynomial as in the
previous section. For simplicity we will consider 2d case only, but the generalization is
straightforward.

Figure 10.2: Scalar function of two variables. This exemplifies a function which cannot be
easily approximated by a quadratic polynomials.

The starting point of the method is a set of 3 points which form a triangle. The point
with the maximal value of F we denote as Hi and the one with the minimal value is denoted
as Lo.
Each next step is based solely on the values of the function at some 3 points. Given the
3 points we have to generate a set of new 3 points trying to make the values smaller at each
step.
There are 4 moves in the algorithm: Reflection, Expansion, Contraction and Shrinkage.
The first two moves can be called “exploration” moves as they let us probe the function
outside the area surrounded by the triangle. The other two have an intention to corner the
minimum of a function by making the triangle smaller and smaller at each of these moves.

Reflection Reflect the triangle w.r.t. to the side opposite to Hi vertex.

Expansion Similar to the reflection by it also stretches the triangle further away by a
factor 2 (see Fig.10.3).

Contraction Opposite to the Expansion. We move the Hi vertex down towards the op-
posite side decreasing the area of the triangle by the factor 2 (see Fig.10.3).

Shrinkage Shrinks the triangle towards Lo vertex.


The algorithm works as follows: Let x1 , x2 and x3 be the current vertices of the triangle
such that f (⃗x1 ) ≤ f (⃗x2 ) ≤ f (⃗x3 ) and denote ⃗x0 = ⃗x1 +⃗
2
x2
.

x1 +⃗x2
1. Order points ⃗x1 , ⃗x2 and ⃗x3 so that f (⃗x1 ) ≤ f (⃗x2 ) ≤ f (⃗x3 ) denote ⃗x0 = 2
a

99
Figure 10.3: Elementary moves of the Down Hill method.

2. Try the reflection xr = ⃗x0 + (⃗x0 − ⃗x3 ).

3. If f (x1 ) ≤ f (xr ) < f (x2 ) then replace x3 by xr and go to 1)

4. If xr is the best point so far f (xr ) < f (x1 ) then do expansion xe = ⃗x0 + 2(⃗x0 − ⃗x3 ).
Then use the best point between xe and xr to replace the worst point x3 and go to 1)

5. If reflection is not successful (i.e. we reached that point instead of going to 1)) which
means that we must have f (xr ) ≥ f (x2 ) compute contracted point xc = ⃗x0 −1/2(⃗x0 −
⃗x3 ). If f (xc ) < f (x3 ) then replace x3 with xc and go to 1)

6. If none of the above work shrink the rectangle ⃗xi = ⃗x1 + 1/2(⃗xi − ⃗x1 )

10.3.1 Python implementation of the Down-Hill method


def onestep (xs):
# 1 r e l a b l e l points
xs = sorted (xs , key= lambda x: f(x[0],x[1]))
fs = [f(x[0],x[1]) for x in xs]
xo = (xs [0]+ xs [1])/2 # m i d d l e p o i n t on t h e s i d e
# 2 try reflection
xr = xo+(xo−xs [2])
fr = f(xr[0],xr [1])
# 3
if (fr<fs [1] and fs[0]<=fr):

100
xs [2] = xr
print (’reflected ’)
return xs
# 4
else:
if(fr<fs [0]):
xe = xo + 2∗( xo−xs [2]) # e x t e n d e d p o i n t
fe = f(xe[0],xe [1])
if (fe<fr):
xs [2] = xe
print (’extension ’)
return xs
else:
xs [2] = xr
print (’reflection ’)
return xs
# 5 contraction
else:
xc = xo −0.5∗(xo−xs [2])
fc = f(xc[0],xc [1])
if (fc<fs [2]):
xs [2] = xc
print (’contruction ’)
return xs
# 6 shinkage
for i in range (1 ,3):
xs[i] = xs[i ]+0.5∗( xs[i]−xs [0])
print (’shrink ’)
return xs

Usage
def f(x,y): # f u n c t i o n we o p t i m i z e
return (1−x )∗∗2+100∗( y−x ∗∗2)∗∗2
# initial triangle
xs = [ array ([1. ,2.]) , array ([2. ,3.]) , array ([3.,−4.])]
# repeat steps s e v e r a l times
for i in range (80):
xs ,s = onestep (xs)
# p r i n t r e s u l t and t h e v a l u e o f t h e f u n c t i o n a t t h e minimum
print (xs [0],f(xs [0][0] , xs [0][1]))
[ 1.00000001 1.00000002] 5.14504919216e−16

See the transformation of the triangle on the figure below

101
Figure 10.4: Iterations of diagonalization of the Down Hill method. We begin with a single
green triangle, and are in search of the minimum (red dot). We then reflect (green triangle)
and reflect again, before contracting. We continue in this way as we get closer and closer to
the target. Eventually the triangles get so close it is hard to resolve where are they anymore!

102
Chapter 11
Solving Linear Systems of Algebraic Equations

[This chapter is optional. If time permits we will cover it. The mathemati-
cal method is something that you have learned before, this is an algorithmic
implementation of it.]
Linear systems of algebraic equations play a central role in numerical methods. Many
problems can be reduced or approximated by a large system of linear equations. Some
essentially linear problems include structures, elastic solids, heat flow, electromagnetic fields
etc. That’s why it is very important to know how to solve these equations efficiently, fast
and without big loss of precision.
In general a system of algebraic equations has the form
A11 x1 + A12 x2 + · · · + A1n xn = b1
A21 x1 + A22 x2 + · · · + A2n xn = b2 (11.1)
..
.
An1 x1 + An2 x2 + · · · + Ann xn = bn
where the unknowns are xi . We will use the standard matrix notations
Ax = b . (11.2)
A common notation for the system is
 
A11 . . . A1n b1
 .. .. ..  .
[A| b] =  . . .  (11.3)
An1 . . . Ann bn
This system of linear equations has a unique solution only if det A ̸= 0. At the same time if
det A = 0, depending on b one either has many solutions or none. As we are going to solve
the system approximately, small determinants can also cause problems. We will discuss some
possible issues below.
The 3 basic steps which we can use to simplify the equations are:
• Exchanging two equations
• Multiplying an equation by a nonzero constant
• Adding/subtracting two equations
In performing the above steps we do not lose any information.

103
11.1 Gaussian Elimination Method
The Gaussian elimination method consists of two parts: elimination phase and the back
substitution phase. We will demonstrate them first by an example. Consider a system

−2x1 + x2 + x3 = 3
x1 + 5x2 + 4x3 = 23 (11.4)
5x1 + 4x2 + x3 = 16

Elimination phase. In the elimination phase we multiply one equation (equation j) by


a constant λ and subtract it from another equation (equation i). This is very intuitive, but
since we will be dealing with python, we need to formalise this a bit. After the operation the
equation i gets replaced by Eq.(i) → Eq.(i)−λEq.(j) whereas equation j remains unchanged.
In this construction the equation being subtracted i.e. Eq.(j) is called the pivot equation.
First we use the first equation of 11.4 as the pivot equation and tune λ to get rid of x1
in the second and the third equations:

−2x1 + x2 + x3 = 3
1 1
(x1 + 5x2 + 4x3 ) + (−2x1 + x2 + x3 ) = 23 + 3 (11.5)
2 2
5 5
(5x1 + 4x2 + x3 ) + (−2x1 + x2 + x3 ) = 16 + 3
2 2
or

−2x1 + x2 + x3 = 15
11 9 49
x2 + x3 = (11.6)
2 2 2
13 7 47
x2 + x3 =
2 2 2
Next we use the second equation as a pivot equation to eliminate x2 from the last equation

−2x1 + x2 + x3 = 3
11 9 49
x2 + x3 = (11.7)
   2 2 2
13 7 13 11 9 99 13 49
x2 + x 3 − x2 + x3 = −
2 2 11 2 2 2 11 2
and at the end of the elimination phase we get the following upper triangular system of
equation

−2x1 + x2 + x3 = 3
11 9 49
x2 + x3 = (11.8)
2 2 2
20 60
− x3 = −
11 11

Back substitution phase. Now we can find the unknowns one by one starting from x3 .
The last equation gives x3 = 3, after that we use the second equation to get x2 = 2 and the
first one to get x1 = 1.

104
11.1.1 Python implementation of Gaussian Elimination Method
def gauss (A,b):
n = len(b)
# Elimination phase
for j in range (0,n−1): # i t e r a t e o v e r t h e p i v o t e q u a t i o n
for i in range(j+1,n):
lam = A[i,j]/A[j,j]
A[i] = A[i] − lam ∗A[j]
b[i] = b[i] − lam ∗b[j]
# Back s u b s t i t u t i o n p h a s e
x = zeros(n)
for j in range(n−1,−1,−1):
x[j] = b[j]/A[j,j]
b = b − x[j]∗A[:,j]
return x
to check that it works we apply it to the above example
from numpy import array
A = array ([[ −2. ,1. ,1.] ,[1. ,5. ,4.] ,[5. ,4. ,1.]])
b = array ([3. ,23. ,16.])
gauss (A,b)

11.2 Pivoting
The Gaussian method discussed above has one obvious problem - it could be that at some
point the element A[j, j] could appear to be zero. In this case we will divide by zero and get
an error. Consider an example
−x2 + x3 = 0 (11.9)
2x1 − x2 = 1 (11.10)
−x1 + 2x2 − x3 = 0 (11.11)
the corresponding augmented coefficient matrix is
 
0 −1 1 0
[A| b] =  2 −1 0 1  . (11.12)
−1 2 −1 0
We see that we are stuck at the very first step of the elimination. At the same time there
will be no problem if we swap the first and the last equations:
 
−1 2 −1 0
[A| b] =  2 −1 0 1  . (11.13)
0 −1 1 0
In fact it would be even better to change the order of the equations so that at each step
of the elimination the pivot equation had the biggest coefficient among all rows below.
We will add two functions to the code above

105
def findbiggest (row ):
return row. tolist (). index(max(row ,key=abs ))

this function will return the index of the biggest (by absolute value) element of the row. To
swap ith and j th rows of A and b simultaneously we can use
def swap(A,b,i,j):
Atmp = array(A[i])
A[i] = A[j]
A[j] = Atmp
btmp = b[i]
b[i] = b[j]
b[j] = btmp
return A, b

Having these functions the modification of gauss is simple


def gaussWithPivoting (A,b):
n = len(b)
# Elimination phase
for j in range (0,n−1):
A,b = swap(A,b,j+ findbiggest (A[j:,j]),j) # <−− new l i n e
for i in range(j+1,n):
lam = A[i,j]/A[j,j]
A[i] = A[i] − lam ∗A[j]
b[i] = b[i] − lam ∗b[j]
# Back s u b s t i t u t i o n p h a s e
x = array ([ None] ∗ n)
for j in range(n−1,−1,−1):
x[j] = b[j]/A[j,j]
b = b − x[j]∗A[:,j]
return x

106

You might also like