Numerical Methods
Numerical Methods
Computational Methods
with Python
This course covers numerical methods for solving various
mathematical problems and introduces the basics of pro-
gramming. The methods are demonstrated using Python.
A short introduction to Python is part of the course. No
knowledge of Python is expected.
Notes by
Profs. N. Gromov, B. Doyon
Drs. N. Nüsken and S. Pougkakiotis
1 Introduction 5
1.1 What are Numerical Methods? . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Why Python? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.1 How to start writing Python . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.2 iPython/Jupyter Interface . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.3 First Program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Introduction to Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.1 Basic Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.2 The First Cell in an iPython Notebook . . . . . . . . . . . . . . . . . 16
1.4 Floating Point Arithmetic and Numerical Errors . . . . . . . . . . . . . . . . 16
1.4.1 Real Numbers in a Computer . . . . . . . . . . . . . . . . . . . . . . 17
1.4.2 Round-off Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.4.3 Arithmetic Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2
3.2.2 Python Code for Polynomial LS Fit . . . . . . . . . . . . . . . . . . . 46
3.3 Interpolation Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.3.1 Error of the Polynomial Interpolation . . . . . . . . . . . . . . . . . . 48
3.3.2 Estimating the Error on a Mesh: The Order of the Error . . . . . . . 49
3.3.3 Optimising the Error Bound: Chebyshev nodes . . . . . . . . . . . . 50
4 Numerical Differentiation 52
4.1 Finite Difference Approximations . . . . . . . . . . . . . . . . . . . . . . . . 52
4.2 Error Estimation: Round-off (Machine Precision) and Approximation Errors 53
4.3 Python experimentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.4 Richardson Extrapolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.4.1 Richardson Extrapolation: An Example . . . . . . . . . . . . . . . . 57
5 Numerical Integration 58
5.1 Trapezoidal Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.1.1 Approximation error . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.1.2 Roundoff error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.1.3 Python implementation for the Trapezoidal Rule . . . . . . . . . . . 61
5.2 Midpoint rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.3 Simpson’s rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.3.1 Error estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.4 Gaussian Integration Method . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.4.1 Python implementation of Gaussian Integration . . . . . . . . . . . . 67
5.5 Monte-Carlo Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.5.1 Estimating Statistical Error . . . . . . . . . . . . . . . . . . . . . . . 70
10 Optimisation problem 96
10.1 Golden Section Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
10.2 Powell’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
10.3 Down-Hill method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
10.3.1 Python implementation of the Down-Hill method . . . . . . . . . . . 100
which is, of course, π!2 Amazingly, the methods we will discuss will work equally well for
much more complicated functions without any modifications. Due to this power, numerical
1
‘Numerical’ methods usually produce approximate solutions. ‘Analytical methods’ use algebraic manip-
ulation to arrive at an exact answer (which could be arbitrarily complicated and thus may not be useful in
practice).
2
Formally, one may have doubts as to whether this is really π, as we can only see the first 30 digits. The
31st digit could indeed deviate from π. However, the probability of that happening (roughly 10−30 ) is much
less than the probability of a huge asteroid hitting Earth in the next half an hour and destroying the planet,
so in many situations it could be acceptable to assume that it is indeed π. There are many other theories
which are more likely than the theory that the number in the r.h.s. of (1.1) is not π. For example, it is
also much more likely that we all live in a computer simulation (in which case we should not make a big
distinction between exact numbers and approximate numerical values as all the computer calculations have
some small numerical error as we discuss in this course too).
5
methods are crucial to physics, biology, chemistry, and also mathematics and theoretical
physics!
Numerical methods are not limited to solving integration problems. In this course we
will cover numerical methods for: linear equations; eigenvalue problems; nonlinear equations;
differential equations; optimization problems; and stochastic differential equations.
https://jupyter.org/try-jupyter/lab/
Nonetheless, if you are interested in delving (independently of this module) more deeply into
Python, you could install it on your laptop. The installation is quite straightforward. To
avoid compatibility problems it is recommended to use the anaconda3 distribution:
https://repo.continuum.io/archive/
or directly at
https://www.anaconda.com/products/individual
Make sure you download anaconda3 (as opposed to anaconda2). The difference is that
anaconda3 comes with Python 3, which we will use for this course. In the repository (first
link above), the name of the file suggests the operational system, i.e. for 64 bit Windows
you need to download
Anaconda3-2021.05-Windows-x86_64.exe
and
Anaconda3-2021.05-MacOSX-x86_64.pkg
for MAC. After installing anaconda3 make sure you also include the main packages as de-
scribed below.
6
Packages we need
If you decide to work with a version of Python installed locally in your computer, please
note that you need to independently install packages that we will be using in this module
(all these are automatically installed in JupyterLite). Specifically, we will make use of the
following packages:
numpy, scipy, matplotlib, ipython, notebook
The step-by-step installation process (which is very straightforward) is described below.
for Windows. open “command prompt”3 (or even better “anaconda prompt” which is
installed if you completed the anaconda installation) and run one by one the following
commands (answer yes to all questions):
conda install matplotlib
conda install ipython
conda install notebook
conda install numpy
conda install scipy
conda install mpmath
If the conda command does not work, it probably means that you have forgotten to “Add
anaconda to PATH” during the installation. Install it again with the correct settings.
To check that the ipython installed correctly run from the “command prompt”
jupyter notebook
Nonetheless, please note that during the class test, you will be asked to use JupyterLite.
Make sure that you are comfortable using it (it is really easy!).
and then evaluate the cell by pressing “Ctrl+shift”. As a result we get 4 as expected!
3
You may need to set your current directory to be inside the anaconda installation directory.
7
1.3 Introduction to Python
1.3.1 Basic Operations
Assignment and Comparison. If you are familiar with basics of programming you can
skip this section. As in any programming language our main friends are variables which can
be represented by one or several latin symbols and can also contain numbers (but not at the
beginning of the name):
a
xXx
variable26
The variables could be assigned various values which can change afterwards. To assign a
value to a variable we can use =. For example, we can assign the value 13 to the variable a
using
a=13
Note that the statement above means “set variable a to the value 13”. It should not be
confused with the mathematical equation a = 13. In particular, the statement
13=a
will result in an error and does not make sense as we cannot set the value “a” to “13” as
“13” is not a variable. What is more similar to the pure-math a = 13 is
a==13
True
(with two equal symbols in a row). The == is a comparison operation which returns True
when the value of the l.h.s. and that of the rhs are the same. One can equally well swap
r.h.s. with the lhs for this operation
13==a
True
More advanced examples of the asignment operation are statements containing nontrivial
expression in the r.h.s.:
a=13 # s e t 13 as a v a l u e o f a
a=a+1 # f i r s t a+1 w i l l be computed g i v i n g 14
# and t h e n t h e a s s i g n m e n t o p e r a t i o n w i l l be
# a p p l i e d s e t t i n g a t o 14
To check the current value of a variable one can use the print function:
a=13 # a becomes 13
a=a+10 # a becomes 23
print (a) # p r i n t t h e c u r r e n t v a l u e o f a
23
Note that when using iPython you don’t need to write print, you can simply type
8
a
23
Control Flow. Comparison operations are especially useful when it comes to control flow
(i.e., controlling which command of a Python program is going to be executed next). A very
useful tool for control flow is the if statement, which determines whether a command will
be executed, depending on some appropriate condition.
x = 10
if x > 5:
print ("x is greater than 5")
x is greater than 5
Note that the indentation after the if statement is important (i.e., the first command after
the if-statement must start with a tabulation symbol without any spaces in the beginning),
since it signifies which commands have to be executed if the associated condition is true. In
general, we can create an if−elif−else statement, as shown in the following example:
x = 5
if x > 5:
print ("x is greater than 5")
elif x == 5:
print ("x is equal to 5")
else:
print ("x is smaller than 5")
x is equal to 5
On an unrelated note, observe that we have used a string within the print command. A
string is a series of characters enclosed within double quotation marks.
7/2
3.5
9
Lists. Lists can be used to store any kind of data, but they should not be confused with
vectors and matrices (for this, use np.array supporting the corresponding mathematical
operations; see below!). The built-in list type works as follows:
a=[1 ,2 ,3 ,4];
a[0] # g i v e s t h e f i r s t e l e m e n t ( Python i n d e x i n g s t a r t s a t 0)
a[1] # g i v e s t h e s e c o n d e l e m e n t
a[−1] # g i v e s t h e l a s t e l e m e n t
a[2:] # p a r t o f t h e l i s t from 3 rd e l e m e n t t o t h e end
a[:] # r e t u r n s t h e e x a c t copy o f t h e l i s t
1
2
4
[3, 4]
[1, 2, 3, 4]
Note that a list is not really a vector. In particular, multiplication by an integer results in
a longer list:
a=[1 ,2 ,3 ,4];
a∗3
[1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4]
Note that that gives a list of all non-negative numbers up to but not including 10. For the
list of numbers from 5 to 10 we have to type
list(range (5 ,11))
[5, 6, 7, 8, 9, 10]
10
Note that again the last argument is not included into the range (but the first is). That is
a typical source of confusion. Always remember that Python indexing starts at zero.
One can also specify the step when using range, so that all even numbers in the range
0 . . . 10 can be obtained by
list(range (0 ,11 ,2))
[0, 2, 4, 6, 8, 10]
As with the if statements, the second line in the above code must start from a tabulation
symbol. This is generally the way in which blocks of code are indicated in Python. Note
that using range(4) we make the variable i attain consecutive integer values from 0 to 3.
You can also use range with two arguments
for i in range (4 ,6): # t o run from 4 t o 5
print (i)
4
5
Remark
Loops are generally very important in the context of programming, and allow us to
automate repeated operations. Nonetheless, in the context of linear algebra (which
typically benefits greatly from the utilization of loops) with Python, it is highly inef-
ficient to use loops. Instead, for all linear algebra purposes we will be using a highly
optimized (linear algebra dedicated) package, called numpy, as discussed bellow.
11
Arrays, Vectors, Matrices. arrays are much closer to the vectors we are used to; we
can add and/or multiply them by constants. To define an array we will have to use the
numpy package:
import numpy as np # l o a d i n g numpy p a c k a g e ( needed o n l y once )
a = np. array ([1 ,2 ,3])
b = np. array ([1 ,0 ,0])
For example:
a ∗0.5
a+b
[0.5 1. 1.5]
[2 2 3]
All the standard matrix operations can be performed very easily. For the matrix-matrix
product use np.dot:
a = np. array ([[1 , 0], [0, 1]])
b = np. array ([[4 , 1], [2, 2]])
np.dot(a,b)
[[4 1]
[2 2]]
Typing a∗b when in fact you want matrix multiplication is a typical hard-to-find mistake!
To compute the trace of a matrix:
a = np. array ([[1 , 0], [4, 7]])
np. trace (a)
8
12
A = np. array ([[1 , 2], [3, 1]])
v = np. array ([1 ,4])
np.dot(A,v)
[9 7]
In general, you can get an exhaustive description of the numpy package by typing
import numpy as np
help(np)
Creating Grids. The functions np.arange and np.linspace are to np.arrays what
range is to lists. The difference between np.arange and np.linspace is that the former
allows you to specify the step size, while the latter allows you to specify the number of
steps:
np. arange (0 ,1 ,0.25)
([0. , 0.25 , 0.5 , 0.75])
13
Elementary Functions and Constants. The numpy package also contains all the ele-
mentary functions such as sin, cos, exp, tan, etc.,
[np.sin (2),np.cos (1.5) , np.abs(−1.2)]
[0.90929742682568171 , 0.070737201667702906 , 1.2]
User-defined Functions. It is typically useful to define our own functions. For example,
let us define a function f (x) = x3 + 2x2 + 1. In Python this is done in the following way:
def f(x):
return x ∗∗3+2∗( x ∗∗2)+1
After that we can use this function just like any other function:
f(1)
4
There are situations when we want to define a function without giving it a specific name.
This happens for example when we want to use the function only once. In this cases it
is convenient to use a so-called lambda statement. This allows us to define an anonymous
function, i.e. a function that is defined without a name:
( lambda x: x ∗∗2)(3)
9
14
Making Plots. We will use the matplotlib.pyplot package for the plots. It has numer-
ous methods and we will only describe some of them. For example we can easily make a plot
of the sin function as shown below:
import numpy as np
import matplotlib . pyplot as plt
xs = np. linspace (0 ,4∗ np.pi ,100) # sampling point along x
ys = np.sin(xs) # function values at these points
plt.plot(xs ,ys) # creates plot
plt.show () # shows t h e p l o t
1.0
0.5
0.0
−0.5
−1.0
0 2 4 6 8 10 12 14
However, the line ys = np.sin(xs) runs in around 1.77µs, while the code involving
the loop takes 173µs (on my computer, this depends a lot on the hardware, of course).
That is, it takes around 100 times longer! This is because numpy is an extremely
optimised package that can perform elementwise computations in parallel. As already
mentioned earlier in the notes, it is a good practice to avoid using loops whenever
possible. It is always preferable to call numpy routines when applicable. In this course,
such considerations will not play a significant role, as most examples can be executed
in very little time anyway. However, for more involved projects it is crucial to pay
attention to such details, as otherwise the code will be far too slow (in fact, the recent
developments in deep learning would hardly be possible without packages like numpy
tailored for ‘under the hood’ parallel computation).
Back to Plots. Plots can be modified for readability in many different ways:
ys1 = np.sin(xs)
ys2 = np.cos(xs)
plt.plot(xs ,ys1 ,’r’) # ‘ r ’ − means red , s o l i d l i n e
plt.plot(xs ,ys2 ,’gx’) # ‘ gx ’ − means green , no l i n e , ‘ x ’ marker
15
plt.show ()
1.0
0.5
0.0
−0.5
−1.0
0 2 4 6 8 10 12 14
Exercise 1 Define the function f (x) = x3 sin(x2 ) and plot its values in the range (0, 10). At
first you will get a plot which is a bit ugly. To make it look nicer you will have to increase the
number of points by changing xs = np.linspace(0,10,1000). Use both types of functions:
created with def and with a lambda-statement.
Getting Help. A very important feature, which one can use even during the class test, is
the “?” symbol before the name of the function:
?range
Docstring :
range (stop) −> range object
range (start , stop[, step ]) −> range object
16
method itself, by construction, can only give an approximation, although sometimes the
approximation it can give is very precise. We’ll discuss more at length the approximations
due to the numerical method and how to estimate the numerical errors they cause for the
various algorithms we’ll study. But another source of numerical error is the finite memory
of the machine itself, combined with the way it keeps in memory the numbers it calculates.
In general, there are many sources of error in numerical computations, such as:
• Round-off errors, occurring due to finite memory in computers;
• Arithmetic errors, occurring due to manipulating inexact numbers;
• Approximation errors, occurring due to the utilization of approximations (such as those
obtained via the Taylor expansion);
• Algorithmic errors, occurring due to the utilization of iterative methods for the solution
of mathematical problems;
• Stochastic errors, occurring due to random noise or measurement errors.
In the rest of this section, we will get a deeper understanding of round-off and arithmetic
errors, by discussing how numbers are stored and manipulated within a computer, by giving
some examples in Python, as well as providing some warnings on things to avoid when writing
code for numerical computations. In general, this module will mostly focus on approximation
and algorithmic errors, although we will briefly discuss (and see some examples of) the effects
of round-off, arithmetic or stochastic errors.
17
Computer Number Representation. Computers use the binary system to store num-
bers. A bit is one digit in the binary system. Most modern computers use number repre-
sentations from IEEE 754-2008 standard arithmetic. Here we will discuss double precision
numbers (64−bits; also known as float64 in Python). A double precision 64−bit IEEE
number x is stored in the computer in the following form:
x = (−1)s × 1.f × 2e−1023 , (1.2)
where 1.f is in the binary form, as is 1023, i.e., 1023 = (1111111111)2 . In (1.2),
• 1 bit is used for the sign; s = 0 or 1 (0 for +, 1 for −);
• 11 bits are used for the non-negative exponent e (all zeros and all ones is reserved; to
be discussed later);
• 52 bits are used for the mantissa f .
Definition: Floating point numbers
All numbers of the form of (1.2) are called floating point (or machine) numbers.
Theorem 1.4.1 (Range of machine numbers) There is a finite (closed) range for the
machine numbers x, namely,
realmin ≤ |x| ≤ realmax.
Proof To prove this result, we need to find the largest and the smallest floating point
numbers enabled by the representation in (1.2).
• What is the largest floating point number, say realmax?
To answer this question, we first note that the largest value of the exponent e is
( |11111111110
{z } )2 = (2046),
11 bits
since the smallest positive value for the mantissa f in this case is ( 00 · · · 0} )2 and
| {z
52 bits
the smallest positive value for the exponent e is ( |00000000001
{z } )2 (note that e =
11 bits
(00000000000)2 = 0 and f = 0 is a special case used to represent zero).
18
Thus any machine number x is such that,
• Larger than realmax, overflow occurs and the computer usually returns inf.
• Smaller than realmin, underflow occurs and the computer usually returns zero.
inf
⋆ NaN (i.e., Not-a-Number) is used to describe things like , inf − inf, etc. in
inf
the machine.
The previous (normalized) IEEE number system fails to represent the following num-
bers:
• the value 0;
We mentioned earlier that the IEEE 754-2008 standard arithmetic reserves the expo-
nents of all zeros (i.e., (00000000000)2 ) and of all ones. If an exponent of all zeros
appears, then the computer assumes a denormalized representation
This allows us to represent numbers smaller (in absolute value) than realmin. If, in
this case, f contains only zeros, the computer considers this number to be 0 (in fact,
there are two different zeros; if the sign is positive, we call this a positive zero, and
negative zero if the sign is negative).
If an exponent of all ones appears and f contains only zeros, then we obtain (−1)s ×∞.
If, in this case, the mantissa f contains some nonzero bits, we consider this to be NaN.
19
negative infinity = float (’−inf ’)
print ( negative infinity ) # Output : − i n f
if math. isinf(value ):
print (" Infinite value!")
inf
−inf
Infinite value!
nan
1.7976931348623157 e+308
inf
<RuntimeWarning : overflow encountered : print ( realmax ∗2)
x = 1.f.
The smallest and largest floating point numbers in this case are x = 1 and
x = (1. 11 · · · 1} )2 = 2 − 2−52 ,
| {z
52 bits
respectively, i.e., the largest floating point number in this case is very close to 2. Consider
the interval [1, 2).
• How many different floating point numbers exist in the interval [1, 2)?
Since each of the 52 bits in the mantissa f can be 0 or 1, the answer is 252 .
• What is the distance between two successive floating point numbers in [1, 2)?
The 252 numbers are equidistributed on the interval [1, 2) (which has a length equal to
1). Thus the distance between two successive floating point numbers in [1, 2) is 2−52 .
20
Definition: Machine epsilon
The number 2−52 , which is the distance between two successive floating point numbers
in [1, 2) is called the machine epsilon and it is denoted by eps.
Note: eps is the smallest floating point number for which 1 + eps is a floating point num-
ber not equal to 1. To see this in practice, in the following Python code, we change the
appearance of the output using format(), while working with the machine epsilon.
import numpy as np
# A l t e r n a t i v e way o f p r i n t i n g u s i n g f o r m a t s p e c i f i e r s
print (f"1: {1:.52 f}")
print (f"1 + eps /2: {1 + machine epsilon / 2:.52f}")
print (f"1 + eps: {1 + machine epsilon :.52f}")
1.0000000000000000000000000000000000000000000000000000
1.0000000000000000000000000000000000000000000000000000
1.0000000000000002220446049250313080847263336181640625
1: 1.0000000000000000000000000000000000000000000000000000
1 + eps /2: 1.0000000000000000000000000000000000000000000000000000
1 + eps: 1.0000000000000002220446049250313080847263336181640625
You can see that 1 + eps/2 is identical to 1, as far as Python is concerned. Any number that
is not exactly representable as in (1.2), is rounded to the closest number that is. This is
called a round-off error. The previously presented Python code shows that 1 + eps/2 cannot
be represented exactly, and thus we incur the maximum possible (relative) round-off error,
that is eps/2. Let us briefly explain why that is.
If we take positive floating point numbers with exponent e = 1024, i.e., of the form
x = (1.f ) × 2,
then the smallest and largest floating point numbers in this case are 2 and (2 − 2−52 ) × 2,
respectively. There are again 252 different floating point numbers equidistributed in the
interval [2, 4), and now the distance between two successive numbers is eps × 2 = 2−52 × 2 =
2−51 > eps. We can repeat the procedure for any positive floating point number, and in
particular for any exponent e, 0 ≤ e ≤ 2047; the distance between two successive numbers
in the interval [2e−1023 , 2e−1022 ) is eps × 2e−1023 = 2−52 × 2e−1023 < eps if e − 1023 < 0 and
eps × 2e−1023 = 2−52 × 2e−1023 > eps if e − 1023 > 0. A sketch
showing the form of distribution
1
of the machine numbers with 3−bit mantissa in , 8 is displayed in Figure 1.1.
8
As expected, we can see that (positive) floating point numbers are dense close to zero and
significantly sparser as we consider intervals of large positive numbers.
21
1
Figure 1.1: Distribution of machine numbers with 3−bit mantissa in ,8 .
8
Given a number to be stored in the computer, it is rounded to its closest machine num-
ber. If this number happens to be in the interval [1, 2), then the maximum absolute error
eps
that occurs is . Similarly we can find the maximum absolute error that occurs when a
2
number is stored in the computer for every binary interval of the form [2e−1023 , 2e−1022 ) , to
e−1023
be eps·2 2 . Does this mean that the round-off errors become worse when storing large
numbers and better when storing positive numbers close to zero? Not quite! We do not care
about absolute errors. Instead, we care about relative errors. Let us distinguish the two.
Let f l(x) be the floating point representation of x. Then the absolute and relative
(round-off) errors for x are given by
|x − f l(x)|
|x − f l(x)| and ,
|x|
respectively.
Theorem 1.4.2 Let x ∈ [realmin, realmax] and let f l(x) be the floating point representation
of x obtained by rounding. Then
|x − f l(x)| eps
≤ . (1.3)
|x| 2
|x − f l(x)| eps h eps eps i
In fact, ≤ is equivalent to f l(x) = x(1 + ε) for some ε ∈ − , .
|x| 2 2 2
Proof We will not prove this result, but it is not hard to show!
Note: From the previous theorem we observe that no matter how large a number is (as-
suming it is within the allowed bounds), the worst-case relative round-off error is always the
same!
To illustrate the importance of round-off errors in practice, let us consider the following
two comparisons:
22
1 + 2 == 3
True
The above example illustrates that many floating point numbers cannot be exactly repre-
sented in the form given in (1.2), and thus we are force to incur round-off errors when working
with such numbers. An extremely important take-away from this example is that you should
never compare real numbers in Python using logical operators like == (for instance in an if-
then statement). Instead, we must resort to asserting whether two numbers are sufficiently
“close enough” to each other, as shown in the following example:
import math
math. isclose (0.1+0.2 ,0.3)
True
Example Let x, y ∈ [realmin, realmax]. What is the worst-case relative error of the addition
on the computer?
Solution: Let f l(x) and f l(y) the floating point representation of x and y, respectively. Then
eps eps
f l(x) = x(1 + ε1 ) for some ε1 ∈ [− , ]
2 2 (1.4)
eps eps
f l(y) = y(1 + ε2 ) for some ε2 ∈ [− , ]
2 2
Thus, f l (f l(x) + f l(y)) is the floating point representation of x + y. Therefore
eps eps
f l (f l(x) + f l(y)) = (f l(x) + f l(y)) (1 + ε3 ) for some ε3 ∈ [− , ].
2 2
Using (1.4) and by doing some basic computations on the above relation we get:
23
Thus we can estimate the relative error as follows:
|f l (f l(x) + f l(y)) − (x + y)| |(ε1 + ε3 + ε3 ε1 )x + (ε2 + ε3 + ε3 ε2 )y|
=
|x + y| |x + y|
x|ε1 + ε3 + ε3 ε1 | + y|ε2 + ε3 + ε3 ε2 |
≤
x+y
x(|ε1 | + |ε3 | + |ε3 | |ε1 |) + y(|ε2 | + |ε3 | + |ε3 | |ε2 |)
≤ ,
x+y
where in the last two inequalities we have used the triangle inequality and the fact that both x
and y are positive. Therefore
eps eps eps2 eps eps eps2
x + + +y + +
|f l (f l(x) + f l(y)) − (x + y)| 2 2 4 2 2 4
≤
|x + y| x+y
eps eps eps2
(x + y) + +
2 2 4
= ,
x+y
from where we conclude that
|f l (f l(x) + f l(y)) − (x + y)| eps2
≤ eps + .
|x + y| 4
From the above example, we can immediately see that basic algebraic properties are not
expected to hold.
1
Example Let x = 1001, y = −1 and z = . The distributive property gives
1000
z(x + y) = zx + zy. (1.5)
Set w = z(x + y), w1 = zx, and w2 = zy. Assume that the arithmetic calculations for the
computation of w, w1 , and w2 are performed exactly. Then, assume that we store w, w1 , w2 , and
we perform the addition w1 + w2 in the computer. Will (1.5) hold exactly?
Solution: Since w = z(x + y) = 1 and 1 is a machine number we have
f l(w) = f l(1) = 1.
24
We saw that addition of positive reals in the computer results in arithmetic errors. What
about subtraction or multiplication? We can show the following results (focusing on posi-
tive numbers, although the results can be extended to negative numbers as well): Assume
that x, y ∈ [realmin, realmax] and for any number w, f l(w) represents the floating point
representation of it. Then
• As we already showed, the worst-case relative error for addition is eps + eps2 /4 (note
that since eps is small, the dominating term is eps).
From the above results, we observe that addition and multiplication arithmetic errors are
both of the order of machine precision, i.e. O(eps). However, the worst-case relative error
for subtraction depends on the particular numbers we are subtracting! This is extremely
important that should be kept in mind when subtracting numbers that are close to each
other. Indeed, if x ≈ y, then 1/|x − y| can have a very large magnitude, and then the
arithmetic error of subtraction can be detrimental. This effect has a special name: it is
called catastrophic cancellation. Let us illustrate the potential problems it leads to in
Python:
x = 1e16
res = math.sqrt(x + 1) − math.sqrt(x)
print (f" Result : { res :.52f}")
Result : 0.0000000000000000000000000000000000000000000000000000
x2 − y 2 x3 − y 3 xn − y n
x−y = = 2 = · · · = .
x+y x + xy + y 2 xn−1 + xn−2 y + · · · + xy n−2 + y n−1
We expect that the above different ways of writing x − y will produce stable formulas.
E.g., if x = 1.1 and y = 1.05 then x − y = 0.05, while x100 ≈ 1.3781 × 104 and
y 100 ≈ 1.315 × 102 , so x100 and y 100 are no longer close to each other.
25
• On the other hand, it is often possible to use mathematical approximations to avoid
using subtraction altogether. For example, say we want to compute exp(x) − 1, for
x ≈ 0. This would lead to catastrophic cancellation, and the previous trick would also
not work. Instead, we can use the Taylor expansion of exp and write:
x2 x3 x4
exp(x) − 1 = x + + + + ...
2 3! 4!
noting that the more Taylor terms we use, the better the approximation we obtain.
That way, we have avoided using subtraction!
Let us see how the first of these tools would work in practice, to improve the loss of accuracy
we observed in the last Python code: We have observe that
√ √ √ √
√ √ x+1− x x+1+ x 1
f (x) = x + 1 − x = √ √ ≡√ √
x+1+ x x+1+ x
import math
x = 1e16
res = math.sqrt(x + 1) − math.sqrt(x)
alt res = 1 / (math.sqrt(x + 1) + math.sqrt(x))
|x − x∗ | ≤ δ(x∗ ).
Assume that f is a continuously differentiable function and that we are interested in finding
an upper bound for the absolute error |f (x) − f (x∗ )|. From the Mean Value Theorem we
get:
f (x) − f (x∗ ) = f ′ (ξ)(x − x∗ ) with ξ ∈ [min{x, x∗ }, max{x, x∗ }].
Let D = max
∗
|f ′ (t)|, then
t∈[min{x,x },max{x,x∗ }]
Thus Dδ(x∗ ) is an upper bound for |f (x) − f (x∗ )|. In many cases this bound gives pretty
good estimates for the error for f (x). However, the above estimates can be crude, especially
in cases where |f ′ | varies a lot between x and x∗ .
26
1.4.4 Summary
Now that we have given a short overview of sources of errors in numerical computations, it
is important to mention some notation that will be used throughout this module to specify
the “order of magnitude” of given errors.
Given two positive functions T : (0, ∞) → (0, ∞) and f : (0, ∞) → (0, ∞) we say that
T (x) = O(f (x)) if and only if there exist constants c > 0 and some x0 > 0 such that
T (x) ≤ cf (x), ∀ x ≥ x0 .
The Big-O notation is very useful, because it compares two functions by only considering
their dominating terms. For example, let eps be a small number (e.g., representing the
machine precision). We saw that the worst-case relative error from addition in a computer
is T (eps) = eps + eps2 /4. Let f (eps) = eps. Then, based on the Big-O definition, and given
that eps is assumed to be small (≪ 1), we can easily show that
It is important to keep this notation in mind, since it will be very useful when quantifying
errors of various numerical computations, and will be useful in comparing the accuracy of
different numerical methods (in the grand scheme of things).
In general, not all algorithms suffer significantly from proliferation of machine errors;
sometimes errors stay small throughout the execution of an algorithm. However, sometimes
they do become important, and constitute the main limitation to how precise our numerical
result can be. In every calculation that we do, it is important to keep in mind machine errors
that can be introduced, even though they often turn out to be much less important than
errors due to approximations made by the numerical method itself.
Practical take-aways:
2. There are several forms of numerical errors (e.g., round-off, arithmetic, algorithmic, or
stochastic), each of which might require different treatment.
3. Be careful when subtracting numbers close to each other (i.e., beware of catastrophic
cancellation).
4. In cases where round-off or arithmetic errors can be very problematic, we can adjust
the 64bit floating point accuracy of Python. We will later see the packages mpmath
and decimal that provide better accuracy.
27
Chapter 2
Solution of a Single Non-linear Equation
In this chapter we study the Newton method for the solution of a single non-linear equation.
A general non-linear equation reads as
f (x) = 0, (2.1)
for some function f . For the Newton method we assume that this function is differentiable
and the derivative f ′ is known (later we discuss how to build various approximations for the
derivative numerically, but in this chapter we assume that the derivative is known analyti-
cally).
If the approximation (2.2) was exact (i.e., if fˆ(x) ≡ f (x)), then x0 − ff′(x 0)
(x0 )
would give us the
exact solution of the equation (2.1). However, fˆ(x) is only an approximation of the nonlinear
function f (x), and so x1 = x0 − ff′(x 0)
(x0 )
is only an approximate zero of f (x). Under favourable
conditions, x1 will be closer to the true solution of our equation (2.1), in comparison to the
starting point x0 . Then we can use x1 as a starting point for the next iteration to get an
even better approximation x2 , and so on, until we reach a satisfactory approximate solution.
In other words, we will generate the series of numbers xn defined by
f (xn−1 )
xn = xn−1 − , (2.4)
f ′ (xn−1 )
28
assuming that f ′ (xn−1 ) ̸= 0, each time (hopefully) approaching closer and closer the true
solution of the equation (2.1) (see the figure below).
When the sequence of numbers xn is convergent, it converges very fast; the number of
exact digits doubles on each iteration (see the table below).
When the number of exact digits doubles on each iteration, then the algorithm is said to be
quadratically convergent (below we discuss more precisely the notion of rate of convergence
of an algorithm).
One should note that we are only guaranteed to get a convergent series if the starting
point x0 is already relatively close to the exact solution. One could be particularly unlucky
by picking the starting point such that f ′ (x0 ) is close to zero, in which case x1 will be huge
and the series may not converge. There is always an element of art to choosing a good
starting point x0 . Depending on that choice, you can either get the result in a few steps
with enormous precision or find a useless divergent series.
29
def f(y):
return y∗∗3/3−2∗y ∗∗2+y−4
Next, we build the sequence xn (stored in the list type of Python). We begin from a trivial
sequence, containing only the starting point x=[x0]. Then we append new points to this
list, and we repeat up to a certain number of steps (in this case, 10).
def newton (f,df ,x0):
xs = [x0]
for i in range (10):
xs. append (xs[−1]−f(xs[−1])/df(xs[−1]))
return xs[−1]
where we’ve chosen the starting point to be 8. Try it out with other starting points, and
other numbers of steps.
Exercise 2 It is also a good exercise to try it with a different condition for the stopping point:
instead of a fixed number of steps, stop the loop at the step n for which |f (xn )| < 10−10 (thus
when we are indeed very near to a zero).
We can associate 3 colours to each of these 3 possibilities. Then, the picture of the fractal is
obtained in the following way: for each point of the complex plane x0 we build the series xn
and if that is convergent to one of the roots we paint this point in the corresponding colour.
1
This section is not examinable, but it is recommended to study it. It’s also very beautiful!
30
To make the picture more beautiful one can also change the intensity of the color depending
on how many iterations are needed in order to reach some fixed precision. The picture
obtained in this way is called the Newton fractal.
The code below created this picture:
# c r e a t e an empty image
from PIL import Image
sizex = 700; sizey = 700
image = Image.new("RGB", (sizex , sizey ))
# drawing area
xmin = 0.0; xmax = 6.0
ymin = −3.0; ymax = 3.0
# t h e r e are 3 p o s s i b l e r o o t s
roots = [ complex (5.83821 ,0) ,
complex (0.080896 , 1.43139 ),
complex (0.080896 ,− 1.43139 )]
# draw t h e f r a c t a l
for i in range(sizey ):
zy = i ∗ (ymax − ymin) / (sizey − 1) + ymin
for j in range(sizex ):
zx = j ∗ (xmax − xmin) / (sizex − 1) + xmin
z = complex (zx , zy)
n = 0;
while abs(f(z))> 1/10∗∗10:
n = n + 1;
z = z − f(z) / df(z)
31
# c h o o s e t h e c o l o r d e p e n d i n g on t h e number o f i t e r a t i o n s n
# and on t h e r o o t
if abs(z−roots [0]) < 1/10∗∗5:
image. putpixel ((j, i), (255 − n % 32 ∗ 8 , 0, 0))
elif abs(z−roots [1]) < 1/10∗∗5:
image. putpixel ((j, i), (0 , 255 − n % 32 ∗ 8, 0))
elif abs(z−roots [2]) < 1/10∗∗5:
image. putpixel ((j, i), (0 , 0, 255 − n % 32 ∗ 8))
# saving r e s u l t to a f i l e
image .save(" fractal .png", "PNG")
|xk+1 − x∗ |
lim =µ
k→∞ |xk − x∗ |q
for some finite nonzero rate of convergence 0 < µ < ∞. Moreover, if the order of
convergence is q = 1, then the rate must be such that µ ∈ (0, 1).
The above definition quantifies the speed of convergence of a sequence (i.e., the larger q
is, the fastest the convergence). A sequence with convergence order q = 1 is called linearly
convergent, for q = 2 is called quadratically convergent, while for q = 3 cubically convergent,
and so on.
To give some intuition, we see that if
xk = x∗ + A exp(bq k )
32
Remark: Order of convergence q = 1
As mentioned in the definition, the case where q = 1 does not guarantee that the
sequence xk converges to x∗ linearly, unless 0 < µ < 1. The condition on µ in this case
is simple to see: assume, for instance, that x∗ = 0 and xk > 0 for all k’s (for example,
take xk = 1/2k ). Then, if the order of convergence is q = 1, we have
xk+1
lim = µ ⇒ xk ∼ Aµk as k → ∞, for some A
k→∞ xk
(for the implication: just solve the recursion xk+1 /xk = µ by taking the product K
Q
k=a
on both sides, for some fixed a). Obviously, this converges if 0 < µ < 1, and diverges
if µ > 1. Note, from this analysis, that “linear convergence” means that the speed at
which the sequence approaches to the point of convergence is exponentially fast. For
example, the sequence xk = 2−k is linearly convergent towards 0.
33
where on the right-hand side the error introduced by truncating the Taylor series expansions
is O(ϵ3n−1 ). We finally get
1 f ′′ (x∗ ) ϵn 1 f ′′ (x∗ )
ϵn = ϵ2n−1 + O(ϵ3
n−1 ) ⇒ lim = ≜ µ. (2.8)
2 f ′ (x∗ ) n→∞ ϵ2
n−1 2 f ′ (x∗ )
If f ′ (x∗ ) ̸= 0 and f ′′ (x∗ ) ̸= 0, this means that this method has a quadratic convergence rate.
As mentioned earlier, in practice this means the following: if after n − 1 iterations we got
the result with, e.g., ϵn−1 ∼ 10−5 error, then just by doing one extra iteration we obtain
the numerical error ϵn ∼ 10−10 which means that we double number of exact digits at each
iteration.
The Degenerate Case. In the above convergence analysis we had to assume that f ′ (x∗ ) ̸=
0. As the derivative appears in denominator of (2.8) our conclusion should be wrong in the
case when f ′ (x∗ ) = 0; the method might still work and it might indeed produce a zero of the
nonlinear equation, but its convergence analysis is different. For f ′ (x∗ ) = 0 and f ′′ (x∗ ) ̸= 0,
instead of (2.7) we get
1 ′′ ∗
f (x )
ϵn ≃ ϵn−1 − ϵn−1 2 ′′ ∗
f (x )
or
1 ϵn 1
ϵn ≃ ϵn−1 ⇒ lim = ≜ µ ∈ (0, 1) .
2 n→∞ ϵn−1 2
This means it indeed still converges, but that the convergence is very slow in this case. On
each iteration we decrease the error ϵn by a factor of 1/2, i.e. we will need at least 4 iterations
just to get one more exact digit for the estimation of x∗ .
Exercise 3 Work out what happens with the convergence order and rate if f ′ (x∗ ) = f ′′ (x∗ ) =
0 and f ′′′ (x∗ ) ̸= 0, etc.
1. is a linear function in x
34
Applying the same logic as in the Newton method we find xn as an exact zero of this linear
approximation
xn − xn−1 xn − xn−2
f (xn−2 ) + f (xn−1 ) = 0
xn−2 − xn−1 xn−1 − xn−2
which gives
xn−1 f (xn−2 ) − xn−2 f (xn−1 )
xn = (2.9)
f (xn−2 ) − f (xn−1 )
One can notice that the convergence rate of this method is slightly slower than that of
the Newton method. It required 11 iterations (vs.8 for the Newton method) to get the result
with 90 digits of accuracy as we can see from the table below. One can even estimate that
the precision increases roughly by a factor 3/2 (from 12 digits to 20 digits) per iteration.
For the Newton method the number of precise digits doubled.
35
2.3.1 Python Code for the Secant Method
Let us implement the secand method. We use the same function as before, i.e. f (y) =
y 3 /3 − 2y 2 + y − 4.
def f(y):
return y∗∗3/3−2∗y ∗∗2+y−4
Try this on your Python notebook, and see what happens when you change the values!
Exercise 4 Again, try it with a different condition for the stopping point: instead of a fixed
number of steps, stop the loop at the step n for which |f (xn )| < 10−10 .
Exercise 5 Run the secant method code given above, but allow it to run for 100 iterations.
What do you observe? Is there an issue, and if so, how can it be fixed?
Hint: To understand the problem, it might be useful to revise the floating point arithmetic
used within a computer (and thus Python as well).
36
Figure 2.1: Plot of the function f defined in (2.10)
Indeed with both the Newton and Secant methods, if we start at x0 = 1 we will get
x1 = −1 and then for x2 = 1 again and it will continue to flipflop forever.
Exercise 6 Check the above statement: calculate analytically the sequences obtained by using
the Newton and Secant methods starting at x0 = 1 for the above function.
The Bisection Method presented here will work even in this situation. Let us describe
the method: at each step we divide the interval in two by computing its midpoint, i.e.
c = (a+b)/2, and the value of the function at the midpoint f (c). Next we have 3 possibilities
2. f (c)f (a) < 0, then we can be sure that zero is between a and c, in which case we
repeat the procedure for the interval [a, c]
3. f (c)f (a) > 0 (which implies that f (c)f (b) < 0), and then zero is hidden between c and
b, in which case we repeat the procedure again for [c, b]
Note that at each iteration we reduce the search area for our zero by a factor of 2.
37
To use it we define some function:
def f(x):
return x∗∗2−2
We see that the convergence is very slow and requires lots of iterations to give decent preci-
sion.
Exercise 7 Derive analytically the convergence order and rate of the bisection method. More
precisely, establish an upper bound on how fast we can approach the true zero, by using your
knowledge of how the interval changes.
We can use this enormous precision to study experimentally the convergence rates of var-
ious algorithms. First we adjust the Newton method code to work with arbitrary precision.
Note that here and below, in order to use the high precision, we need to make sure that the
38
variable used in the algorithm are high-precision numbers, which we do by using the mpf
method from mpmath.
def newtonHP (f,df ,x0 , precision ):
n = 0;
x = [mp.mpf(x0)]
while (abs(f(x[−1]))>10∗∗(−precision )) and (n<100):
# An a l t e r n a t i v e way o f w r i t i n g n = n+1
n += 1
x. append (x[−1]−f(x[−1])/df(x[−1]))
return x
To verify the quadratic convergence rate of the Newton method we compute the number of
exact digits as log10 |xn − x∗ |, divide it by 2n , and plot the result (see Section 2.2). Here
we use
√ the function f (x) = x2 − 3 as a simple example, so that we know the exact solution
x∗ = 3. As a starting point, we choose x0 = 7.5.
x0 = 7.5
result = NewtonHP (f,df ,x0 ,50)
numberOfDigitsOverqSquare = []
ns = []
n = 0
for x in result :
n += 1
ns. append (n)
numberOfDigitsOverqSquare . append (
float(−mp.log(abs(x − 3∗∗ mp.mpf (0.5)))/ mp.log (10))/2∗∗ n
)
plt.plot(ns , numberOfDigitsOverqSquare ,"rx",
label=" Precision Convergence ")
plt. xlabel (’Iteration Number ’)
plt. ylabel (’Number of Digits of Precision / 2^n’)
plt. title(’Convergence rate of Newton \’s Method ’)
plt. legend ()
plt.show ()
From the associated figure, we see that limn→∞ log |xn − x∗ |/q n = const (= 0.1 here) with
q = 2. This implies that the Newton method exhibits a local quadratic convergence rate in
this case, confirming our analytical proof given in Section 2.2.1.
Repeating
√ the
same exercise for the secant method, we could confirm that in this case
1
q = 2 1 + 5 ≈ 1.6180. As expected, the secant method has a slower convergence rate
compared to Newton method. In this way, we are generally able to work out how powerful
our algorithms are (albeit for specific examples), and subsequently estimate the number of
iterations needed to reach some desired precision!
Exercise 8 Numerically estimate the order of convergence for the secant and bisection
method by adapting the above code. Check that you get q ≈ 1.6180 for the secant method,
and that your answer for the bisection method agrees with what you obtained in Exercise 7.
39
40
Chapter 3
Approximation of Functions by Polynomials
There are a number of reasons why we would want to approximate complicated functions
using simpler functions. Solving various equations numerically can give very good results
which are frequently not accessible using other methods (such as analytical exact solutions).
But in order to make it tractable for the computer, an exact equation is usually replaced by
a “discretised” equation: the functions involved are replaced by piecewise-simpler functions
(e.g., piecewise-constant, piecewise-linear, piecewise-polynomial, etc.), or, more generally, by
a simpler function within a family of functions, with a finite number of parameters to be
determined. This discretised or simplified equation is designed to give a good approximation
but is not equivalent to the initial problem (e.g., system of nonlinear equations). It is thus
of utmost importance to have a way of deriving such approximations, while also knowing
how precise these are.
Another general way of seeing this problem is the following: given a set of data points,
perhaps coming from experimental observations, we would like to find the closest “natural”,
or smooth enough, function that these data points might represent. We will see later how
discretisation works. But in this chapter, we will instead consider approximations by a given
simple family of functions: namely, polynomials.
Figure 3.1: Two types of the polynomial approximation. Left: interpolation when the
polynomial goes exactly through the data points (red dots). Right: fitting (noisy) data with
a polynomial.
The problem of polynomial approximation can be formulated in the following way: Given
41
a set of data points, say
x0 x1 x2 . . . xn
,
y0 y1 y2 . . . yn
find the best polynomial which describes this data. Note that there is always a unique
polynomial of degree n which goes through all these points (assuming that the xi ’s are
pairwise distinct). However, sometimes the data may contain significant measurement noise
and instead one needs to find a smooth function which goes between the points, as shown
on the right-hand side of Figure 3.1.
where pi (x) is the basic building block for the interpolation. The pi (x)’s are fully fixed by
the following conditions: they should each be polynomials of degree at most n, and they
should satisfy pi (xi ) = 1 and pi (xj ) = 0, for all j ̸= i. With these conditions, Pn (x) is a
polynomial of degree at most n, and satisfies Pn (xi ) = yi , for all i.
It is not hard to show that
n
Y x − xj
pi (x) = . (3.1)
x i − xj
j=0, j̸=i
It is indeed easy to see that pi (xi ) = 1 and pi (xj ) = 0, for j ̸= i, and that these are
polynomials of degree n. For example, for xi = i those pi (x) are depicted below
42
def pbasis (x,i,xs):
xi = xs[i]
res = 1
for xj in xs:
if xj != xi: # x j not eq ua l to x i
res ∗= (x−xj )/(xi−xj) # t h i s i s t h e same as
# r e s = r e s ∗ ( x−x j ) / ( x i −x j )
return res
This implements the multiplication of all factors in the product over j in (3.1) with a loop,
where the “result” res starts with value 1, and is multiplied by the correct factor for all
j. Then the Lagrange Interpolation is simply a sum over all points yi times the basis
polynomials. The function must have as input not only the independent variable x, but also
both data sets: the xi ’s and the yi ’s, stored in the lists xs and ys (note: len is a method
that gives the length of a list or array):
def lagrange (x,ys ,xs):
res = 0
for i in range(len(ys )):
res += ys[i]∗ pbasis (x,i,xs [:])
return res
In order to check how it works we can reproduce the left graph in Figure 3.1 above by using
the pyplot library:
import numpy as np
import matplotlib . pyplot as plt
xdense = np. arange (0 ,7 ,0.1)
plt.plot(xs , ys , ’ro’)
plt.plot(xdense , lagrange (xdense ,ys ,xs))
plt.show ()
43
3.1.2 Limitations of Polynomial Interpolation
The polynomial interpolations usually work very well in the middle of the interpolation
interval, but could produce unwanted oscillations at the ends of the interval, as shown in the
graph below:
One of the ways to reduce the oscillations is to increase the number of points:
However, this also makes the interpolation slower and requires more resources to store the
information. A more efficient way is to choose points xi so that they are more dense at the
ends of the interval. One of the possibilities
is to take points xi , called Chebyshev points or
a+b a−b i+1/2
nodes, such that xi = 2 − 2 cos π n+1 , i = 0, . . . , n, for the interval [a, b]. These are
the roots of the Chebyshev polynomial of the first kind (rescaled to the interval [a, b] instead
of the conventional [−1, 1]). The important property of these points is that they are more
concentrated near the edge of the interval. One can think of this as projecting values of the
cos function taken at regular angles, onto the x-axis. The mathematical origin of this choice
is explained in Section 3.3. This formula already gives a much better result:
44
3.2 Fitting Data
Fitting is another way of approximating a function from some data points. The curve of
best fit does not need to pass exactly through the data points and as such can be used to
smooth the data and clean it from potential errors/noise.
where fi (x) is some predetermined basis of m+1 linearly independent functions. An example
of basis functions are the monomials fi (x) = xi , giving a polynomial fit. But other bases are
possible; in fact any basis of linearly independent functions works.
Our next task is to find the “optimal” coefficients ai so that the fit function f (x) passes
as close as possible to the data points yi . Obviously, in the extreme case when m = n (where
n + 1 is the number of the points, labelled from 0 to n inclusively) we should reproduce an
exact interpolation; for instance, with the choice of monomials fi (x) = xi , this will reproduce
the polynomial interpolation from the previous section. In general, we assume m ≤ n and so
we want to get the closest possible approximation of the data by means of a smaller number
of parameters.
A useful measure for the quality of the approximation is given by the sum of squares of
deviations !2
X n n
X Xm
S= (yi − f (xi ))2 = yi − aj fj (xi ) .
i=0 i=0 j=0
We have to find the values of the coefficients aj for which S is minimal, i.e. we have to
require
∂S
= 0 , k = 0, . . . , m.
∂ak
More explicitly, the previous equation reads as
n m
!
∂S X X
= −2 yi − aj fj (xi ) fk (xi ) = 0 , k = 0, . . . , m . (3.2)
∂ak i=0 j=0
By inspection, we see that this is a linear system of equations in the variables ai ; remember
that these are the variables we are trying to solve for. That is, we can write this system of
equations as
Aa = b .
The matrix A is obtained by inspection of (3.2): the matrix element Akj is obtained by
looking at the equation ∂S/∂ak = 0, and in it, the coefficient of aj . Similarly, the vector
b comes from the equation ∂S/∂ak = 0, by collecting the terms that do not depend on the
aj ’s. This gives
Xn Xn
Ajk = fj (xi )fk (xi ), bk = fk (xi )yi .
i=0 i=0
45
In the particular case of fj (x) = xj , we obtain
n
X n
X
Akj = xj+k
i , bk = xki yi .
i=0 i=0
46
import numpy as np
# c r e a t i n g a matrix of zeros
A = np.array ([[0.]∗( m +1)]∗( m+1))
# c r e a t i n g an empty v e c t o r
b = np.array ([0.]∗( m+1))
# f i l l i n g i n A and b w i t h v a l u e s
for k in range(m+1):
b[k] = np.sum(ys ∗ xs ∗∗ k)
for i in range(m+1):
A[k,i] = np.sum(xs ∗∗( k+i))
coefs = np. linalg .solve(A,b)
print (coefs)
def fit(x):
return sum(coefs[k ]∗( x ∗∗ k) for k in range(len(coefs )))
return fit
To check how it works we make a plot. An important remark is that the output of the function
findas is not a number, not a list, but itself a function! Indeed, within the function findas,
there is a def fit(x): which defines a new function fit (the name “fit” is a dummy name,
it is just a dummy variable used inside the function findas that takes for value a function),
and this new function returns the result of the polynomial fit. It is this function that is
the return value of findas, in the line return fit. Thus, in the code given below, the
variable ft (which is the result of calling findas), is itself a function — it is the function
that represents the fit of our data. This is a convenient way of returning a fit: just return
the function that fits.
import matplotlib . pyplot as plt
plt.plot(xs , ys , ’ro’)
plt.plot(xdense , fitlist )
plt.show ()
The above code should reproduce the right plot given in the beginning of the chapter.
47
Remark
Observe that the above code uses some interesting Python syntax: lambda x: ...
is a way of writing a function without having to attribute it to a symbol;
map(a,b) is a way of creating an ordered set of elements by mapping the
ordered set b (it can be a list for instance) via the function a; and list
transforms a set of elements into an actual list (which can be used for plot-
ting). In the above, list(map(lambda x: ft(x), xdense)) does the same as
[ft(x) for x in xdense].
Finally, let us note that the commands used to create the zero matrix A and the zero
vector b in the function findas can be written in several different ways. The most
common (and readable) way of doing this is by using the np.zeros command. Specif-
ically, we could substitute the command: A = np.array([[0.]∗(m+1)]∗(m+1))
by the command A = np.zeros((m + 1, m + 1)). Similarly, the com-
mand b = np.array([0.]∗(m+1)) could “equivalently” be written as
b = np.zeros(m + 1).
Exercise 9 This is a simple exercise, but it is an important one, since it logically completes
a previous exercise. In Exercise 8, evaluate numerically the convergence orders, instead of
checking if a given q works. For this purpose, extract the data for y(k) = log(| log(|xk − x∗ |)|)
as a function of k, and make sure that you have enough values of k. Then do a linear, 2nd
order polynomial, and 3rd order polynomial fit of this data. Then try to fit only the data for
k large enough. Using such an analysis, deduce that the result is linear in k at large enough
k, and extract the value of q (the convergence order).
48
vanishes at A and B, there is at least one point inside [A, B] where its derivative vanishes.
We apply this theorem N times: first we conclude that q ′ (y) vanishes at N +1 points between
y = x0 , . . . , xN and x, then q (2) (y) vanishes at N points, and so on. Thus, we finally observe
that q (N +1) (y) should vanish at some point ζ ∈ [a, b].
Noticing that (N + 1)-th derivative of q reads:
we get for y = ζ
q (N +1) (ζ) = 0 = f (N +1) (ζ) − (N + 1)!r(x)
from where we see that
f (N +1) (ζ)
r(x) = .
(N + 1)!
In practice, we will replace in this formula |NN +1 (x)| by its maximum on the interval,
maxy∈[a,b] [|NN +1 (x)|].
√
Example Estimate the interpolation error for the function f (x) = x for the nodes [100, 120, 140].
For that we use our error estimate (3.3):
3 1 1
|f (x) − PN (x)| ≤ max max|(x − 100)(x − 120)(x − 140)|.
8 y 5/2 3!
49
as the mesh size h is made smaller (assuming that we have enough data to fit), and we would
like to know in what way (in essence “how fast”) the error decreases to zero. This is called
the order of the error.
Suppose x ∈ [xk−1 , xk ]. Then,
That is, as h is made smaller, we see that |f (x) − PN (x)| ≤ chN +1 for some c > 0 (that
depends on the function f and a quantity depending on N that is less than or equal to 1).
In that case, we say that PN has an error of order hN +1 , or an error O(hN +1 ) (assuming,
of course, that the magnitude of the (N + 1)-st derivative of f is bounded by a constant
independent of N (or h), which is not always the case).
min max ΠN
i=0 (x − xi ) . (3.7)
x0 ,...,xN x∈[a,b]
NN +1 (x) ≡ ΠN
i=0 (x − xi ) = x
N +1
− r(x),
for some polynomial r ∈ PN (where PN is the space of all polynomials of degree N ), de-
pending on {xj }N
i=0 (this can be shown by using standard identities of polynomials). Thus,
problem (3.7) can equivalently be written as
In other words, we are looking for a polynomial of degree N yielding the best possible
approximation of xN +1 on [a, b] (in the sense described above).
It turns out (by using the so-called equioscillation theorem; this in non-examinable) that
a solution to the optimisation problem (3.8) arises from the Chebyshev polynomial of the first
kind. In our convention, these polynomials are defined by TN +1 (cos θ) = 2−N cos((N + 1)θ)
(where on the right-hand side, trigonometric identities have to be used to write the result
as a polynomial in cos θ). Observe that for any x ∈ [−1, 1], we can equivalently write
the previous as TN +1 (x) = 2−N cos ((N + 1) cos−1 x). Let us note that this is a normalized
50
version of the standard Chebyshev polynomial of the first kind, yielding a monic polynomial
(i.e., a polynomial with leading coefficient equal to 1). As a monic polynomial, TN +1 (x) can
be written as
TN +1 (x) = xN +1 − q(x), for some q ∈ PN .
It turns out that q(x) is a solution to problem (3.8). Now that we have a solution to problem
(3.8), we can find a solution to problem (3.7) by finding all the roots (or zeros) of the
(normalized) Chebyshev polynomial and use them as our nodals
points. On the interval
π(i+1/2)
[−1, 1], the nodal points (zeros of TN +1 ) are xi = cos N +1
, for i = 0, 1, . . . , N .
This result holds for the interval [−1, 1], and translation and dilation can be used to
accommodate other intervals [a, b] (see Section 3.1.2). Thus, the roots of TN +1 are the
nodal points to use in order to minimise the error. Specifically, one can show that for these
polynomials (i.e., by using the aforementioned nodal points) we obtain
1
|NN +1 (x)| = |TN +1 (x)| ≤ , for x ∈ [−1, 1], (3.9)
2N
and therefore,
maxy∈[−1,1] |f (N +1) (y)|
|f (x) − PN (x)| ≤ . (3.10)
(N + 1)!2N
In general, we have
N +1
maxy∈[a,b] |f (N +1) (y)|
b−a
|f (x) − PN (x)| ≤ .
2 (N + 1)!2N
Note that this is a smaller error than that in (3.6). To prove this, you might consider using
Stirling’s approximation for (n + 1)! (assuming n is large), however this is non-examinable.
51
Chapter 4
Numerical Differentiation
The problem posed in this CHapter is extremely simple: given the function f (x), compute
dn f /dxn at given x. In particular, assuming that we can compute the values of the function
f at some points xk , we would like to convert this information into an approximate expres-
sion for the derivatives. Nonetheless, we will show that such an approximation, although
extremely useful, always incurs a substantial error that needs to be tracked and quantified.
53
In order to address this, consider the ratio a/b, where a ∼ b and both are rather small,
say O(h); consider a = Ah and b = Bh for some small finite h, and O(1) numbers A and B
(the exact value of which is not important here). a and b are assumed to be obtained from
some operation on the computer, so have absolute errors O(ϵ). Thus, when evaluating the
ratio a/b on a computer, we have
a + xϵ a (bx − ay)
= +ϵ + O(ϵ2 ) (4.7)
b + yϵ b b2
for some (random) O(1) coefficients x and y (whose exact values do not matter), representing
the round-off errors of a and b, respectively. Note that in (4.7) we utilized a first-order Taylor
series expansion in ϵ evaluated at ϵ = 0. This means that the absolute error of this calculation
is
(bx − ay) Bhx − Ahy Bx − Ay ϵ ϵ
ϵ = ϵ = = O . (4.8)
b2 B 2 h2 B2 h h
Recall that A, B, x, y are all O(1) numbers, and x and y are essentially random. Here we
took into account both small quantities ϵ and h. You can also verify that the higher order
terms in the ϵ Taylor series expansion will lead to terms of the form O((ϵ/h)n ) for higher
powers n (i.e., n ≥ 2).
From Eqs. (4.8) and (4.6), we conclude that
ϵ ϵ
′ ′
fnum (x) = fapprox (x) + O = f ′ (x) + O(h) + O .
h h
This means that we cannot
really take h to be very small, as this will increase the induced
ϵ
arithmetic error O h of this numerical method, because h is in the denominator. How-
ever at the same time, we cannot take h to be too large, because this will increase the
approximation error O(h) of the numerical method: instead, we should strike an optimal
balance between the two sources of error in order to minimize the overall absolute error of
our approximation of the derivative.
For this “forward difference approximation” method, the combined absolute error is
ϵ
E ≜ O(h) + O .
h
The important question is: what is the optimal value for h, which minimises the error? In
order to answer this, we write the combined error as
vϵ
E = uh +
h
for some O(1) numbers u and v (whose values are not important), and find the minimum
with respect to h. By taking the derivative of the previous expression (w.r.t. h) and equating
it to zero, this minimum is obtained at
√
r
vϵ vϵ
u− 2 =0 ⇒ h= = O( ϵ). (4.9)
h u
√
That is, the optimal choice of h is O( ϵ) = O(10−8 ). Thus, because of finite machine-
precision, we do not expect to achieve more than about 8 digits of accuracy (or precision) in
this derivative approximation.
54
Let us also estimate the absolute error for the “central difference approximation” given
in eq. (4.2). In this case, we can (similarly to the forward differences case) show that the
combined error is O(h2 ) + O( hϵ ) whose minimal value is achieved at h = O(ϵ1/3 ) = O(10−6 ),
and thus the best possible absolute approximation error is O(10−12 ). We see that the central
difference approximation is a slightly better, but still not to the extent that is suggested in
(4.2)2 .
Let us at this point note that in Python, we could use a multiple precision library, such as
mpmath, which allows to decrease ϵ to, say, 10−500 (see, for example, Section 2.5). Nonetheless,
we should always keep in mind that increasing the precision in Python introduces significant
memory and CPU overheads, making our code slower and less scalable. Thus, we should
always strive to optimize our code and our numerical approximations to make the most of
the standard machine precision.
Exercise 10 Show that all Taylor series terms in (4.7) have the form, under a calculation
like (4.8), O((ϵ/h)n ).
Exercise 11 Work out various expressions for the second derivative, and evaluate the opti-
mal h, taking into account round-off and approximation errors. You should get that for the
k th derivative, the roundoff error is O(ϵ/hk ), and with an approximation error O(hn ), the
1 n
optimal value is h = O(ϵ n+k ) with optimal error E = O(ϵ n+k ).
mp.mp.dps = 40
def f(x):
return x ∗∗3+2∗ x
def df(x):
return 3∗( x ∗∗2)+2
def numberOfGoodDigits (h):
differencempf = (f(1+h)−f(1−h ))/(2∗ h) − df (1)
return float((−mp.log(mp.fabs( differencempf )))/ mp.log (10))
2
Note that if we did not consider round-off errors, then the forward difference approximation would be
O(h), while the central difference would be O(h2 ) (i.e., in theory, the latter is significantly better). In practice
however, we see that the approximation power of the two is not as big, when also considering round-off errors.
55
Next, we utilize the following piece of code to plot the number of accurate digits versus the
value of h. The resulting plot is given in Figure 4.3.
import matplotlib . pyplot as plt
We observe that the plot in Fig. 4.3 verifies our estimates for the optimal value of h!
Indeed, for the central difference approximation, we observed that an optimal value of h is
O(ϵ1/3 ). In the above Python example, we set ϵ = 10−40 , and thus the optimal h should be
about O(10−14 ). This is indeed verified in Fig. 4.3.
Remark
Observe that in the previous code we use arange, which is a method we have seen
from the numpy library. But we have not imported that library! In fact, mpmath, which
we have imported, has a lot of the mathematical methods of numpy. In particular
in contains arange, sin, cos, .... But crucially, these return multiple precision
floats (i.e., mpf) numbers. So, above, in writing xdense = arange(0,40,0.1), we
made sure that the numbers we pass to the function numberOfGoodDigits are of mpf
type, hence the calculations done in numberOfGoodDigits are all on mpf numbers.
Thus, we are able to work with 40 digits of accuracy; this is essential for the procedure
to work.
56
In general, working with high-precision libraries (like mpmath) can be tricky and leads
to non-scalable code. Thus, such libraries should be used for particular calculations
that require high precision, and then mpf numbers should be (hopefully) safely cast
back to regular floats (using which we can utilize any available Python package, like
numpy).
2 f (x+h/2)−f
h/2
(x)
− f (x+h)−f (x)
h 4f (x + h/2) − f (x + h) − 3f (x)
G= = ,
2−1 h
which indeed gives f ′ (x) with better precision:
4f (x + h/2) − f (x + h) − 3f (x) 1
= f ′ (x) − h2 f (3) (x) + O(h4 ),
h 12
i.e., precision is boosted to O(h2 )!
57
Chapter 5
Numerical Integration
We will discuss two main classes of conceptually different methods for numerical integration.
The first class includes midpoint, trapezoidal and Simpson methods. In these methods you
only get the values of the integrand for some fixed points xi and build an approximation
valid between these points like on the picture below
Figure 5.1: Illustration for the midpoint, trapezoid and Simpson methods. In all cases a
(piecewise) polynomial interpolation of the integrand is used (constant, linear and parabolic).
For these methods the sampling points are equally separated.
Figure 5.2: For the Gaussian integration method the sampling points are chosen in a special
way to increase precision
In this first class, we will also see the Gaussian integration method. This is a more
advanced way of integrating, where we also adjust, in a smart way, the positions of the
58
points xi at which the “measurement” of the integrand is made, in order to reduce as much
as possible the error made in the resulting approximation of the integral.
The second class is very different: it is based on a random process, is less accurate,
but much more flexible and easier to implement in the computer. This is the Monte Carlo
method. It is extremely popular in science, thanks to its great flexibility.
That is, with the trapezoidal rule, the numerical approximation of the integral on the interval
[a, b] is obtained by separating it into n small intervals of size h = b−a
n
each, that is with
xi = a + hi, and evaluating trapezoid area on each of these small intervals:
Z b n−1 n−1 n−1
X X f (xi ) + f (xi+1 ) h X h
f (x)dx ≈ Iab [f ] := Si = h = f (a) + f (a+hi)h+f (b) . (5.2)
a i=0 i=0
2 2 i=1 2
h2 ′′
f (xi ) + f (xi+1 ) ′ h
Si = h ≈ f (xi ) + f (xi ) + hf (xi ) + f (xi ) (5.5)
2 2 2
We then evaluate an upper bound for the total error for the full integral using this. With
(5.19), the estimate for the error is the sum
Z b n−1 n−1 3
X X h ′′
E= f (x)dx − Iab [f ] ≤ Ei = 4
|f (xi )| + O(h ) (5.7)
a i=0 i=0
12
3
nh
≤ maxx∈[a,b] |f ′′ (x)| + O(nh4 ) (5.8)
12
h2
= (b − a)maxx∈[a,b] |f ′′ (x)| + O(h3 ) (5.9)
12
where we used the fact that the sum contains n = b−a h
terms, which, as h is made small, is a
−1
large number of order O(h ). Thus, the error in the trapezoidal approximation is bounded
by O(h2 ). This is unless the second derivative is zero everywhere, in which case the function
is linear and the trapezoidal approximation is exact!
60
error 1 . But the computational operation of going from A to A + ai adds an O(ϵ) error again:
numerically we do A → A + ai + O(ϵ). Thus, it is equivalent to assume each ai to have a
roundoff error O(ϵ).
Now we do the sum of a big numberPof such terms, hence the error accumulates. That
n−1
is, the numerical evaluation of the sum i=0 ai is
n−1
X
(ai + αi ϵ) (5.10)
i=0
where αi are (pseudo-)random O(1) numbers. With maxi |αi | = α = O(1), we can then
bound the difference between the numerical and “mathematical” expressions as
n
X n
X
(ai + αi ϵ) − ai ≤ αnϵ = O(ϵ/h) (5.11)
i=0 i=0
and therefore n n
X X √
(ai + αi ϵ) − ai = O(ϵ/ h). (5.13)
i=0 i=0
Thus, in fact, we expect this the trapezoidal rule numerical integration to be affected by
smaller roundoff errors than the simplest numerical derivative method. This is typical of
numerical integration: it is less sensitive to roundoff errors than numerical differentiation;
the same analysis holds for the other methods of integration in the first class, so we will not
repeat it.
61
def f(x):
return 2/(x ∗∗2+1)
trapezoidal (f,−1,1,1000)
What should you get?
Check that the error decreases as 1/n2 .
h2
E≤ (b − a)maxx∈[a,b] |f ′′ (x)| + . . . = O(h2 ). (5.20)
24
This is the same order O(h2 ) as the trapezoidal rule, but the coefficient 1/24 is smaller –
thus this is (marginally) more precise!
62
5.3 Simpson’s rule
Simpson’s rule is a mixture of trapezoidal and the midpoint methods which uses the knowl-
edge of the function at the end points and also in the middle of each of the intervals. For
the small interval [a, b] = [xi , xi+1 ] Simpson’s rule states:
Z b
b−a a+b
f (x)dx ≈ Si := f (a) + 4f + f (b) . (5.21)
a 6 2
Again for not-small intervals [a, b] we divide it into n parts of lengths h = (a − b)/n with
endpoints xi = a + ih (again x0 = a, xn = b), and then the integral is estimated by the sum
Z b n
X h xi−1 + xi
f (x)dx ≈ Iab [f ] := f (xi−1 ) + 4f + f (xi ) . (5.22)
a i=1
6 2
giving
n
X h h
Iab [f ] := f (a + ih − h) + 4f a + ih − + f (a + ih) . (5.23)
i=1
6 2
Where does (5.21) come from? One way to see it is by using Richardson extrapolation,
section 4.4, on the trapezoidal method. Indeed consider the trapezoidal approximation for
integration on the interval [a, b], with two different choices of small parameter h: one is
h1 = b − a (the full interval), and the other is h2 = (b − a)/2 = h1 /2 (half of the interval). In
the notation of the Richardson extrapolation section we have p = 2, as the approximation
error for the trapezoidal method is O(h2 ), and the results of the approximate integration on
[a, b] for the choices h1 , h1 are
f (a) f (b) b − a f (a) a+b f (b)
g(h1 ) = (b − a) + , g(h2 ) = +f + . (5.24)
2 2 2 2 2 2
which is nothing but the Simpson’s rule. Hence, according to the Richardson method, the
resulting approximate integral should have an error of O(h3 ) or better. We see below that
it is in fact better!
63
and
h2 h3 h4 5h5
Si = f (xi )h + f ′ (xi ) + f ′′ (xi ) + f ′′′ (xi ) + f ′′′′ + ... (5.27)
2 3! 4! (4!)2
giving
xi +h
h5
Z
Ei = Si − f (x)dx = f ′′′′ (xi ) + O(h6 ). (5.28)
xi 4!5!
From this, using again that there are n = (b − a)/h terms in the sum for the approximation
Rb
Iab [f ], the overall approximation error E = a f (x)dx − Iab [f ] ≤ n−1
P
i=0 Ei is bounded as
h4
E≤ (b − a)maxx∈[a,b] |f ′′′′ (x)| + . . . = O(h4 ). (5.29)
4!5!
This is much better than the trapezoidal and midpoint methods.
(here for lightness of notation we omit the explicit dependence on a, b of the approximation
Rb
I[f ]). For instance, if you want to approximate a g(x)dx, then depending on what g(x)
looks like, you want to choose the right w(x), so that the other factor f (x) does not vary
too much. How to choose w(x) is a bit of an art; for now we just assume w(x) is given.
The second concept is that we have essentially two groups of parameters that we can
adjust to approximate the integral: the nodes xn,i , and what I would call the “approximation
weights” Ai , for i = 0, 1, 2, . . . , n:
n
X
I[f ] = Ai f (xi ). (5.31)
i=0
In fact, the number of nodes n + 1 is another parameter that we can choose – but here we fix
it, and then try to optimise on xi and Ai . So, we have in total 2n + 2 parameters to adjust.
Once for the measure w(x), and the integration range [a, b], and the number of nodes n,
are given, then the method fixes the xi ’s and Ai ’s. The method attempts to minimise the
error made in the approximation. How? It follows this simple requirement: the approximate
method should give the exact value of the integral for all polynomials of degree less than or
equal to 2n + 1:
2n+1
X
fpoly (x) = am x m . (5.32)
m=0
64
That is Z b
w(x)fpoly (x)dx = I[f ]. (5.33)
a
The space of such polynomial is (2n+2)-dimensional, as there are 2n+2 arbitrary coefficients
am , m = 0, 1, . . . , 2n + 1, hence (5.32) is in fact 2n + 2 equations:
Z b n
X
w(x)xm dx = Ai xm
i , m = 0, . . . , 2n + 1 . (5.34)
a i=0
Then since ϕn+1 (x)xk for k = 0, . . . , n is a polynomial of degree less than or equal to 2n + 1,
we should have
Z b Xn
k
w(x)ϕn+1 (x)x dx = Ai xki ϕn+1 (xi ) = 0 , k = 0, . . . , n . (5.36)
a i=0
Conceptually, this is an orthogonality condition: it says that the polynomials ϕn+1 (x) and
xk , k = 0, 1, . . . , n, are orthogonal with respect to the weight w(x). More practically, this
a set of n + 1 equations (for the n + 1 values of k). But the polynomial ϕn+1 (x), once we
choose its normalisation so that the coefficient of xn+1 is 1 (recall that it must have degree
n + 1, so this coefficient cannot be zero), also has exactly n + 1 free parameters:
Thus, the equations (5.36) are now n + 1 linear equations for these n + 1 coefficients, so they
can be solved by linear method (inverting a matrix). Explicitly, the orthogonality condition
is n
X
ci hi+k = −hn+1+k , k = 0, . . . , n (5.38)
i=0
Note that these moments hi do not depend on our choice of n (i.e. our choice of the number
of nodes n + 1). Once the weight w(x) and integral boundaries a, b, are known, we can
calculate them.
Eq. (5.38) gives a linear system of equations on the coefficients ci . Once ci ’s are found,
then ϕ(x) is known, and we just have to evaluate its zeros to find xi ’s. Thus we have
transformed the nonlinear problem on the nodes into a linear problem, and a nonlinear
problem of finding the roots of a polynomial.
65
Note that it is important to assume that w(x) is positive – thus w(x)dx is a measure –
otherwise it is not guaranteed that all xi are real (although we will not go into the mathe-
matical underpinning of this statement).
Once xi ’s are fixed, then one can determine Ai ’s from the linear equation (5.34). In fact,
this is an overdetermined set of linear equations, as there are n + 1 parameters Ai , but there
are 2n + 2 equations. Of course, we’ve already fixed the nodes xi in such a way that the full
set of equations can be solved, so it is sufficient to concentrate on the first n + 1 equations,
in order to determine the Ai ’s. In terms of the moments, these first n + 1 equations take the
form n
X
hm = A i xm
i , m = 0, . . . , n. (5.40)
i=0
2. Solve the linear system (5.38) and construct the polynomial (5.37);
Example: w = 1, a = −1, b = 1
Step 1. Compute hi :
1
(−1)i + 1
Z
hi = xi dx = (5.41)
−1 i+1
so that for instance:
2 2
h0 = 2 , h1 = 0 , h2 = , h3 = 0 , h 4 = (5.42)
3 5
Step 2. Form a linear system for the coefficients ci of the polynomial (5.38). This depends
on our choice of n, so for instance:
n = 0 : h0 c0 = −h1 ⇒ c0 = 0 ⇒ ϕ1 = 0 + x (5.43)
1
h0 c0 + h1 c1 = −h2 c0 = − 3 1
n=1: ⇒ ⇒ ϕ2 = − + x2 (5.44)
h1 c0 + h2 c1 = −h3 c1 = 0 3
h0 c0 + h1 c1 + h2 c2 = −h3 c0 = 0
3
n=2: h1 c0 + h2 c1 + h3 c2 = −h4 ⇒ c1 = − 53 ⇒ ϕ3 = − x + x3(5.45)
5
h2 c0 + h3 c1 + h4 c2 = −h5 c2 = 0
66
From this form it is easy to see that
Z 1
xk ϕn (x)dx = 0 , k = 0, . . . , n − 1 (5.48)
−1
Step 3. Find zeros of ϕn . Again this depends on n, and this gives the nodes we’re looking
for:
n=0: x0 = 0 (5.49)
p p
n=1: x0 = − 1/3, x1 = + 1/3 (5.50)
p p
n=2: x0 = − 3/5, x1 = 0, x2 = + 3/5 (5.51)
n=3: x0 = −0.861136, x1 = −0.339981, x2 = 0.339981, x3 = 0.861136 (5.52)
n = 0 : A0 = 2 (5.54)
A0 + A1 = 2
n=1: A0 A1 ⇒ A0 = 1 A1 = 1 (5.55)
−√ 3
+√ 3
=0
A0q+ A1 + q
A2 = 2
5 8 5
n=2: − 35 A0 + 35 A2 = 0 ⇒ A0 = A1 = A2 = . (5.56)
3 9 9 9
A + 53 A2 = 32
5 0
67
In order to find xi we need to solve the linear system (5.38) which is built out of moments
hi . These must be evaluated by integrating – so we need to know some integrals, in order to
construct our approximate integration method! It is convenient to choose w(x) so that the
weights hi can be evaluated analytically. This is the case in the example done above. It is
2
also the case for the weight w(x) = e−x chosen here; on the interval [−1, 1], the resulting
integrals can be expressed in terms of special functions known as “error functions” (the
theory of special functions is very useful for the Gaussian integration method). But here, to
save us the analytical work, we will instead use an integrator that is already in Python: the
integrate.quad method from the scipy package
from scipy import integrate
def h(k):
return integrate .quad( lambda x: w(x)∗x ∗∗k,a,b)[0]
(check the Python documentation if you’d like to know more about the integrate structure
and its .quad integration method!) We could have used, instead, one of our previous inte-
gration methods, such as Simpson’s method. There is still an advantage here, as once these
few moments are evaluated, to as high a precision as we require, then we can evaluate to a
very high precision many more functions – we build on other, perhaps less strong methods,
to get the very precise Gaussian method.
After that we build the linear system, solve it, and then find zeros of the function ϕn+1 (x),
all this in a single function:
def nodesAndWeights (n):
# finding xi
B = array ([[ None ]∗( n +1)]∗( n+1))
u = array ([ None ]∗( n+1))
for k in range(n+1):
u[k] = − h(n+1+k)
for i in range(n+1):
B[k,i] = h(k+i)
cs = linalg .solve(B,u)
cs = append (cs ,[1])
xs = roots(cs[::−1]).real
xs.sort ()
# f i n d i n g Ai
As = array ([ None ]∗( n+1))
for k in range(n+1):
u[k] = h(k)
for i in range(n+1):
B[k,i] = xs[i ]∗∗ k
As = linalg .solve(B,u)
return xs , As
Some remarks:
• We use the special work None instead of 0. to put zeros – in fact this puts “nothing”,
just a blank space to be filled in later on (in this way, when filled in with floating point
numbers, we won’t have the problem we would have had by putting 0 without the
period afterwards).).
68
• We use the linalg.solve method to solve the linear system that gives the coefficients
ci for the polynomial ϕ(x).
• We use the append method, instead of cs.append. The latter would not have worked
because cs is an array object, not a list object, and arrays don’t have the method
append, like the type array, attached to them. The method append comes with numpy
and is more flexible as it allows one to append a whole list; it takes as input lists or
arrays, but returns an array, not a list object; while a.append, for a list a, does not
need numpy, only appends single elements and returns a list.
• We use the roots(L) method which finds the roots of the polynomial specified by
the coefficients that are in the list L. Importantly, the roots method considers the
coefficients in the opposite order from that we have chosen! That is, the first element
of the list is the coefficient of the highest-degree monomial xn+1 . So we need to invert
the list, with the Python syntax cs[::−1]. The roots method also gives the zeros
as complex numbers (as in general zeros of polynomials are complex numbers). Our
zeros are guaranteed to be real, mathematically; but numerically, as we know, there
are roundoff errors, so Python may well give us numbers with very small, but nonzero,
imaginary part, for instance of the order 10−16 ... In order to avoid any problem down
the line, we take the real part.
The Gaussian integration then can be done as follows
def gauss (f,A,x):
return sum(A∗ array(list(map(f,x))))
For smooth functions this method works very well. It has excellent precision given only a
few evaluations of the function f . Note that the various singularities of the function f can
be absorbed into the measure function w, so if you have many integrals to compute for the
functions with the same type of singularities it could make sense to use this method.
69
S2
S1
the point are equally probable). We assume that we can also easily test whether the point
is inside or outside some given region R1 ⊂ R of area S1 < S, which we would like to
estimate. Repeating this operation N times, and counting how many points n are inside our
domain S1 , we can estimate the area itself. Indeed, because of the assumed uniformity of
the random process, the number of points n should be proportional to the area S1 , so that,
by normalisation (when S1 = S, we have n = N )
S1 n
≈ (5.58)
S N
In the picture we show the two complementary regions R1 ∪ R2 = R with areas S1 + S2 = S.
In (5.58), the left-hand side is the exact mathematical quantity we are looking for, and
the right-hand side is our numerical-method approximation. An important difference with
respect to previous integral approximations such as (5.22), (5.19) or (5.30), is that in (5.58),
the right-hand side is a random variable. That is, the numerical method does not give the
same value, for a given N , every time we repeat it. Numerical methods based on random
processes are more rare than those based on deterministic processes, but are getting more
and more popular. Because the method is based on a random process, the estimate of the
approximation error cannot be solely based on Taylor series expansions; it has to have a
probabilistic, statistical nature. We will not go into details of how to correctly define an
error, but instead just use basic aspects of statistics.
Intuitively the equation (5.58) is obvious. Below we will derive it and we will see that it
is indeed correct in the limit N → ∞. For finite (but large) N we will be able to estimate
the statistical error in this equation.
70
N = 1: then the answer is obvious Pn=1 (1) = p1 and Pn=0 (1) = 1 − p1 = p2 where we denote
S1 S2
p1 = , p2 = . (5.59)
S1 + S2 S1 + S2
Next, for N = 2 the probability of getting no points inside is Pn=0 (2) = (p2 )2 , and similarly
Pn=2 (2) = (p1 )2 . The probability of the remaining option to get just one point inside is simply
Pn=1 (2) = 1−Pn=0 (2)−Pn=2 (2) = 1−(p2 )2 −(p1 )2 = 1−(1−p1 )2 −(p1 )2 = 2p1 −2p21 = 2p1 p2 .
We can summarize that
Pn=0 (2) = (p2 )2 , Pn=1 (2) = 2p1 p2 , Pn=2 (2) = (p1 )2 . (5.60)
The factor 2 in Pn=1 (2) simply indicates that there are two possibilities to get one point
inside - either first point is inside second outside or the first point is outside and the second
inside. In general, for the total number of points N with n points inside, the number of
such combinations is given by the binomial coefficient Nn = n!(NN−n)! !
which indeed gives
2 2!
1
= 1!1! = 2. Thus the general formula is
N n N −n
Pn (N ) = p p . (5.61)
n 1 2
Recall that we are looking for the random variable
n
α= . (5.62)
N
The probability distribution (5.61) will tells us about how this is distributed. For our nu-
merical method to work, we want the average value to be equal to S1 /S:
S1
E[α] = (5.63)
S
Then, the error in this numerical approximation will be identified with the standard deviation
of n/N p
σ = E[α2 ] − E[α]2 (5.64)
In fact, as we want the area S1 , we have to multiply by S so the error estimate is
Eapprox = σS (5.65)
We first evaluate the average and standard deviation. The trick is to evaluate explicitly
the “generating function of cumulants”:
N N
λα X λn/N
X N −n
E e = Pn (N )e = (eλ/N p1 )n pN
2 = (eλ/N p1 + p2 )N . (5.66)
n=0 n=0
n
We deduce r r
p1 (1 − p1 ) p1 p2
E[α] = p1 , σ= = (5.69)
N N
Therefore we find the correct average (5.63), and we find that the numerical error (the
standard deviation) is
r r r
p1 (1 − p1 ) p1 p2 S1 S2
Eapprox = S= S= . (5.70)
N N N
√
As claimed, the numerical error decreases like 1/ N .
Let us try to get a bit more intuition about the probability distribution itself, when N
becomes large. From (5.66), we can take the limit N → ∞ on the rhs and we find, using the
standard formula for the exponential limu→0 (1 + au)1/u = ea ,
This means that E[αn ] = pn1 for all n, and therefore α √is for sure given by p1 , without any
fluctuations. This is in agreement with the decay as 1/ N of the variance: at large N , the
numerical approximation α gives the correct mathematical value with probability 1.
That is, for given S1 and S2 , for any N there is a nonzero probability to be extremely
unlucky and get no points inside at all, for instance; however for large N this probability is
exponentially suppressed, and we will get α = p1 or very nearby with high probability.
Can we see more explicitly that the probability distribution becomes peaked at α = p1
when N → ∞? To see this we have to use Stirling’s approximation of the factorial m! ∼ mm ,
valid for large m. Using it we find:
The function g(α) is negative and has maximum at α = p1 (which can be found from
g ′ (α) = 0) see Fig.5.4. As we see from Fig.5.3 the only relevant region is around α = p1
as the probability decreases fast away from this point. Computing the Taylor series around
α = p1 we find
ᾱ2 1
g(α) = 0 + ᾱ × 0 − + O(ᾱ3 ), ᾱ = α − p1 (5.73)
2 p1 p2
72
Figure 5.4: Probability Pn=αN (N ) for large N becomes eN g(α) . The function g(α) defined in
(5.72) is maximal at α = p1 .
which is the normal distribution for α = n/N with mean at p1 and with variance σ 2 = p1Np2 ,
in agreement with what we found above. We see that for large N the variance decreases, the
distribution becomes very narrow, and as a result we can be
√ rather confident that our n is
near its mean value p1 N . The uncertainty in n is ∼ N σ = N p1 p2 i.e. we can write
p
n ≃ p1 N ± N p 1 p2 . (5.75)
S1
Thus knowing n we can estimate p1 = S
by
r
n p 1 p2
p1 ≃ ± (5.76)
N N
from where we get r r
n p1 p2 n S1 S2
S1 = p 1 S ≃ S ± S =S ± (5.77)
N N N N
where the last term gives us the error estimate.
73
n = n+1 # i n c r e a s e t h e c o u n t e r o f t h e p o i n t s i n s i d e
p1 = n/N # our e s t i m a t e f o r t h e r a t i o o f a r e a s
S1 = p1 ∗4 # our e s t i m a t e f o r t h e a r e a S1 = p1 S
error =4∗ sqrt(p1∗(1−p1 ))/( sqrt(N)) # e r r o r o f t h e a p p r o x i m a t i o n
which, for example, gives 3.122 and ∼ 0.01655 for the error estimate. This is indeed consis-
tent with the exact result 3.1416.
74
Chapter 6
Numerical solution of ordinary differential
equations
75
Here f is a function that determines the differential equation. For instance, for n = 2,
f (x, y, y ′ ) = xy ′ + y ⇒ y ′′ = xy ′ + y. (6.6)
It turns out that the n-order problem for a single function it can always be transformed
into 1st order problem for a n-dimensional vector. We do this by introducing auxiliary
functions
y0 = y , y1 = y ′ , y2 = y ′′ , . . . , yn−1 = y (n−1) (6.7)
and in these notations the 1st order differential equation becomes
y0′ = y1 (6.8)
y1′ = y2 (6.9)
... (6.10)
′
yn−1 = f (x, y0 , y1 , . . . , yn−1 ) . (6.11)
That is, we can identify ⃗y := (y0 , y1 , . . . , yn−1 ) = (y, y ′ , y ′′ , . . . , y (n−1) ) as our unknown n-
dimensional vector of functions, and with the choice
the differential equation (6.4) is equivalent to (6.1) (and the initial conditions also match).
For the numerical implementation, we will assume that this transformation has been
done, and concentrate on the vectorial first order problem.
⃗y (x + h) − ⃗y (x)
⃗y ′ (x) ≈ . (6.13)
h
Using this in (6.1), we get the approximate formulation of the problem, which can be put
on the computer. In order not to confuse the true mathematical solution to the original
problem, with the numerical solution, I’ll use different notations: ytrue for the solution to the
original problem, and y for the solution to the approximate problem. Thus we have
′
⃗ytrue (x) = F⃗ (x, ⃗ytrue (x)), ⃗ytrue (a) = α
⃗ (6.14)
⃗y (x + h) − ⃗y (x)
= F⃗ (x, ⃗y (x)), ⃗y (a) = α
⃗ (6.15)
h
for the approximate problem, and we expect ⃗ytrue (x) ≈ y(x) for all x; clearly we have an
exact equality for x = a, by definition.
76
The main observation is that eq. (6.15) can be solved iteratively for all ⃗y (a + ih), for
i = 1, 2, 3, 4, . . .. In fact, it does not fix ⃗y (x) for all x, but only on the grid
xi = a + ih. (6.16)
But this is ok, if the grid is tight enough (h is small enough) it will give a good approximation,
and we can use our interpolation methods to fill-in the gaps between the grid.
In oder to see this, we introduce the notation
This is it: the numerical algorithm simply computes ⃗y1 from the right-hand side, which only
depends on the known ⃗y0 = a; then it computes ⃗y2 , which only requires the now-known ⃗y1 ,
etc.
77
interval has a length h = (b − a)/N . It evaluates the result ⃗yi = ⃗y (xi ) for all i = 0 (that
is x0 = a, so this is already given), 1, 2, 3, . . . , N inclusively, with xN = b the last point
given. It returns both the grid xi ’s, and the results ⃗yi ’s. In the memory, ⃗yi ’s are N + 1
vectors organised as a matrix N + 1 by n; recall that n is the size of each vector ⃗yi , and
in the algorithm, this is determined as the size of α ⃗ (so we don’t need another entry in the
function).
def euler (alpha ,a,b,N):
h = (b−a)/N;
ys = zeros ((N+1, alpha.size ));
ys [0] = alpha;
xs = arange (a,b+h,h)
for i in range(N):
ys[i+1] = ys[i] + h∗F(a + i∗h, ys[i])
return xs ,ys
Usage example: We consider y ′ = x with the initial data y(0) = 1. Note how in this code
we use ys[:,0] in order to access the zeroth index of all vectors ⃗yi , that is the elements ⃗yi0 ,
for all i, in order to plot them all. This means we only plot the result for the function itself,
not its derivative (here the “vectors” ⃗yi are one-dimensional, so this is trivial, but in other
examples below this becomes important).
from numpy import ∗
def f(x,yvec ):
return x
alpha = array ([1])
xs ,ys= euler(alpha ,0 ,1 ,1000)
# plot of the r e s u l t
% matplotlib inline
import matplotlib .pylab as plt
plt.plot(xs ,ys [: ,0])
plt.plot(xs ,1+ xs ∗∗2/2)
plt.show ()
1.6
1.5
1.4
1.3
1.2
1.1
1.0
0.0 0.2 0.4 0.6 0.8 1.0
78
Another example y ′′ = −y, and initial data y(0) = 0 and y ′ (0) = 1, whose analytical solution
is sin(x), against which we compare:
def f(x,yvec ):
return −yvec [0]
alpha = array ([0 ,1])
1.5
1.0
0.5
0.0
−0.5
−1.0
−1.5
0 5 10 15 20 25 30
we clearly see that the error accumulates! In the next section we will see how to improve
the situation.
Thus the error for a single step of the recursion is of order h2 . Much like for integration, such
errors accumulate as we perform many steps of (6.21) – although here it is more difficult
to make an accurate bound for the error estimate. Nevertheless, we can estimate the total
error similarly. If we make N steps, say in order to reach the point b, the error will becomes
O(N h2 ), and since N = (b − a)/h we get the error of order O(h) at the end. The choice
of b – here the end-point of the interval we used in the numerical algorithm – is arbitrary.
The important point is that for any fixed position x that is a finite distance away from a,
we will need to perform (x − a)/h steps, and as h becomes smaller, this grows like 1/h. So
79
for any such x, the error is expected to be O(h). For larger x’s, the coefficient of h in this
error estimate will be larger, but still going like h.
So, to increase the precision by one digit we will have to make 10 times more iterations
which will make the whole calculation 10 times slower. That is not extremely good, but so
we will deal with the error in a smarter way in the next section.
80
6.1.5 Runge-Kutta Method: Python implementation
We only have to modify one line in the Euler method
def rungekutta (alpha ,a,b,N):
h = (b−a)/N;
ys = zeros ((N+1, alpha.size ));
ys [0] = alpha;
xs = arange (a,b+h,h)
for i in range(N):
ys[i+1] = ys[i] + h /2∗( F(a + i∗h, ys[i])
+F(a + i∗h+h, ys[i]+h∗F(a + i∗h, ys[i])))
return xs ,ys
1.5
1.0
0.5
0.0
−0.5
−1.0
−1.5
0 5 10 15 20 25 30
we get a result which agrees excellently with the analytical solution this time!
81
It is convenient to introduce the following notation:
⃗ 0 = hF⃗ (x, ⃗y )
K (6.27)
⃗ 1 = hF⃗ (x + h/2, ⃗y + K
K ⃗ 0 /2) (6.28)
⃗ 2 = hF⃗ (x + h/2, ⃗y + K
K ⃗ 1 /2) (6.29)
⃗ 3 = hF⃗ (x + h, ⃗y + K
K ⃗ 2) (6.30)
We know that the linear second order equation should have two independent solutions. The
boundary conditions y(a) = α and y(b) = β should fix a unique solution.
82
Figure 6.1: Shooting method: Different probe trajectories (dashed) trying to meet the bound-
ary conditions and the solid line successfully meets the boundary conditions.
find in particular y(b), however, for some randomly picked value for u most likely one gets
y(b) ̸= β. The strategy is to tune the value of u so that we get as close as possible to the
dedicated value of y(b). In other words, y(b) defines us a function of u: g(u) = y(b) and we
have to find u such that G(u) ≡ g(u) − β = 0. This maps our problem to the problem of
finding a zero of a function, which can be solved by some iterative method such as Secant
method1
To summarize, the main steps in the method are:
83
then we can plot the result to make sure we indeed found the solution interpolating between
α = 1 and β = 0:
xs ,ys= rungekutta (array ([ alpha ,u0 ]) ,0 ,30 ,2000)
plt.plot(xs ,ys [: ,0])
plt.show ()
84
Chapter 7
Stochastic differential equations
Stochastic differential equations are like usual differential equations with some noise taken
into account. The typical example is a free particle motion in empty space given by the
Newton’s 2nd law
mẍ = 0 (7.1)
which is an ordinary differential equation. Now imagine a particle going through some cloud
of dust (tiny small particles) - it will exhibit some irregular collisions affecting the trajectory
of the particle in some unpredictable way. This can also be described by Newton’s 2nd law,
which includes some random force f (t)
By random force we understand some random variable with certain distribution (if the
particles in the cloud are small the expectation value of f (t) is small etc.).
Interestingly, very similar equations can describe the financial market where the role of
the dust cloud of the particles is played by buyers and sellers.
where a and b are some fixed functions and dW (t) is a random “force” represented by the
Wiener process. What this equation means could become clearer in the discrete version:
where ∆Wn = N (0, ∆t) is a random variable with normal distribution and with expectation
value 0 and variance ∆t.
85
ts = arange (0,T,dt)
N = ts.size
xs= zeros(N)
xs [0]=0
sdt = sqrt (1∗ dt)
for i in arange (1,N):
xs[i] = xs[i−1] + sigma ∗ random . normal (0, sdt)
return ts , xs
Note that random.normal generates a random variable with normal distribution, where the
first argument is the average, and the second the standard deviation; here the variance is
dt, so we need to take the square root. This will generate a rather convincing picture of
Brownian motion
% matplotlib inline
import matplotlib . pyplot as plt
ts , xs = brownian (1 ,1)
plt.plot(ts , xs)
plt.show ()
Next we can check that the random variable x(T ) corresponds to the normal distribution
x2
e− 2T
p(x) = √
2πT
. For that we will run the simulation many times
ys = zeros(M)
T = 1
for i in range (100):
ts , xs = brownian (T ,1)
ys[i]=xs[−1]
plt.plot(ts , xs ,"b")
plt.show ()
86
The question is how to convert the set of final points ys into a nice smooth distribution.
Fortunately in the scipy package there is a suitable function gaussian kde which converts
a set of points into a smooth distribution as shown below
from scipy.stats.kde import gaussian kde
kde = gaussian kde (ys)
x = arange (−5,5,0.1)
plt.plot(x,kde(x))
plt.plot(x,exp(−x∗x/2/T)/ sqrt (2∗ pi ∗T))
plt.show ()
0.30
0.25
0.20
0.15
0.10
0.05
0.00
−6 −4 −2 0 2 4 6
87
Chapter 8
Partial Differential Equations
In this chapter we briefly discuss what can be done to numerically solve partial differential
equations.
The type of problems we may like to solve are those with boundary conditions specified for
the function f (x, y). For simplicity we assume that we are studying the equation (8.1) in
the rectangular domain x ∈ [a, b] and y ∈ [c, d]. Using the approximation of the second
derivative:
f (x + h, y) + f (x − h, y) − 2f (x, y) f (x, y + h) + f (x, y − h) − 2f (x, y)
∂x2 f (x, y) = 2
, ∂y2 f (x, y) =
h h2
we get for
or in other words
f (x + h, y) + f (x − h, y) + f (x, y + h) + f (x, y − h)
f (x, y) = .
4
So we can see that the equation is satisfied if the value f (x, y) at some point of the lattice
is equal to the average of the values at the neighboring nodes. This gives the idea for the
algorithm - we scan through the lattice replacing the values at the nodes with the mean
value of the f s at the neighboring nodes at the previous iteration.
88
def relaxation (ts):
res = ts [:]
for x in range (1,sizex −1):
for y in range (1,sizey −1):
res[x,y]=( ts[x+1,y]+ts[x−1,y]+ts[x,y+1]+ ts[x,y−1])/4
return res
The line res = ts[:] copies ts into res, instead of just attributing the variable res to
the same table as ts (which is what res = ts would have done). Here this is absolutely
essential, otherwise the iteration, in the double for loop, would not be the right one, it would
act on elements of the same matrix and there would be interference between the different
iteration steps, leading to the wrong result!
We apply it to the simulation of a temperature field inside a room with one door and one
window
sizex = 14 # s i z e o f t h e room
sizey = 16
ts = array ([[20.]∗ sizey ]∗ sizex) # s e t t i n g some i n i t i a l v a l u e s
for i in range (1 ,7):
ts[0,i]=10 # s e t t i n g boundary v a l u e s a t t h e door
for i in range (8 ,13):
ts[sizex −1,i]=0 # and t h e boundary v a l u e s a t t h e window
89
Figure 8.1: Iterations of diagonalization of the relaxation method. It is important to remem-
ber this that the boundary is fixed (by definition of it being a boundary value problem), so
the outside perimeter of squares will never change colour. The above simulation computes
the temperature field inside a room (the rectangle), given the temperature of the window
(the blue bar below) and the door (the green bar on the top) and the fixed temperature of
the walls (the maroon perimeter). Iteratively we find the temperature field inside the room,
and thus the ideal placement for a bed to snooze in.
90
Chapter 9
Eigenvalue problem
where ⃗x and λ are both unknown. The goal is to determine λ. ⃗x could also be of some
interest, but we will not consider it here. This is a very important problem with numerous
applications. For instance, some ODEs can be reduced to this form. Many problems in
Classical Mechanics and Quantum Mechanics can be reduced to the eigenvalue problem.
(A − λI)⃗x = 0 (9.2)
where I is a unit matrix. We see that in this form this is a linear homogeneous equation.
For this equation to have a nontrivial solution we should require |A − λI| = 0. Expansion of
the determinant leads to the polynomial equation, also known as the characteristic equation
a0 + a1 λ + a2 λ2 + · · · + an λn = 0 (9.3)
which has the roots λi , i = 1, 2, . . . , n called the eigenvalues of the matrix A. For n not too
large we can solve this equation numerically. Let’s consider an example
1 −1 0
A = −1 4 −2 (9.4)
0 −2 2
and then find its roots. To do that we first estimate roughly their positions with a plot:
91
xs = arange (−1,6,1/10)
ys = [detl(x) for x in xs]
plot(xs ,ys)
grid () # t o add a g r i d
we see that the roots are around 0, 1 and 5. Then we can use the Newton method to find
zeros with good precision:
import scipy. optimize as sc
[sc. newton (detl ,0),
sc. newton (detl ,1),
sc. newton (detl ,5)]
[0.28129030960808521 , 1.3160308608707258 , 5.4026788295211823]
This method is easy to implement, however, it requires the knowledge of the starting points
for finding the root and it will not work for the degenerate eigenvalues as well as for the
current case. Finally, this method will be hard to use for large matrices (with many eigen-
values).
92
which is again an equivalent eigenvalue problem for another matrix P −1 AP = A∗ . Ideally
we would like A∗ to be diagonal - then the problem is essentially solved as we just read
off λ′ s from its diagonal elements. However finding such transformation P immediately is
complicated. We will do that step by step annihilating the off-diagonal elements one by one.
At each step we will transform A with a particular matrix R of the form:
1 0 0 0 0 0 0 0
0 1 0 0 0 0 0 0
0 0 c 0 0 s 0 0
0 0 0 1 0 0 0 0
R= 0 0 0 0 1 0 0 0
(9.7)
0 0 −s 0 0 c 0 0
0 0 0 0 0 0 1 0
0 0 0 0 0 0 0 1
with c = cos θ and s = sin θ for some angle θ. Here Rkk = Rll = c and Rkl = −Rlk = s. This
is called the Jacobi rotation matrix. Note that its inverse is given by RT and as a result A∗
is of the form (here an example for k = 3, l = 6):
A1,1 A1,2 cA1,3−sA1,6 A1,4 A1,5 sA1,3+cA1,6 A1,7 A1,8
A1,2 A2,2 cA2,3−sA2,6 A2,4 A2,5 sA 2,3+cA
2,6 A 2,7 A 2,8
A3,3 c2+s sA6,6−2cA3,6 cA3,4−sA4,6 cA3,5−sA5,6 csA3,3+ c2−s2 A3,6−csA6,6cA3,7−sA6,7 cA3,8−sA6,8
cA1,3−sA1,6 cA2,3−sA2,6
A A cA −sA A A sA +cA A A
1,4 2,4 3,4 4,6 4,4 4,5 3,4 4,6 4,7 4,8
A1,5 A2,5 cA −sA A A sA +cA A A
3,5 5,6 4,5 5,5 3,5 5,6 5,7 5,8
sA1,3+cA1,6 sA2,3+cA2,6 csA3,3+ c2−s2 A3,6−csA6,6 sA3,4+cA4,6 sA3,5+cA5,6 2
A 3,3 s +c 2sA 3,6+cA 6,6 sA3,7+cA 6,7 sA3,8+cA 6,8
A1,7 A2,7 cA3,7−sA6,7 A4,7 A5,7 sA3,7+cA6,7 A7,7 A7,8
A1,8 A2,8 cA3,8−sA6,8 A4,8 A5,8 sA3,8+cA6,8 A7,8 A8,8
(9.8)
(It looks rather complicated but you are advised to derive it yourself by multiplying smaller
matrices. Consider separately elements of A∗ij for i = k, i ̸= k and/or j = l, j ̸= k). The
idea is to take such value of θ that the element A∗kl = 0. This gives:
Note that
√ we do not need to know the angle itself. Instead we need c and s, which are related
by s = 1 − c2 . So we write
√
(2c2 − 1)Akl + c 1 − c2 (Akk − All ) = 0 (9.10)
where the sign is chosen so that s = 0 and c = 1 for Alk = 0. I.e. so that R becomes almost
an identity matrix for small off-diagonal elements. In fact, these formula for s, c still need to
be made more precise, as we need to take the square root. For c we take the positive square
root to get c = 1 for Alk = 0, but for s we need to take just the right branch. This branch
is determined by solving for s in (9.9) after using s2 = 1 − c2 , so we get
(1 − 2c2 )Akl
s= (9.12)
c(Akk − All )
93
With all this, we now have
After that we apply this transformation iteratively selecting k and l such that Akl is the
biggest off-diagonal element. See implementation for the details.
Finally, we have to specify which k and l to use at each step. For that we scan the upper
part of the matrix, finding the biggest element:
# f i n d s l o c a t i o n o f t h e b i g g e s t e l e m e n t a b o v e t h e main d i a g o n a l
def findmax (A):
imax = 0
jmax = 1
maxval = abs(A[0 ,1])
n = (A[0]). size
94
for i in range(n):
for j in range(i+1,n):
if abs(A[i,j])>maxval :
imax = i
jmax = j
maxval = abs(A[i,j])
return imax , jmax , maxval
Figure 9.1: Iterations of diagonalization of some random matrix with Jacobi Method
95
Chapter 10
Optimisation problem
The problem of optimisation is a problem of finding the minimum (or maximum) of a function
F (⃗x) of several parameters ⃗x. Depending on the number of parameters and smoothness of
the function F the problem can be complicated and sometimes very slow and one should
adopt a suitable method to achieve the best and fastest result. Normally, one finds only a
local minimum which is another common problem of the existing methods.
We will describe a few methods applicable in different situations: those applicable for one
single parameter; those applicable when the function is smooth and as an example of what
one can do in the situation when the function F is not smooth we describe the Down-Hill
Simplex method.
x1 = a + α(b − a) , x2 = b − α(b − a)
Now we choose the value of α in such a way that one of x′i coincides with xj in the previous
step: x′1 = x2 or x′2 = x1 (the only one that makes sense). This will guarantee that as we go
through the iterations, the new yi ’s we get will always be such that the function, at these
points, gets smaller - so we should approach a minimum.
For instance let us assume that F (x1 ) ≥ F (x2 ) so that a′ = a + α(b − a). In this case we
need x′1 = x2 , so that
96
Figure 10.1: First several iterations of the Golden Section Search
(α2 − 3α + 1)(a − b) = 0
97
First consider the following simplified problem. Let’s assume that our function F (⃗x) has
the following form:
X 1X
F (⃗x) = c + bi x i + Aij xi xj (10.1)
i
2 i,j
where At = A can be assumed to be symmetric.
If the function is differentiable we can use that at the minimum of F we should have
∂F
=0
∂xi
which becomes X
0 = bi + Aij xj
j
Dropping higher terms in the Taylor expansion we again have the same as in (10.1) which
we are ready to optimise:
⃗x − ⃗x0 = −A−1 .⃗b (10.3)
where
Aij = ∂i ∂j F (⃗x0 ) , bi = ∂i F (⃗x0 ) (10.4)
this will give us some values of ⃗x, which is not an exact minimum but rather an approximation
which we can use as a starting point for the next iteration, exactly like in the Newton method.
In this way we get the following equations:
⃗xn = ⃗xn−1 − A−1⃗b , Aij = ∂i ∂j F (⃗xn−1 ) , bi = ∂i F (⃗xn−1 ) . (10.5)
The above is of course just a special case of the problem of finding the simultaneous zero
of, say, m + 1 functions:
fi (⃗x) = 0, ⃗x = (x0 , x1 , . . . , xm ), i = 0, 1, . . . , m (10.6)
Indeed by Taylor series expansion
X
fi (⃗x) = fi (⃗x0 ) + ∂j fi (⃗x0 )(xj − x0,j ) + . . . (10.7)
j
and solving for xj the equation that asks the right-hand side to be zero for every i, calling
the solution ⃗x1 , this gives
⃗x1 = ⃗x0 − A−1⃗b, Aij = ∂j fi (⃗x0 ), bi = fi (⃗x0 ). (10.8)
The recursion relation is obtained by replacing ⃗x0 → ⃗xn and ⃗x1 → ⃗xn+1 . This is the multi-
dimensional version of the Newton method.
98
10.3 Down-Hill method
This multidimensional optimisation method is applicable in the case when the function is
not even differentiable or cannot be efficiently approximated by the polynomial as in the
previous section. For simplicity we will consider 2d case only, but the generalization is
straightforward.
Figure 10.2: Scalar function of two variables. This exemplifies a function which cannot be
easily approximated by a quadratic polynomials.
The starting point of the method is a set of 3 points which form a triangle. The point
with the maximal value of F we denote as Hi and the one with the minimal value is denoted
as Lo.
Each next step is based solely on the values of the function at some 3 points. Given the
3 points we have to generate a set of new 3 points trying to make the values smaller at each
step.
There are 4 moves in the algorithm: Reflection, Expansion, Contraction and Shrinkage.
The first two moves can be called “exploration” moves as they let us probe the function
outside the area surrounded by the triangle. The other two have an intention to corner the
minimum of a function by making the triangle smaller and smaller at each of these moves.
Expansion Similar to the reflection by it also stretches the triangle further away by a
factor 2 (see Fig.10.3).
Contraction Opposite to the Expansion. We move the Hi vertex down towards the op-
posite side decreasing the area of the triangle by the factor 2 (see Fig.10.3).
99
Figure 10.3: Elementary moves of the Down Hill method.
4. If xr is the best point so far f (xr ) < f (x1 ) then do expansion xe = ⃗x0 + 2(⃗x0 − ⃗x3 ).
Then use the best point between xe and xr to replace the worst point x3 and go to 1)
5. If reflection is not successful (i.e. we reached that point instead of going to 1)) which
means that we must have f (xr ) ≥ f (x2 ) compute contracted point xc = ⃗x0 −1/2(⃗x0 −
⃗x3 ). If f (xc ) < f (x3 ) then replace x3 with xc and go to 1)
6. If none of the above work shrink the rectangle ⃗xi = ⃗x1 + 1/2(⃗xi − ⃗x1 )
100
xs [2] = xr
print (’reflected ’)
return xs
# 4
else:
if(fr<fs [0]):
xe = xo + 2∗( xo−xs [2]) # e x t e n d e d p o i n t
fe = f(xe[0],xe [1])
if (fe<fr):
xs [2] = xe
print (’extension ’)
return xs
else:
xs [2] = xr
print (’reflection ’)
return xs
# 5 contraction
else:
xc = xo −0.5∗(xo−xs [2])
fc = f(xc[0],xc [1])
if (fc<fs [2]):
xs [2] = xc
print (’contruction ’)
return xs
# 6 shinkage
for i in range (1 ,3):
xs[i] = xs[i ]+0.5∗( xs[i]−xs [0])
print (’shrink ’)
return xs
Usage
def f(x,y): # f u n c t i o n we o p t i m i z e
return (1−x )∗∗2+100∗( y−x ∗∗2)∗∗2
# initial triangle
xs = [ array ([1. ,2.]) , array ([2. ,3.]) , array ([3.,−4.])]
# repeat steps s e v e r a l times
for i in range (80):
xs ,s = onestep (xs)
# p r i n t r e s u l t and t h e v a l u e o f t h e f u n c t i o n a t t h e minimum
print (xs [0],f(xs [0][0] , xs [0][1]))
[ 1.00000001 1.00000002] 5.14504919216e−16
101
Figure 10.4: Iterations of diagonalization of the Down Hill method. We begin with a single
green triangle, and are in search of the minimum (red dot). We then reflect (green triangle)
and reflect again, before contracting. We continue in this way as we get closer and closer to
the target. Eventually the triangles get so close it is hard to resolve where are they anymore!
102
Chapter 11
Solving Linear Systems of Algebraic Equations
[This chapter is optional. If time permits we will cover it. The mathemati-
cal method is something that you have learned before, this is an algorithmic
implementation of it.]
Linear systems of algebraic equations play a central role in numerical methods. Many
problems can be reduced or approximated by a large system of linear equations. Some
essentially linear problems include structures, elastic solids, heat flow, electromagnetic fields
etc. That’s why it is very important to know how to solve these equations efficiently, fast
and without big loss of precision.
In general a system of algebraic equations has the form
A11 x1 + A12 x2 + · · · + A1n xn = b1
A21 x1 + A22 x2 + · · · + A2n xn = b2 (11.1)
..
.
An1 x1 + An2 x2 + · · · + Ann xn = bn
where the unknowns are xi . We will use the standard matrix notations
Ax = b . (11.2)
A common notation for the system is
A11 . . . A1n b1
.. .. .. .
[A| b] = . . . (11.3)
An1 . . . Ann bn
This system of linear equations has a unique solution only if det A ̸= 0. At the same time if
det A = 0, depending on b one either has many solutions or none. As we are going to solve
the system approximately, small determinants can also cause problems. We will discuss some
possible issues below.
The 3 basic steps which we can use to simplify the equations are:
• Exchanging two equations
• Multiplying an equation by a nonzero constant
• Adding/subtracting two equations
In performing the above steps we do not lose any information.
103
11.1 Gaussian Elimination Method
The Gaussian elimination method consists of two parts: elimination phase and the back
substitution phase. We will demonstrate them first by an example. Consider a system
−2x1 + x2 + x3 = 3
x1 + 5x2 + 4x3 = 23 (11.4)
5x1 + 4x2 + x3 = 16
−2x1 + x2 + x3 = 3
1 1
(x1 + 5x2 + 4x3 ) + (−2x1 + x2 + x3 ) = 23 + 3 (11.5)
2 2
5 5
(5x1 + 4x2 + x3 ) + (−2x1 + x2 + x3 ) = 16 + 3
2 2
or
−2x1 + x2 + x3 = 15
11 9 49
x2 + x3 = (11.6)
2 2 2
13 7 47
x2 + x3 =
2 2 2
Next we use the second equation as a pivot equation to eliminate x2 from the last equation
−2x1 + x2 + x3 = 3
11 9 49
x2 + x3 = (11.7)
2 2 2
13 7 13 11 9 99 13 49
x2 + x 3 − x2 + x3 = −
2 2 11 2 2 2 11 2
and at the end of the elimination phase we get the following upper triangular system of
equation
−2x1 + x2 + x3 = 3
11 9 49
x2 + x3 = (11.8)
2 2 2
20 60
− x3 = −
11 11
Back substitution phase. Now we can find the unknowns one by one starting from x3 .
The last equation gives x3 = 3, after that we use the second equation to get x2 = 2 and the
first one to get x1 = 1.
104
11.1.1 Python implementation of Gaussian Elimination Method
def gauss (A,b):
n = len(b)
# Elimination phase
for j in range (0,n−1): # i t e r a t e o v e r t h e p i v o t e q u a t i o n
for i in range(j+1,n):
lam = A[i,j]/A[j,j]
A[i] = A[i] − lam ∗A[j]
b[i] = b[i] − lam ∗b[j]
# Back s u b s t i t u t i o n p h a s e
x = zeros(n)
for j in range(n−1,−1,−1):
x[j] = b[j]/A[j,j]
b = b − x[j]∗A[:,j]
return x
to check that it works we apply it to the above example
from numpy import array
A = array ([[ −2. ,1. ,1.] ,[1. ,5. ,4.] ,[5. ,4. ,1.]])
b = array ([3. ,23. ,16.])
gauss (A,b)
11.2 Pivoting
The Gaussian method discussed above has one obvious problem - it could be that at some
point the element A[j, j] could appear to be zero. In this case we will divide by zero and get
an error. Consider an example
−x2 + x3 = 0 (11.9)
2x1 − x2 = 1 (11.10)
−x1 + 2x2 − x3 = 0 (11.11)
the corresponding augmented coefficient matrix is
0 −1 1 0
[A| b] = 2 −1 0 1 . (11.12)
−1 2 −1 0
We see that we are stuck at the very first step of the elimination. At the same time there
will be no problem if we swap the first and the last equations:
−1 2 −1 0
[A| b] = 2 −1 0 1 . (11.13)
0 −1 1 0
In fact it would be even better to change the order of the equations so that at each step
of the elimination the pivot equation had the biggest coefficient among all rows below.
We will add two functions to the code above
105
def findbiggest (row ):
return row. tolist (). index(max(row ,key=abs ))
this function will return the index of the biggest (by absolute value) element of the row. To
swap ith and j th rows of A and b simultaneously we can use
def swap(A,b,i,j):
Atmp = array(A[i])
A[i] = A[j]
A[j] = Atmp
btmp = b[i]
b[i] = b[j]
b[j] = btmp
return A, b
106