Computational Physics I
Luigi Scorzato Lecture 2: Floating point arithmetic
Computer memories are nite: 1. 2. how can we represent , or on a computer? To what extent can such representation(s) be trusted?
Representation
Usually computers assign 32 or 63 bits foreach single number, but there are two main strategies to do it:
Fixed Point (used for Integers):
n = (1)a0 (a1 20 + a2 21 + . . . + aM 1 2M 2 )
M is the number of bits available for a single number (typical choices are M=32 or 64) and ai = 0,1. nmax = 2M-1 i.e. (with 64bits) 9.2 1018. This means that (9 1018), (9 1018 +1) are all represented and distinguishable; but 1019=
Floating Point (used for Reals, ...)
x = (1)s 1.f 2
bias
IEEE754 standard s is 1 bit for the sign; f is the 23 (single precision) or 53 (double) bits mantissa and is the 8 (single) 10 (double) bits exponent. The length of the Mantissa denes roughly the relative precision and that of the exponent the range. Both are represented as integers.
Limitations of Floating Point Arithmetics
Commutativity and Addition inverse are OK for IEEE: (a+b=b+a; a*b=b*a; a-a=0) (less trivial than you may think). But this are the last good news... Addition is not associative ((a+b)+c) != (a+(b+c)) Distributive law does not hold ((a+b)*c) != (a*c+b*c) Multiplication inverse may not exist a* (1/a) != 1 Most simple numbers in decimal units are not mapped exactly
Typical mechanism that produces errors:
shift of the mantissa, when summing very different numbers. E.g:
A Model
Instead of: || oat(A op B) - (A op B) ||=0, we can only assume that: || oat(A op B) - (A op B) || < u || A op B || (where op=+,-,/,* of single oating point numbers) we can use this model to predict which errors we might expect. For example for the scalar product one nds (Golub-van Loan):
N N N
fl
k=1
xk yk
k=1
xk yk N u
k=1
|xk yk | + O(u2 )
Simple exercises
Exponential function [see my_exp.m] Accumulating sums as e.g. harmonic series: lim_n (1+x/n)^n [my_exp_seq.m]
(x)n = n! n=0
N k=1
1 N ln N + Euler k
Trying to understand precisely the origin of rounding errors is often frustrating and as hard as
Message:
solving analytically the problem that we want to solve numerically. What we can do is to check a posteriori: check the correctness with known exact results, which have the same numerical difculties. check consistency when changing conditions by negligeable amounts (when you know that they should not matter: sometimes high sensitivity is physical). check consistency when changing numerical precision of the operations.
Compute Derivative of a function
Notation:
fn = f (t0 + nh) f
(1),order=1
Naive 1st derivative:
f1 f0 + O(h) h
One can do better (remember Taylor):
f1 f2 f1 f1 f2 f2 f (1),o=2 f (1),o=4 = = = = = =
h2 (2) h3 (3) h4 (4) f0 hf + f f + f + O(h5 ) 2 3! 4! (2h)2 (2) (2h)3 (3) (2h)4 (4) f0 2hf (1) + f f + f + O((2h)5 ) 2 3! 4! 1 3 (3) (1) 2hf + h f + O(h5 ) 3 8 3 (3) (1) 4hf + h f + O(h5 ) 3 f1 f1 + O(h2 ) 2h (f2 f2 ) 8(f1 f1 ) + O(h4 ) 2h
(1)
However, smaller h and higher orders are not necessarely better: see the following example
Exercise: Write a program that computes the derivative of sin( x) for
different order of approximation; different values of h; different values of .
Compare with [numdiff.m]