Floating Point
Floating Point
Floating points
IEEE Standard unifies arithmetic model
by Cleve Moler
What is the
output?
0f<1
and must be representable in binary using at most 52 bits. In
52
e = 1 c
64-bit word, with 52 bits for f , 11 bits for e, and 1 bit for the
for the exponent field, 0 and 211-1, are reserved for exceptional
microprocessor companies.
All computers designed in the last 15 or so years use IEEE
floating-point arithmetic. This doesnt mean that they all get
exactly the same results, because there is some flexibility within
the standard. But it does mean that we now have a machineindependent model of how floating-point arithmetic behaves.
single precision format which saves space but isnt much faster
each for f and e. Between 2e and 2e+1 the numbers are equally
machine epsilon.
x = (1 + f ) 2
eps = 2^(52)
b = a 1
c = b + b + b
a = 4/3
C l e v e s C o r n e r ( c o n t i n u e d )
of the vector
values of eps.
0:0.1:1
exactly.
numbers is eps. In either case, you can say that the roundoff
In fact,
fact, e is equal to eps. Before the IEEE standard, this code was
10 = 12 + 12 + 02 + 02 + 12 +12 + 02
4
10 + 0 211 + 1 212 +
causes t to be printed as
Name
Binary
Decimal
eps
2^(-52)
2.2204e-16
realmin
2^(-1022)
2.2251e-308
realmax
(2-eps)*2^1023
1.7977e+308
3fb999999999999a
0.3/0.1
little less than 0.3 and the actual denominator is a little greater
than 0.1.
Ten steps of length t are not precisely the same as one step
of length 1. MATLAB is careful to arrange that the last element
the gap you can see in our toy system between zero and the
smallest positive number. They do provide an elegant model
for handling underflow, but their practical importance for
x(2)
c(2)/U(2,2)
16
x(1)
(11 - x(2))/10
-0.5
x2 = 11
1; 3
3.3]'
0.3]
produce
x =
-0.5000
16.0000
where
and co-founder of
The MathWorks.
and
c(2)
expand (x - 1)7 and carefully choosing the range for the x-axis
y = (x-1).^7;
= 3.3 - 33*(0.1)
= -4.4409e-16