0% found this document useful (0 votes)
293 views

Numerical Methods

This document discusses computer representation of numbers and computer arithmetic. It begins by explaining binary numbers and how computers store data in binary format in memory organized by bytes. It then discusses storing different data types like characters, integers, and floating point numbers in memory. The rest of the document discusses concepts like the IEEE floating point standard, rounding and arithmetic operations for different data types, and common pitfalls in floating point programs.

Uploaded by

soccerboyred
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
293 views

Numerical Methods

This document discusses computer representation of numbers and computer arithmetic. It begins by explaining binary numbers and how computers store data in binary format in memory organized by bytes. It then discusses storing different data types like characters, integers, and floating point numbers in memory. The rest of the document discusses concepts like the IEEE floating point standard, rounding and arithmetic operations for different data types, and common pitfalls in floating point programs.

Uploaded by

soccerboyred
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

C

S
3
4
1
4
.
C
S
@
V
T
,
F
a
l
l
2
0
1
3
Computer Representation of Numbers and
Computer Arithmetic
August 29, 2013
Contents
1 Binary numbers 3
2 Memory 5
3 Characters in Memory 5
4 The Memory Model 6
5 Storing Signed Numbers 7
6 Integers in Memory 8
7 Floating-Point Numbers 10
8 The IEEE standard 15
8.1 Floating Point Types . . . . . . . . . . . . . . . . . . . . . . . 16
8.2 Detailed IEEE representation . . . . . . . . . . . . . . . . . . 17
8.3 Number range . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
8.4 Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
9 The Set of FP Numbers 21
1
C
S
3
4
1
4
.
C
S
@
V
T
,
F
a
l
l
2
0
1
3
c _Adrian Sandu, 1998-2013. Distribution outside classroom not permitted. 2
10 Rounding - up or down 22
11 Chopping 23
12 Rounding to nearest. 24
13 Arithmetic Operations 26
14 IEEE Arithmetic 28
15 The Guard Digit 29
16 Tipical pitfalls with oating point programs 31
16.1 Binary versus decimal . . . . . . . . . . . . . . . . . . . . . . 31
16.2 Floating point comparisons . . . . . . . . . . . . . . . . . . . . 32
16.3 Funny conversions . . . . . . . . . . . . . . . . . . . . . . . . . 32
16.4 Memory versus register operands . . . . . . . . . . . . . . . . 33
16.5 Cancellation (Loss-of Signicance) Errors . . . . . . . . . . 34
16.6 Insignicant Digits . . . . . . . . . . . . . . . . . . . . . . . . 35
16.7 Order matters . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
17 Integer Multiplication 36
18 Special Arithmetic Operations 36
18.1 Signed zeros . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
18.2 Operations with . . . . . . . . . . . . . . . . . . . . . . . . 37
18.3 Operations with NaN . . . . . . . . . . . . . . . . . . . . . . . 37
18.4 Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
19 Arithmetic Exceptions 38
19.1 Division by 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
19.2 Overow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
19.3 Underow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
19.4 Inexact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
19.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
C
S
3
4
1
4
.
C
S
@
V
T
,
F
a
l
l
2
0
1
3
c _Adrian Sandu, 1998-2013. Distribution outside classroom not permitted. 3
20 Flags and Exception Trapping 41
21 Systems Aspects, from D. Goldberg, p. 193 43
21.1 Instruction Sets . . . . . . . . . . . . . . . . . . . . . . . . . . 43
21.2 Ambiguity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
21.3 Programming Language Features . . . . . . . . . . . . . . . . 44
21.4 Optimizers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
21.5 Exception Handling . . . . . . . . . . . . . . . . . . . . . . . . 45
22 Long Summations 46
23 Homework 46
1 Binary numbers
In the decimal system, the number 107.625 means
107.625 = 1 10
2
+ 7 10
0
+ 6 10
1
+ 2 10
2
+ 5 10
3
.
Such a number is the sum of terms of the form a digit times a dierent power
of 10 - we say that 10 is the basis of the decimal system. There are 10 digits
(0,...,9).
All computers today use the binary system. This has obvious hardware
advantages, since the only digits in this system are 0 and 1. In the binary
system the number is represented as the sum of terms of the form a digit
times a dierent power of 2. For example,
(107.625)
10
= 2
6
+ 2
5
+ 2
3
+ 2
1
+ 2
0
+ 2
1
+ 2
3
= (1101011.101)
2
.
Arithmetic operations in the binary system are performed similarly as in the
C
S
3
4
1
4
.
C
S
@
V
T
,
F
a
l
l
2
0
1
3
c _Adrian Sandu, 1998-2013. Distribution outside classroom not permitted. 4
decimal system; since there are only 2 digits, 1+1=10.
1 1 1 1 0
+ 1 1 0 1
1 0 1 0 1 1
1 1 1
1 1 0
0 0 0
1 1 1
1 1 1
1 0 1 0 1 0
Decimal to binary conversion. For the integer part, we divide by 2
repeatedly (using integer division); the remainders are the successive digits of
the number in base 2, from least to most signicant.
Quotients 107 53 26 13 6 3 1 0
Remainders 1 1 0 1 0 1 1
For the fractional part, multiply the number by 2; take away the integer
part, and multiply the fractional part of the result by 2, and so on; the se-
quence of integer parts are the digits of the base 2 number, from most to least
signicant.
Fractional 0.625 0.25 0.5 0
Integer 1 0 1
Octal representation. A binary number can be easily represented in
base 8. Partition the number into groups of 3 binary digits (2
3
= 8), from
decimal point to the right and to the left (add zeros if needed). Then, replace
each group by its octal equivalent.
(107.625)
10
= ( 1 101 011 . 101 )
2
= (153.5)
8
Hexadecimal representation. To represent a binary number in base 16
proceed as above, but now partition the number into groups of 4 binary digits
(2
4
= 16). The base 16 digits are 0,...,9,A=10,...,F=15.
(107.625)
10
= ( 0110 1011 . 1010 )
2
= (6B.A)
16
1. Convert the following binary numbers to decimal, octal and hexa: 1001101101.0011,
11011.111001;
C
S
3
4
1
4
.
C
S
@
V
T
,
F
a
l
l
2
0
1
3
c _Adrian Sandu, 1998-2013. Distribution outside classroom not permitted. 5
2. Convert the following hexa numbers to both decimal and binary: 1AD.CF,
D4E5.35A;
3. Convert the following decimal numbers to both binary and hexa: 6752.8756,
4687.4231.
2 Memory
The data and the programs are stored in binary format in computers memory.
Memory is organized in bytes, where 1 byte = 8 binary digits. In practice
we use multiples of byte.
1 KB 1,024 Bytes 2
10
bytes
1 MB 1,024 KB 2
20
bytes
1 GB 1,024 MB 2
30
bytes
There are several physical memories in a computer; they form a memory hier-
archy. Note that the physical chips for cache memory use a dierent technology
than the chips for main memory; they are faster, but smaller and more expen-
sive. Also, the disk is a magnetic storage media, a dierent technology than
the electronic main memory; the disk is larger, cheaper but slower.
Memory Type Size Access time
Registers Bytes 1 clock cycle
Cache L1 MB 4 clock cycles
Cache L2 MB 10 clock cycles
Main memory GB 100 clock cycles
Hard drive TB 10, 000 clock cycles
3 Characters in Memory
Characters are letters of the alphabet, both upper and lower case, punctua-
tion marks, and various other symbols. In the ASCII convention (American
Standard Code for Information Interchange) one character uses 7 bits. (there
are at most 2
7
= 128 dierent characters representable with this convention).
As a consequence, each character will be stored in exactly one byte of memory.
C
S
3
4
1
4
.
C
S
@
V
T
,
F
a
l
l
2
0
1
3
c _Adrian Sandu, 1998-2013. Distribution outside classroom not permitted. 6
Homework 1 Implement the following program
program test_char
character a, b
a=s
write(6,*) Please input b:
read*, b
write(6,*) a,b
end program test_char
Note how characters are declared and initialized. Run the program successfully.
4 The Memory Model
When programming, we think of the main memory as a long sequence of bytes.
Bytes are numbered sequentially; each byte is designated by its number, called
the address.
For example, suppose we have a main memory of 4 Gb; there are 2
32
bytes
in the memory; addresses ranging from 0...2
32
1 can be represented using 32
bits (binary digits), or (equiv.) by 8 hexa digits.
Suppose we want to store the string john. With one character per byte,
we need 4 successive memory locations (bytes) for the string. Each memory
location has an address and a content.
j
o
h
n
Content
1B56AF75
1B56AF74
1B56AF73
1B56AF72
Address
C
S
3
4
1
4
.
C
S
@
V
T
,
F
a
l
l
2
0
1
3
c _Adrian Sandu, 1998-2013. Distribution outside classroom not permitted. 7
When we declare a variable, the corresponding number of bytes is reserved in
the memory; the name of the variable is just an alias for the address of the
rst byte in the storage.
5 Storing Signed Numbers
m binary digits (bits) of memory can store 2
m
dierent numbers. They can
be positive integers between 00. . . 00 = (0)
10
and 11. . . 11 = (2
m
1)
10
. For
example, using m = 3 bits, we can represent any integer between 0 and 7.
If we want to represent signed integers (i.e. both positive and negative
numbers) using m bits, we can use one of the following methods:
Sign/Magnitude representation. Reserve the rst bit for the signum (for
example, let 0 denote positive numbers, and 1 negative numbers); the
other m 1 bits will store the magnitude (the absolute value) of the
number. In this case the range of numbers represented is 2
m1
+ 1 to
+2
m1
1. With m = 3 there are 2 bits for the magnitude, dierent
possible magnitudes, between 0 and 127; each of these can have a positive
and negative sign. Note that with this representation we have both
positive and negative zero. If we make the convention that the sign bit
is 1 for negative numbers we have
Number
10
([S]M)
2
-3 [1]11
-2 [1]10
-1 [1]01
-0 [1]00
+0 [0]00
+1 [0]01
+2 [0]10
+3 [0]11
Twos complement representation. All numbers from2
m1
to +2
m1
1
are represented by the smallest positive integer with which they are
congruent modulo 2
m
. With m = 3, for example, we have
C
S
3
4
1
4
.
C
S
@
V
T
,
F
a
l
l
2
0
1
3
c _Adrian Sandu, 1998-2013. Distribution outside classroom not permitted. 8
Number
10
(2C)
10
(2C)
2
-4 4 100
-3 5 101
-2 6 110
-1 7 111
0 0 000
1 1 001
2 2 010
3 3 011
Note that the rst bit is 1 for negative numbers, and 0 for nonnegative
numbers.
Biased representation. A number x [2
m1
, 2
m1
1] is represented
by the positive value x = x + 2
m1
[0, 2
m
1]. Adding the bias 2
m1
gives positive results.
Number
10
(biased)
10
(biased)
2
-4 0 000
-3 1 001
-2 2 010
-1 3 011
0 4 100
1 5 101
2 6 110
3 7 111
The rst bit is 0 for negative numbers, and 1 for nonnegative numbers.
6 Integers in Memory
One byte of memory can store 2
8
= 256 dierent numbers. They can be
positive integers between 00000000 = (0)
10
and 11111111 = (255)
10
.
C
S
3
4
1
4
.
C
S
@
V
T
,
F
a
l
l
2
0
1
3
c _Adrian Sandu, 1998-2013. Distribution outside classroom not permitted. 9
For most applications, one byte integers are too small. Standard data types
usually reserve 2, 4 or 8 successive bytes for each integer. In general, using p
bytes (p = 1, 2, 4, 8) we can represent integers in the range
Unsigned integers: 0 2
8p
1
Signed integers: 2
8p1
2
8p1
1
-
-
--
.
.
. .

.
.
. .
`
`
`
`

S
S
S
1 byte
2 bytes
4 bytes
7
15
1
1
1
31
S 1
31
15
S 1
S 1 7
Homework 2 Compute the lower and upper bounds for signed and unsigned
integers representable with p = 2 and with p = 4 bytes.
Homework 3 Write a Fortran program in which you dene two integer vari-
ables m and i. Initialize m to 2147483645. Then read i and print out the sum
m + i.
program test_int
integer m,i
m = 2147483645
C
S
3
4
1
4
.
C
S
@
V
T
,
F
a
l
l
2
0
1
3
c _Adrian Sandu, 1998-2013. Distribution outside classroom not permitted.10
write(6,*) Please input i:
read*, i
write(6,*) m+i=,m+i
end program test_int
Run the program several times, with i = 1,2,3,4,5.
1. Do you obtain correct results ? What you see here is an example of in-
teger overow. The result of the summation is larger than the maximum
representable integer.
2. What exactly happens at integer overow ? In which sense are the results
inaccurate ?
3. How many bytes does Fortran use to represent integers ?
4. Modify the program to print -m-i and repeat the procedure. What is the
minimum (negative) integer representable ?
Note. Except for the overow situation, the result of an integer addition or
multiplication is always exact (i.e. the numerical result is exactly the mathe-
matical result).
7 Floating-Point Numbers
For most applications in science and engineering integer numbers are not suf-
cient; we need to work with real numbers. Real numbers like have an
innite number of decimal digits; there is no hope to store them exactly. On a
computer, oating point convention is used to represent (approximations of)
the real numbers. The design of computer systems requires in-depth knowl-
edge about FP. Modern processors have special FP instructions, compilers
must generate such FP instructions, and the operating system must handle
the exception conditions generated by these FP instructions.
We will now illustrate the oating point representation in base 10. Any
decimal number x can be uniquely written as
C
S
3
4
1
4
.
C
S
@
V
T
,
F
a
l
l
2
0
1
3
c _Adrian Sandu, 1998-2013. Distribution outside classroom not permitted.11
x = m 10
e
+1 or -1 sign
m 1 m < 10 mantissa
e integer exponent
For example
107.625 = +1 1.07625 10
2
.
If we did not impose the condition 1 m < 10 we could have represented
the number in various dierent ways, for example
(+1) 0.107625 10
3
or (+1) 0.00107625 10
5
.
When the condition 1 m < 10 is satised, we say that the mantissa is
normalized. Normalization guarantees that
1. the FP representation is unique,
2. since m < 10 there is exactly one digit before the decimal point, and
3. since m 1 the rst digit in the mantissa is nonzero. Thus, none of the
available digits is wasted by storing leading zeros.
Suppose our storage space is limited to 6 decimal digits per FP number.
We allocate 1 decimal digit for the sign, 3 decimal digits for the mantissa and
2 decimal digits for the exponent. If the mantissa is longer we will chop it
to the most signicant 3 digits (another possibility is rounding, which we will
talk about shortly).
MMM EE
Our example number can be then represented as
+1
. .
107
. .
+2
. .
m e
A oating point number is represented as (sign, mantissa, exponent) with
a limited number of digits for the mantissa and the exponent. The parameters
C
S
3
4
1
4
.
C
S
@
V
T
,
F
a
l
l
2
0
1
3
c _Adrian Sandu, 1998-2013. Distribution outside classroom not permitted.12
of the FP system are = 10 (the basis), d
m
= 3 (the number of digits in the
mantissa) and d
e
= 2 (the number of digits for the exponent).
Most real numbers cannot be exactly represented as oating point numbers.
For example, numbers with an innite representation, like = 3.141592 . . .,
will need to be approximated by a nite-length FP number. In our FP
system, will be represented as
+ 314 00
Note that the nite representation in binary is dierent than nite represen-
tation in decimal; for example, (0.1)
10
has an innite binary representation.
In general, the FP representation f(x) is just an approximation of the
real number x. The relative error is the dierence between the two numbers,
divided by the real number
=
x f(x)
x
.
For example, if x = 107.625, and f(x) = 1.07 10
2
is its representation in
our FP system, then the relative error is
=
107.625 1.07 10
2
107.625
5.8 10
3
Another measure for the approximation error is the number of units in the
last place, or ulps. The error in ulps is computed as
err = [x f(x)[
dm1e
.
where e is the exponent of f(x) and d
m
is the number of digits in the mantissa.
For our example
err = [107.625 1.07 10
2
[ 10
312
= 0.625ulps .
The dierence between relative errors corresponding to 0.5 ulps is called
the wobble factor. If x f(x) = 0.5 ulps and f(x) = m.mmm m
e
,
then x f(x) =
_
/2
dm
_

e
, and since
e
x <
e+1
we have that
1
2

dm

x f(x)
x
= 0.5 ulps

2

dm
C
S
3
4
1
4
.
C
S
@
V
T
,
F
a
l
l
2
0
1
3
c _Adrian Sandu, 1998-2013. Distribution outside classroom not permitted.13
If the error is n ulps, the last log

n digits in the number are contaminated


by error. Similarly, if the relative error is , the last log

_
2
1dm
_
digits
are in error.
With normalized mantissas, the three digits m
1
m
2
m
3
always read m
1
.m
2
m
3
,
i.e. the decimal point has xed position inside the mantissa. For the original
number, the decimal point can be oated to any position in the bit-string we
like by changing the exponent.
We see now the origin of the term oating point: the decimal point can be
oated to any position in the bit-string we like by changing the exponent.
With 3 decimal digits, our mantissas range between 1.00, . . . , 9.99. For
exponents, two digits will provide the range 00, . . . , 99.
Consider the number 0.000123. When we represent it in our oating point
system, we lose all the signicant information:
+1
. .
000
. .
00
..
m e
In order to overcome this problem, we need to allow for negative exponents
also. We will use a biased representation: if the bits e
1
e
2
are stored in the
exponent eld, the actual exponent is e
1
e
2
49 (49 is called the exponent
bias). This implies that, instead of going from 00 to 99, our exponents will
actually range from 49 to +50. The number
0.000123 = +1 1.23 10
4
is then represented, with the biased exponent convention, as
+1
. .
123
. .
45
..
m e
What is the maximum number allowed by our toy oating point system?
If m = 9.99 and e = +99, we obtain
x = 9.99 10
50
.
C
S
3
4
1
4
.
C
S
@
V
T
,
F
a
l
l
2
0
1
3
c _Adrian Sandu, 1998-2013. Distribution outside classroom not permitted.14
If m = 000 and e = 00 we obtain a representation of ZERO. Depending
on , it can be +0 or 0. Both numbers are valid, and we will consider them
equal.
What is the minimum positive number that can be represented in our toy
oating point system? The smallest mantissa value that satises the normal-
ization requirement is m = 1.00; together with e = 00 this gives the number
10
49
. If we drop the normalization requirement, we can represent smaller
numbers also. For example, m = 0.10 and e = 00 give 10
50
, while m = 0.01
and e = 00 give 10
51
.
The FP numbers with exponent equal to ZERO and the rst digit in the
mantissa also equal to ZERO are called subnormal numbers.
Allowing subnormal numbers improves the resolution of the FP system
near 0. Non-normalized mantissas will be permitted only when e = 00, to
represent ZERO or subnormal numbers, or when e = 99 to represent special
numbers.
Example (D. Goldberg, p. 185, adapted): Suppose we work with our toy
FP system and do not allow for subnormal numbers. Consider the fragment
of code
IF (x ,= y) THEN z = 1.0/(x y)
designed to guard against division by 0. Let x = 1.02 10
49
and y =
1.01 10
49
. Clearly x ,= y but, (since we do not use subnormal numbers)
xy = 0. In spite of all the trouble we are dividing by 0! If we allow subnormal
numbers, x y = 0.01 10
49
and the code behaves correctly.
Note that for the exponent bias we have chosen 49 and not 50. The reason
for this is self-consistency: the inverse of the smallest normal number does not
overow
x
min
= 1.00 10
49
,
1
x
min
= 10
+49
< 9.99 10
50
= x
max
.
(with a bias of 50 we would have had 1/x
min
= 10
50
> 9.99 10
+49
= x
max
).
Similar to the decimal case, any binary number x can be represented
x = m 2
e
+1 or -1 sign
m 1 m < 2 mantissa
e integer exponent
C
S
3
4
1
4
.
C
S
@
V
T
,
F
a
l
l
2
0
1
3
c _Adrian Sandu, 1998-2013. Distribution outside classroom not permitted.15
For example,
1101011.101 = +1 1.101011101 2
6
. (1)
With 6 binary digits available for the mantissa and 4 binary digits available
for the exponent, the oating point representation is
+1
. .
110101
. .
0110
. .
m e
(2)
When we use normalized mantissas, the rst digit is always nonzero. With
binary oating point representation, a nonzero digit is (of course) 1, hence the
rst digit in the normalized binary mantissa is always 1.
1 x < 2 (x)
2
= 1.m
1
m
2
m
3
. . .
As a consequence, it is not necessary to store it; we can store the mantissa
starting with the second digit, and store an extra, least signicant bit, in the
space we saved. This is called the hidden bit technique.
For our binary example (2) the leftmost bit (equal to 1, of course, showed
in bold) is redundant. If we do not store it any longer, we obtain the hidden
bit representation:
+1
. .
101011
. .
0110
. .
m e
(3)
We can now pack more information in the same space: the rightmost bit
of the mantissa holds now the 7
th
bit of the number (1) (equal to 1, showed
in bold). This 7
th
bit was simply omitted in the standard form (2). Question:
Why do we prefer
8 The IEEE standard
The IEEE standard regulates the representation of binary oating point num-
bers in a computer, how to perform consistently arithmetic operations and how
C
S
3
4
1
4
.
C
S
@
V
T
,
F
a
l
l
2
0
1
3
c _Adrian Sandu, 1998-2013. Distribution outside classroom not permitted.16
to handle exceptions, etc. Developed in 1980s, is now followed by virtually
all microprocessor manufacturers.
Supporting IEEE standard greatly enhances programs portability. When
a piece of code is moved from one IEEE-standard-supporting machine to an-
other IEEE-standard-supporting machine, the results of the basic arithmetic
operations (+,-,*,/) will be identical.
8.1 Floating Point Types
The standard denes the following FP types:
Single Precision. (4 consecutive bytes/ number).
[e
1
e
2
e
3
e
8
[m
1
m
2
m
3
m
23
Useful for most short calculations.
Double Precision. (8 consecutive bytes/number)
[e
1
e
2
e
3
e
11
[m
1
m
2
m
3
m
52
Most often used with scientic and engineering numerical computations.
Extended Precision. (10 consecutive bytes/number).
[e
1
e
2
e
3
e
15
[m
1
m
2
m
3
m
64
Useful for temporary storage of intermediate results in long calculations. (e.g.
compute a long inner product in extended precision then convert the result
back to double) There is a single-extended format also. The standard suggests
that implementations should support the extended format corresponding to
the widest basic format supported (since all processors today allow for dou-
ble precision, the double-extended format is the only one we discuss here).
Extended precision enables libraries to eciently compute quantities within
0.5 ulp. For example, the result of x*y is correct within 0.5 ulp, and so is
the result of log(x). Clearly, computing the logarithm is a more involved
operation than multiplication; the log library function performs all the inter-
mediate computations in extended precision, then rounds the result to single
or double precision, thus avoiding the corruption of more digits and achieving
C
S
3
4
1
4
.
C
S
@
V
T
,
F
a
l
l
2
0
1
3
c _Adrian Sandu, 1998-2013. Distribution outside classroom not permitted.17
a 0.5 ulp accuracy. From the user point of view this is transparent, the log
function returns a result correct within 0.5 ulp, the same accuracy as simple
multiplication has.
8.2 Detailed IEEE representation
(for single precision standard; double is similar)
[e
1
e
2
e
3
e
8
[m
1
m
2
m
3
m
23
Signum. bit = 0 (positive) or 1 (negative).
Exponent. Biased representation, with an exponent bias of (127)
10
.
Mantissa. Hidden bit technique.
e
1
e
2
e
3
e
8
Numerical Value
(00000000)
2
= (0)
10
(0.m
1
. . . m
23
)
2
2
126
(ZERO or subnormal)
(00000001)
2
= (1)
10
(1.m
1
. . . m
23
)
2
2
126

(01111111)
2
= (127)
10
(1.m
1
. . . m
23
)
2
2
0
(10000000)
2
= (128)
10
(1.m
1
. . . m
23
)
2
2
1

(11111110)
2
= (254)
10
(1.m
1
. . . m
23
)
2
2
+127
(11111111)
2
= (255)
10
if m
1
. . . m
23
= 0
NaN otherwise
Note that e
min
< e
max
, which implies that 1/x
min
does not overow.
8.3 Number range
The range of numbers represented in dierent IEEE formats is sum-
marized in Table 8.3.
C
S
3
4
1
4
.
C
S
@
V
T
,
F
a
l
l
2
0
1
3
c _Adrian Sandu, 1998-2013. Distribution outside classroom not permitted.18
IEEE Format E
min
E
max
Single Prec. -126 +127
Double Prec. -1,022 +1,023
Extended Prec. -16,383 +16,383
Table 1: IEEE oating point number exponent ranges
8.4 Precision
To dene the precision of the FP system, let us go back to our toy
FP representation (2 decimal digits for the exponent and 3 for the
mantissa).
We want to add two numbers, e.g.
1 = 1.00 10
0
and 0.01 = 1.00 10
2
.
In order to perform the addition, we bring the smaller number to the
same exponent as the larger number by shifting right the mantissa.
For our example,
1.00 10
2
= 0.01 10
0
.
Next, we add the mantissas and normalize the result if necessary.
In our case
1.00 10
0
+ 0.01 10
0
= 1.01 10
0
.
Suppose now we want to add
1 = 1.00 10
0
and 0.001 = 1.00 10
3
.
For bringing them to the same exponent, we need to shift right the
mantissa 3 positions, and, due to our limited space (3 digits) we lose
all the signicant information. Thus
1.00 10
0
+ 0.00[1] 10
0
= 1.00 10
0
.
We can see now that this is a limitation of the FP system due to the
storage of only a nite number of digits.
C
S
3
4
1
4
.
C
S
@
V
T
,
F
a
l
l
2
0
1
3
c _Adrian Sandu, 1998-2013. Distribution outside classroom not permitted.19
IEEE Format Machine precision () No. Decimal Digits
Single Prec. 2
23
1.2 10
7
7
Double Prec. 2
52
1.1 10
16
16
Extended Prec. 2
63
1.1 10
19
19
Table 2: Precision of dierent IEEE representations
The precision of the oating point system (the machine preci-
sion) is the smallest number for which 1 + > 1.
For our toy FP system, it is clear from the previous discussion
that = 0.01.
If the relative error in a computation is p, then the number of
corrupted decimal digits is log
10
p.
In (binary) IEEE arithmetic, the rst single precision number
larger than 1 is 1 + 2
23
, while the rst double precision number is
1 + 2
52
. For extended precision there is no hidden bit, so the rst
such number is 1+2
63
. You should be able to justify this yourselves.
If the relative error in a computation is p, then the number of
corrupted binary digits is log
2
p.
Remark: We can now answer the following question. Signed in-
tegers are represented in twos complement. Signed mantissas are
represented using the sign-magnitude convention. For signed expo-
nents the standard uses a biased representation. Why not represent
the exponents in twos complement, as we do for the signed inte-
gers? When we compare two oating point numbers (both positive,
for now) the exponents are looked at rst; only if they are equal we
proceed with the mantissas. The biased exponent is a much more
convenient representation for the purpose of comparison. We com-
pare two signed integers in greater than/less than/ equal to expres-
sions; such expressions appear infrequently enough in a program, so
we can live with the twos complement formulation, which has other
benets. On the other hand, any time we perform a oating point
addition/subtraction we need to compare the exponents and align
the operands. Exponent comparisons are therefore quite frequent,
and being able to do them eciently is very important. This is the
C
S
3
4
1
4
.
C
S
@
V
T
,
F
a
l
l
2
0
1
3
c _Adrian Sandu, 1998-2013. Distribution outside classroom not permitted.20
argument for preferring the biased exponent representation.
Homework 4 Consider the real number (0.1)
10
. Write its single pre-
cision, oating point representation. Does the hidden bit technique
result in a more accurate representation?
Homework 5 What is the gap between 1024 and the rst IEEE
single precision number larger than 1024?
Homework 6 Let x = m2
e
be a normalized single precision num-
ber, with 1 m < 2. Show that the gap between x and the next
largest single precision number is
2
e
.
Homework 7 The following program adds 1 +2
p
, then subtracts 1.
If 2
p
< the nal result will be zero. By providing dierent values
for the exponent, you can nd the machine precision for single
and double precision. Note the declaration for the simple pre-
cision variables (real) and the declaration for double precision
variables (double precision). The command 2.0 p calculates 2
p
( is the power operator). Also note the form of the constants in
single precision (2.e0) vs. double precision (2.d0).
program test_precision
real a
double precision b
integer p
PRINT*, Please provide exponent
read*, p
a = 1.e0 + 2.e0**(-p)
PRINT*, a-1.e0
b = 1.d0 + 2.d0**(-p)
PRINT*, b-1.d0
END
Run the program for values dierent of p ranging from 20 to 60.
Find experimentally the values of for single and for double pre-
cision.
C
S
3
4
1
4
.
C
S
@
V
T
,
F
a
l
l
2
0
1
3
c _Adrian Sandu, 1998-2013. Distribution outside classroom not permitted.21
9 The Set of FP Numbers
The set of all FP numbers consists of
FP = 0, all normal, all subnormal, .
Because of the limited number of digits, the FP numbers are a -
nite set. For example, in our toy FP system, we have approximately
2 10
5
FP numbers altogether.
The FP numbers are not uniformly spread between min and max
values; they have a high density near zero, but get sparser as we
move away from zero.
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
....................
. . . . . . . . . . . . . . .
...............
. . . . . . . . . . .
`
`
`Z
Z
Z

.
.
.
.
.
-
-
-
-
-
-
-
-
--.
.
.
.
.
.
.
.
.
.
0 1 100 1000 10
lower density higher density
100fpts 100fpts
100fpts
1 apart
.1 apart .01 apart
For example, in our FP system, there are 90 points between 1 and
10 (hence, the gap between 2 successive numbers is 0.01). Between
10 and 100 there are again 90 FP numbers, now with a gap of 0.1.
The interval 100 to 1000 is covered by another 90 FP values, the
dierence between 2 successive ones being 1.0.
In general, if m 10
e
is a normalized FP number, with mantissa
1.00 m < 9.98, the very next FP number representable is (m +
) 10
e
(please give a moments thought about why this is so). In
consequence, the gap between m 10
e
and the next FP number is
10
e
. The larger the oating point numbers, the larger the gap
between them will be (the machine precision is a xed number,
while the exponent e increases with the number).
C
S
3
4
1
4
.
C
S
@
V
T
,
F
a
l
l
2
0
1
3
c _Adrian Sandu, 1998-2013. Distribution outside classroom not permitted.22
In binary format, similar expressions hold. Namely, the gap be-
tween m2
e
and its successor is 2
e
.
10 Rounding - up or down
It is often the case that we have a real number X that is not exactly a
oating point number: X falls between two consecutive FP numbers
X

and X
+
.
l h
/
/
\
\
X
X
X
-
+
In order to represent X in the computer, we need to approximate
it by a FP number. If we choose X

we say that we rounded X


down; if we choose X
+
we say that we rounded X up. We can
choose a dierent FP number also, but this makes little sense, as
the approximation error will be larger than with X

. For example,
= 3.141592 . . . is in between

= 3.14 and
+
= 3.15.

and
+
are
successive oating point numbers in our toy system.
We will denote f(X) the FP number that approximates X. Then
f(X) =
_
X

, if rounding down,
X
+
, if rounding up.
Obviously, when rounding up or down we have to make a certain
representation error; we call it the roundo (rounding) error.
The relative roundo error, , is dened as
=
f(X) X
X
.
This does not work for X = 0, so we will prefer the equivalent for-
mulation
f(X) = X (1 + ) .
C
S
3
4
1
4
.
C
S
@
V
T
,
F
a
l
l
2
0
1
3
c _Adrian Sandu, 1998-2013. Distribution outside classroom not permitted.23
What is the largest error that we can make when rounding (up
or down)? The two FP candidates can be represented as X

= m2
e
and X
+
= (m + ) 2
e
(this is correct since they are successive FP
numbers). For now suppose both numbers are positive (if negative,
a similar reasoning applies). Since
[f(X) X[ [X
+
X

[, and X X

,
we have
[[
[X
+
X

[
X

=
2
e
m2
e
.
Homework 8 Find an example of X such that, in our toy FP sys-
tem, rounding down produces a roundo error = . This shows
that, in the worst case, the upper bound can actually be attained.
Now, we need to choose which one of X
+
, X

better approxi-
mates X. There are two possible approaches.
11 Chopping
Suppose X = 107.625. We can represent it as +1 107 +2 by simply
discarding (chopping) the digits which do not t the mantissa
format (here the remaining digits are 625). We see that the FP
representation is precisely X

, and we have 0 X

< X. Now, if
X was negative, X = 107.625, the chopped representation would
be -1 107 +2 , but now this is X
+
. Note that in this situation
X < X
+
0. In consequence, with chopping, we choose X

if X > 0
and X
+
is X < 0. In both situations the oating point number
is closer to 0 than the real number X, so chopping is also called
rounding toward 0.
C
S
3
4
1
4
.
C
S
@
V
T
,
F
a
l
l
2
0
1
3
c _Adrian Sandu, 1998-2013. Distribution outside classroom not permitted.24
/
/
`
`
`
`
/
/
-107.625
-107
0
107.625
107
Chopping has the advantage of being very simple to implement
in hardware. The roundo error for chopping satises
<
chopping
0 .
For example:
X = 1.00999999 . . . f(X)
chop
= 1.00
and
=
f(X) X
X
=
0.0099...
1.00999...
= 0.0099 0.01 = .
12 Rounding to nearest.
This approximation mode is used by most processors, and is called,
in short rounding. The idea is to choose the FP number (X

or
X
+
) which oers the best approximation of X:
f(X) =
_
X

, if X

X <
X
+
+X

2
,
X
+
, if
X
+
+X

2
< X X
+
.
C
S
3
4
1
4
.
C
S
@
V
T
,
F
a
l
l
2
0
1
3
c _Adrian Sandu, 1998-2013. Distribution outside classroom not permitted.25

`
`
`
Z
Z
Z

X
-
+
(X +X
+
)/2 X
-
The roundo for the round to nearest approximation mode
satises

2

rounding


2
.
The worst possible error is here half (in absolute magnitude) the
worst-case error of chopping. In addition, the errors in the round to
nearest approximation have both positive and negative signs. Thus,
when performing long computations, it is likely that positive and
negative errors will cancel each other out, giving a better numerical
performance with rounding than with chopping.
There is a ne point to be made regarding round to nearest
approximation. What happens if there is a tie, i.e. if X is precisely
(X
+
+X

)/2? For example, with 6 digits mantissa, the binary number


X = 1.0000001 can be rounded to X

= 1.000000 or to X
+
= 1.000001.
In this case, the IEEE standard requires to choose the approximation
with an even last bit; that is, here choose X

. This ensures that,


when we have ties, half the roundings will be done up and half down.
The idea of rounding to even can be applied to decimal numbers
also (and, in general, to any basis). To see why rounding to even
works better, consider the following example. Let x = 5 10
2
and
compute ((((1x) x) x) x) with correct rounding. All operations
produce exact intermediate results with the fourth digit equal to 5;
when rounding this exact result, we can go to the nearest even num-
ber, or we can round up, as is customary in mathematics. Rounding
C
S
3
4
1
4
.
C
S
@
V
T
,
F
a
l
l
2
0
1
3
c _Adrian Sandu, 1998-2013. Distribution outside classroom not permitted.26
to nearest even produces the correct result (1.00), while rounding up
produces 1.02.
An alternative to rounding is interval arithmetic. The output
of an operation is an interval that contains the correct result. For
example x y [z, z], where the limits of the interval are obtain by
rounding down and up respectively. The nal result with interval
arithmetic is an interval that contains the true solution; if the inter-
val is too large to be meaningful we should repeat the calculations
with a higher precision.
Homework 9 In IEEE single precision, what are the rounded values
for 4 +2
20
, 8 +2
20
,16 +2
20
,32 +2
20
,64 +2
20
. (Here and from now
rounded means rounded to nearest.)
In conclusion, real numbers are approximated and represented
in the oating point format. The IEEE standard recognizes four
approximation modes:
1. Round Up;
2. Round Down;
3. Round Toward Zero;
4. Round to Nearest (Even).
Virtually all processors implement the (round to nearest) approx-
imation. From now on, we will call it, by default, rounding. Com-
puter numbers are therefore accurate only within a factor of (1/2).
In single precision, this gives 1 10
7
, or about 7 accurate decimal
places. In double precision, this gives 110
16
, or about 16 accurate
decimal digits.
13 Arithmetic Operations
To perform arithmetic operations, the values of the operands are
loaded into registers; the Arithmetic and Logic Unit (ALU) performs
C
S
3
4
1
4
.
C
S
@
V
T
,
F
a
l
l
2
0
1
3
c _Adrian Sandu, 1998-2013. Distribution outside classroom not permitted.27
the operation, and puts the result in a third register; the value is
then stored back in memory.
\
\
\/
/
/ /
/
/
/
/
/
/
/
\
\
\
\
\
\
\
\
/
/
`
`
/
/
`
`
/
/
`
`
`
`
/
/
.
.
.
-
-
-
`
`
/
/
ALU
REG 3
OP 1
OP 2
Result
REG 1 REG 2
The two operands are obviously oating point numbers. The
result of the operation stored in memory must also be a oating
point number.
Is there any problem here? Yes! Even if the operands are FP
numbers, the result of an arithmetic operation may not be a FP
number.
To understand this, let us add two oating point numbers, a= 9.72 01
(97.2) and b= 6.43 00 (6.43), using our toy FP system. To perform
the summation we need to align the numbers by shifting the smaller
one (6.43) to the right.
C
S
3
4
1
4
.
C
S
@
V
T
,
F
a
l
l
2
0
1
3
c _Adrian Sandu, 1998-2013. Distribution outside classroom not permitted.28
9. 7 2 01
0. 6 4 3 01
10. 3 6 3 01
The result (103.63) is not a oating number. We can round the
result to obtain 1.04 02 (104).
From this example we draw a rst useful conclusion: the result
of any arithmetic operation is, in general, corrupted by roundo
errors. Thus, the arithmetic result is dierent from the mathe-
matical result.
If a, b are oating point numbers, and a+b is the result of math-
ematical addition, we will denote by a b the computed addition.
The fact that a b ,= a + b has surprising consequences. Let
c= 9.99 -1 (0.999). Then
(a b) c = 1.04 02 (104),
while
a (b c) = 1.05 02 (105)
(you can readily verify this). Unlike mathematical addition, com-
puted addition is not associative!
Homework 10 Show that computed addition is commutative, i.e.
a b=b a.
14 IEEE Arithmetic
The IEEE standard species that the result of an arithmetic op-
eration (+,-,*,/) must be computed exactly and then rounded to
nearest. In other words,
a b = f(a + b)
a b = f(a b)
a b = f(a b)
a b = f(a/b) .
C
S
3
4
1
4
.
C
S
@
V
T
,
F
a
l
l
2
0
1
3
c _Adrian Sandu, 1998-2013. Distribution outside classroom not permitted.29
The same requirement holds for square root, remainder, and con-
versions between integer and oating point formats: compute the
result exactly, then round.
This IEEE convention completely species the result of arith-
metic operations; operations performed in this manner are called
exactly, or correctly rounded. It is easy to move a program from
one machine that supports IEEE arithmetic to another. Since the
results of arithmetic operations are completely specied, all the in-
termediate results should coincide to the last bit (if this does not
happen, we should look for software errors!).
(Note that it would be nice to have the results of transcendental
functions like exp(x) computed exactly, then rounded to the desired
precision; this is however impractical, and the standard does NOT
require correctly rounded results in this situation.)
Performing only correctly rounded operations seems like a nat-
ural requirement, but it is often dicult to implement it in hard-
ware. The reason is that if we are to nd rst the exact result we
may need additional resources. Sometimes it is not at all possible
to have the exact result in hand - for example, if the exact result
is a periodic number (in our toy system, 2.0/3.0 = 0.666...).
15 The Guard Digit
Is useful when subtracting almost equal numbers. Suppose a =
(1.0)
2
2
0
and b = (1.11 . . . 1)
2
2
1
, with 23 1s after the binary
point. Both a and b are single precision oating point numbers.
The mathematical result is ab = (1.0)
2
2
24
. It is a oating point
number also, hence the numerical result should be identical to the
mathematical result, a b = f(a b) = a b.
When we subtract the numbers, we align them by shifting b one
position to the right. If computer registers are 24-bit long, then
we may have one of the following situations.
1. Shift b and chop it to single precision format (to t the
C
S
3
4
1
4
.
C
S
@
V
T
,
F
a
l
l
2
0
1
3
c _Adrian Sandu, 1998-2013. Distribution outside classroom not permitted.30
register), then subtract.
1.000 . . . 000
0.111 . . . 111
0.000 . . . 001
The result is 2
23
, twice the mathematical value.
2. Shift b and round it to single precision format (to t the
register), then subtract.
1.000 . . . 000
1.000 . . . 000
0.000 . . . 000
The result is 0, and all the meaningful information is lost.
3. Append the registers with an extra guard bit. When we
shift b, the guard bit will hold the 23
rd
1. The subtraction is then
performed in 25 bits.
1.000 . . . 000 [0]
0.111 . . . 111 [1]
0.000 . . . 000 [1]
The result is normalized, and is rounded back to 24 bits. This
result is 2
24
, precisely the mathematical value. Funny fact: Cray
supercomputers lack the guard bit. In practice, many processors
do subtractions and additions in extended precision, even if the
operands are single or double precision. This provides eectively
16 guard bits for these operations. This does not come for free:
additional hardware makes the processor more expensive; besides,
the longer the word the slower the arithmetic operation is.
The following theorem (see David Goldberg, p. 160) shows the
importance of the additional guard digit. Let x, y be FP numbers
in a FP system with , d
m
, d
e
;
if we compute x y using d
m
digits, then the relative rounding
error in the result can be as large as 1 (i.e. all the digits
are corrupted!).
C
S
3
4
1
4
.
C
S
@
V
T
,
F
a
l
l
2
0
1
3
c _Adrian Sandu, 1998-2013. Distribution outside classroom not permitted.31
if we compute xy using d
m
+1 digits, then the relative rounding
error in the result is less than 2.
Note that, although using an additional guard digit greatly im-
proves accuracy, it does not guarantee that the result will be exactly
rounded (i.e. will obey the IEEE requirement). As an example
consider x = 2.34 10
2
, y = 4.56 in our toy FP system. In exact
arithmetic, x y = 229.44, which rounds to f(x y) = 2.29 10
2
.
With the guard bit arithmetic, we rst shift y and chop it to 4
digits, y = 0.045 10
2
. Now x y = 2.295 10
2
(calculation done
with 4 mantissa digits). When we round this number to the near-
est (even) we obtain 2.30 10
2
, a value dierent from the exactly
rounded result.
However, by introducing a second guard digit and a third, sticky
bit, the result is the same as if the dierence was computed exactly
and then rounded (D.Goldberg, p. 177).
16 Tipical pitfalls with oating point programs
All numerical examples in this section were produced on an Alpha
21264 workstation. On other systems the results may vary, but in
general the highlighted problems remain the same.
16.1 Binary versus decimal
Consider the code fragment
program test
real x
data x /1.0E-4/
print*, x
end program test
We expect the answer to be 1.0E4, but in fact the program prints
9.9999997E 05. Note that we did nothing but store and print! The
C
S
3
4
1
4
.
C
S
@
V
T
,
F
a
l
l
2
0
1
3
c _Adrian Sandu, 1998-2013. Distribution outside classroom not permitted.32
anomaly comes from the fact that 0.0001 is converted (inexactly)
to binary, then the stored binary value is converted back to decimal
for printing.
16.2 Floating point comparisons
Because of the inexactities, it is best to avoid strict equality when
comparing oating point numbers. For the above example, the code
if ( 1.0E+8 * x**2 .EQ. 1.0 ) then
print*, Correct
end if
should print Correct, but does not, since the left expression is
corrupted by roundo. The right way to do oating point com-
parisons is to dene the epsilon machine, eps, and check that the
magnitude of the dierence is less than half epsilon times the sum
of the operands:
epsilon = 1.0E-7
w = 1.0E+8 * x**2
if ( abs(w-1.0) .LE. 0.5*epsilon*( abs(w)+abs(1.0) ) ) then
print*, Correct
end if
This time we allow small roundos to appear, and the program
takes the right branch.
16.3 Funny conversions
Sometimes the inexactness in oating point is uncovered by real to
integer conversion, which by Fortran default is done using trunca-
tion. For example the code
program test
real x
integer i
C
S
3
4
1
4
.
C
S
@
V
T
,
F
a
l
l
2
0
1
3
c _Adrian Sandu, 1998-2013. Distribution outside classroom not permitted.33
data x /0.0001/
i = 10000*x
print *, i
end program test
produces a stunning result: the value of i is 0, not 1!
Another problem appears when a single precision number is con-
verted to double precision. This does not increase the accuracy of
the number. For example the code
program test
real x
double precision y
data x /1.234567/
y = x
print *, X =,x, Y =,y
end program test
produces the output
X= 1.234567 Y= 1.23456704616547
The explanation is that, when converting single to double preci-
sion, register entries are padded with zeros in the binary represen-
tation. The double precision number is printed with 15 positions
and the inexactity shows up. (if we use a formatted print for x
with 15 decimal places we obtain the same result). In conclusion,
we should only print the number of digits that are signicant to
the problem.
16.4 Memory versus register operands
The code
data a /3.0/, b /10.0/
data x /3.0/, y /10.0/
z = (y/x)-(b/a)
C
S
3
4
1
4
.
C
S
@
V
T
,
F
a
l
l
2
0
1
3
c _Adrian Sandu, 1998-2013. Distribution outside classroom not permitted.34
call ratio(x,y,a1)
call ratio(a,b,a2)
call sub(a2,a1,c)
print*, z-c
may produce a nonzero result. This is so because z is computed
with register operands (and FP registers for Pentium are in ex-
tended precision, 80 bits) while for c the operands a and b are
stored in the memory. (note that the Alpha compiler produces
zero).
16.5 Cancellation (Loss-of Signicance) Errors
When subtracting numbers that are nearly equal, the most signif-
icant digits in the operands match and cancel each other. This is
no problem if the operands are exact, but in real life the operands
are corrupted by errors. In this case the cancellations may prove
catastrophic.
For example, we want to solve the quadratic equation
a x
2
+ b x + c = 0 ,
where all the coecients are FP numbers
a = 1.00 10
3
, b = 1.00 10
0
, c = 9.99 10
1
,
using our toy decimal FP system and the quadratic formula
r
1,2
=
b

b
2
4ac
2a
.
The true solutions are r
1
= 999, r
2
= 1. In our FP system
b
2
= 1.00, 4ac = 3.99 10
3
, and b
2
4ac = 1.00. It is here where
the cancellation occurs! Then r
1
= (1 1)/(2 10
3
) = 1000 and
r
2
= (1+1)/(210
3
) = 0. If the error in r
1
is acceptable, the error
in r
2
is 100%!
C
S
3
4
1
4
.
C
S
@
V
T
,
F
a
l
l
2
0
1
3
c _Adrian Sandu, 1998-2013. Distribution outside classroom not permitted.35
To overcome this, we might use the pair of mathematically
equivalent formulas
r
1,2
=
2c
b

b
2
4ac
.
With this formula, r
2
= (2c)/(2) = 9.99 10
1
, a much better
approximation.
16.6 Insignicant Digits
Consider the Fortran code
PROGRAM test
REAL :: x=100000.0, y=100000.1, z
z = y-x
PRINT*, Z=,z
END PROGRAM test
We would expect the output
Z = 0.1000000
but in fact the program prints (on Alpha ...)
Z = 0.1015625
Since single precision handles about 7 decimal digits, and the sub-
traction z = y x cancels the most signicant 6, the result contains
only one signicant digit. The appended garbage 15625 are in-
signicant digits, coming from the inexact binary representation
of x and y. Beware of convincing-looking results!
16.7 Order matters
Mathematically equivalent expressions may give dierent values in
oating point, depending on the order of evaluation. For example
C
S
3
4
1
4
.
C
S
@
V
T
,
F
a
l
l
2
0
1
3
c _Adrian Sandu, 1998-2013. Distribution outside classroom not permitted.36
PROGRAM test
REAL :: x=12345.6, y=45678.9, z=98765432.1
REAL :: w1, w2
w1 = x*y/z
w2 = 1/z; w2 = x*w2; w2 = y*w2
PRINT*, w1-w2
END PROGRAM test
Mathematically, the dierence between w1 and w2 should be zero,
but on Alpha ... it is about 4.e 7.
17 Integer Multiplication
As another example, consider the multiplication of two single-
precision, FP numbers.
(m
1
2
e
1
) (m
2
2
e
2
) = (m
1
m
2
) 2
e
1
+e
2
.
In general, the multiplication of two 24-bit binary numbers (m
1

m
2
) gives a 48-bit result. This can be easily achieved if we do
the multiplication in double precision (where the mantissa has 53
available bits), then round the result back to single precision.
However, if the two numbers to be multiplied are double-precision,
the exact result needs a 106-bit long mantissa; this is more than
even extended precision can provide. Usually, multiplications and
divisions are performed by specialized hardware, able to handle this
kind of problems.
18 Special Arithmetic Operations
18.1 Signed zeros
Recall that the binary representation 0 has all mantissa and ex-
ponent bits zero. Depending on the sign bit, we may have +0 or
C
S
3
4
1
4
.
C
S
@
V
T
,
F
a
l
l
2
0
1
3
c _Adrian Sandu, 1998-2013. Distribution outside classroom not permitted.37
0. Both are legal, and they are distinct; however, if x = +0 and
y = 0 then the comparison (x.EQ.y) returns .TRUE. for consistency.
The main reason for allowing signed zeros is to maintain con-
sistency with the two types of innity, + and . In IEEE
arithmetic, 1/(+0) = + and 1/(0) = . If we had a single,
unsigned 0, with 1/0 = +, then 1/(1/ ) = 1/0 = +, and not
as expected.
There are other good arguments in favor of signed zeros. For
example, consider the function tan(/2x), discontinuous at x = 0;
we can consistently dene the result to be based on the signum
of x = 0.
Signed zeros have disadvantages also; for example, with x = +0
and y = 0 we have that x = y but 1/x ,= 1/y!
18.2 Operations with
The following operations with innity are possible:
a/ =
_
0, a nite
NaN, a =
a =
_

_
, a > 0 ,
, a < 0 ,
NaN, a = 0 .
+ a =
_

_
, anite ,
, a = ,
NaN, a = .
18.3 Operations with NaN
Any operation involving NaN as (one of) the operand(s) produces
NaN. In addition, the following operations produce NaN: +
(), 0 , 0/0, /,
_
[x[, x modulo 0, modulo x.
C
S
3
4
1
4
.
C
S
@
V
T
,
F
a
l
l
2
0
1
3
c _Adrian Sandu, 1998-2013. Distribution outside classroom not permitted.38
(a < b).OR.(a = b).OR.(a > b) True, if a, b FP numbers
False, if one of them NaN
+0 = 0 True
+= False
Table 3: IEEE results to comparisons
18.4 Comparisons
The IEEE results to comparisons are summarized in Table 18.4.
19 Arithmetic Exceptions
One of the most dicult things in programming is to treat excep-
tional situations. It is desirable that a program handles excep-
tional data in a manner consistent with the handling of normal
data. The results will then provide the user with the information
needed to debug the code, if an exception occurred. The extra FP
numbers allowed by the IEEE standard are meant to help handling
such situations.
The IEEE standard denes 5 exception types: division by 0,
overow, underow, invalid operation and inexact operation.
19.1 Division by 0
If a is a oating point number, then IEEE standard requires that
a/0.0 =
_

_
+ , if a > 0 ,
, if a < 0 ,
NaN , if a = 0.
If a > 0 or a < 0 the denitions make mathematical sense. Recall
that have special binary representations, with all exponent bits
equal to 1 and all mantissa bits equal to 0.
C
S
3
4
1
4
.
C
S
@
V
T
,
F
a
l
l
2
0
1
3
c _Adrian Sandu, 1998-2013. Distribution outside classroom not permitted.39
If a = 0, then the operation is 0/0, which makes no mathematical
sense. What we obtain is therefore invalid information. The result
is the Not a Number, in short NaN. Recall that NaN also have
a special binary representation. NaN is a red ag, which tells the
user that something wrong happened with the program. may or
may not be the result of a bug, depending on the context.
19.2 Overow
Occurs when the result of an arithmetic operation is nite, but
larger in magnitude than the largest FP number representable us-
ing the given precision. The standard IEEE response is to set the
result to (round to nearest) or to the largest representable FP
number (round toward 0). Some compilers will trap the overow
and abort execution with an error message.
Example (Demmel 1984, from D. Goldberg, p. 187, adapted):
In our toy FP system lets compute
2 10
23
+ 10
23
i
2 10
25
+ 10
25
i
whose result is 1.00 10
2
, a normal FP number. A direct use
of the formula
a + b i
c + d i
=
ac + bd
c
2
+ d
2
+
bc ad
c
2
+ d
2
i
returns the result equal to 0, since the denominators overow.
Using the scaled formulation
=
d
c
;
a + b i
c + d i
=
a + b
c + d
+
b a
c + d
i
we have = 0.5, (a + b)/(c + d) = (2.5 10
23
)/(2.5 10
25
) = 0.01 and
b a = 0.
Sometimes overow and innity arithmetic mat lead to curious
results. For example, let x = 3.16 10
25
and compute
x
2
/
(x + 1.0 10
23
)
2
= 9.93 10
1
C
S
3
4
1
4
.
C
S
@
V
T
,
F
a
l
l
2
0
1
3
c _Adrian Sandu, 1998-2013. Distribution outside classroom not permitted.40
Since the denominator overows it is set to innity; the numerator
does not overow, therefore the result is 0!. If we compute the
same quantity as
_
x
x + 1 10
23
_
=
_
3.16
3.17
_
= 0.99
we obtain a result closer to the mathematical value.
19.3 Underow
Occurs when the result of an arithmetic operation is smaller than
the smallest normalized FP number which can be stored. In IEEE
standard the result is a subnormal number (gradual underow)
or 0, if the result is small enough. Note that subnormal numbers
have fewer bits of precision than normalized ones, so using them
may lead to a loss of accuracy. For example, let
x = 1.99 10
40
, y = 1.00 10
11
, z = 1.00 10
+11
,
and compute t = (x y) z. The mathematical result is t = 1.99
10
40
. According to our roundo error analysis, we expect the cal-
culated t to satisfy

t
expected
= (1 + )t
exact
, [[ ,
where the bound on delta comes from the fact that we have two
oating point multiplications, and (with exact rounding) each of
them can introduce a roundo error as large as the half the ma-
chine precision [

[ /2:
x y = (1 +
1

)(x y)
(x y) z = (1 +
2

) [(x y) z]
= (1 +
2

)(1 +
1

) [x y z]
(1 +
1

+
2

) [x y z]
(1 + ) [x y z]
C
S
3
4
1
4
.
C
S
@
V
T
,
F
a
l
l
2
0
1
3
c _Adrian Sandu, 1998-2013. Distribution outside classroom not permitted.41
IEEE Exception Operation Result
Invalid Operation NaN
Division by 0
Overow (or FPmax)
Underow 0 or subnormal
Precision rounded value
Table 4: The IEEE Standard Response to Exceptions
Since in our toy system = 0.01, we expect the computed result to
be in the range

t
expected
[(1 2)t
exact
, (1 + 2)t
exact
] = [1.98 10
40
, 2.00 10
40
] .
However, the product x y = 1.99 10
51
underows, and has to be
represented by the subnormal number 0.01 10
49
; when multiplied
by z this gives

t = 1.00 10
40
, which means that the relative error
is almost 100 times larger than expected

t = 1.00 10
40
= (1 +

)t
exact
,

= 0.99 = 99 !
19.4 Inexact
Occurs when the result of an arithmetic operation is inexact. This
situation occurs quite often!
19.5 Summary
The IEEE standard response to exceptions is summarized in Table
19.5.
20 Flags and Exception Trapping
Each exception is signaled by setting an associate status ag; the
ag remains set until explicitly cleared. The user is able to read
C
S
3
4
1
4
.
C
S
@
V
T
,
F
a
l
l
2
0
1
3
c _Adrian Sandu, 1998-2013. Distribution outside classroom not permitted.42
and write the status ags. There are 5 status ags (one for each
possible exception type); in addition, for each ag there is a trap
enable bit (see below), and there are 4 rounding modes bits. If
the result of a computation is, say, +, the ag values help user
decide whether this is the eect of an overow or is a genuine
innity, like 1/0.
The programmer has the option to
Mask the exception. The appropriate ag is set and the pro-
gram continues with the standard response shown in the table;
Trap the exception. Occurrence of the exception triggers a
call to a special routine, the trap handler. Trap handlers
receive the values of the operands which lead to the ex-
ception, or the result;
clear or set the status ag; and
return a value that will be used as the result of the faulty
operation.
Using trap handler calls for each inexact operation is prohibitive.
For overow/underow, the argument to the trap handler is the
result, with a modied exponent (the wrapped-around result). In
single precision the exponent is decreased/increased by 192, and in
double precision by 1536, followed by a rounding of the number to
the corresponding precision.
Trap handlers are useful for backward compatibility, when an
old code expects to be aborted if exception occurs. Example (from
D. Goldberg, page 189): without aborting, the sequence
doSuntil(x >= 100)
will loop indenitely if x becomes NaN.
C
S
3
4
1
4
.
C
S
@
V
T
,
F
a
l
l
2
0
1
3
c _Adrian Sandu, 1998-2013. Distribution outside classroom not permitted.43
21 Systems Aspects, from D. Goldberg, p. 193
The design of computer systems requires in-depth knowledge about
FP. Modern processors have special FP instructions, compilers
must generate such FP instructions, and the operating system
must handle the exception conditions generated by these FP in-
structions.
21.1 Instruction Sets
It is useful to have a multiplication of single precision operands
(p mantissa digits) that returns a double precision result (2p man-
tissa digits). All calculations require occasional bursts of higher
precision.
21.2 Ambiguity
A language should dene the semantics precisely enough to prove
statements about the programs. Common points of ambiguity:
x=3.0/10.0 FP number, it is usually not specied that all
occurrences of 10.0*x must have the same value.
what happens during exceptions.
interpretation of parenthesis.
evaluation of subexpressions. If x real and m,n integers, in
the expression x+m/n is the division integer or FP? For exam-
ple, we can compute all the operations in the highest preci-
sion present in the expression; or we can assign from bottom
up in the expression graph tentative precisions based on the
operands, and then from top down assign the maximum of
the tentative and the expected precision.
dening the exponential consistently. Ex: (-3)**3 = -27 but
(-3.0)**(3.0) is problematic, as it is dened via logarithm.
C
S
3
4
1
4
.
C
S
@
V
T
,
F
a
l
l
2
0
1
3
c _Adrian Sandu, 1998-2013. Distribution outside classroom not permitted.44
Goldberg proposes to consider f(x) a, g(x) b as x 0. If
f(x)
g(x)
c for all f, g then a
b
= c. For example, 2

= , but
1

= NaN since 1
1/x
1 but (1 x)
1/x
e
1
.
21.3 Programming Language Features
The IEEE standard says nothing about how the features can be
accessed from a programming language. There is usually a mis-
match between IEEE-supporting hardware and programming lan-
guages. Some capabilities, like exactly rounded square root, can be
accessed through a library of function calls. Others are harder:
The standard requires extended precision, while most lan-
guages dene only single and double.
Languages need to provide subroutines for reading and writing
the state (exception ags, enable bits, rounding mode bits,
etc).
Cannot dene x = 0 x since this is not true for x = +0;
NaN are unordered, therefore when comparing 2 numbers we
have <, >, =, unordered.
The precisely dened IEEE rounding modes may conict with
the programming languages implicitly-dened rounding modes
or primitives.
21.4 Optimizers
Consider the following code for estimating the machine
eps = 1.0;
do
eps=0.5*eps;
while (eps+1 > 1)
C
S
3
4
1
4
.
C
S
@
V
T
,
F
a
l
l
2
0
1
3
c _Adrian Sandu, 1998-2013. Distribution outside classroom not permitted.45
If the compiler optimizes (eps+1 > 1) to (eps > 0) the code will
compute the largest number which is rounded to 0.
Optimizers should be careful when applying mathematical alge-
braic identities to FP variables. If, during the optimization pro-
cess, the expression x+(y +z) is changed to (x+y) +z, the meaning
of the computation is dierent.
Converting constants like 1.0E 40 x from decimal to binary
at compile time can change the semantic (a conversion at run
time obeys the current value of the IEEE rounding modes, and
eventually raise the inexact and underow ags).
Semantics can be changed during common subexpression elim-
ination. In the code
C = A*B; RndMode = Up; D = A*B;
A B is not a common subexpression, since it is computed with
dierent rounding modes.
21.5 Exception Handling
When an operation traps, the conceptual model is that everything
stops and the trap handler is executed; since the underlying as-
sumption is that of serial execution, traps are harder to implement
on machines that use pipelining or have multiple ALU. Hardware
support for identifying exactly which operation did trap may be
needed. For example,
x = y*z; z = a+b;
both operations can be physically executed in parallel; if the mul-
tiplication traps, the handler will need the value of the arguments
y and z. But the value of z is modied by the addition, which
started in the same time as the multiply and eventually nished
rst. IEEE supporting systems must either avoid such a situa-
tion, or provide a way to store z and pass the original value to the
handler, if needed.
C
S
3
4
1
4
.
C
S
@
V
T
,
F
a
l
l
2
0
1
3
c _Adrian Sandu, 1998-2013. Distribution outside classroom not permitted.46
22 Long Summations
Long summations have a problem: since each individual summa-
tion brings an error of 0.5 ulp in the partial result, the total result
can be quite inaccurate. Fixes
compute the partial sums in a higher precision;
sort the terms rst;
use Kahans formula.
23 Homework
Homework 11 The following program computes a very large FP
number in double precision. When assigning a=b, the double preci-
sion number (b) will be converted to single precision (a), and this
may result in overow. Compile the program with both Fortran90
and Fortran77 compilers. For example, f90 file.f -o a90.out and
f77 file.f -o a77.out. Run the two versions of the program for
p = 0, 0.1, 0.01. Does the Fortran90 compiler obey the IEEE stan-
dard? For which value the single precision overow occurs? How
about the Fortran77 compiler? Note that, if you do not see Normal
End Here ! and STOP the program did not nish normally; trapping
was used for the FP exception. If you see them, masking was used
for the exception and the program terminated normally. Do the
compilers use masking or trapping?
PROGRAM test_overflow
REAL :: a, p
DOUBLE PRECISION :: b
PRINT*, Please provide p:
read*, p
b = (1.99d0+p)*(2.d0**127)
PRINT*, b
a = b
C
S
3
4
1
4
.
C
S
@
V
T
,
F
a
l
l
2
0
1
3
c _Adrian Sandu, 1998-2013. Distribution outside classroom not permitted.47
PRINT*, a
PRINT*, Normal End Here !
END PROGRAM test_overflow
Homework 12 The following program computes a small FP num-
ber (2
p
) in single and in double precision. Then, this number
is multiplied by 2
p
. The theoretical result of this computation is
1. Run the code for p = 120 and for p = 150. What do you see?
Does the Fortran90 compiler obey the IEEE standard? Repeat the
compilation with the Fortran77 compiler, and run again. Any dif-
ferences?
PROGRAM test_underflow
REAL :: a,b
DOUBLE PRECISION :: c,d
INTEGER :: p
PRINT*, Please provide p
read*, p
c = 2.d0**(-p)
d = (2.d0**p)*c
PRINT*, Double Precision: , d
a = 2.e0**(-p)
b = (2.d0**p)*a
PRINT*, Single Precision: , b
PRINT*, Normal End Here !
END PROGRAM test_underflow
Homework 13 The following program performs some messy calcu-
lations, like division by 0, etc. Compile it with both Fortran90 and
Fortran77 compilers, and run the two executables. What do you
see? Any dierences?
PROGRAM test_arit
REAL :: a, b, c, d
c = 0.0
d = -0.0
C
S
3
4
1
4
.
C
S
@
V
T
,
F
a
l
l
2
0
1
3
c _Adrian Sandu, 1998-2013. Distribution outside classroom not permitted.48
PRINT*, c=,c, d=,d
a = 1.0/c
PRINT*, a
b = 1.0/d
PRINT*, a=,a, b=,b
PRINT*, a+b=,a+b
PRINT*, a-b=,a-b
PRINT*, a/b=,a/b
END PROGRAM test_arit

You might also like