0% found this document useful (0 votes)

2 views

Sutter Capitulo-12

Chapter 12 discusses floating point arithmetic, focusing on the IEEE 754-2008 standard for floating-point operations, which includes definitions for arithmetic formats, rounding methods, and basic arithmetic operations. It details the representation of floating-point numbers using a triplet of natural numbers and describes the algorithms for addition, subtraction, multiplication, division, and square root. The chapter also covers the encoding schemes and examples for converting between decimal and binary representations.

Uploaded by

Francisco Villarroel

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

Sutter Capitulo-12

Uploaded by

Francisco Villarroel

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 32

Chapter 12

Floating Point Arithmetic

There are many data processing applications (e.g. image and voice processing),
which use a large range of values and that need a relatively high precision. In such
cases, instead of encoding the information in the form of integers or fixed-point
numbers, an alternative solution is a floating-point representation. In the first
section of this chapter, the IEEE standard for floating point is described. The next
section is devoted to the algorithms for executing the basic arithmetic operations.
The two following sections define the main rounding methods and introduce the
concept of guard digit. Finally, the last few sections propose basic implementa-
tions of the arithmetic operations, namely addition and subtraction, multiplication,
division and square root.

12.1 IEEE 754-2008 Standard

The IEEE-754 Standard is a technical standard established by the Institute of

Electrical and Electronics Engineers for floating-point operations. There are
numerous CPU, FPU and software implementations of this standard. The current
version is IEEE 754-2008 [1], which was published in August 2008. It includes
nearly all the definitions of the original IEEE 754-1985 and IEEE 854-1987
standards. The main enhancement in the new standard is the definition of decimal
floating point representations and operations. The standard defines the arithmetic
and interchange formats, rounding algorithms, arithmetic operations and exception
handling.

J.-P. Deschamps et al., Guide to FPGA Implementation of Arithmetic Functions, 305

Lecture Notes in Electrical Engineering 149, DOI: 10.1007/978-94-007-2987-2_12,
Springer Science+Business Media Dordrecht 2012
306 12 Floating Point Arithmetic

12.1.1 Formats

Formats in IEEE 754 describe sets of floating-point data and encodings for
interchanging them. This format allows representing a finite subset of real num-
bers. The floating-point numbers are represented using a triplet of natural numbers
(positive integers). The finite numbers may be expressed either in base 2 (binary)
or in base 10 (decimal). Each finite number is described by three integers: the sign
(zero or one), the significand s (also known as coefficient or mantissa), and the
exponent e. The numerical value of the represented number is (-1)sign 9 s 9 Be,
where B is the base (2 or 10).
For example, if sign = 1, s = 123456, e = -3 and B = 10, then the repre-
sented number is -123.456.
The format also allows the representation of infinite numbers (+? and -?),
and of special values, called Not a Number (NaN), to represent invalid values. In
fact there are two kinds of NaN: qNaN (quiet) and sNaN (signaling). The latter,
used for diagnostic purposes, indicates the source of the NaN.
The values that can be represented are determined by the base (B), the number
of digits of the significand (precision p), and the maximum and minimum values
emin and emax of e. Hence, s is an integer belonging to the range 0 to Bp-1, and e is
an integer such that emin e emax :
For example if B = 10 and p = 7 then s is included between 0 and 9999999. If
emin = -96 and emax = 96, then the smallest non-zero positive number that can be
represented is 1 9 10-101, the largest is 9999999 9 1090 (9.999999 9 1096), and
the full range of numbers is from -9.999999 9 1096 to 9.999999 9 1096. The
numbers closest to the inverse of these bounds (-1 9 10-95 and 1 9 10-95) are
considered to be the smallest (in magnitude) normal numbers. Non-zero numbers
between these smallest numbers are called subnormal (also denormalized)
numbers.
Zero values are finite values whose significand is 0. The sign bit specifies if a
zero is +0 (positive zero) or -0 (negative zero).

12.1.2 Arithmetic and Interchange Formats

The arithmetic format, based on the four parameters B, p, emin and emax, defines the
set of represented numbers, independently of the encoding that will be chosen for
storing and interchanging them (Table 12.1). The interchange formats define fixed-
length bit-strings intended for the exchange of floating-point data. There are some
differences between binary and decimal interchange formats. Only the binary format
will be considered in this chapter. A complete description of both the binary and the
decimal format can be found in the document of the IEEE 754-2008 Standard [1].
For the interchange of binary floating-point numbers, formats of lengths equal
to 16, 32, 64, 128, and any multiple of 32 bits for lengths bigger than 128, are
12.1 IEEE 754-2008 Standard 307

Table 12.1 Binary and decimal floating point format in IEEE 754-2008
Binary formats (B = 2) Decimal formats (B = 10)
Parameter Binary Binary Binary Binary Decimal Decimal Decimal
16 32 64 128 132 l64 128
p, digits 10 ? 1 23 ? 1 52 ? 1 112 ? 1 7 16 34
emax +15 +127 +1023 +16383 +96 +384 +16,383
emin -14 -126 -1022 -16382 -95 -383 -16,382
Common Half Single Double Quadruple
name precision precision precision precision

Table 12.2 Binary interchange format parameters

Parameter Binary16 Binary32 Binary64 Binary128 Binary{k} (k C 128)
k, storage width in bits 16 32 64 128 Multiple of 32
p, precision in bits 11 24 53 113 k-w
emax 15 127 1,023 16,383 2ðkp1Þ 1
bias, E - e 15 127 1,023 16,383 emax
w, exponent field width 5 8 11 15 Round(4log2 k) - 13
t, trailing significand bits 10 23 52 112 k-w-1

Fig. 12.1 Binary interchange format

defined (Table 12.2). The 16-bit format is only for the exchange or storage of small
numbers.
The binary interchange encoding scheme is the same as in the IEEE 754-1985
standard. The k-bit strings are made up of three fields (Fig. 12.1):
• a 1-bit sign S,
• a w-bit biased exponent E = e ? bias,
• the p - 1 trailing bits of the significand; the missing bit is encoded in the
exponent (hidden first bit).
Each binary floating-point number has just one encoding. In the following
description, the significand s is expressed in scientific notation, with the radix point
immediately following the first digit. To make the encoding unique, the value of the
significand s is maximized by decreasing e until either e = emin or s C 1 (normal-
ization). After normalization, there are two possibilities regarding the significand:
• If s C 1 and e C emin then a normalized number of the form 1.d1 d2…dp -1 is
obtained. The first ‘‘1’’ is not stored (implicit leading 1).
308 12 Floating Point Arithmetic

• If e ¼ emin and 0 \ s \ 1, the floating-point number is called subnormal. Sub-

normal numbers (and zero) are encoded with a reserved biased exponent value.
They have an implicit leading significand bit = 0 (Table 12.2).
The minimum exponent value is emin ¼ 1 emax : The range of the biased
exponent E is 1 to 2w - 2 to encode normal numbers. The reserved value 0 is used
to encode ± 0 and subnormal numbers. The value 2w - 1 is reserved to encode
±? and NaNs.
The value of a binary floating-point data is inferred from the constituent fields
as follows:
• If E ¼ 2w 1 (all 1’s in E), then the data is an NaN or infinity. If T = 0, then it
is a qNaN or an sNaN. If the first bit of T is 1, it is an sNaN. If T = 0, then the
value is ð1Þsign ðþ1Þ:
• If E = 0 (all 0’s in E), then the data is 0 or a subnormal number. If T = 0 it is a
signed 0. Otherwise (T = 0), the value of the corresponding floating-point
number is ð1ÞSign 2emin ð0 þ 21p T Þ:
• If 1 B E B 2w - 2, then the data is ð1Þsign 2ðEbiasÞ ð1 þ 21 p T Þ:
Remember that the significand of a normal number has an implicit leading 1.

Example 12.1
Convert the decimal number -9.6875 to its binary32 representation.
• First convert the absolute value of the number to binary (Chap. 10):
9:687510 ¼ 1001:10112 :
• Normalize: 1001:1011 ¼ 1:00110011 23 : Hence e ¼ 3; s ¼ 1:0011 0011:
• Hide the first bit and complete with 0’s up to 23 bits:
00110011000000000000000.
• Add bias to the exponent. In this case, w = 8, bias = 28 - 1 = 127 and thus
E ¼ e þ bias ¼ 3 þ 127 ¼ 13010 ¼ 100000102 :
• Compose the final 32-bit representation:
• 1 10000010 001100110000000000000002 ¼ C119800016 :

Example 12.2
Convert the following binary32 numbers to their decimal representation.
• 7FC0000016: sig n ¼ 0; E ¼ FF16 ; T 6¼ 0; hence it is an NaN. Since the first bit
of T is 1, it is a quiet NaN.
• FF80000016: sig n ¼ 0; E ¼ FF16 ; T ¼ 0; hence it is -?.
• 6545AB7816 : sign ¼ 0; E ¼ CA16 ; ¼ 20210 ; e ¼ E bias ¼ 202 127 ¼ 7510 ;
T ¼ 100010110101011011110002 ;
s ¼ 1:100010110101011011110002 ¼ 1:544295310 :
The number is 1:100010110101011011112 275 ¼ 5:8341827 1022 :
12.1 IEEE 754-2008 Standard 309

• 1234567816 : sign ¼ 0; E ¼ 2416 ¼ 3610 ; e ¼ E bias ¼ 36 127 ¼ 9110 ;

T ¼ 011010001010110011110002 ;
s ¼ 1:011010001010110011110002 ¼ 1:40888810 :
The number is 1:01101000101011001111
291 ¼ 5:69045661 1028 :8000000016: sign = 1, E = 0016, T = 0. The
number is -0.0.
• 8000000016: sign = 1, E = 0016, T = 0. The number is -0.0.
• 0001234516:sign = 0, E = 0016, T = 0, hence it is a subnormal number,
e ¼ E bias ¼ 127; T ¼ 00000100100011010001012 ;
s ¼ 0:00000100100011010001012 ¼ 0:017777710 :
The number is 0:0000010010001101000101 2127 ¼ 1:0448782 1040 :

12.2 Arithmetic Operations

First analyze the main arithmetic operations and generate the corresponding
computation algorithms. In what follows it will be assumed that the significand s is
represented in base B (in binary if B = 2, in decimal if B = 10) and that it belongs
to the interval 1 B s B B - ulp, where ulp stands for the unit in the last position
or unit of least precision. Thus s is expressed in the form (s0 s-1 s-2… s-p).
Be where emin e emax and 1 B s0 B B - 1.

Comment 12.1
The binary subnormals and the decimal floating point are not normalized numbers
and are not included in the following analysis. This situation deserves some special
treatment and is out of the scope of this section.

12.2.1 Addition of Positive Numbers

Given two positive floating-point numbers s1 Be1 and s2 Be2 their sum s Be is
computed as follows: assume that e1 is greater than or equal to e2; then (alignment)
the sum of s1 Be1 and s2 Be2 can be expressed in the form s Be where

s ¼ s1 þ s2 = Be1e2 and e ¼ e1 : ð12:1Þ
The value of s belongs to the interval
1 s 2 B 2 ulp; ð12:2Þ
so that s could be greater than or equal to B. If it is the case, that is if
B s 2 B 2 ulp; ð12:3Þ
310 12 Floating Point Arithmetic

then (normalization) substitute s by s/B, and e by e ? 1, so that the value of s Be

is the same as before, and the new value of s satisfies
1 s 2 ð2=BÞ ulp B ulp: ð12:4Þ
The significands s1 and s2 of the operands are multiples of ulp. If e1 is greater than
e2, the value of s could no longer be a multiple of ulp and some rounding function
should be applied to s. Assume that s0 \ s \ s00 = s0 ? ulp, s0 and s00 being two
successive multiples of ulp. Then the rounding function associates to s either s0 or s00 ,
according to some rounding strategy. According to (12.4) and to the fact that 1 and
B—ulp are multiples of ulp, it is obvious that 1 B s0 \ s00 B B—ulp. Nevertheless, if
the condition (12.3) does not hold, that is if 1 B s \ B, s could belong to the interval
B ulp\s\B; ð12:5Þ
so that rounding(s) could be equal to B, then a new normalization step would be
necessary, i.e. substitution of s = B by s = 1 and e by e ? 1.

Algorithm 12.1: Sum of positive numbers

Examples 12.3
Assume that B = 2 and ulp = 2-5, so that the numbers are represented in the form
s 2e where 1 B s B 1.111112. For simplicity e is written in decimal (base 10).

1. Compute z ¼ 1:10101 23 þ ð1:00010 21 Þ:
Alignment: z ¼ ð1:10101 þ 0:000100010Þ 23 ¼ 1:101110010 23 :
Rounding: s ﬃ 1:10111:
Final result: z ﬃ 1:10111 23 :

2. Compute z ¼ ð1:11010 23 Þ þ ð1:00110 22 Þ:

Alignment:z ¼ ð1:11010 þ 0:100110Þ 23 ¼ 10:011010 23 :
Normalization : s ¼ 1:0011010; e ¼ 4:
Rounding: s ffi 1:00110:
Final result: z ffi 1:00110 24 :
12.2 Arithmetic Operations 311

3. Compute z ¼ 1:10101 23 þ 1:10101 21 :
Alignment: z ¼ ð1:10010 þ 0:0110101Þ 23 ¼ 1:1111101 23 :
Rounding: s ffi 10:00000:
Normalization: s ffi 1:00000; e ¼ 4:
Final result : z ffi 1:00000 24 :

Comments 12.2
1. The addition of two positive numbers could produce an overflow as the final
value of e could be greater than emax.
2. Observe in the previous examples the lack of precision due to the small number
of bits (6 bits) used in the significand s.

12.2.2 Difference of Positive Numbers

Given two positive floating-point numbers s1 Be1 and s2 Be2 their difference s Be is
computed as follows: assume that e1 is greater than or equal to e2; then (for alignment)
the difference between s1 Be1 and s2 Be2 can be expressed in the form s Be where

s ¼ s1 s2 = Be1e2 and e ¼ e1 : ð12:6Þ
The value of s belongs to the interval s ðB ulpÞ s B ulp: If s is neg-
ative, then it is substituted by and the sign of the final result will be modified
accordingly. If s is equal to 0, then an exception equal_zero could be raised. It
remains to consider the case where 0\s B ulp: The value of s could be smaller
than 1. In order to normalize the significand, s is substituted by s Bk and e by e - k,
where k is the minimum exponent k such that s Bk 1: Thus, the relation 1 s B
holds. It remains to round (up or down) the significand and to normalize it if
necessary.
In the following algorithm, the function leading_zeroes(s) computes the
smallest k such that s Bk C 1.

Algorithm 12.2: Difference of positive numbers

312 12 Floating Point Arithmetic

Examples 12.4
Assume again that B = 2 and ulp = 2-5, so that the numbers are represented in
the form s 2e where 1 B s B 1.111112. For computing the difference, the 2’s
complement representation is used (one extra bit is used).

1. Compute z ¼ ð1:10101 22 Þ 1:01010 21 :
Alignment: z ¼ ð0:00110101 1:01010Þ 21 :
20 s complement addition: ð00:00110101 þ 10:10101 þ 00:00001Þ 21 ¼
10:11100101 21 :
Change of sign: s ¼ 01:00011010 þ 00:00000001 ¼ 01:00011011:
Rounding: s ﬃ 1:00011:
Final result: z ﬃ 1:00011 21 :

2. Compute z ¼ 1:00010 23 1:10110 22 :
Alignment: z ¼ ð1:00010 0:110110Þ 23 :
20 s complement addition: ð01:00010 þ 11:001001 þ 00:000001Þ 23 ¼
00:001110 23 :
Leading zeroes: k ¼ 3; s ¼ 1:11000; e ¼ 0:
Final result: z ¼ 1:11000 20 :

3. Compute z ¼ 1:01010 23 1:01001 21 :
Alignment: z ¼ ð1:01010 0:0101001Þ 23 ¼ 0:1111111 23 :
Leading zeroes: k ¼ 1; s ¼ 1:111111; e ¼ 2:
Rounding: s ffi 10:00000:
Normalization: s ffi 1:0000; e ¼ 3:
Final result:z ffi 1:00000 23 :

Comment 12.3
The difference of two positive numbers could produce an underflow as the final
value of e could be smaller than emin.
12.2 Arithmetic Operations 313

Table 12.3 Effective Operation Sign1 Sign2 Actual

operation in floating point operation
adder–subtractor
0 0 0 s1 ? s2
0 0 1 s1 - s2
0 1 0 -(s1 - s2)
0 1 1 -(s1 - s2)
1 0 0 s1 - s2
1 0 1 s1 ? s2
1 1 0 -(s1 ? s2)
1 1 1 -(s1 - s2)

12.2.3 Addition and Subtraction

Given two floating-point numbers ð1Þsign1 s1 Be1 and ð1Þsign2 s2 Be2 ; and a
control variable operation, an algorithm is defined for computing

z ¼ ð1Þsign s Be ¼ ð1Þsign1 s1 Be1 þ ð1Þsign2 s2 Be2 ; if operation ¼ 0;

sign e sign1 e1 sign2 e2
z ¼ ð1Þ s B ¼ ð1Þ s1 B ð1Þ s2 B ; if operation ¼ 1:
Once the significands have been aligned, the actual operation (addition or
subtraction of the significands) depends on the values of operation, sign1 and sign2
(Table 12.3). The following algorithm computes z. The procedure swap (a, b)
interchanges a and b.

Algorithm 12.3: Addition and subtraction

314 12 Floating Point Arithmetic

12.2.4 Multiplication

Given two floating-point numbers ð1Þsign1 s1 Be1 and ð1Þsign2 s2 Be2 their
product (-1)sign s Be is computed as follows:
sign ¼ sign1 xor sign2 ; s ¼ s1 s2 ; e ¼ e1 þ e2 : ð12:7Þ
2
The value of s belongs to the interval 1 B s B (B - ulp) , and could be greater
than or equal to B. If it is the case, that is if B B s B (B - ulp)2, then (normal-
ization) substitute s by s/B, and e by e ? 1. The new value of s satisfies

1 s ðB ulpÞ2 =B ¼ B 2:ulp þ ðulpÞ2 =B\B ulp ð12:8Þ

(ulp \ B so that 2 - ulp/B [ 1). It remains to round the significand and to
normalize if necessary.

Algorithm 12.4: Multiplication

Examples 12.5
Assume again that B = 2 and ulp = 2-5, so that the numbers are represented in
the form s 2e where 1 B s B 1.111112. The exponent e is represented in decimal.

1. Compute z ¼ 1:11101 22 1:00010 25 :
Multiplication : z ¼ 01:1100101100 23 :
Rounding : s ﬃ 1:11001:
Final result : z ﬃ 1:11001 23 :

2. Compute z ¼ 1:11101 23 ð1:00011 21 Þ:
Multiplication : z ¼ 10:00010101112 22 :
Normalization : s ¼ 1:000010101112 ; e ¼ 3:
Rounding : s ffi 1:00001:
Final result : z ffi 1:00001 23 :
12.2 Arithmetic Operations 315

3. Compute z ¼ 1:01000 x 21 1:10011 22 :
Multiplication : z ¼ 01:111111000 22 :
Normalization : s ¼ 1:11111; e ¼ 3:
Rounding : s ffi 10:00000:
Normalization : s ffi 1; e ¼ 4:
Final result : z ffi 1:000002 24 :

Comment 12.4
The product of two real numbers could produce an overflow or an underflow as the
final value of e could be greater than emax or smaller than emin (addition of two
negative exponents).

12.2.5 Division

Given two floating-point numbers ð1Þsign1 s1 Be1 and ð1Þsign2 s2 Be2 their
quotient
(-1)sign s Be is computed as follows:
sign ¼ sign1 xor sign2 ; s ¼ s1 =s2 ; e ¼ e1 e2 : ð12:9Þ
The value of s belongs to the interval 1/B \ s B B - ulp, and could be smaller
than 1. If that is the case, that is if s ¼ s1 =s2 \1; then s1 \s2 ; s1 s2
ulp; s1 =s2 1 ulp=s2 \1 ulp=B; and 1=B\s\1 ulp=B:
Then (normalization) substitute s by s B, and e by e - 1. The new value of
s satisfies 1 \ s \ B - ulp. It remains to round the significand.

Algorithm 12.5: Division

Examples 12.6
Assume again that B = 2 and ulp = 2-5, so that the numbers are represented in
the form s • 2e where 1 B s B 1.111112. The exponent e is represented in
decimal.
316 12 Floating Point Arithmetic

1. Compute z ¼ 1:11101 23 1:00011 21 :
Division: z ¼ 1:1011111000 24 :
Rounding: s ﬃ 1:10111:
Final result: z ﬃ 1:00001 23 :

2. Compute z ¼ 1:01000 21 1:10011 22 :
Division: z ¼ 0:1100100011 21 :
Normalization: s ffi 1:100100011; e ¼ 2:
Rounding: s ffi 1:10010:
Final result: z ffi 1:10010 22 :

Comment 12.5
The quotient of two real numbers could produce an underflow or an overflow as the
final value of e could be smaller than emin or bigger than emax. Observe that a second
normalization is not necessary as in the case of addition, subtraction and multiplication.

12.2.6 Square Root

Given a positive floating-point number s1 Be1, its square root s Be is computed

as follows:
if e1 is even; s ¼ ðs1 Þ1=2 ; e ¼ e1 =2; ð12:10Þ

if e1 is odd; s ¼ ðs1 =BÞ1=2 ; e ¼ ðe1 þ 1Þ=2: ð12:11Þ

1/2
In the first case (12.10), 1 B s B (B - ulp) \ B - ulp. In the second case (1/
B)1/2 B s \ 1. Hence (normalization) s must be substituted by s B and e by e - 1,
so that 1 B s \ B. It remains to round the significand and to normalize if necessary.

Algorithm 12.6: Square root

An alternative is to replace (12.11) by:

if e1 is odd; s ¼ ðs1 BÞ1=2 ; e ¼ ðe1 1Þ=2: ð12:12Þ

12.2 Arithmetic Operations 317

In this case B1=2 s ðB2 ulp BÞ1=2 \B; then the first normalization is not
necessary. Nevertheless, s could be B - ulp \ s \ B, and then depending on the
rounding strategy, normalization after rounding could be still necessary.
Algorithm 12.7: Square root, second version

Note that the ‘‘round to nearest’’ (default rounding in IEEE 754-2008) and the
‘‘truncation’’ rounding schemes allow avoiding the second normalization.

Examples 12.7
Assume again that B = 2 and ulp = 2-5, so that the numbers are represented in the
form s 2e where 1 B s B 1.111112. The exponent e is represented in decimal form.
1=2
1. Compute z ¼ 1:11101 24 :
Square rooting : z ¼ 1:01100001 22 :
Rounding : s ﬃ 1:01100:
Final result : z ﬃ 1:01100 22 :

1=2
2. Compute z ¼ ð1:00101 21 Þ :
Even exponent : s ¼ 10:0101; e ¼ 2:
Square rooting : z ¼ 1:10000101 21 :
Rounding : s ﬃ 1:10000
Final result : z ﬃ 1:10000 21 :

1=2
3. Compute z ¼ 1:11111 23 :
Even exponent; s ¼ 11:1111; e ¼ 2:
Square rooting : z ¼ 1:11111011 21 :
Rounding : s ﬃ 1:11111ðround to nearestÞ:
Final result : z ﬃ 1:111112 21 :
318 12 Floating Point Arithmetic

However, some rounding schemes (e.g. toward infinite) generate s ﬃ 10:00000:

Then, the result after normalization is s ﬃ 1:00000; e ¼ 2; and the final result
z ﬃ 1:00000 22 :

Comment 12.6
The square rooting of a real number could produce an underflow as the final value
of e could be smaller than emin.

12.3 Rounding Schemes

Given a real number x and a floating-point representation system, the following

situations could happen:
j xj\smin Be min ; that is, an underflow situation,
j xj [ smax Be max ; that is, an overflow situation,
j xj ¼ s Be ; where emin e emax and smin s smax :
In the third case, either s is a multiple of ulp, in which case a rounding operation
is not necessary, or it is included between two multiples s0 and s00 of ulp:
s0 \ s \ s00 .
The rounding operation associates to s either s0 or s00 , according to some
rounding strategy. The most common are the following ones:
• The truncation method (also called round toward 0 or chopping) is accom-
plished by dropping the extra digits, i.e. round(s) = s0 if s is positive, round(-
s) = s00 if s is negative.
• The round toward plus infinity is defined by round(s) = s00 .
• The round toward minus infinity is defined by round(s) = s0 .
• The round to nearest method associates s with the closest value, that is, if
s \ s0 ? ulp/2, round(s) = s0 , and if s [ s0 ? ulp/2, round(s) = s00 .
If the distances to s0 and s00 are the same, that is, if s = s0 ? ulp/2, there are
several options. For instance:
• round(s) = s0 ;
• round(s) = s00 ;
• round(s) = s0 if s is positive, round(s) = s00 if s is negative. It is the round to
nearest, ties to zero scheme.
• round(s) = s00 if s is positive, round(s) = s0 if s is negative. It is the round to
nearest, ties away from zero scheme.
• round(s) = s0 if s0 is an even multiple of ulp, round(s) = s00 if s00 is an even
multiple of ulp. It is the default scheme in the IEEE 754 standard.
• round(s) = s0 if s0 is an odd multiple of ulp, round(s) = s00 if s00 is an odd
multiple of ulp.
12.3 Rounding Schemes 319

The preceding schemes (round to the nearest) produce the smallest absolute
error, and the two last (tie to even, tie to odd) also produce the smallest average
absolute error (unbiased or 0-bias representation systems).
Assume now that the exact result of an operation, after normalization, is
s ¼ 1:s1 s2 s3 . . . sp jsðpþ1Þ sðpþ2Þ sðpþ3Þ . . .

where ulp is equal to B-p (the | symbol indicates the separation between the digit
which corresponds to the ulp and the following). Whatever the chosen rounding
scheme, it is not necessary to have previously computed all the digits s-(p+1) s-
(p+2)…; it is sufficient to know whether all the digits s-(p+1) s-(p+2)… are equal to
0, or not. For example the following algorithm computes round(s) if the round to
the nearest, tie to even scheme is used.

Algorithm 12.8: Round to the nearest, tie to even

In order to execute the preceding algorithm it is sufficient to know the value of

s1 ¼ 1 s1 s2 s3 . . . sp ; the value of s-(p+1), and whether s2 ¼
0:00 . . . 0j0 sðpþ2Þ sðpþ3Þ . . . is equal to 0, or not.

12.3.1 Rounding Schemes in IEEE 754

From the previous description, the IEEE 754-2008 standard defines five rounding
algorithms. The two first round to a nearest value; the others are called directed
roundings:
• Round to nearest, ties to even; this is the default rounding for binary floating-
point and the recommended default for decimal.
• Round to nearest, ties away from zero.
• Round toward 0—directed rounding towards zero.
• Round toward +?—directed rounding towards positive infinity
• Round toward -?—directed rounding towards negative infinity.
320 12 Floating Point Arithmetic

12.4 Guard Digits

Consider the exact result r of an operation, before normalization. According to the

preceding paragraph:

r\B2 ; i:e: r ¼ r1 r0 r1 r2 r3 . . . rp jrðpþ1Þ rðpþ2Þ rðpþ3Þ . . .

The normalization operation (if necessary) is accomplished by

• dividing the result by B (sum of positive numbers, multiplication),
• multiplying the result by B (division),
• multiplying the result by Bk (difference of positive numbers).
Furthermore, if the operation is a difference of positive numbers (Algorithm
12.2), consider two cases:
• if e1 e2 2; then r ¼ s1 s2 =ðBe1e2 Þ [ 1 B=B2 ¼ 1
1=B 1=Bðas B 2Þ; so that the number k of leading zeroes is equal to 0 or 1, and
the normalization operation (if necessary i.e. k = 1) is accomplished by multi-
plying the result by B;
• if e1 - e2 B 1, then the result before normalization is either

r0 r1 r2 r3 . . . rp jrðpþ1Þ 0 0 . . . ðe1 e2 ¼ 1Þ; or

r0 r1 r2 r3 . . . rp j0 0 0 . . . ðe1 e2 ¼ 0Þ:

A consequence of the preceding analysis is that the result after normalization

can be either
r0 r1 r2 r3 . . . rp jrðpþ1Þ rðpþ2Þ rðpþ3Þ . . . ðno normalization operationÞ;
ð12:13Þ

or
r1 r0 r1 r2 . . . rpþ1 jrp rðpþ1Þ rðpþ2Þ . . . ðdivide by BÞ; ð12:14Þ

or
r1 r2 r3 r4 . . . rðpþ1Þ jrðpþ2Þ rðpþ3Þ rðpþ4Þ . . . ðmultiply by BÞ; ð12:15Þ

rk rðkþ1Þ rðkþ2Þ . . . rp rðpþ1Þ 0 . . . 0j0 0 . . . ðmultiply by Bk where k [ 1Þ:

ð12:16Þ
For executing a rounding operation, the worst case is (12.15). In particular, for
executing Algorithm 12.8, it is necessary to know
12.4 Guard Digits 321

• the value of s1 ¼ r1 r2 r3 r4 . . . rðpþ1Þ ;

• the value of rðpþ2Þ ;
• whether s2 ¼ 0:00 . . . 0j0 rðpþ3Þ rðpþ4Þ . . . is equal to 0, or not.
The conclusion is that the result r of an operation, before normalization, must
be computed in the form
r ﬃ r1 r0 r1 r2 r3 . . . rp jrðpþ1Þ rðpþ2Þ T;

that is, with two guard digits r-(p+1) and r-(p+2)-, and an additional sticky digit
T equal to 0 if all the other digits rðpþ3Þ ; rðpþ4Þ ; . . . are equal to 0, and equal to
any positive value otherwise.
After normalization, the significand will be obtained in the following general form:
s ﬃ 1:s1 s2 s3 . . . sp jsðpþ1Þ sðpþ2Þ sðpþ3Þ :

The new version of Algorithm 12.8 is the following:

Algorithm 12.9: Round to the nearest, tie to even, second version

Observe that in binary representation, the following algorithm is even simpler.

Algorithm 12.10: Round to the nearest, tie to even, third version

12.5 Arithmetic Circuits

This section proposes basic implementations of the arithmetic operations, namely

addition and subtraction, multiplication, division and square root. The implementation
is based on the previous section devoted to the algorithms, rounding and guard digit.
322 12 Floating Point Arithmetic

The circuits support normalized binary IEEE 754-2008 operands. Regarding the
binary subnormals the associated hardware to manage this situation is complex.
Some floating point implementations solve operations with subnormals via soft-
ware routines. In the FPGA arena, most cores do not support denormalized
numbers. The dynamic range can be increased using fewer resources by increasing
the size of the exponent (a 1-bit increase in exponent, roughly doubles the dynamic
range) and is typically the solution adopted.

12.5.1 Adder–Subtractor

An adder-subtractor based on Algorithm 12.3 will now be synthesized. The

operands are supposed to be in IEEE 754 binary encoding. It is made up of five
parts, namely unpacking, alignment, addition, normalization and rounding, and
packing. The following implementation does not support subnormal numbers; they
are interpreted as zero.

12.5.1.1 Unpacking

The unpacking separates the constitutive parts of the Floating Points and addi-
tionally detects the special numbers (infinite, zeros and NaNs). The special number
detection is implemented using simple comparators. The following VHDL process
defines the unpacking of a floating point operand FP; k is the number of bits of FP,
w is the number of bits of the exponent, and p is the significand precision.

The previous is implemented using two w bits comparators, one p bits com-
parator and some additional gates for the rest of the conditions.

12.5.1.2 Alignment

The alignment circuit implements the three first lines of the Algorithm 12.3, i.e.
12.5 Arithmetic Circuits 323

e1 e2 sign1 sign2 operation s1 s2

actual_sign2
subtractor subtractor

e1-e2 e2-e1

0 1 0 1 0 1 0 1 1 0 1 0

sign(e1-e2) new_s2 s
000

dif right shifter 000

(from 0 to p+3)

e sign new_sign2 aligned_s2 aligned_s1

Fig. 12.2 Alignment circuit in floating point addition/subtraction

An example of implementation is shown in Fig. 12.2. The principal and more

complex component is the right shifter.
Given a (p ? 3)-bit input vector ½1 ak1 ak2 . . . a1 a0 0 0 0; the shifter gener-
ates a (p ? 3)-bit output vector. The lsb (least significand bit) of the output is the
sticky bit indicating if the shifted bits are equal to zero or not. If B = 2, the sticky-
digit circuit is an OR circuit.
Observe that if e1 e2 p þ 3; then the shifter output is equal to [0 0… 0 1],
since the last bit is the sticky bit and the input number is a non-zero. The operation
with zero is treated differently.

12.5.1.3 Addition and Subtraction

Depending on the respective signs of the aligned operands, one of the following
operations must be executed: if they have the same sign, the sum aligned_-
s1 ? aligned_s2 must be computed; if they have different signs, the difference
324 12 Floating Point Arithmetic

aligned_s1—aligned_s2 is computed. If the result of the significand difference is

negative, then aligned_s2—aligned_s1 must be computed. Moreover, the only
situation in which the final result could be negative is when the e1 = e2.
In the circuit of Fig. 12.3 two additions are performed in parallel: result = a-
ligned_s1 ± aligned_s2, where the actual operation is selected with the signs of the
operands, and alt_result = s2 - s1. At the end of this stage a multiplexer selects
the correct results. The operation selection is done as follows:

12.5.1.4 Normalization and Rounding

The normalization circuit executes the following part of Algorithm 12.3:

If the number of leading zeroes is greater than p þ 3; i:e: s1 s2 [ Bðpþ2Þ ;

then s2 [ s1 Bðpþ2Þ : If e1 were greater than e2 then s2 ðB ulpÞ=B ¼
1 Bðpþ1Þ so that 1 Bðpþ1Þ s2 \s1 Bðpþ2Þ 1 Bðpþ2Þ ; which is
impossible. Thus, the only case where the number of leading zeroes can be greater
than p ? 3 is when e1 = e2 and s1 = s2. If more than p ? 3 leading 0’s are
detected in the circuit of Fig. 12.4, a zero_flag is raised.
It remains to execute the following algorithm where operation is the internal
operation computed in Fig. 12.3:
12.5 Arithmetic Circuits 325

s2 s1 aligned_s1 aligned_s2
sign
sign2
0

(p+4)-bits adder /
(p+1)-bits subtractor
subtractor

result operation
alt_result

s1 &000 alt_result & 000

alt_result >0
s2 &000 result
operation

e1 significand
selection 0 1 2 3
e2

iszero1 signif
iszero2

Fig. 12.3 Effective addition and subtraction

The rounding depends on the selected rounding strategy. An example of

rounding circuit implementation is shown in Fig. 12.4. If the round to the nearest,
tie to even method is used (Algorithm 12.10), the block named rounding decision
computes the following Boolean function decision:

The second rounding is avoided in the following way. The only possibility to need a
second rounding is when, as the result of an addition, the significand is 1.111…1111xx.
This situation is detected in a combinational block that generates the signal ‘‘isTwo’’
and adds one to the exponent. After rounding, the resulting number is 10.000…000, but
the two most significand bits are discarded and the hidden 1 is appended.

12.5.1.5 Packing

The packing joins the constitutive parts of the floating point result. Additionally
depending on special cases (infinite, zeros and NaNs), generates the corresponding
codification.
Example 12.8 (Complete VHDL code available)
Generate the VHDL model of an IEEE decimal floating-point adder-subtractor. It
is made up of five previously described blocks. Fig. 12.5 summarizes the inter-
connections. For clearness and reusability the code is written using parameters,
326 12 Floating Point Arithmetic

signif e
e

=
/B 111...111 +1

isTwo k
signif(p+3) leading
0 1 0 1 0 zero_flag x Bk -k
0's

sign alt_result > 0

operation Sign
0 1 0 1 computation

s
new_e new_sign
Rounding s(p+2:3)

p-bits 1 (ulp)
half adder

s3
s2 rounding
s1 decision 1 0
s0

new_s

Fig. 12.4 Normalization and rounding

where K is the size of the floating point numbers (sign, exponent, significand), E is
the size of the exponent and P is the size of the significand (including the hidden
1). The entity declaration of the circuit is:

For simplicity the code is written as a single VHDL code except additional files
to describe the right shifter of Fig. 12.2 and the leading zero detection and shifting
of Fig. 12.4. The code is available at the book home page.
12.5 Arithmetic Circuits 327

FP1 FP2

add/sub isInf1
Unpack isInf2
sign1 sign2 e1 e2 isZ1 isZ2 s1 s2 isNaN1 isNaN2
s1 s2
sign1 sign2 e1 e2

sign1 sign2 e1 e2 isZ1 isZ2 s1 s2

Alignment
sign new_sign2 e aligned_s1 aligned_s2

sign sign2 aligned_s1 aligned_s2 s1 s2

Addition / Subtraction
operation significand

sign operation e significand

overflow
Conditions
Normalization and Rounding undeflow Detection
new_sign new_e new_s zero_flag

sign e s isInf
Pack isZero
FP isNaN
FP

Fig. 12.5 General structure of a floating point adder-subtractor

A two stage pipeline could be achieved dividing the data path between addtion/
subtraction and normalization and rounding stages (dotted line in Fig. 12.5).

12.5.2 Multiplier

A basic multiplier deduced from Algorithm 12.4 is shown in Fig. 12.6. The
unpacking and packing circuits are the same as in the case of the adder-subtractor
(Fig. 12.5, Sects. 12.5.1.1 and 12.5.1.5) and for simplicity, are not drawn. The
‘‘normalization and rounding’’ is a simplified version of Fig. 12.4, where the part
related to the subtraction is not necessary.
328 12 Floating Point Arithmetic

s1 s2 e1 e2 sign1 sign2

adder
p-by-p-bits
multiplier e

2.p-1 .. 0 p-4 .. 0

2.p-1 .. p-3
sticky digit
prod(p+3 .. 0) generation

e
Normalization

=
/B 111...111 +1

isTwo
prod(p+3)
0 1 0 1

Rounding
s(p+2:3)

p-bits 1 (ulp)
half adder

s3
s2 rounding
s1 decision 1 0
s0

s e sign

Fig. 12.6 General structure of a floating point multiplier

The obvious method of computing the sticky bit is with a large fan-in OR gate
on the low order bits of the product. Observe, in this case, that the critical path
includes the p by p bits multiplication and the sticky digit generation.
An alternative method consists of determining the number of trailing zeros in
the two inputs of the multiplier. It is easy to demonstrate that the number of
trailing zeros in the product is equal to the sum of the number of trailing zeros in
each input operand. Notice that this method does not require the actual low order
product bits, just the input operands, so the computation can occur in parallel with
the actual multiply operation, removing the sticky computation from the critical
path.
The drawback of this method is that significant extra hardware is required. This
hardware includes two long length priority encoders to count the number of
12.5 Arithmetic Circuits 329

s1 s2 e1 e2 sign1 sign2 isInf1 isNan1 isZero1

isInf2 isNan2 isZero2

sticky bit conditions

adder
generation detection
p-by-p-bits

multiplier
isZero isInf isNan

2.p-1 .. 0 2.p-1 .. p-3 sticky bit

prod(p+3 .. 0)
e
conditions
Normalization and isZero adjust
Rounding

isInf isNan
s e sign underflow overflow isZero

Fig. 12.7 A better general structure of a floating point multiplier

trailing zeros in the input operands, a small length adder, and a small length
comparator. On the other hand, some hardware is eliminated, since the actual low
order bits of the product are no longer needed.
A faster floating point multiplier architecture that computes the p by p multi-
plication and the sticky bit in parallel is presented in Fig. 12.7. The dotted lines
suggest a three stage pipeline implementation using a two stage p by p multipli-
cation. The two extra blocks are shown to indicate the special conditions detec-
tions. In the second block, the range of the exponent is controlled to detect
overflow and underflow conditions. In this figure the packing and unpacking
process are omitted for simplicity.

Example 12.9 (complete VHDL code available)

Generate the VHDL model of a generic floating-point multiplier. It is made up of
the blocks depicted in Fig. 12.7 described in a single VHDL file. For clearness and
reusability the code is written using parameters, where K is the size of the floating
point numbers (sign, exponent, significand), E is the size of the exponent and P is
the size of the significand (including the hidden 1). The entity declaration of the
circuit is:
330 12 Floating Point Arithmetic

s1 s2 e1 e2 sign1 sign2

subtractor
p-by-p-bits
divider e
q r

p+2 .. 0 p-1 .. 0

sticky digit
generation

div(p+3 .. 0)

e
Normalization

*B -1

div(p+2)
0 1 0 1
quotient < 0

Rounding

s e sign

Fig. 12.8 A general structure of a floating point divider

The code is available at the home page of this book. The combinational circuit
registers inputs and outputs to ease the synchronization. A two or three stage
pipeline is easily achievable adding the intermediate registers as suggested in
Fig. 12.7. In order to increase the clock frequency, more pipeline registers can be
inserted into the integer multiplier.
12.5 Arithmetic Circuits 331

s1 s2 e1 e2 sign1 sign2

subtractor

p-by-p-bits
divider

q Sticky bit

div(p+3 .. 0)

Normalization and
Rounding

s e sign

Fig. 12.9 A pipelined structure of a floating point divider

12.5.3 Divider

A basic divider, deduced from Algorithm 12.6, is shown in Fig. 12.9. The
unpacking and packing circuits are similar to those of the adder-subtractor or
multiplier. The ‘‘normalize and rounding’’ is a simplified version of Fig. 12.4,
where the part related to the subtraction is not necessary.
The inputs of the p-bit divider are s1 and s2. The first operator s1 is internally
divided by two (s1/B, i.e. right shifted) so that the dividend is smaller than the
divisor. The precision is chosen equal to p ? 3 digits. Thus, the outputs quotient
and remainder satisfy the relation (s1/B).Bp+3 = s2.q ? r where r \ s2, that is,

s1 =s2 ¼ q:Bðpþ2Þ þ ðr=s2 Þ Bðpþ2Þ where ðr=s2 Þ Bðpþ2Þ \Bðpþ2Þ :

The sticky digit is equal to 1 if r \ 0 and to 0 if r = 0. The final approximation
of the exact result is

quotient ¼ q:Bðpþ2Þ þ sticky digit: Bðpþ3Þ :

332 12 Floating Point Arithmetic

If a non-restoring divider is used, a further optimization could be done in the

sticky bit computation. In the non-restoring divider the final remainder could be
negative. In this case, a final correction should be done. This final operation can be
avoided: the sticky bit is 0 if the remainder is equal to zero, otherwise it is 1.
A divider is a time consuming circuit. In order to obtain circuits, with fre-
quencies similar to those of floating point multipliers or adders, more pipeline
stages are necessary. Figure 12.9 shows a possible pipeline for a floating point
divider where the integer divider is also pipelined.

Example 12.10 (complete VHDL code available)

Generate the VHDL model of a generic floating-point divider. It is made up of the
blocks depicted in Fig. 12.9. Most of the design is described in a single VHDL file,
but for the integer divider. The integer divider is a non-restoring divider
(div_nr_wsticky.vhd) that uses a basic non-restoring cell (a_s.vhd). For clearness
and reusability the code is written using parameters, where K is the size of the
floating point numbers (sign, exponent, significand), E is the size of the exponent
and P is the size of the significand (including the hidden 1). The entity declaration
of the circuit is:

12.5.4 Square Root

A basic square rooter deduced from Algorithm 12.7 is shown in Fig. 12.10. The
unpacking and packing circuits are the same as in previous operations and, for
simplicity, are not drawn. Remember that the first normalization is not necessary,
and for most rounding strategies the second normalization is not necessary either.
The exponent is calculated as follows:
E ¼ e1 =2 þ bias ¼ ðE1 biasÞ=2 þ bias ¼ E1 =2 þ bias=2:

• If e1 is even, then both E1 and bias are odd (bias is always odd). Thus, E ¼
bE1 =2c þ bbias=2c þ 1 where bE1 =2c and bbias=2c amount to right shifts.
• If e1 is odd, then E1 is even. The significand is multiplied by 2 and the exponent
reduced by one unit. The biased exponent
E ¼ ðE1 1Þ=2 þ bias=2 ¼ E1 =2 þ bbias=2c:
To summarize, the biased exponent E ¼ bE1 =2c þ bbias=2c þ parityðE1 Þ:
12.5 Arithmetic Circuits 333

Fig. 12.10 Structure of a s1 e1

floating point squarer

-1 mod 2
0 0

0 1 0 1

p+1 .. 0
/2

(p+2)-bits e
square rooter

Q sticky

sqrt(p+1 .. 0) sticky bit

p+2 .. 0
Normalization

s e

Since the integer square root is a complex operation, a pipelined floating-point

square rooter should be based on a pipelined integer square root. The dotted line of
Fig. 12.10 shows a possible five stage pipeline circuit.

Example 12.11 (complete VHDL code available)

Generate the VHDL model of a generic floating-point square rooter. It is made up
of the blocks depicted in Fig. 12.10. The design is described in a single VHDL file,
but for the integer square root. The integer square root is based in a non-restoring
algorithm (sqrt_wsticky.vhd) that uses two basic non-restoring cells (sqrt_cell.vhd
and sqrt_cell_00.vhd). For clearness and reusability, the code is written using
parameters, where K is the size of the floating point numbers (sign, exponent,
significand), E is the size of the exponent and P is the size of the significand
(including the hidden 1). The entity declaration of the circuit is:
334 12 Floating Point Arithmetic

Table 12.4 Combinational floating point operators in binary32 format

FF LUTs DSP48 Slices Delay
FP_add 96 699 – 275 11.7
FP_mult 96 189 2 105 8.4
FP_mult_luts 98 802 – 234 9.7
FP_div 119 789 – 262 46.6
FP_sqrt 64 409 – 123 38.0

Table 12.5 Combinational floating point operators in binary64 format

FF LUTs DSP48 Slices Delay
FP_add 192 1372 – 585 15.4
FP_mult 192 495 15 199 15.1
FP_mult_luts 192 3325 – 907 12.5
FP_div 244 3291 – 903 136.9
FP_sqrt 128 1651 – 447 97.6

Table 12.6 Pipelined floating point operators in binary32 format

FF LUTs DSP48 Slices Period Latency
FP_add 137 637 – 247 6.4 2
FP_mult 138 145 2 76 5.6 2
FP_mult_luts 142 798 – 235 7.1 2
FP_mult 144 178 2 89 3.8 3
FP_mult_luts 252 831 – 272 5.0 3
FP_sqrt 384 815 – 266 9.1 6
FP_div 212 455 -0 141 9.2 5

12.5.5 Implementation Results

Several implementation results are now presented. The circuits were implemented
in a Virtex 5, speed grade 2, device, using ISE 13.1 and XST for synthesis.
Tables 12.4 and 12.5 show combinational circuit implementations for binary32
and binary64 floating point operators. The inputs and outputs are registered. When
the number of registers (FF) is greater than the number of inputs and outputs, this
is due to the register duplication made by synthesizer. The multiplier is imple-
mented using the embedded multiplier (DSP48) and general purpose logic.
Table 12.6 shows the results for pipelined versions in the case of decimal32
data. The circuits include input and output registers. The adder is pipelined in two
stages (Fig. 12.5). The multiplier is segmented using three pipeline stages
(Fig. 12.7). The divider latency is equal to 6 cycles and the square root latency is
equal to five cycles (Figs. 12.9 and 12.10).
12.6 Exercises 335

12.6 Exercises

1. How many bits are there in the exponent and the significand of a 256-bit
binary floating point number? What are the ranges of the exponent and the
bias?
2. Convert the following decimal numbers to the binary32 and binary64 floating-
point format. (a) 123.45; (b) -1.0; (c) 673.498e10; (d) qNAN; (e) -1.345e-
129; (f) ?; (g) 0.1; (h) 5.1e5
3. Convert the following binary32 number to the corresponding decimal number.
(a) 08F05352; (b) 7FC00000; (c) AAD2CBC4; (d) FF800000; (e) 484B0173;
(f) E9E55838; (g) E9E55838.
4. Add, subtract, multiply and divide the following binary floating point numbers
with B = 2 and ulp = 2-5, so that the numbers are represented in the form s
2e where 1 B s B 1.111112. For simplicity e is written in decimal (base 10).
(a) 1.10101 9 23 op 1.10101 9 21
(b) 1.00010 9 2-1 op 1.00010 9 2-1
(c) 1.00010 9 2-3 op 1.10110 9 22
(d) 1.10101 9 23 op 1.00000 9 24
5. Add, subtract, multiply and divide the following decimal floating point
numbers using B = 10 and ulp = 10-4, so that the numbers are represented in
the form s 10e where 1 B s B 9.9999 (normalized decimal numbers).
(a) 9.4375 9 103 op 8.6247 9 102
(b) 1.0014 9 103 op 9.9491 9 102
(c) 1.0714 9 104 op 7.1403 9 102
(d) 3.4518 9 10-1 op 7.2471 9 103
6. Analyze the consequences and implication to support denormalized (subnor-
mal in IEEE 754-2008) numbers in the basic operations.
7. Analyze the hardware implication to deal with no-normalized significands
(s) instead of normalized as in the binary standard.
8. Generate VHDL models adding a pipeline stage to the binary floating point
adder of Sect. 12.5.1.
9. Add a pipeline stage to the binary floating point multiplier of Sect. 12.5.2.
10. Generate VHDL models adding two pipeline stages to the binary floating point
multiplier of Sect. 12.5.2.
11. Generate VHDL models adding several pipeline stages to the binary floating
point divider of Sect. 12.5.3.
12. Generate VHDL models adding several pipeline stages to the binary floating
point square root of Sect. 12.5.4.
336 12 Floating Point Arithmetic

Reference

1. IEEE (2008) IEEE standard for floating-point arithmetic, 29 Aug 2008

Outline of Chapter 1: Digital Logic Design Ch1-1
No ratings yet
Outline of Chapter 1: Digital Logic Design Ch1-1
46 pages
Chapter 1
No ratings yet
Chapter 1
89 pages
Number Systems and Codes
No ratings yet
Number Systems and Codes
78 pages
High Speed Computations
No ratings yet
High Speed Computations
16 pages
Introduction To Numerical Computing: Statistics 580 Number Systems
No ratings yet
Introduction To Numerical Computing: Statistics 580 Number Systems
35 pages
Numerical Computing: 2.1 Numbers
No ratings yet
Numerical Computing: 2.1 Numbers
24 pages
Floating Point Numbers
No ratings yet
Floating Point Numbers
8 pages
Lec1 Digitalsystem Chap1
No ratings yet
Lec1 Digitalsystem Chap1
35 pages
Computer Architecture
No ratings yet
Computer Architecture
5 pages
Number System-1
No ratings yet
Number System-1
41 pages
02-Data Representation View
No ratings yet
02-Data Representation View
58 pages
Chapter 1: Digital Systems and Binary Numbers
No ratings yet
Chapter 1: Digital Systems and Binary Numbers
48 pages
Ch1 DLD 123
No ratings yet
Ch1 DLD 123
52 pages
Chapter 1
No ratings yet
Chapter 1
38 pages
Wchapter 1
No ratings yet
Wchapter 1
74 pages
Chapter 1: Digital Systems and Binary Numbers
No ratings yet
Chapter 1: Digital Systems and Binary Numbers
24 pages
Digital Computers and Information: (1.1 Through 1.6)
No ratings yet
Digital Computers and Information: (1.1 Through 1.6)
36 pages
COA Chapter 2
No ratings yet
COA Chapter 2
15 pages
Logic Circuit and Design (LABORATORY) Prelims
No ratings yet
Logic Circuit and Design (LABORATORY) Prelims
20 pages
Finite Word Length Effects
No ratings yet
Finite Word Length Effects
31 pages
0 Notes1 Integers
No ratings yet
0 Notes1 Integers
26 pages
Logic Circuits Design: Digital Systems and Information
No ratings yet
Logic Circuits Design: Digital Systems and Information
52 pages
The IEEE Standard For Floating Point Arithmetic
No ratings yet
The IEEE Standard For Floating Point Arithmetic
9 pages
PPT 01 Number Systems Codes
No ratings yet
PPT 01 Number Systems Codes
36 pages
Chapter 2: Number System
100% (1)
Chapter 2: Number System
60 pages
Module 1 - Number Systems
No ratings yet
Module 1 - Number Systems
70 pages
COA Chapter 2
No ratings yet
COA Chapter 2
9 pages
Logic Design
100% (1)
Logic Design
31 pages
Chapter I Digital Systems
No ratings yet
Chapter I Digital Systems
8 pages
COA Sanchit Sir Notes
No ratings yet
COA Sanchit Sir Notes
145 pages
IEEE Standard 754
No ratings yet
IEEE Standard 754
10 pages
Coa Chapter 1
No ratings yet
Coa Chapter 1
62 pages
The World Is Not Just Integers: Programming Languages Support Numbers With Fraction
No ratings yet
The World Is Not Just Integers: Programming Languages Support Numbers With Fraction
51 pages
Logic CKT 01
No ratings yet
Logic CKT 01
17 pages
Class 3_and 4_Number System and Complements
No ratings yet
Class 3_and 4_Number System and Complements
32 pages
Number Systems & Code Conversions: Learning Objectives
No ratings yet
Number Systems & Code Conversions: Learning Objectives
25 pages
Data Storage in Computer System: BITS Pilani
No ratings yet
Data Storage in Computer System: BITS Pilani
30 pages
Lect - 1
No ratings yet
Lect - 1
33 pages
Computer_Architecture_Detailed_Answers
No ratings yet
Computer_Architecture_Detailed_Answers
2 pages
03_numberSystem1_OL3
No ratings yet
03_numberSystem1_OL3
16 pages
Number System
No ratings yet
Number System
28 pages
7-8 Slides Float
No ratings yet
7-8 Slides Float
10 pages
ECE 223 Number System
No ratings yet
ECE 223 Number System
16 pages
Chapter 4-Data Representation in Computers
No ratings yet
Chapter 4-Data Representation in Computers
8 pages
02-PLC (Forms of Signal) PDF
No ratings yet
02-PLC (Forms of Signal) PDF
14 pages
Aes Spec v316 PDF
No ratings yet
Aes Spec v316 PDF
38 pages
2019 2020 CSE206 Week09 Ch9 Ch10 Number Systems and Computer Arithmetic
No ratings yet
2019 2020 CSE206 Week09 Ch9 Ch10 Number Systems and Computer Arithmetic
39 pages
DLC Pre Requisite
No ratings yet
DLC Pre Requisite
21 pages
Legend Logic Design
No ratings yet
Legend Logic Design
221 pages
Ch1-Intro-Approximation-errors
No ratings yet
Ch1-Intro-Approximation-errors
41 pages
Computer Architecture
No ratings yet
Computer Architecture
31 pages
Chapter 1 Introduction
No ratings yet
Chapter 1 Introduction
49 pages
COA CH 03 - Number Systems & Codes (1) (1)
No ratings yet
COA CH 03 - Number Systems & Codes (1) (1)
12 pages
2 CS1FC16 Information Representation
No ratings yet
2 CS1FC16 Information Representation
4 pages
EC-502 - Aritra Dutta
No ratings yet
EC-502 - Aritra Dutta
6 pages
International Journal of Engineering Research and Development
No ratings yet
International Journal of Engineering Research and Development
6 pages
Lecture #2
No ratings yet
Lecture #2
12 pages
Digital Electronics
No ratings yet
Digital Electronics
64 pages
Principles of Digital Electronics
From Everand
Principles of Digital Electronics
Sapana Rane
No ratings yet
Pre-Calculus Essentials
From Everand
Pre-Calculus Essentials
Ernest Woodward
No ratings yet
Leica 702
No ratings yet
Leica 702
15 pages
Number System Practice Sheet 1
No ratings yet
Number System Practice Sheet 1
23 pages
Introduction To Data Interpretation
No ratings yet
Introduction To Data Interpretation
11 pages
Quantum Cat New Edition
80% (5)
Quantum Cat New Edition
1,221 pages
Conjugate PDF
No ratings yet
Conjugate PDF
2 pages
Inequality Collection
No ratings yet
Inequality Collection
10 pages
CS2 2020 Q 5 Programs
No ratings yet
CS2 2020 Q 5 Programs
4 pages
PPS Module2 CSE
No ratings yet
PPS Module2 CSE
20 pages
Number System Basics
No ratings yet
Number System Basics
4 pages
Notes Topic 2.10 Key - Inverses of Exponential Functions
No ratings yet
Notes Topic 2.10 Key - Inverses of Exponential Functions
2 pages
EC Lab4 (202211028)
No ratings yet
EC Lab4 (202211028)
44 pages
History
No ratings yet
History
6 pages
product-of-primes-pdf2
No ratings yet
product-of-primes-pdf2
2 pages
Computer Notes Class 9th
No ratings yet
Computer Notes Class 9th
3 pages
Activity 2.3.1 Hexadecimal
No ratings yet
Activity 2.3.1 Hexadecimal
5 pages
Sequence Finite Sequence Infinite Sequence Series: For Example
No ratings yet
Sequence Finite Sequence Infinite Sequence Series: For Example
2 pages
Nainital Bank Xam: Quantitative Aptitude
No ratings yet
Nainital Bank Xam: Quantitative Aptitude
7 pages
Area Code in Grecce - Google Search
No ratings yet
Area Code in Grecce - Google Search
1 page
Excel-Module-1
No ratings yet
Excel-Module-1
153 pages
Presantation - Chapter 07-Decrease and Conquer
No ratings yet
Presantation - Chapter 07-Decrease and Conquer
41 pages
Delhi Public School Faridabad Viii Aryabhatta Hand Out-57: Topic: Mixed Questions
No ratings yet
Delhi Public School Faridabad Viii Aryabhatta Hand Out-57: Topic: Mixed Questions
4 pages
Data Types in Java: Dr. Kumud Tripathi
No ratings yet
Data Types in Java: Dr. Kumud Tripathi
12 pages
Math8 M3Worksheet3 EditablePDF
No ratings yet
Math8 M3Worksheet3 EditablePDF
2 pages
Sequential Prog. Lec 1
No ratings yet
Sequential Prog. Lec 1
39 pages
RATIO and PROPORTION
No ratings yet
RATIO and PROPORTION
10 pages
Best Approach: Fundamental of Mathematics (Sheet)
100% (1)
Best Approach: Fundamental of Mathematics (Sheet)
26 pages
Python Operators: Arithmetic Operators: Arithmetic Operators Are Used To Perform Mathematical
No ratings yet
Python Operators: Arithmetic Operators: Arithmetic Operators Are Used To Perform Mathematical
10 pages
Expressions: Slides Are Adapted From The Originals Available at
No ratings yet
Expressions: Slides Are Adapted From The Originals Available at
44 pages
G8DLL Q1W4 Lc05a
No ratings yet
G8DLL Q1W4 Lc05a
13 pages
AB 2022 Final Solutions
No ratings yet
AB 2022 Final Solutions
6 pages

Sutter Capitulo-12

Uploaded by

Sutter Capitulo-12

Uploaded by

Chapter 12

Floating Point Arithmetic

12.1 IEEE 754-2008 Standard

The IEEE-754 Standard is a technical standard established by the Institute of

J.-P. Deschamps et al., Guide to FPGA Implementation of Arithmetic Functions, 305

12.1.2 Arithmetic and Interchange Formats

Table 12.2 Binary interchange format parameters

Fig. 12.1 Binary interchange format

• If e ¼ emin and 0 \ s \ 1, the floating-point number is called subnormal. Sub-

• 1234567816 : sign ¼ 0; E ¼ 2416 ¼ 3610 ; e ¼ E bias ¼ 36 127 ¼ 9110 ;

12.2 Arithmetic Operations

12.2.1 Addition of Positive Numbers

then (normalization) substitute s by s/B, and e by e ? 1, so that the value of s Be

Algorithm 12.1: Sum of positive numbers

2. Compute z ¼ ð1:11010  23 Þ þ ð1:00110  22 Þ:

12.2.2 Difference of Positive Numbers

Algorithm 12.2: Difference of positive numbers

Table 12.3 Effective Operation Sign1 Sign2 Actual

12.2.3 Addition and Subtraction

z ¼ ð1Þsign s Be ¼ ð1Þsign1 s1 Be1 þ ð1Þsign2 s2 Be2 ; if operation ¼ 0;

Algorithm 12.3: Addition and subtraction

1 s ðB ulpÞ2 =B ¼ B 2:ulp þ ðulpÞ2 =B\B ulp ð12:8Þ

Algorithm 12.4: Multiplication

Algorithm 12.5: Division

12.2.6 Square Root

Given a positive floating-point number s1 Be1, its square root s Be is computed

if e1 is odd; s ¼ ðs1 =BÞ1=2 ; e ¼ ðe1 þ 1Þ=2: ð12:11Þ

Algorithm 12.6: Square root

An alternative is to replace (12.11) by:

if e1 is odd; s ¼ ðs1 BÞ1=2 ; e ¼ ðe1 1Þ=2: ð12:12Þ

However, some rounding schemes (e.g. toward infinite) generate s ﬃ 10:00000:

12.3 Rounding Schemes

Given a real number x and a floating-point representation system, the following

Algorithm 12.8: Round to the nearest, tie to even

In order to execute the preceding algorithm it is sufficient to know the value of

12.3.1 Rounding Schemes in IEEE 754

12.4 Guard Digits

Consider the exact result r of an operation, before normalization. According to the

r\B2 ; i:e: r ¼ r1 r0 r1 r2 r3 . . . rp jrðpþ1Þ rðpþ2Þ rðpþ3Þ . . .

The normalization operation (if necessary) is accomplished by

r0 r1 r2 r3 . . . rp jrðpþ1Þ 0 0 . . . ðe1 e2 ¼ 1Þ; or

A consequence of the preceding analysis is that the result after normalization

rk rðkþ1Þ rðkþ2Þ . . . rp rðpþ1Þ 0 . . . 0j0 0 . . . ðmultiply by Bk where k [ 1Þ:

• the value of s1 ¼ r1 r2 r3 r4 . . . rðpþ1Þ ;

The new version of Algorithm 12.8 is the following:

Algorithm 12.9: Round to the nearest, tie to even, second version

Observe that in binary representation, the following algorithm is even simpler.

Algorithm 12.10: Round to the nearest, tie to even, third version

12.5 Arithmetic Circuits

This section proposes basic implementations of the arithmetic operations, namely

An adder-subtractor based on Algorithm 12.3 will now be synthesized. The

e1 e2 sign1 sign2 operation s1 s2

dif right shifter 000

e sign new_sign2 aligned_s2 aligned_s1

Fig. 12.2 Alignment circuit in floating point addition/subtraction

An example of implementation is shown in Fig. 12.2. The principal and more

12.5.1.3 Addition and Subtraction

aligned_s1—aligned_s2 is computed. If the result of the significand difference is

12.5.1.4 Normalization and Rounding

The normalization circuit executes the following part of Algorithm 12.3:

If the number of leading zeroes is greater than p þ 3; i:e: s1 s2 [ Bðpþ2Þ ;

s1 &000 alt_result & 000

Fig. 12.3 Effective addition and subtraction

The rounding depends on the selected rounding strategy. An example of

sign alt_result > 0

Fig. 12.4 Normalization and rounding

sign1 sign2 e1 e2 isZ1 isZ2 s1 s2

sign sign2 aligned_s1 aligned_s2 s1 s2

sign operation e significand

Fig. 12.5 General structure of a floating point adder-subtractor

Fig. 12.6 General structure of a floating point multiplier

s1 s2 e1 e2 sign1 sign2 isInf1 isNan1 isZero1

sticky bit conditions

2.p-1 .. 0 2.p-1 .. p-3 sticky bit

Fig. 12.7 A better general structure of a floating point multiplier

Example 12.9 (complete VHDL code available)

2. Compute z ¼ ð1:11010 23 Þ þ ð1:00110 22 Þ: