Sutter Capitulo-12
Sutter Capitulo-12
There are many data processing applications (e.g. image and voice processing),
which use a large range of values and that need a relatively high precision. In such
cases, instead of encoding the information in the form of integers or fixed-point
numbers, an alternative solution is a floating-point representation. In the first
section of this chapter, the IEEE standard for floating point is described. The next
section is devoted to the algorithms for executing the basic arithmetic operations.
The two following sections define the main rounding methods and introduce the
concept of guard digit. Finally, the last few sections propose basic implementa-
tions of the arithmetic operations, namely addition and subtraction, multiplication,
division and square root.
12.1.1 Formats
Formats in IEEE 754 describe sets of floating-point data and encodings for
interchanging them. This format allows representing a finite subset of real num-
bers. The floating-point numbers are represented using a triplet of natural numbers
(positive integers). The finite numbers may be expressed either in base 2 (binary)
or in base 10 (decimal). Each finite number is described by three integers: the sign
(zero or one), the significand s (also known as coefficient or mantissa), and the
exponent e. The numerical value of the represented number is (-1)sign 9 s 9 Be,
where B is the base (2 or 10).
For example, if sign = 1, s = 123456, e = -3 and B = 10, then the repre-
sented number is -123.456.
The format also allows the representation of infinite numbers (+? and -?),
and of special values, called Not a Number (NaN), to represent invalid values. In
fact there are two kinds of NaN: qNaN (quiet) and sNaN (signaling). The latter,
used for diagnostic purposes, indicates the source of the NaN.
The values that can be represented are determined by the base (B), the number
of digits of the significand (precision p), and the maximum and minimum values
emin and emax of e. Hence, s is an integer belonging to the range 0 to Bp-1, and e is
an integer such that emin e emax :
For example if B = 10 and p = 7 then s is included between 0 and 9999999. If
emin = -96 and emax = 96, then the smallest non-zero positive number that can be
represented is 1 9 10-101, the largest is 9999999 9 1090 (9.999999 9 1096), and
the full range of numbers is from -9.999999 9 1096 to 9.999999 9 1096. The
numbers closest to the inverse of these bounds (-1 9 10-95 and 1 9 10-95) are
considered to be the smallest (in magnitude) normal numbers. Non-zero numbers
between these smallest numbers are called subnormal (also denormalized)
numbers.
Zero values are finite values whose significand is 0. The sign bit specifies if a
zero is +0 (positive zero) or -0 (negative zero).
The arithmetic format, based on the four parameters B, p, emin and emax, defines the
set of represented numbers, independently of the encoding that will be chosen for
storing and interchanging them (Table 12.1). The interchange formats define fixed-
length bit-strings intended for the exchange of floating-point data. There are some
differences between binary and decimal interchange formats. Only the binary format
will be considered in this chapter. A complete description of both the binary and the
decimal format can be found in the document of the IEEE 754-2008 Standard [1].
For the interchange of binary floating-point numbers, formats of lengths equal
to 16, 32, 64, 128, and any multiple of 32 bits for lengths bigger than 128, are
12.1 IEEE 754-2008 Standard 307
Table 12.1 Binary and decimal floating point format in IEEE 754-2008
Binary formats (B = 2) Decimal formats (B = 10)
Parameter Binary Binary Binary Binary Decimal Decimal Decimal
16 32 64 128 132 l64 128
p, digits 10 ? 1 23 ? 1 52 ? 1 112 ? 1 7 16 34
emax +15 +127 +1023 +16383 +96 +384 +16,383
emin -14 -126 -1022 -16382 -95 -383 -16,382
Common Half Single Double Quadruple
name precision precision precision precision
defined (Table 12.2). The 16-bit format is only for the exchange or storage of small
numbers.
The binary interchange encoding scheme is the same as in the IEEE 754-1985
standard. The k-bit strings are made up of three fields (Fig. 12.1):
• a 1-bit sign S,
• a w-bit biased exponent E = e ? bias,
• the p - 1 trailing bits of the significand; the missing bit is encoded in the
exponent (hidden first bit).
Each binary floating-point number has just one encoding. In the following
description, the significand s is expressed in scientific notation, with the radix point
immediately following the first digit. To make the encoding unique, the value of the
significand s is maximized by decreasing e until either e = emin or s C 1 (normal-
ization). After normalization, there are two possibilities regarding the significand:
• If s C 1 and e C emin then a normalized number of the form 1.d1 d2…dp -1 is
obtained. The first ‘‘1’’ is not stored (implicit leading 1).
308 12 Floating Point Arithmetic
Example 12.1
Convert the decimal number -9.6875 to its binary32 representation.
• First convert the absolute value of the number to binary (Chap. 10):
9:687510 ¼ 1001:10112 :
• Normalize: 1001:1011 ¼ 1:00110011 23 : Hence e ¼ 3; s ¼ 1:0011 0011:
• Hide the first bit and complete with 0’s up to 23 bits:
00110011000000000000000.
• Add bias to the exponent. In this case, w = 8, bias = 28 - 1 = 127 and thus
E ¼ e þ bias ¼ 3 þ 127 ¼ 13010 ¼ 100000102 :
• Compose the final 32-bit representation:
• 1 10000010 001100110000000000000002 ¼ C119800016 :
Example 12.2
Convert the following binary32 numbers to their decimal representation.
• 7FC0000016: sig n ¼ 0; E ¼ FF16 ; T 6¼ 0; hence it is an NaN. Since the first bit
of T is 1, it is a quiet NaN.
• FF80000016: sig n ¼ 0; E ¼ FF16 ; T ¼ 0; hence it is -?.
• 6545AB7816 : sign ¼ 0; E ¼ CA16 ; ¼ 20210 ; e ¼ E bias ¼ 202 127 ¼ 7510 ;
T ¼ 100010110101011011110002 ;
s ¼ 1:100010110101011011110002 ¼ 1:544295310 :
The number is 1:100010110101011011112 275 ¼ 5:8341827 1022 :
12.1 IEEE 754-2008 Standard 309
First analyze the main arithmetic operations and generate the corresponding
computation algorithms. In what follows it will be assumed that the significand s is
represented in base B (in binary if B = 2, in decimal if B = 10) and that it belongs
to the interval 1 B s B B - ulp, where ulp stands for the unit in the last position
or unit of least precision. Thus s is expressed in the form (s0 s-1 s-2… s-p).
Be where emin e emax and 1 B s0 B B - 1.
Comment 12.1
The binary subnormals and the decimal floating point are not normalized numbers
and are not included in the following analysis. This situation deserves some special
treatment and is out of the scope of this section.
Given two positive floating-point numbers s1 Be1 and s2 Be2 their sum s Be is
computed as follows: assume that e1 is greater than or equal to e2; then (alignment)
the sum of s1 Be1 and s2 Be2 can be expressed in the form s Be where
s ¼ s1 þ s2 = Be1e2 and e ¼ e1 : ð12:1Þ
The value of s belongs to the interval
1 s 2 B 2 ulp; ð12:2Þ
so that s could be greater than or equal to B. If it is the case, that is if
B s 2 B 2 ulp; ð12:3Þ
310 12 Floating Point Arithmetic
Examples 12.3
Assume that B = 2 and ulp = 2-5, so that the numbers are represented in the form
s 2e where 1 B s B 1.111112. For simplicity e is written in decimal (base 10).
1. Compute z ¼ 1:10101 23 þ ð1:00010 21 Þ:
Alignment: z ¼ ð1:10101 þ 0:000100010Þ 23 ¼ 1:101110010 23 :
Rounding: s ffi 1:10111:
Final result: z ffi 1:10111 23 :
Comments 12.2
1. The addition of two positive numbers could produce an overflow as the final
value of e could be greater than emax.
2. Observe in the previous examples the lack of precision due to the small number
of bits (6 bits) used in the significand s.
Given two positive floating-point numbers s1 Be1 and s2 Be2 their difference s Be is
computed as follows: assume that e1 is greater than or equal to e2; then (for alignment)
the difference between s1 Be1 and s2 Be2 can be expressed in the form s Be where
s ¼ s1 s2 = Be1e2 and e ¼ e1 : ð12:6Þ
The value of s belongs to the interval s ðB ulpÞ s B ulp: If s is neg-
ative, then it is substituted by and the sign of the final result will be modified
accordingly. If s is equal to 0, then an exception equal_zero could be raised. It
remains to consider the case where 0\s B ulp: The value of s could be smaller
than 1. In order to normalize the significand, s is substituted by s Bk and e by e - k,
where k is the minimum exponent k such that s Bk 1: Thus, the relation 1 s B
holds. It remains to round (up or down) the significand and to normalize it if
necessary.
In the following algorithm, the function leading_zeroes(s) computes the
smallest k such that s Bk C 1.
Examples 12.4
Assume again that B = 2 and ulp = 2-5, so that the numbers are represented in
the form s 2e where 1 B s B 1.111112. For computing the difference, the 2’s
complement representation is used (one extra bit is used).
1. Compute z ¼ ð1:10101 22 Þ 1:01010 21 :
Alignment: z ¼ ð0:00110101 1:01010Þ 21 :
20 s complement addition: ð00:00110101 þ 10:10101 þ 00:00001Þ 21 ¼
10:11100101 21 :
Change of sign: s ¼ 01:00011010 þ 00:00000001 ¼ 01:00011011:
Rounding: s ffi 1:00011:
Final result: z ffi 1:00011 21 :
2. Compute z ¼ 1:00010 23 1:10110 22 :
Alignment: z ¼ ð1:00010 0:110110Þ 23 :
20 s complement addition: ð01:00010 þ 11:001001 þ 00:000001Þ 23 ¼
00:001110 23 :
Leading zeroes: k ¼ 3; s ¼ 1:11000; e ¼ 0:
Final result: z ¼ 1:11000 20 :
3. Compute z ¼ 1:01010 23 1:01001 21 :
Alignment: z ¼ ð1:01010 0:0101001Þ 23 ¼ 0:1111111 23 :
Leading zeroes: k ¼ 1; s ¼ 1:111111; e ¼ 2:
Rounding: s ffi 10:00000:
Normalization: s ffi 1:0000; e ¼ 3:
Final result:z ffi 1:00000 23 :
Comment 12.3
The difference of two positive numbers could produce an underflow as the final
value of e could be smaller than emin.
12.2 Arithmetic Operations 313
Given two floating-point numbers ð1Þsign1 s1 Be1 and ð1Þsign2 s2 Be2 ; and a
control variable operation, an algorithm is defined for computing
12.2.4 Multiplication
Given two floating-point numbers ð1Þsign1 s1 Be1 and ð1Þsign2 s2 Be2 their
product (-1)sign s Be is computed as follows:
sign ¼ sign1 xor sign2 ; s ¼ s1 s2 ; e ¼ e1 þ e2 : ð12:7Þ
2
The value of s belongs to the interval 1 B s B (B - ulp) , and could be greater
than or equal to B. If it is the case, that is if B B s B (B - ulp)2, then (normal-
ization) substitute s by s/B, and e by e ? 1. The new value of s satisfies
Examples 12.5
Assume again that B = 2 and ulp = 2-5, so that the numbers are represented in
the form s 2e where 1 B s B 1.111112. The exponent e is represented in decimal.
1. Compute z ¼ 1:11101 22 1:00010 25 :
Multiplication : z ¼ 01:1100101100 23 :
Rounding : s ffi 1:11001:
Final result : z ffi 1:11001 23 :
2. Compute z ¼ 1:11101 23 ð1:00011 21 Þ:
Multiplication : z ¼ 10:00010101112 22 :
Normalization : s ¼ 1:000010101112 ; e ¼ 3:
Rounding : s ffi 1:00001:
Final result : z ffi 1:00001 23 :
12.2 Arithmetic Operations 315
3. Compute z ¼ 1:01000 x 21 1:10011 22 :
Multiplication : z ¼ 01:111111000 22 :
Normalization : s ¼ 1:11111; e ¼ 3:
Rounding : s ffi 10:00000:
Normalization : s ffi 1; e ¼ 4:
Final result : z ffi 1:000002 24 :
Comment 12.4
The product of two real numbers could produce an overflow or an underflow as the
final value of e could be greater than emax or smaller than emin (addition of two
negative exponents).
12.2.5 Division
Given two floating-point numbers ð1Þsign1 s1 Be1 and ð1Þsign2 s2 Be2 their
quotient
(-1)sign s Be is computed as follows:
sign ¼ sign1 xor sign2 ; s ¼ s1 =s2 ; e ¼ e1 e2 : ð12:9Þ
The value of s belongs to the interval 1/B \ s B B - ulp, and could be smaller
than 1. If that is the case, that is if s ¼ s1 =s2 \1; then s1 \s2 ; s1 s2
ulp; s1 =s2 1 ulp=s2 \1 ulp=B; and 1=B\s\1 ulp=B:
Then (normalization) substitute s by s B, and e by e - 1. The new value of
s satisfies 1 \ s \ B - ulp. It remains to round the significand.
Examples 12.6
Assume again that B = 2 and ulp = 2-5, so that the numbers are represented in
the form s • 2e where 1 B s B 1.111112. The exponent e is represented in
decimal.
316 12 Floating Point Arithmetic
1. Compute z ¼ 1:11101 23 1:00011 21 :
Division: z ¼ 1:1011111000 24 :
Rounding: s ffi 1:10111:
Final result: z ffi 1:00001 23 :
2. Compute z ¼ 1:01000 21 1:10011 22 :
Division: z ¼ 0:1100100011 21 :
Normalization: s ffi 1:100100011; e ¼ 2:
Rounding: s ffi 1:10010:
Final result: z ffi 1:10010 22 :
Comment 12.5
The quotient of two real numbers could produce an underflow or an overflow as the
final value of e could be smaller than emin or bigger than emax. Observe that a second
normalization is not necessary as in the case of addition, subtraction and multiplication.
In this case B1=2 s ðB2 ulp BÞ1=2 \B; then the first normalization is not
necessary. Nevertheless, s could be B - ulp \ s \ B, and then depending on the
rounding strategy, normalization after rounding could be still necessary.
Algorithm 12.7: Square root, second version
Note that the ‘‘round to nearest’’ (default rounding in IEEE 754-2008) and the
‘‘truncation’’ rounding schemes allow avoiding the second normalization.
Examples 12.7
Assume again that B = 2 and ulp = 2-5, so that the numbers are represented in the
form s 2e where 1 B s B 1.111112. The exponent e is represented in decimal form.
1=2
1. Compute z ¼ 1:11101 24 :
Square rooting : z ¼ 1:01100001 22 :
Rounding : s ffi 1:01100:
Final result : z ffi 1:01100 22 :
1=2
2. Compute z ¼ ð1:00101 21 Þ :
Even exponent : s ¼ 10:0101; e ¼ 2:
Square rooting : z ¼ 1:10000101 21 :
Rounding : s ffi 1:10000
Final result : z ffi 1:10000 21 :
1=2
3. Compute z ¼ 1:11111 23 :
Even exponent; s ¼ 11:1111; e ¼ 2:
Square rooting : z ¼ 1:11111011 21 :
Rounding : s ffi 1:11111ðround to nearestÞ:
Final result : z ffi 1:111112 21 :
318 12 Floating Point Arithmetic
Comment 12.6
The square rooting of a real number could produce an underflow as the final value
of e could be smaller than emin.
The preceding schemes (round to the nearest) produce the smallest absolute
error, and the two last (tie to even, tie to odd) also produce the smallest average
absolute error (unbiased or 0-bias representation systems).
Assume now that the exact result of an operation, after normalization, is
s ¼ 1:s1 s2 s3 . . . sp jsðpþ1Þ sðpþ2Þ sðpþ3Þ . . .
where ulp is equal to B-p (the | symbol indicates the separation between the digit
which corresponds to the ulp and the following). Whatever the chosen rounding
scheme, it is not necessary to have previously computed all the digits s-(p+1) s-
(p+2)…; it is sufficient to know whether all the digits s-(p+1) s-(p+2)… are equal to
0, or not. For example the following algorithm computes round(s) if the round to
the nearest, tie to even scheme is used.
From the previous description, the IEEE 754-2008 standard defines five rounding
algorithms. The two first round to a nearest value; the others are called directed
roundings:
• Round to nearest, ties to even; this is the default rounding for binary floating-
point and the recommended default for decimal.
• Round to nearest, ties away from zero.
• Round toward 0—directed rounding towards zero.
• Round toward +?—directed rounding towards positive infinity
• Round toward -?—directed rounding towards negative infinity.
320 12 Floating Point Arithmetic
or
r1 r0 r1 r2 . . . rpþ1 jrp rðpþ1Þ rðpþ2Þ . . . ðdivide by BÞ; ð12:14Þ
or
r1 r2 r3 r4 . . . rðpþ1Þ jrðpþ2Þ rðpþ3Þ rðpþ4Þ . . . ðmultiply by BÞ; ð12:15Þ
or
that is, with two guard digits r-(p+1) and r-(p+2)-, and an additional sticky digit
T equal to 0 if all the other digits rðpþ3Þ ; rðpþ4Þ ; . . . are equal to 0, and equal to
any positive value otherwise.
After normalization, the significand will be obtained in the following general form:
s ffi 1:s1 s2 s3 . . . sp jsðpþ1Þ sðpþ2Þ sðpþ3Þ :
The circuits support normalized binary IEEE 754-2008 operands. Regarding the
binary subnormals the associated hardware to manage this situation is complex.
Some floating point implementations solve operations with subnormals via soft-
ware routines. In the FPGA arena, most cores do not support denormalized
numbers. The dynamic range can be increased using fewer resources by increasing
the size of the exponent (a 1-bit increase in exponent, roughly doubles the dynamic
range) and is typically the solution adopted.
12.5.1 Adder–Subtractor
12.5.1.1 Unpacking
The unpacking separates the constitutive parts of the Floating Points and addi-
tionally detects the special numbers (infinite, zeros and NaNs). The special number
detection is implemented using simple comparators. The following VHDL process
defines the unpacking of a floating point operand FP; k is the number of bits of FP,
w is the number of bits of the exponent, and p is the significand precision.
The previous is implemented using two w bits comparators, one p bits com-
parator and some additional gates for the rest of the conditions.
12.5.1.2 Alignment
The alignment circuit implements the three first lines of the Algorithm 12.3, i.e.
12.5 Arithmetic Circuits 323
actual_sign2
subtractor subtractor
e1-e2 e2-e1
0 1 0 1 0 1 0 1 1 0 1 0
sign(e1-e2) new_s2 s
000
Depending on the respective signs of the aligned operands, one of the following
operations must be executed: if they have the same sign, the sum aligned_-
s1 ? aligned_s2 must be computed; if they have different signs, the difference
324 12 Floating Point Arithmetic
s2 s1 aligned_s1 aligned_s2
sign
sign2
0
(p+4)-bits adder /
(p+1)-bits subtractor
subtractor
result operation
alt_result
e1 significand
selection 0 1 2 3
e2
iszero1 signif
iszero2
The second rounding is avoided in the following way. The only possibility to need a
second rounding is when, as the result of an addition, the significand is 1.111…1111xx.
This situation is detected in a combinational block that generates the signal ‘‘isTwo’’
and adds one to the exponent. After rounding, the resulting number is 10.000…000, but
the two most significand bits are discarded and the hidden 1 is appended.
12.5.1.5 Packing
The packing joins the constitutive parts of the floating point result. Additionally
depending on special cases (infinite, zeros and NaNs), generates the corresponding
codification.
Example 12.8 (Complete VHDL code available)
Generate the VHDL model of an IEEE decimal floating-point adder-subtractor. It
is made up of five previously described blocks. Fig. 12.5 summarizes the inter-
connections. For clearness and reusability the code is written using parameters,
326 12 Floating Point Arithmetic
signif e
e
=
/B 111...111 +1
isTwo k
signif(p+3) leading
0 1 0 1 0 zero_flag x Bk -k
0's
operation Sign
0 1 0 1 computation
s
new_e new_sign
Rounding s(p+2:3)
p-bits 1 (ulp)
half adder
s3
s2 rounding
s1 decision 1 0
s0
new_s
where K is the size of the floating point numbers (sign, exponent, significand), E is
the size of the exponent and P is the size of the significand (including the hidden
1). The entity declaration of the circuit is:
For simplicity the code is written as a single VHDL code except additional files
to describe the right shifter of Fig. 12.2 and the leading zero detection and shifting
of Fig. 12.4. The code is available at the book home page.
12.5 Arithmetic Circuits 327
FP1 FP2
add/sub isInf1
Unpack isInf2
sign1 sign2 e1 e2 isZ1 isZ2 s1 s2 isNaN1 isNaN2
s1 s2
sign1 sign2 e1 e2
sign e s isInf
Pack isZero
FP isNaN
FP
A two stage pipeline could be achieved dividing the data path between addtion/
subtraction and normalization and rounding stages (dotted line in Fig. 12.5).
12.5.2 Multiplier
A basic multiplier deduced from Algorithm 12.4 is shown in Fig. 12.6. The
unpacking and packing circuits are the same as in the case of the adder-subtractor
(Fig. 12.5, Sects. 12.5.1.1 and 12.5.1.5) and for simplicity, are not drawn. The
‘‘normalization and rounding’’ is a simplified version of Fig. 12.4, where the part
related to the subtraction is not necessary.
328 12 Floating Point Arithmetic
s1 s2 e1 e2 sign1 sign2
adder
p-by-p-bits
multiplier e
2.p-1 .. 0 p-4 .. 0
2.p-1 .. p-3
sticky digit
prod(p+3 .. 0) generation
e
Normalization
=
/B 111...111 +1
isTwo
prod(p+3)
0 1 0 1
Rounding
s(p+2:3)
p-bits 1 (ulp)
half adder
s3
s2 rounding
s1 decision 1 0
s0
s e sign
The obvious method of computing the sticky bit is with a large fan-in OR gate
on the low order bits of the product. Observe, in this case, that the critical path
includes the p by p bits multiplication and the sticky digit generation.
An alternative method consists of determining the number of trailing zeros in
the two inputs of the multiplier. It is easy to demonstrate that the number of
trailing zeros in the product is equal to the sum of the number of trailing zeros in
each input operand. Notice that this method does not require the actual low order
product bits, just the input operands, so the computation can occur in parallel with
the actual multiply operation, removing the sticky computation from the critical
path.
The drawback of this method is that significant extra hardware is required. This
hardware includes two long length priority encoders to count the number of
12.5 Arithmetic Circuits 329
multiplier
isZero isInf isNan
prod(p+3 .. 0)
e
conditions
Normalization and isZero adjust
Rounding
isInf isNan
s e sign underflow overflow isZero
trailing zeros in the input operands, a small length adder, and a small length
comparator. On the other hand, some hardware is eliminated, since the actual low
order bits of the product are no longer needed.
A faster floating point multiplier architecture that computes the p by p multi-
plication and the sticky bit in parallel is presented in Fig. 12.7. The dotted lines
suggest a three stage pipeline implementation using a two stage p by p multipli-
cation. The two extra blocks are shown to indicate the special conditions detec-
tions. In the second block, the range of the exponent is controlled to detect
overflow and underflow conditions. In this figure the packing and unpacking
process are omitted for simplicity.
s1 s2 e1 e2 sign1 sign2
subtractor
p-by-p-bits
divider e
q r
p+2 .. 0 p-1 .. 0
sticky digit
generation
div(p+3 .. 0)
e
Normalization
*B -1
div(p+2)
0 1 0 1
quotient < 0
Rounding
s e sign
The code is available at the home page of this book. The combinational circuit
registers inputs and outputs to ease the synchronization. A two or three stage
pipeline is easily achievable adding the intermediate registers as suggested in
Fig. 12.7. In order to increase the clock frequency, more pipeline registers can be
inserted into the integer multiplier.
12.5 Arithmetic Circuits 331
s1 s2 e1 e2 sign1 sign2
subtractor
p-by-p-bits
divider
q Sticky bit
div(p+3 .. 0)
Normalization and
Rounding
s e sign
12.5.3 Divider
A basic divider, deduced from Algorithm 12.6, is shown in Fig. 12.9. The
unpacking and packing circuits are similar to those of the adder-subtractor or
multiplier. The ‘‘normalize and rounding’’ is a simplified version of Fig. 12.4,
where the part related to the subtraction is not necessary.
The inputs of the p-bit divider are s1 and s2. The first operator s1 is internally
divided by two (s1/B, i.e. right shifted) so that the dividend is smaller than the
divisor. The precision is chosen equal to p ? 3 digits. Thus, the outputs quotient
and remainder satisfy the relation (s1/B).Bp+3 = s2.q ? r where r \ s2, that is,
A basic square rooter deduced from Algorithm 12.7 is shown in Fig. 12.10. The
unpacking and packing circuits are the same as in previous operations and, for
simplicity, are not drawn. Remember that the first normalization is not necessary,
and for most rounding strategies the second normalization is not necessary either.
The exponent is calculated as follows:
E ¼ e1 =2 þ bias ¼ ðE1 biasÞ=2 þ bias ¼ E1 =2 þ bias=2:
• If e1 is even, then both E1 and bias are odd (bias is always odd). Thus, E ¼
bE1 =2c þ bbias=2c þ 1 where bE1 =2c and bbias=2c amount to right shifts.
• If e1 is odd, then E1 is even. The significand is multiplied by 2 and the exponent
reduced by one unit. The biased exponent
E ¼ ðE1 1Þ=2 þ bias=2 ¼ E1 =2 þ bbias=2c:
To summarize, the biased exponent E ¼ bE1 =2c þ bbias=2c þ parityðE1 Þ:
12.5 Arithmetic Circuits 333
-1 mod 2
0 0
0 1 0 1
p+1 .. 0
/2
(p+2)-bits e
square rooter
Q sticky
p+2 .. 0
Normalization
s e
Several implementation results are now presented. The circuits were implemented
in a Virtex 5, speed grade 2, device, using ISE 13.1 and XST for synthesis.
Tables 12.4 and 12.5 show combinational circuit implementations for binary32
and binary64 floating point operators. The inputs and outputs are registered. When
the number of registers (FF) is greater than the number of inputs and outputs, this
is due to the register duplication made by synthesizer. The multiplier is imple-
mented using the embedded multiplier (DSP48) and general purpose logic.
Table 12.6 shows the results for pipelined versions in the case of decimal32
data. The circuits include input and output registers. The adder is pipelined in two
stages (Fig. 12.5). The multiplier is segmented using three pipeline stages
(Fig. 12.7). The divider latency is equal to 6 cycles and the square root latency is
equal to five cycles (Figs. 12.9 and 12.10).
12.6 Exercises 335
12.6 Exercises
1. How many bits are there in the exponent and the significand of a 256-bit
binary floating point number? What are the ranges of the exponent and the
bias?
2. Convert the following decimal numbers to the binary32 and binary64 floating-
point format. (a) 123.45; (b) -1.0; (c) 673.498e10; (d) qNAN; (e) -1.345e-
129; (f) ?; (g) 0.1; (h) 5.1e5
3. Convert the following binary32 number to the corresponding decimal number.
(a) 08F05352; (b) 7FC00000; (c) AAD2CBC4; (d) FF800000; (e) 484B0173;
(f) E9E55838; (g) E9E55838.
4. Add, subtract, multiply and divide the following binary floating point numbers
with B = 2 and ulp = 2-5, so that the numbers are represented in the form s
2e where 1 B s B 1.111112. For simplicity e is written in decimal (base 10).
(a) 1.10101 9 23 op 1.10101 9 21
(b) 1.00010 9 2-1 op 1.00010 9 2-1
(c) 1.00010 9 2-3 op 1.10110 9 22
(d) 1.10101 9 23 op 1.00000 9 24
5. Add, subtract, multiply and divide the following decimal floating point
numbers using B = 10 and ulp = 10-4, so that the numbers are represented in
the form s 10e where 1 B s B 9.9999 (normalized decimal numbers).
(a) 9.4375 9 103 op 8.6247 9 102
(b) 1.0014 9 103 op 9.9491 9 102
(c) 1.0714 9 104 op 7.1403 9 102
(d) 3.4518 9 10-1 op 7.2471 9 103
6. Analyze the consequences and implication to support denormalized (subnor-
mal in IEEE 754-2008) numbers in the basic operations.
7. Analyze the hardware implication to deal with no-normalized significands
(s) instead of normalized as in the binary standard.
8. Generate VHDL models adding a pipeline stage to the binary floating point
adder of Sect. 12.5.1.
9. Add a pipeline stage to the binary floating point multiplier of Sect. 12.5.2.
10. Generate VHDL models adding two pipeline stages to the binary floating point
multiplier of Sect. 12.5.2.
11. Generate VHDL models adding several pipeline stages to the binary floating
point divider of Sect. 12.5.3.
12. Generate VHDL models adding several pipeline stages to the binary floating
point square root of Sect. 12.5.4.
336 12 Floating Point Arithmetic
Reference