0% found this document useful (0 votes)
2 views

Sutter Capitulo-12

Chapter 12 discusses floating point arithmetic, focusing on the IEEE 754-2008 standard for floating-point operations, which includes definitions for arithmetic formats, rounding methods, and basic arithmetic operations. It details the representation of floating-point numbers using a triplet of natural numbers and describes the algorithms for addition, subtraction, multiplication, division, and square root. The chapter also covers the encoding schemes and examples for converting between decimal and binary representations.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Sutter Capitulo-12

Chapter 12 discusses floating point arithmetic, focusing on the IEEE 754-2008 standard for floating-point operations, which includes definitions for arithmetic formats, rounding methods, and basic arithmetic operations. It details the representation of floating-point numbers using a triplet of natural numbers and describes the algorithms for addition, subtraction, multiplication, division, and square root. The chapter also covers the encoding schemes and examples for converting between decimal and binary representations.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Chapter 12

Floating Point Arithmetic

There are many data processing applications (e.g. image and voice processing),
which use a large range of values and that need a relatively high precision. In such
cases, instead of encoding the information in the form of integers or fixed-point
numbers, an alternative solution is a floating-point representation. In the first
section of this chapter, the IEEE standard for floating point is described. The next
section is devoted to the algorithms for executing the basic arithmetic operations.
The two following sections define the main rounding methods and introduce the
concept of guard digit. Finally, the last few sections propose basic implementa-
tions of the arithmetic operations, namely addition and subtraction, multiplication,
division and square root.

12.1 IEEE 754-2008 Standard

The IEEE-754 Standard is a technical standard established by the Institute of


Electrical and Electronics Engineers for floating-point operations. There are
numerous CPU, FPU and software implementations of this standard. The current
version is IEEE 754-2008 [1], which was published in August 2008. It includes
nearly all the definitions of the original IEEE 754-1985 and IEEE 854-1987
standards. The main enhancement in the new standard is the definition of decimal
floating point representations and operations. The standard defines the arithmetic
and interchange formats, rounding algorithms, arithmetic operations and exception
handling.

J.-P. Deschamps et al., Guide to FPGA Implementation of Arithmetic Functions, 305


Lecture Notes in Electrical Engineering 149, DOI: 10.1007/978-94-007-2987-2_12,
 Springer Science+Business Media Dordrecht 2012
306 12 Floating Point Arithmetic

12.1.1 Formats

Formats in IEEE 754 describe sets of floating-point data and encodings for
interchanging them. This format allows representing a finite subset of real num-
bers. The floating-point numbers are represented using a triplet of natural numbers
(positive integers). The finite numbers may be expressed either in base 2 (binary)
or in base 10 (decimal). Each finite number is described by three integers: the sign
(zero or one), the significand s (also known as coefficient or mantissa), and the
exponent e. The numerical value of the represented number is (-1)sign 9 s 9 Be,
where B is the base (2 or 10).
For example, if sign = 1, s = 123456, e = -3 and B = 10, then the repre-
sented number is -123.456.
The format also allows the representation of infinite numbers (+? and -?),
and of special values, called Not a Number (NaN), to represent invalid values. In
fact there are two kinds of NaN: qNaN (quiet) and sNaN (signaling). The latter,
used for diagnostic purposes, indicates the source of the NaN.
The values that can be represented are determined by the base (B), the number
of digits of the significand (precision p), and the maximum and minimum values
emin and emax of e. Hence, s is an integer belonging to the range 0 to Bp-1, and e is
an integer such that emin  e  emax :
For example if B = 10 and p = 7 then s is included between 0 and 9999999. If
emin = -96 and emax = 96, then the smallest non-zero positive number that can be
represented is 1 9 10-101, the largest is 9999999 9 1090 (9.999999 9 1096), and
the full range of numbers is from -9.999999 9 1096 to 9.999999 9 1096. The
numbers closest to the inverse of these bounds (-1 9 10-95 and 1 9 10-95) are
considered to be the smallest (in magnitude) normal numbers. Non-zero numbers
between these smallest numbers are called subnormal (also denormalized)
numbers.
Zero values are finite values whose significand is 0. The sign bit specifies if a
zero is +0 (positive zero) or -0 (negative zero).

12.1.2 Arithmetic and Interchange Formats

The arithmetic format, based on the four parameters B, p, emin and emax, defines the
set of represented numbers, independently of the encoding that will be chosen for
storing and interchanging them (Table 12.1). The interchange formats define fixed-
length bit-strings intended for the exchange of floating-point data. There are some
differences between binary and decimal interchange formats. Only the binary format
will be considered in this chapter. A complete description of both the binary and the
decimal format can be found in the document of the IEEE 754-2008 Standard [1].
For the interchange of binary floating-point numbers, formats of lengths equal
to 16, 32, 64, 128, and any multiple of 32 bits for lengths bigger than 128, are
12.1 IEEE 754-2008 Standard 307

Table 12.1 Binary and decimal floating point format in IEEE 754-2008
Binary formats (B = 2) Decimal formats (B = 10)
Parameter Binary Binary Binary Binary Decimal Decimal Decimal
16 32 64 128 132 l64 128
p, digits 10 ? 1 23 ? 1 52 ? 1 112 ? 1 7 16 34
emax +15 +127 +1023 +16383 +96 +384 +16,383
emin -14 -126 -1022 -16382 -95 -383 -16,382
Common Half Single Double Quadruple
name precision precision precision precision

Table 12.2 Binary interchange format parameters


Parameter Binary16 Binary32 Binary64 Binary128 Binary{k} (k C 128)
k, storage width in bits 16 32 64 128 Multiple of 32
p, precision in bits 11 24 53 113 k-w
emax 15 127 1,023 16,383 2ðkp1Þ  1
bias, E - e 15 127 1,023 16,383 emax
w, exponent field width 5 8 11 15 Round(4log2 k) - 13
t, trailing significand bits 10 23 52 112 k-w-1

Fig. 12.1 Binary interchange format

defined (Table 12.2). The 16-bit format is only for the exchange or storage of small
numbers.
The binary interchange encoding scheme is the same as in the IEEE 754-1985
standard. The k-bit strings are made up of three fields (Fig. 12.1):
• a 1-bit sign S,
• a w-bit biased exponent E = e ? bias,
• the p - 1 trailing bits of the significand; the missing bit is encoded in the
exponent (hidden first bit).
Each binary floating-point number has just one encoding. In the following
description, the significand s is expressed in scientific notation, with the radix point
immediately following the first digit. To make the encoding unique, the value of the
significand s is maximized by decreasing e until either e = emin or s C 1 (normal-
ization). After normalization, there are two possibilities regarding the significand:
• If s C 1 and e C emin then a normalized number of the form 1.d1 d2…dp -1 is
obtained. The first ‘‘1’’ is not stored (implicit leading 1).
308 12 Floating Point Arithmetic

• If e ¼ emin and 0 \ s \ 1, the floating-point number is called subnormal. Sub-


normal numbers (and zero) are encoded with a reserved biased exponent value.
They have an implicit leading significand bit = 0 (Table 12.2).
The minimum exponent value is emin ¼ 1  emax : The range of the biased
exponent E is 1 to 2w - 2 to encode normal numbers. The reserved value 0 is used
to encode ± 0 and subnormal numbers. The value 2w - 1 is reserved to encode
±? and NaNs.
The value of a binary floating-point data is inferred from the constituent fields
as follows:
• If E ¼ 2w 1 (all 1’s in E), then the data is an NaN or infinity. If T = 0, then it
is a qNaN or an sNaN. If the first bit of T is 1, it is an sNaN. If T = 0, then the
value is ð1Þsign ðþ1Þ:
• If E = 0 (all 0’s in E), then the data is 0 or a subnormal number. If T = 0 it is a
signed 0. Otherwise (T = 0), the value of the corresponding floating-point
number is ð1ÞSign 2emin  ð0 þ 21p  T Þ:
• If 1 B E B 2w - 2, then the data is ð1Þsign 2ðEbiasÞ  ð1 þ 21 p  T Þ:
Remember that the significand of a normal number has an implicit leading 1.

Example 12.1
Convert the decimal number -9.6875 to its binary32 representation.
• First convert the absolute value of the number to binary (Chap. 10):
9:687510 ¼ 1001:10112 :
• Normalize: 1001:1011 ¼ 1:00110011  23 : Hence e ¼ 3; s ¼ 1:0011 0011:
• Hide the first bit and complete with 0’s up to 23 bits:
00110011000000000000000.
• Add bias to the exponent. In this case, w = 8, bias = 28 - 1 = 127 and thus
E ¼ e þ bias ¼ 3 þ 127 ¼ 13010 ¼ 100000102 :
• Compose the final 32-bit representation:
• 1 10000010 001100110000000000000002 ¼ C119800016 :

Example 12.2
Convert the following binary32 numbers to their decimal representation.
• 7FC0000016: sig n ¼ 0; E ¼ FF16 ; T 6¼ 0; hence it is an NaN. Since the first bit
of T is 1, it is a quiet NaN.
• FF80000016: sig n ¼ 0; E ¼ FF16 ; T ¼ 0; hence it is -?.
• 6545AB7816 : sign ¼ 0; E ¼ CA16 ; ¼ 20210 ; e ¼ E  bias ¼ 202  127 ¼ 7510 ;
T ¼ 100010110101011011110002 ;
s ¼ 1:100010110101011011110002 ¼ 1:544295310 :
The number is 1:100010110101011011112  275 ¼ 5:8341827  1022 :
12.1 IEEE 754-2008 Standard 309

• 1234567816 : sign ¼ 0; E ¼ 2416 ¼ 3610 ; e ¼ E  bias ¼ 36  127 ¼ 9110 ;


T ¼ 011010001010110011110002 ;
s ¼ 1:011010001010110011110002 ¼ 1:40888810 :
The number is 1:01101000101011001111
291 ¼ 5:69045661  1028 :8000000016: sign = 1, E = 0016, T = 0. The
number is -0.0.
• 8000000016: sign = 1, E = 0016, T = 0. The number is -0.0.
• 0001234516:sign = 0, E = 0016, T = 0, hence it is a subnormal number,
e ¼ E  bias ¼ 127; T ¼ 00000100100011010001012 ;
s ¼ 0:00000100100011010001012 ¼ 0:017777710 :
The number is 0:0000010010001101000101  2127 ¼ 1:0448782  1040 :

12.2 Arithmetic Operations

First analyze the main arithmetic operations and generate the corresponding
computation algorithms. In what follows it will be assumed that the significand s is
represented in base B (in binary if B = 2, in decimal if B = 10) and that it belongs
to the interval 1 B s B B - ulp, where ulp stands for the unit in the last position
or unit of least precision. Thus s is expressed in the form (s0  s-1  s-2…  s-p).
Be where emin  e  emax and 1 B s0 B B - 1.

Comment 12.1
The binary subnormals and the decimal floating point are not normalized numbers
and are not included in the following analysis. This situation deserves some special
treatment and is out of the scope of this section.

12.2.1 Addition of Positive Numbers

Given two positive floating-point numbers s1  Be1 and s2  Be2 their sum s  Be is
computed as follows: assume that e1 is greater than or equal to e2; then (alignment)
the sum of s1  Be1 and s2  Be2 can be expressed in the form s  Be where

s ¼ s1 þ s2 = Be1e2 and e ¼ e1 : ð12:1Þ
The value of s belongs to the interval
1  s  2  B  2  ulp; ð12:2Þ
so that s could be greater than or equal to B. If it is the case, that is if
B  s  2  B  2  ulp; ð12:3Þ
310 12 Floating Point Arithmetic

then (normalization) substitute s by s/B, and e by e ? 1, so that the value of s  Be


is the same as before, and the new value of s satisfies
1  s  2  ð2=BÞ  ulp  B  ulp: ð12:4Þ
The significands s1 and s2 of the operands are multiples of ulp. If e1 is greater than
e2, the value of s could no longer be a multiple of ulp and some rounding function
should be applied to s. Assume that s0 \ s \ s00 = s0 ? ulp, s0 and s00 being two
successive multiples of ulp. Then the rounding function associates to s either s0 or s00 ,
according to some rounding strategy. According to (12.4) and to the fact that 1 and
B—ulp are multiples of ulp, it is obvious that 1 B s0 \ s00 B B—ulp. Nevertheless, if
the condition (12.3) does not hold, that is if 1 B s \ B, s could belong to the interval
B  ulp\s\B; ð12:5Þ
so that rounding(s) could be equal to B, then a new normalization step would be
necessary, i.e. substitution of s = B by s = 1 and e by e ? 1.

Algorithm 12.1: Sum of positive numbers

Examples 12.3
Assume that B = 2 and ulp = 2-5, so that the numbers are represented in the form
s  2e where 1 B s B 1.111112. For simplicity e is written in decimal (base 10).

1. Compute z ¼ 1:10101  23 þ ð1:00010  21 Þ:
Alignment: z ¼ ð1:10101 þ 0:000100010Þ  23 ¼ 1:101110010  23 :
Rounding: s ffi 1:10111:
Final result: z ffi 1:10111  23 :

2. Compute z ¼ ð1:11010  23 Þ þ ð1:00110  22 Þ:


Alignment:z ¼ ð1:11010 þ 0:100110Þ  23 ¼ 10:011010  23 :
Normalization : s ¼ 1:0011010; e ¼ 4:
Rounding: s ffi 1:00110:
Final result: z ffi 1:00110  24 :
12.2 Arithmetic Operations 311
 
3. Compute z ¼ 1:10101  23 þ 1:10101  21 :
Alignment: z ¼ ð1:10010 þ 0:0110101Þ  23 ¼ 1:1111101  23 :
Rounding: s ffi 10:00000:
Normalization: s ffi 1:00000; e ¼ 4:
Final result : z ffi 1:00000  24 :

Comments 12.2
1. The addition of two positive numbers could produce an overflow as the final
value of e could be greater than emax.
2. Observe in the previous examples the lack of precision due to the small number
of bits (6 bits) used in the significand s.

12.2.2 Difference of Positive Numbers

Given two positive floating-point numbers s1  Be1 and s2  Be2 their difference s  Be is
computed as follows: assume that e1 is greater than or equal to e2; then (for alignment)
the difference between s1  Be1 and s2  Be2 can be expressed in the form s  Be where

s ¼ s1  s2 = Be1e2 and e ¼ e1 : ð12:6Þ
The value of s belongs to the interval s  ðB  ulpÞ  s  B  ulp: If s is neg-
ative, then it is substituted by and the sign of the final result will be modified
accordingly. If s is equal to 0, then an exception equal_zero could be raised. It
remains to consider the case where 0\s  B  ulp: The value of s could be smaller
than 1. In order to normalize the significand, s is substituted by s  Bk and e by e - k,
where k is the minimum exponent k such that s  Bk  1: Thus, the relation 1  s  B
holds. It remains to round (up or down) the significand and to normalize it if
necessary.
In the following algorithm, the function leading_zeroes(s) computes the
smallest k such that s  Bk C 1.

Algorithm 12.2: Difference of positive numbers


312 12 Floating Point Arithmetic

Examples 12.4
Assume again that B = 2 and ulp = 2-5, so that the numbers are represented in
the form s  2e where 1 B s B 1.111112. For computing the difference, the 2’s
complement representation is used (one extra bit is used).

1. Compute z ¼ ð1:10101  22 Þ  1:01010  21 :
Alignment: z ¼ ð0:00110101  1:01010Þ  21 :
20 s complement addition: ð00:00110101 þ 10:10101 þ 00:00001Þ  21 ¼
10:11100101  21 :
Change of sign:  s ¼ 01:00011010 þ 00:00000001 ¼ 01:00011011:
Rounding:  s ffi 1:00011:
Final result: z ffi 1:00011  21 :

 
2. Compute z ¼ 1:00010  23  1:10110  22 :
Alignment: z ¼ ð1:00010  0:110110Þ  23 :
20 s complement addition: ð01:00010 þ 11:001001 þ 00:000001Þ  23 ¼
00:001110  23 :
Leading zeroes: k ¼ 3; s ¼ 1:11000; e ¼ 0:
Final result: z ¼ 1:11000  20 :

 
3. Compute z ¼ 1:01010  23  1:01001  21 :
Alignment: z ¼ ð1:01010  0:0101001Þ  23 ¼ 0:1111111  23 :
Leading zeroes: k ¼ 1; s ¼ 1:111111; e ¼ 2:
Rounding: s ffi 10:00000:
Normalization: s ffi 1:0000; e ¼ 3:
Final result:z ffi 1:00000  23 :

Comment 12.3
The difference of two positive numbers could produce an underflow as the final
value of e could be smaller than emin.
12.2 Arithmetic Operations 313

Table 12.3 Effective Operation Sign1 Sign2 Actual


operation in floating point operation
adder–subtractor
0 0 0 s1 ? s2
0 0 1 s1 - s2
0 1 0 -(s1 - s2)
0 1 1 -(s1 - s2)
1 0 0 s1 - s2
1 0 1 s1 ? s2
1 1 0 -(s1 ? s2)
1 1 1 -(s1 - s2)

12.2.3 Addition and Subtraction

Given two floating-point numbers ð1Þsign1 s1  Be1 and ð1Þsign2 s2  Be2 ; and a
control variable operation, an algorithm is defined for computing

z ¼ ð1Þsign s  Be ¼ ð1Þsign1 s1  Be1 þ ð1Þsign2 s2  Be2 ; if operation ¼ 0;


sign e sign1 e1 sign2 e2
z ¼ ð1Þ s  B ¼ ð1Þ s1  B  ð1Þ s2  B ; if operation ¼ 1:
Once the significands have been aligned, the actual operation (addition or
subtraction of the significands) depends on the values of operation, sign1 and sign2
(Table 12.3). The following algorithm computes z. The procedure swap (a, b)
interchanges a and b.

Algorithm 12.3: Addition and subtraction


314 12 Floating Point Arithmetic

12.2.4 Multiplication

Given two floating-point numbers ð1Þsign1 s1  Be1 and ð1Þsign2 s2  Be2 their
product (-1)sign  s  Be is computed as follows:
sign ¼ sign1 xor sign2 ; s ¼ s1  s2 ; e ¼ e1 þ e2 : ð12:7Þ
2
The value of s belongs to the interval 1 B s B (B - ulp) , and could be greater
than or equal to B. If it is the case, that is if B B s B (B - ulp)2, then (normal-
ization) substitute s by s/B, and e by e ? 1. The new value of s satisfies

1  s  ðB  ulpÞ2 =B ¼ B  2:ulp þ ðulpÞ2 =B\B  ulp ð12:8Þ


(ulp \ B so that 2 - ulp/B [ 1). It remains to round the significand and to
normalize if necessary.

Algorithm 12.4: Multiplication

Examples 12.5
Assume again that B = 2 and ulp = 2-5, so that the numbers are represented in
the form s  2e where 1 B s B 1.111112. The exponent e is represented in decimal.
 
1. Compute z ¼ 1:11101  22  1:00010  25 :
Multiplication : z ¼ 01:1100101100  23 :
Rounding : s ffi 1:11001:
Final result : z ffi 1:11001  23 :


2. Compute z ¼ 1:11101  23  ð1:00011  21 Þ:
Multiplication : z ¼ 10:00010101112  22 :
Normalization : s ¼ 1:000010101112 ; e ¼ 3:
Rounding : s ffi 1:00001:
Final result : z ffi 1:00001  23 :
12.2 Arithmetic Operations 315
 
3. Compute z ¼ 1:01000 x 21  1:10011  22 :
Multiplication : z ¼ 01:111111000  22 :
Normalization : s ¼ 1:11111; e ¼ 3:
Rounding : s ffi 10:00000:
Normalization : s ffi 1; e ¼ 4:
Final result : z ffi 1:000002  24 :

Comment 12.4
The product of two real numbers could produce an overflow or an underflow as the
final value of e could be greater than emax or smaller than emin (addition of two
negative exponents).

12.2.5 Division

Given two floating-point numbers ð1Þsign1  s1  Be1 and ð1Þsign2  s2  Be2 their
quotient
(-1)sign  s  Be is computed as follows:
sign ¼ sign1 xor sign2 ; s ¼ s1 =s2 ; e ¼ e1  e2 : ð12:9Þ
The value of s belongs to the interval 1/B \ s B B - ulp, and could be smaller
than 1. If that is the case, that is if s ¼ s1 =s2 \1; then s1 \s2 ; s1  s2
ulp; s1 =s2  1  ulp=s2 \1  ulp=B; and 1=B\s\1  ulp=B:
Then (normalization) substitute s by s  B, and e by e - 1. The new value of
s satisfies 1 \ s \ B - ulp. It remains to round the significand.

Algorithm 12.5: Division

Examples 12.6
Assume again that B = 2 and ulp = 2-5, so that the numbers are represented in
the form s • 2e where 1 B s B 1.111112. The exponent e is represented in
decimal.
316 12 Floating Point Arithmetic
 
1. Compute z ¼ 1:11101  23 1:00011  21 :
Division: z ¼ 1:1011111000  24 :
Rounding: s ffi 1:10111:
Final result: z ffi 1:00001  23 :

 
2. Compute z ¼ 1:01000  21 1:10011  22 :
Division: z ¼ 0:1100100011  21 :
Normalization: s ffi 1:100100011; e ¼ 2:
Rounding: s ffi 1:10010:
Final result: z ffi 1:10010  22 :

Comment 12.5
The quotient of two real numbers could produce an underflow or an overflow as the
final value of e could be smaller than emin or bigger than emax. Observe that a second
normalization is not necessary as in the case of addition, subtraction and multiplication.

12.2.6 Square Root

Given a positive floating-point number s1  Be1, its square root s  Be is computed


as follows:
if e1 is even; s ¼ ðs1 Þ1=2 ; e ¼ e1 =2; ð12:10Þ

if e1 is odd; s ¼ ðs1 =BÞ1=2 ; e ¼ ðe1 þ 1Þ=2: ð12:11Þ


1/2
In the first case (12.10), 1 B s B (B - ulp) \ B - ulp. In the second case (1/
B)1/2 B s \ 1. Hence (normalization) s must be substituted by s  B and e by e - 1,
so that 1 B s \ B. It remains to round the significand and to normalize if necessary.

Algorithm 12.6: Square root

An alternative is to replace (12.11) by:

if e1 is odd; s ¼ ðs1  BÞ1=2 ; e ¼ ðe1  1Þ=2: ð12:12Þ


12.2 Arithmetic Operations 317

In this case B1=2  s  ðB2  ulp  BÞ1=2 \B; then the first normalization is not
necessary. Nevertheless, s could be B - ulp \ s \ B, and then depending on the
rounding strategy, normalization after rounding could be still necessary.
Algorithm 12.7: Square root, second version

Note that the ‘‘round to nearest’’ (default rounding in IEEE 754-2008) and the
‘‘truncation’’ rounding schemes allow avoiding the second normalization.

Examples 12.7
Assume again that B = 2 and ulp = 2-5, so that the numbers are represented in the
form s  2e where 1 B s B 1.111112. The exponent e is represented in decimal form.
1=2
1. Compute z ¼ 1:11101  24 :
Square rooting : z ¼ 1:01100001  22 :
Rounding : s ffi 1:01100:
Final result : z ffi 1:01100  22 :

1=2
2. Compute z ¼ ð1:00101  21 Þ :
Even exponent : s ¼ 10:0101; e ¼ 2:
Square rooting : z ¼ 1:10000101  21 :
Rounding : s ffi 1:10000
Final result : z ffi 1:10000  21 :

1=2
3. Compute z ¼ 1:11111  23 :
Even exponent; s ¼ 11:1111; e ¼ 2:
Square rooting : z ¼ 1:11111011  21 :
Rounding : s ffi 1:11111ðround to nearestÞ:
Final result : z ffi 1:111112  21 :
318 12 Floating Point Arithmetic

However, some rounding schemes (e.g. toward infinite) generate s ffi 10:00000:


Then, the result after normalization is s ffi 1:00000; e ¼ 2; and the final result
z ffi 1:00000  22 :

Comment 12.6
The square rooting of a real number could produce an underflow as the final value
of e could be smaller than emin.

12.3 Rounding Schemes

Given a real number x and a floating-point representation system, the following


situations could happen:
j xj\smin  Be min ; that is, an underflow situation,
j xj [ smax  Be max ; that is, an overflow situation,
j xj ¼ s  Be ; where emin  e  emax and smin  s  smax :
In the third case, either s is a multiple of ulp, in which case a rounding operation
is not necessary, or it is included between two multiples s0 and s00 of ulp:
s0 \ s \ s00 .
The rounding operation associates to s either s0 or s00 , according to some
rounding strategy. The most common are the following ones:
• The truncation method (also called round toward 0 or chopping) is accom-
plished by dropping the extra digits, i.e. round(s) = s0 if s is positive, round(-
s) = s00 if s is negative.
• The round toward plus infinity is defined by round(s) = s00 .
• The round toward minus infinity is defined by round(s) = s0 .
• The round to nearest method associates s with the closest value, that is, if
s \ s0 ? ulp/2, round(s) = s0 , and if s [ s0 ? ulp/2, round(s) = s00 .
If the distances to s0 and s00 are the same, that is, if s = s0 ? ulp/2, there are
several options. For instance:
• round(s) = s0 ;
• round(s) = s00 ;
• round(s) = s0 if s is positive, round(s) = s00 if s is negative. It is the round to
nearest, ties to zero scheme.
• round(s) = s00 if s is positive, round(s) = s0 if s is negative. It is the round to
nearest, ties away from zero scheme.
• round(s) = s0 if s0 is an even multiple of ulp, round(s) = s00 if s00 is an even
multiple of ulp. It is the default scheme in the IEEE 754 standard.
• round(s) = s0 if s0 is an odd multiple of ulp, round(s) = s00 if s00 is an odd
multiple of ulp.
12.3 Rounding Schemes 319

The preceding schemes (round to the nearest) produce the smallest absolute
error, and the two last (tie to even, tie to odd) also produce the smallest average
absolute error (unbiased or 0-bias representation systems).
Assume now that the exact result of an operation, after normalization, is
s ¼ 1:s1 s2 s3 . . . sp jsðpþ1Þ sðpþ2Þ sðpþ3Þ . . .

where ulp is equal to B-p (the | symbol indicates the separation between the digit
which corresponds to the ulp and the following). Whatever the chosen rounding
scheme, it is not necessary to have previously computed all the digits s-(p+1) s-
(p+2)…; it is sufficient to know whether all the digits s-(p+1) s-(p+2)… are equal to
0, or not. For example the following algorithm computes round(s) if the round to
the nearest, tie to even scheme is used.

Algorithm 12.8: Round to the nearest, tie to even

In order to execute the preceding algorithm it is sufficient to know the value of


s1 ¼ 1  s1 s2 s3 . . . sp ; the value of s-(p+1), and whether s2 ¼
0:00 . . . 0j0 sðpþ2Þ sðpþ3Þ . . . is equal to 0, or not.

12.3.1 Rounding Schemes in IEEE 754

From the previous description, the IEEE 754-2008 standard defines five rounding
algorithms. The two first round to a nearest value; the others are called directed
roundings:
• Round to nearest, ties to even; this is the default rounding for binary floating-
point and the recommended default for decimal.
• Round to nearest, ties away from zero.
• Round toward 0—directed rounding towards zero.
• Round toward +?—directed rounding towards positive infinity
• Round toward -?—directed rounding towards negative infinity.
320 12 Floating Point Arithmetic

12.4 Guard Digits

Consider the exact result r of an operation, before normalization. According to the


preceding paragraph:

r\B2 ; i:e: r ¼ r1 r0  r1 r2 r3 . . . rp jrðpþ1Þ rðpþ2Þ rðpþ3Þ . . .

The normalization operation (if necessary) is accomplished by


• dividing the result by B (sum of positive numbers, multiplication),
• multiplying the result by B (division),
• multiplying the result by Bk (difference of positive numbers).
Furthermore, if the operation is a difference of positive numbers (Algorithm
12.2), consider two cases:
• if e1  e2  2; then r ¼ s1  s2 =ðBe1e2 Þ [ 1  B=B2 ¼ 1 
1=B  1=Bðas B  2Þ; so that the number k of leading zeroes is equal to 0 or 1, and
the normalization operation (if necessary i.e. k = 1) is accomplished by multi-
plying the result by B;
• if e1 - e2 B 1, then the result before normalization is either

r0  r1 r2 r3 . . . rp jrðpþ1Þ 0 0 . . . ðe1  e2 ¼ 1Þ; or


r0  r1 r2 r3 . . . rp j0 0 0 . . . ðe1  e2 ¼ 0Þ:

A consequence of the preceding analysis is that the result after normalization


can be either
r0  r1 r2 r3 . . . rp jrðpþ1Þ rðpþ2Þ rðpþ3Þ . . . ðno normalization operationÞ;
ð12:13Þ

or
r1  r0 r1 r2 . . . rpþ1 jrp rðpþ1Þ rðpþ2Þ . . . ðdivide by BÞ; ð12:14Þ

or
r1  r2 r3 r4 . . . rðpþ1Þ jrðpþ2Þ rðpþ3Þ rðpþ4Þ . . . ðmultiply by BÞ; ð12:15Þ

or

rk  rðkþ1Þ rðkþ2Þ . . . rp rðpþ1Þ 0 . . . 0j0 0 . . . ðmultiply by Bk where k [ 1Þ:


ð12:16Þ
For executing a rounding operation, the worst case is (12.15). In particular, for
executing Algorithm 12.8, it is necessary to know
12.4 Guard Digits 321

• the value of s1 ¼ r1  r2 r3 r4 . . . rðpþ1Þ ;


• the value of rðpþ2Þ ;
• whether s2 ¼ 0:00 . . . 0j0 rðpþ3Þ rðpþ4Þ . . . is equal to 0, or not.
The conclusion is that the result r of an operation, before normalization, must
be computed in the form
r ffi r1 r0  r1 r2 r3 . . . rp jrðpþ1Þ rðpþ2Þ T;

that is, with two guard digits r-(p+1) and r-(p+2)-, and  an additional sticky digit
T equal to 0 if all the other digits rðpþ3Þ ; rðpþ4Þ ; . . . are equal to 0, and equal to
any positive value otherwise.
After normalization, the significand will be obtained in the following general form:
s ffi 1:s1 s2 s3 . . . sp jsðpþ1Þ sðpþ2Þ sðpþ3Þ :

The new version of Algorithm 12.8 is the following:

Algorithm 12.9: Round to the nearest, tie to even, second version

Observe that in binary representation, the following algorithm is even simpler.

Algorithm 12.10: Round to the nearest, tie to even, third version

12.5 Arithmetic Circuits

This section proposes basic implementations of the arithmetic operations, namely


addition and subtraction, multiplication, division and square root. The implementation
is based on the previous section devoted to the algorithms, rounding and guard digit.
322 12 Floating Point Arithmetic

The circuits support normalized binary IEEE 754-2008 operands. Regarding the
binary subnormals the associated hardware to manage this situation is complex.
Some floating point implementations solve operations with subnormals via soft-
ware routines. In the FPGA arena, most cores do not support denormalized
numbers. The dynamic range can be increased using fewer resources by increasing
the size of the exponent (a 1-bit increase in exponent, roughly doubles the dynamic
range) and is typically the solution adopted.

12.5.1 Adder–Subtractor

An adder-subtractor based on Algorithm 12.3 will now be synthesized. The


operands are supposed to be in IEEE 754 binary encoding. It is made up of five
parts, namely unpacking, alignment, addition, normalization and rounding, and
packing. The following implementation does not support subnormal numbers; they
are interpreted as zero.

12.5.1.1 Unpacking

The unpacking separates the constitutive parts of the Floating Points and addi-
tionally detects the special numbers (infinite, zeros and NaNs). The special number
detection is implemented using simple comparators. The following VHDL process
defines the unpacking of a floating point operand FP; k is the number of bits of FP,
w is the number of bits of the exponent, and p is the significand precision.

The previous is implemented using two w bits comparators, one p bits com-
parator and some additional gates for the rest of the conditions.

12.5.1.2 Alignment

The alignment circuit implements the three first lines of the Algorithm 12.3, i.e.
12.5 Arithmetic Circuits 323

e1 e2 sign1 sign2 operation s1 s2

actual_sign2
subtractor subtractor

e1-e2 e2-e1

0 1 0 1 0 1 0 1 1 0 1 0

sign(e1-e2) new_s2 s
000

dif right shifter 000


(from 0 to p+3)

e sign new_sign2 aligned_s2 aligned_s1

Fig. 12.2 Alignment circuit in floating point addition/subtraction

An example of implementation is shown in Fig. 12.2. The principal and more


complex component is the right shifter.
Given a (p ? 3)-bit input vector ½1 ak1 ak2 . . . a1 a0 0 0 0; the shifter gener-
ates a (p ? 3)-bit output vector. The lsb (least significand bit) of the output is the
sticky bit indicating if the shifted bits are equal to zero or not. If B = 2, the sticky-
digit circuit is an OR circuit.
Observe that if e1  e2  p þ 3; then the shifter output is equal to [0 0… 0 1],
since the last bit is the sticky bit and the input number is a non-zero. The operation
with zero is treated differently.

12.5.1.3 Addition and Subtraction

Depending on the respective signs of the aligned operands, one of the following
operations must be executed: if they have the same sign, the sum aligned_-
s1 ? aligned_s2 must be computed; if they have different signs, the difference
324 12 Floating Point Arithmetic

aligned_s1—aligned_s2 is computed. If the result of the significand difference is


negative, then aligned_s2—aligned_s1 must be computed. Moreover, the only
situation in which the final result could be negative is when the e1 = e2.
In the circuit of Fig. 12.3 two additions are performed in parallel: result = a-
ligned_s1 ± aligned_s2, where the actual operation is selected with the signs of the
operands, and alt_result = s2 - s1. At the end of this stage a multiplexer selects
the correct results. The operation selection is done as follows:

12.5.1.4 Normalization and Rounding

The normalization circuit executes the following part of Algorithm 12.3:

If the number of leading zeroes is greater than p þ 3; i:e: s1  s2 [ Bðpþ2Þ ;


then s2 [ s1  Bðpþ2Þ : If e1 were greater than e2 then s2  ðB  ulpÞ=B ¼
1  Bðpþ1Þ so that 1  Bðpþ1Þ  s2 \s1  Bðpþ2Þ  1  Bðpþ2Þ ; which is
impossible. Thus, the only case where the number of leading zeroes can be greater
than p ? 3 is when e1 = e2 and s1 = s2. If more than p ? 3 leading 0’s are
detected in the circuit of Fig. 12.4, a zero_flag is raised.
It remains to execute the following algorithm where operation is the internal
operation computed in Fig. 12.3:
12.5 Arithmetic Circuits 325

s2 s1 aligned_s1 aligned_s2
sign
sign2
0

(p+4)-bits adder /
(p+1)-bits subtractor
subtractor

result operation
alt_result

s1 &000 alt_result & 000


alt_result >0
s2 &000 result
operation

e1 significand
selection 0 1 2 3
e2

iszero1 signif
iszero2

Fig. 12.3 Effective addition and subtraction

The rounding depends on the selected rounding strategy. An example of


rounding circuit implementation is shown in Fig. 12.4. If the round to the nearest,
tie to even method is used (Algorithm 12.10), the block named rounding decision
computes the following Boolean function decision:

The second rounding is avoided in the following way. The only possibility to need a
second rounding is when, as the result of an addition, the significand is 1.111…1111xx.
This situation is detected in a combinational block that generates the signal ‘‘isTwo’’
and adds one to the exponent. After rounding, the resulting number is 10.000…000, but
the two most significand bits are discarded and the hidden 1 is appended.

12.5.1.5 Packing

The packing joins the constitutive parts of the floating point result. Additionally
depending on special cases (infinite, zeros and NaNs), generates the corresponding
codification.
Example 12.8 (Complete VHDL code available)
Generate the VHDL model of an IEEE decimal floating-point adder-subtractor. It
is made up of five previously described blocks. Fig. 12.5 summarizes the inter-
connections. For clearness and reusability the code is written using parameters,
326 12 Floating Point Arithmetic

signif e
e

=
/B 111...111 +1

isTwo k
signif(p+3) leading
0 1 0 1 0 zero_flag x Bk -k
0's

sign alt_result > 0

operation Sign
0 1 0 1 computation

s
new_e new_sign
Rounding s(p+2:3)

p-bits 1 (ulp)
half adder

s3
s2 rounding
s1 decision 1 0
s0

new_s

Fig. 12.4 Normalization and rounding

where K is the size of the floating point numbers (sign, exponent, significand), E is
the size of the exponent and P is the size of the significand (including the hidden
1). The entity declaration of the circuit is:

For simplicity the code is written as a single VHDL code except additional files
to describe the right shifter of Fig. 12.2 and the leading zero detection and shifting
of Fig. 12.4. The code is available at the book home page.
12.5 Arithmetic Circuits 327

FP1 FP2

add/sub isInf1
Unpack isInf2
sign1 sign2 e1 e2 isZ1 isZ2 s1 s2 isNaN1 isNaN2
s1 s2
sign1 sign2 e1 e2

sign1 sign2 e1 e2 isZ1 isZ2 s1 s2


Alignment
sign new_sign2 e aligned_s1 aligned_s2

sign sign2 aligned_s1 aligned_s2 s1 s2


Addition / Subtraction
operation significand

sign operation e significand


overflow
Conditions
Normalization and Rounding undeflow Detection
new_sign new_e new_s zero_flag

sign e s isInf
Pack isZero
FP isNaN
FP

Fig. 12.5 General structure of a floating point adder-subtractor

A two stage pipeline could be achieved dividing the data path between addtion/
subtraction and normalization and rounding stages (dotted line in Fig. 12.5).

12.5.2 Multiplier

A basic multiplier deduced from Algorithm 12.4 is shown in Fig. 12.6. The
unpacking and packing circuits are the same as in the case of the adder-subtractor
(Fig. 12.5, Sects. 12.5.1.1 and 12.5.1.5) and for simplicity, are not drawn. The
‘‘normalization and rounding’’ is a simplified version of Fig. 12.4, where the part
related to the subtraction is not necessary.
328 12 Floating Point Arithmetic

s1 s2 e1 e2 sign1 sign2

adder
p-by-p-bits
multiplier e

2.p-1 .. 0 p-4 .. 0

2.p-1 .. p-3
sticky digit
prod(p+3 .. 0) generation

e
Normalization

=
/B 111...111 +1

isTwo
prod(p+3)
0 1 0 1

Rounding
s(p+2:3)

p-bits 1 (ulp)
half adder

s3
s2 rounding
s1 decision 1 0
s0

s e sign

Fig. 12.6 General structure of a floating point multiplier

The obvious method of computing the sticky bit is with a large fan-in OR gate
on the low order bits of the product. Observe, in this case, that the critical path
includes the p by p bits multiplication and the sticky digit generation.
An alternative method consists of determining the number of trailing zeros in
the two inputs of the multiplier. It is easy to demonstrate that the number of
trailing zeros in the product is equal to the sum of the number of trailing zeros in
each input operand. Notice that this method does not require the actual low order
product bits, just the input operands, so the computation can occur in parallel with
the actual multiply operation, removing the sticky computation from the critical
path.
The drawback of this method is that significant extra hardware is required. This
hardware includes two long length priority encoders to count the number of
12.5 Arithmetic Circuits 329

s1 s2 e1 e2 sign1 sign2 isInf1 isNan1 isZero1


isInf2 isNan2 isZero2

sticky bit conditions


adder
generation detection
p-by-p-bits

multiplier
isZero isInf isNan

2.p-1 .. 0 2.p-1 .. p-3 sticky bit

prod(p+3 .. 0)
e
conditions
Normalization and isZero adjust
Rounding

isInf isNan
s e sign underflow overflow isZero

Fig. 12.7 A better general structure of a floating point multiplier

trailing zeros in the input operands, a small length adder, and a small length
comparator. On the other hand, some hardware is eliminated, since the actual low
order bits of the product are no longer needed.
A faster floating point multiplier architecture that computes the p by p multi-
plication and the sticky bit in parallel is presented in Fig. 12.7. The dotted lines
suggest a three stage pipeline implementation using a two stage p by p multipli-
cation. The two extra blocks are shown to indicate the special conditions detec-
tions. In the second block, the range of the exponent is controlled to detect
overflow and underflow conditions. In this figure the packing and unpacking
process are omitted for simplicity.

Example 12.9 (complete VHDL code available)


Generate the VHDL model of a generic floating-point multiplier. It is made up of
the blocks depicted in Fig. 12.7 described in a single VHDL file. For clearness and
reusability the code is written using parameters, where K is the size of the floating
point numbers (sign, exponent, significand), E is the size of the exponent and P is
the size of the significand (including the hidden 1). The entity declaration of the
circuit is:
330 12 Floating Point Arithmetic

s1 s2 e1 e2 sign1 sign2

subtractor
p-by-p-bits
divider e
q r

p+2 .. 0 p-1 .. 0

sticky digit
generation

div(p+3 .. 0)

e
Normalization

*B -1

div(p+2)
0 1 0 1
quotient < 0

Rounding

s e sign

Fig. 12.8 A general structure of a floating point divider

The code is available at the home page of this book. The combinational circuit
registers inputs and outputs to ease the synchronization. A two or three stage
pipeline is easily achievable adding the intermediate registers as suggested in
Fig. 12.7. In order to increase the clock frequency, more pipeline registers can be
inserted into the integer multiplier.
12.5 Arithmetic Circuits 331

s1 s2 e1 e2 sign1 sign2

subtractor

p-by-p-bits
divider

q Sticky bit

div(p+3 .. 0)

Normalization and
Rounding

s e sign

Fig. 12.9 A pipelined structure of a floating point divider

12.5.3 Divider

A basic divider, deduced from Algorithm 12.6, is shown in Fig. 12.9. The
unpacking and packing circuits are similar to those of the adder-subtractor or
multiplier. The ‘‘normalize and rounding’’ is a simplified version of Fig. 12.4,
where the part related to the subtraction is not necessary.
The inputs of the p-bit divider are s1 and s2. The first operator s1 is internally
divided by two (s1/B, i.e. right shifted) so that the dividend is smaller than the
divisor. The precision is chosen equal to p ? 3 digits. Thus, the outputs quotient
and remainder satisfy the relation (s1/B).Bp+3 = s2.q ? r where r \ s2, that is,

s1 =s2 ¼ q:Bðpþ2Þ þ ðr=s2 Þ  Bðpþ2Þ where ðr=s2 Þ  Bðpþ2Þ \Bðpþ2Þ :


The sticky digit is equal to 1 if r \ 0 and to 0 if r = 0. The final approximation
of the exact result is

quotient ¼ q:Bðpþ2Þ þ sticky digit: Bðpþ3Þ :


332 12 Floating Point Arithmetic

If a non-restoring divider is used, a further optimization could be done in the


sticky bit computation. In the non-restoring divider the final remainder could be
negative. In this case, a final correction should be done. This final operation can be
avoided: the sticky bit is 0 if the remainder is equal to zero, otherwise it is 1.
A divider is a time consuming circuit. In order to obtain circuits, with fre-
quencies similar to those of floating point multipliers or adders, more pipeline
stages are necessary. Figure 12.9 shows a possible pipeline for a floating point
divider where the integer divider is also pipelined.

Example 12.10 (complete VHDL code available)


Generate the VHDL model of a generic floating-point divider. It is made up of the
blocks depicted in Fig. 12.9. Most of the design is described in a single VHDL file,
but for the integer divider. The integer divider is a non-restoring divider
(div_nr_wsticky.vhd) that uses a basic non-restoring cell (a_s.vhd). For clearness
and reusability the code is written using parameters, where K is the size of the
floating point numbers (sign, exponent, significand), E is the size of the exponent
and P is the size of the significand (including the hidden 1). The entity declaration
of the circuit is:

12.5.4 Square Root

A basic square rooter deduced from Algorithm 12.7 is shown in Fig. 12.10. The
unpacking and packing circuits are the same as in previous operations and, for
simplicity, are not drawn. Remember that the first normalization is not necessary,
and for most rounding strategies the second normalization is not necessary either.
The exponent is calculated as follows:
E ¼ e1 =2 þ bias ¼ ðE1  biasÞ=2 þ bias ¼ E1 =2 þ bias=2:

• If e1 is even, then both E1 and bias are odd (bias is always odd). Thus, E ¼
bE1 =2c þ bbias=2c þ 1 where bE1 =2c and bbias=2c amount to right shifts.
• If e1 is odd, then E1 is even. The significand is multiplied by 2 and the exponent
reduced by one unit. The biased exponent
E ¼ ðE1  1Þ=2 þ bias=2 ¼ E1 =2 þ bbias=2c:
To summarize, the biased exponent E ¼ bE1 =2c þ bbias=2c þ parityðE1 Þ:
12.5 Arithmetic Circuits 333

Fig. 12.10 Structure of a s1 e1


floating point squarer

-1 mod 2
0 0

0 1 0 1

p+1 .. 0
/2

(p+2)-bits e
square rooter

Q sticky

sqrt(p+1 .. 0) sticky bit

p+2 .. 0
Normalization

s e

Since the integer square root is a complex operation, a pipelined floating-point


square rooter should be based on a pipelined integer square root. The dotted line of
Fig. 12.10 shows a possible five stage pipeline circuit.

Example 12.11 (complete VHDL code available)


Generate the VHDL model of a generic floating-point square rooter. It is made up
of the blocks depicted in Fig. 12.10. The design is described in a single VHDL file,
but for the integer square root. The integer square root is based in a non-restoring
algorithm (sqrt_wsticky.vhd) that uses two basic non-restoring cells (sqrt_cell.vhd
and sqrt_cell_00.vhd). For clearness and reusability, the code is written using
parameters, where K is the size of the floating point numbers (sign, exponent,
significand), E is the size of the exponent and P is the size of the significand
(including the hidden 1). The entity declaration of the circuit is:
334 12 Floating Point Arithmetic

Table 12.4 Combinational floating point operators in binary32 format


FF LUTs DSP48 Slices Delay
FP_add 96 699 – 275 11.7
FP_mult 96 189 2 105 8.4
FP_mult_luts 98 802 – 234 9.7
FP_div 119 789 – 262 46.6
FP_sqrt 64 409 – 123 38.0

Table 12.5 Combinational floating point operators in binary64 format


FF LUTs DSP48 Slices Delay
FP_add 192 1372 – 585 15.4
FP_mult 192 495 15 199 15.1
FP_mult_luts 192 3325 – 907 12.5
FP_div 244 3291 – 903 136.9
FP_sqrt 128 1651 – 447 97.6

Table 12.6 Pipelined floating point operators in binary32 format


FF LUTs DSP48 Slices Period Latency
FP_add 137 637 – 247 6.4 2
FP_mult 138 145 2 76 5.6 2
FP_mult_luts 142 798 – 235 7.1 2
FP_mult 144 178 2 89 3.8 3
FP_mult_luts 252 831 – 272 5.0 3
FP_sqrt 384 815 – 266 9.1 6
FP_div 212 455 -0 141 9.2 5

12.5.5 Implementation Results

Several implementation results are now presented. The circuits were implemented
in a Virtex 5, speed grade 2, device, using ISE 13.1 and XST for synthesis.
Tables 12.4 and 12.5 show combinational circuit implementations for binary32
and binary64 floating point operators. The inputs and outputs are registered. When
the number of registers (FF) is greater than the number of inputs and outputs, this
is due to the register duplication made by synthesizer. The multiplier is imple-
mented using the embedded multiplier (DSP48) and general purpose logic.
Table 12.6 shows the results for pipelined versions in the case of decimal32
data. The circuits include input and output registers. The adder is pipelined in two
stages (Fig. 12.5). The multiplier is segmented using three pipeline stages
(Fig. 12.7). The divider latency is equal to 6 cycles and the square root latency is
equal to five cycles (Figs. 12.9 and 12.10).
12.6 Exercises 335

12.6 Exercises

1. How many bits are there in the exponent and the significand of a 256-bit
binary floating point number? What are the ranges of the exponent and the
bias?
2. Convert the following decimal numbers to the binary32 and binary64 floating-
point format. (a) 123.45; (b) -1.0; (c) 673.498e10; (d) qNAN; (e) -1.345e-
129; (f) ?; (g) 0.1; (h) 5.1e5
3. Convert the following binary32 number to the corresponding decimal number.
(a) 08F05352; (b) 7FC00000; (c) AAD2CBC4; (d) FF800000; (e) 484B0173;
(f) E9E55838; (g) E9E55838.
4. Add, subtract, multiply and divide the following binary floating point numbers
with B = 2 and ulp = 2-5, so that the numbers are represented in the form s 
2e where 1 B s B 1.111112. For simplicity e is written in decimal (base 10).
(a) 1.10101 9 23 op 1.10101 9 21
(b) 1.00010 9 2-1 op 1.00010 9 2-1
(c) 1.00010 9 2-3 op 1.10110 9 22
(d) 1.10101 9 23 op 1.00000 9 24
5. Add, subtract, multiply and divide the following decimal floating point
numbers using B = 10 and ulp = 10-4, so that the numbers are represented in
the form s  10e where 1 B s B 9.9999 (normalized decimal numbers).
(a) 9.4375 9 103 op 8.6247 9 102
(b) 1.0014 9 103 op 9.9491 9 102
(c) 1.0714 9 104 op 7.1403 9 102
(d) 3.4518 9 10-1 op 7.2471 9 103
6. Analyze the consequences and implication to support denormalized (subnor-
mal in IEEE 754-2008) numbers in the basic operations.
7. Analyze the hardware implication to deal with no-normalized significands
(s) instead of normalized as in the binary standard.
8. Generate VHDL models adding a pipeline stage to the binary floating point
adder of Sect. 12.5.1.
9. Add a pipeline stage to the binary floating point multiplier of Sect. 12.5.2.
10. Generate VHDL models adding two pipeline stages to the binary floating point
multiplier of Sect. 12.5.2.
11. Generate VHDL models adding several pipeline stages to the binary floating
point divider of Sect. 12.5.3.
12. Generate VHDL models adding several pipeline stages to the binary floating
point square root of Sect. 12.5.4.
336 12 Floating Point Arithmetic

Reference

1. IEEE (2008) IEEE standard for floating-point arithmetic, 29 Aug 2008

You might also like