Lecture4 Multiplier
Lecture4 Multiplier
Design Automation
Spring 2019
Lecture 4 - Shifter and
Multiplier Design
Reading Assignment:
Weste: Chapter 8
Rabaey: Chapter 11
Note: some of the figures in this slide set are adapted from the slide set
of “ Digital Integrated Circuits” by Rabaey. Et. al. 2002
1 EESM5020/19 Lecture 4
Shifter Design
• Shifting operations are important and are used extensively for
– arithmetic shifting, logical shifting, rotation,
– floating point operations, scaling and multiplications by
constant number
– Data alignment
– Field extraction/combination
– Address generation
• Shifting a data-word left or right over a constant amount is
trivial hardware operation. A programmable shifter, however, is
more complex.
• E.g. shift left or right for a variable number of bit
• Design style
– Two dimension arrays
– Variable size
– Rotate
– Padding with zeros/ones
2 EESM5020/19 Lecture 4
A simple shifter
Right nop Left
Ai Bi
Ai-1 Bi-1
Bit-Slice i
4 EESM5020/19 Lecture 4
The Barrel Shifter (2)
A3
B3
Sh1
A2
B2
Sh3
A0
B0
6 EESM5020/19 Lecture 4
Logarithmic Shifter (2)
Sh1 Sh1 Sh2 Sh2 Sh4 Sh4
A3 B3
A2 B2
A1 B1
A0 B0
8 EESM5020/19 Lecture 4
The Multiplier
• Very important operation. Often the speed of multiplication
limits the performance of the digital processor.
• Multiplications are used in many digital signal processing
applications:
– correlations, convolution, filtering, and frequency analysis.
– Vector product, matrix multiplication.
– Weighted sums required in many DSP such as Neural
network, Filtering etc…
• Multipliers are in fact complex adder arrays.
• The analysis of the multiplier gives us some further insight
on how to optimize the performance (or the area) of
complex circuit topologies.
9 EESM5020/19 Lecture 4
Example
• Example: 10x5
Multiplicand: 1 0 1 0 10
Multiplier: 0 1 0 1 5
1 0 1 0
0 0 0 0 4 partial products
1 0 1 0
0 0 0 0
0 1 1 0 0 1 0 50
•The multiplication process may be viewed to consist
of two steps:
•Evaluation of partial products
•Accumulation of the shifted partial products.
• Partial products can be generated using an array of AND gates.
10 EESM5020/19 Lecture 4
The Multiplier(II)
11 EESM5020/19 Lecture 4
Simple multiplier
Partial Product
multiplier
generation
Shift right
Adder every cycle
Shift
12 EESM5020/19 Lecture 4
The Array Multiplier
• Consider two unsigned binary number X and Y that
are M and N bits wide, respectively
M -1 N -1 M + N -1
X = å X i 2 Y = åYj 2
å
i j
P = X ´Y = Pk 2 k
i =0 j =0
k =0
æ M -1 öæç N -1 ö M -1æ N -1
÷ ç
ö
i+ j ÷
=çX =
ç å X i 2i ÷ç Y =
÷ç å
Yj 2 j ÷ =
÷
å åç X Y
i j 2 ÷
è i =0 øè j =0 ø i =0 ç
è j =0
÷
ø
•Pk the partial product terms called summands. There are
M*N summands which are generated in parallel by a set of
M*N AND gates
13 EESM5020/19 Lecture 4
The Array Multiplier (II)
• A n*n multiplier requires n(n-2) full adders, n half
adders, and n2 AND gates. The worst case delay is
(2n+1)tg, where tg is the worst case adder delay.
14 EESM5020/19 Lecture 4
The Array Multiplier (III)
• The following is a basic cell used in array multiplier
B C
Y Y
X
+
CO
X PO
15 EESM5020/19 Lecture 4
A 4*4 array multiplier
Y0
x3 x2 x1 x0
Z0
X3 X2 X1 X0 Y1
HA FA FA HA
X3 X2 X1 X0 Y2 Z1
FA FA FA HA
X3 X2 X1 X0 Y3 Z2
FA FA FA HA
Z7 Z6 Z5 Z4 Z3
16 EESM5020/19 Lecture 4
The MxN Array Multiplier - Critical Path
HA FA FA HA
FA FA FA HA Critical Path 1
Critical Path 2
FA FA FA HA
P
VD D
VDD
A
Ci
A A P S
P
Ci
A
B B P
VD D
VDD
P A
P Co
Ci Ci
Ci
A
18 EESM5020/19 Lecture 4
Issues for design fast multiplier
19 EESM5020/19 Lecture 4
Booth Encoding
• The multiplier we studied before use radix-2
multiplication, i.e. by observing one bit of the
multiplicand at a time.
• Higher radix multipliers may be designed to reduce the
number of adders and hence the delay required to
compute the partial sums.
• Booth encoding - perform two’s complement
multiplication and perform several steps of the
multiplication at once.
• It takes the advantage of the fact that an add-subtracter
is nearly as fast and small as a simple adder.
• The most common form of Booth’s algorithm looks at
three bits of the multiplier at a time to perform two
stages of multiplication.
20 EESM5020/19 Lecture 4
Booth Multiplier: Example
• 2a = 2a+1- 2a and hence we can recode each 1 in
multiplier as “+2-1”
– Converts sequences of 1 to 10…0(-1)
– Might reduce the number of 1’s
0 0 1 1 1 1 1 1 0 0
+1 -1
+1 -1
Less 1’s in +1 -1
this sequence +1 -1
+1 -1
+1 -1
0 1 0 0 0 0 0 -1 0 0
21 [© K. Bazaragan] EESM5020/19 Lecture 4
Booth Recoding: Multiplication Example
0 0 1 1 0 6x
Sign extension 0 1 1 1 0 14 Only two
+1 0 0 -1 0 rows of
0 0 0 0 0 partial sums
1 1 1 1 1 0 1 0 (-6)
0 0 0 0 0
0 0 0 0 0
0 0 1 1 0
0 0 1 0 1 0 1 0 0 84
23 EESM5020/19 Lecture 4
Modified Booth Multiplier
• We can reduce the # of partial sums –Group more bits
• Group pairs, leaving –2, -1, 0, 1, 2
– Grouping reduces # of partial products by half
• Booth recoding results in:
– Gets rid of 3’s (sequences of 1’s in general)
0 1 1 0 1 1 1 0 0 0 1 0
+1 0 -1 +1 0 0 -1 0 0 +1 -1 0
+2 -1 0 -2 +1 -2
[©Hauck]
24 EESM5020/19 Lecture 4
Modified Booth Encoding (II)
25 EESM5020/19 Lecture 4
Modified Booth Multiplier
yi+4 yi+2
Adder/subtractor
yi+3
Adder/subtractor yi+1
Pj+ code code
yi+2 yi
1
mux sel mux sel
0 x 2x 0 x 2x
Pj
28 EESM5020/19 Lecture 4
Modified Booth Multiplier -
Summary
• Uses high-radix to reduce number of intermediate
addition operands
– Can go higher: radix-8, radix-16
– Radix-8 should implement *3, *-3, *4, *-4
– Recoding and partial product generation becomes
more complex
• Can automatically take care of signed multiplication
29 EESM5020/19 Lecture 4
Multiply by a constant
30 EESM5020/19 Lecture 4
Canonic Signed Digit Arithmetic
• Encoding a binary number such that it contains the
fewest number of non-zero bits is called canonic
signed digit(CSD).
• The following are the properties of CSD numbers:
– No 2 consecutive bits in a CSD number are non-zero.
– The CSD representation of a number contains the minimum
possible number of non-zero bits, thus the name canonic.
– The CSD representation of a number is unique.
– CSD numbers cover the range (-4/3,4/3), out of which the
values in the range [-1,1) are of greatest interest.
– Among the W-bit CSD numbers in the range [-1,1), the
average number of non-zero bits is W/3 + 1/9 + O(2-W).
Hence, on average, CSD numbers contains about 33% fewer
non-zero bits than two’s complement numbers.
31 EESM5020/19 Lecture 4
Canonic Signed Digit Arithmetic
• Conversion of W-bit number to a'-1 = 0; g -1 = 0; a'W = a'W -1
CSD format:
– A = a’W-1. a’W-2… a’1. a’0 =2’s For(i=0 to W-1)
complement number { q = a' Åa' ;
i i i -1
– Its CSD representation is aW-1. aW-
2… a1. a0 g i = g i -1q i ;
• Algorithm to obtain CSD
representation: ai = (1 - 2a'i +1 )g i ;
}
32 EESM5020/19 Lecture 4
Example
# of non-zero 001011101111100010101111
term = 15 1 means -1
001011101111100010110001
# of non-zero 001011101111100011010001
term = 8
001011101111100101010001
001011110000100101010001
010100010000100101010001
33 EESM5020/19 Lecture 4
CSD Multiplication
34 EESM5020/19 Lecture 4
Using Horner’s rule for partial product
accumulation to reduce the truncation error.
35 EESM5020/19 Lecture 4
36 EESM5020/19 Lecture 4
Reduce the number of addition in adding the
partial sums - Wallace-Tree Based Multiplier
• Principle
– Sum N shifted partial products
– Do N-input addition efficiently
– Reduced N-input addition in steps
– Use counters, e.g. carry-save adder (CSA) (3/2
reduction)
• CSA is simple, it is just a full adder
– At the end of the array you need to add two parts
together.
– This take a fast adder, but you only need one at the
end, not one for each partial product.
37 EESM5020/19 Lecture 4
Carry –save adder – 3-to-2
reduction
• The Wallace tree multiplier uses logic tricks to
speed up the required addition. It is an adder tree
built from carry save adders using 3-to-2 reduction
ABC CS No. of 1‘s
000 00 0 A 1-bit adder provides a 3:2
001 01 1 compression in the number
010 01 1 of bits. The addition of
011 10 2 partial products in a column
100 01 1 of an array multiplier may be
101 10 2 thought of as totaling up the
110 10 2 number of 1’s in that column,
111 11 3 with an carry being passed
to the next column to the
left.
38 EESM5020/19 Lecture 4
Carry-save adder example
10010101
01101011
10101010
Reduce 2 N-bit additions to
01010100 1 N+1 bit additions
10101011
39 EESM5020/19 Lecture 4
Wallace Tree Multiplier
Multiplicand
Partial Products
Summation Network
y0 y1 y2 y3 y4 y5
Ci-1
FA
y3
FA FA
Ci Ci Ci-1
Ci-1
FA Ci Ci-1
y4
FA
Ci Ci-1 Ci Ci-1
FA
y5
Ci FA
FA
C S
C S
42 EESM5020/19 Lecture 4
Use 4:2 counter instead of 3:2 reduction
4:2 counter
• Use two (3:2) reduction circuit to
build a 4:2 counter
3:2 CSA • Generate two carries, one internal
and 1 external
3:2 CSA • The internal carry is sent to the
next higher digit
Advantages:
More regular layout 3:2 CSA 3:2 CSA 3:2 CSA
43 EESM5020/19 Lecture 4
Example 4*4 multiplier
44 EESM5020/19 Lecture 4
Example 8*8 multiplier
45 EESM5020/19 Lecture 4
Sign extension with Booth multiplier
• When the partial product is negative, we
need to do sign extension.
• If we do it just by copying of bit, there is
impact on the delay since the fanout can 11111111
111111
be large. 1111
11
• We can do some tricks 10101011
– Pre-add the triangle of 1’s
11111111
S
– The to clear out 1’s by adding 1 to the row 0 0 0 0 0 0 0 0 (S=0)
or 1 1 1 1 1 1 1 1 (S=1)
– Now you only need to add few bits SSS
1S
– Adding these few bits is equivalent to 1S
complete sign extension 1S
1 0 1 0 1 0 11
46 EESM5020/19 Lecture 4
Example to illustrate
s s s s s s s s s
s s s s s s s
s s s s s
s s s
s
s
1 1 1 1 1 1 1 1 1
s
1 1 1 1 1 1 1
s
1 1 1 1 1
s
1 1 1
s
1
s s s
1 s
1 s
1 s
s
47 EESM5020/19 Lecture 4
Implementing Large multiplier
using several smaller multipliers
• If we already have n*n bit multipliers, we can use it
to implement large bit-width multipliers
• E.g. 2n*2n bit multiplier can be implemented by 4
n*n bit multipliers
X ⋅ Y = ( X H ⋅ 2 n + X L ) ⋅ (YH ⋅ 2 n + YL )
= X H ⋅ YH ⋅ 2 2n + (X H ⋅ YL + YH ⋅ X L ) ⋅ 2 n + X L ⋅ YL
48 EESM5020/19 Lecture 4
Implementing Large multiplier
using several smaller multipliers
• Use 4 n*n bit multiplier to XH XL
generate 4 partial products of 2n YH YL
bits, align them correctly and add XL*YL
49 EESM5020/19 Lecture 4
Multipliers —Summary
50 EESM5020/19 Lecture 4
Floating-point units
• More complex operation/more time
• Fewer access
• Often designed outside the normal ALU
• Co-processor
• Floating point representation
• Data = (-1)sign*0.1 Fraction*2exp
• Normalization:
– 1 < Data <= ½ (Exp =0, Sign =0)
– First Decimal Digit is one
– No need for representing it
• IEEE standard: sign – 1 bit, exponent – 11 bits,
fraction – 52 bits => total 64 bits
51 EESM5020/19 Lecture 4
Floating Point Addition
• Align operands
– Check exponents
– Shift data
• Add fractional bits
– Integer addition
• Normalization
– Shift data
– Increment or decrement exponents
• Rounding data
52 EESM5020/19 Lecture 4
Floating point adder
+/- sign exponent mantissa
A B A B A B
Shift
Exp. Diff.
Align
Sign
Unit
Adder
(Mantissa)
Exp. update
Norm
Round
C
C
sign mantissa C
53 exponent EESM5020/19 Lecture 4
Floating Point Multiplication
• Add exponents
– 11 bit addition
• Multiply the mantissa
– Integer multiplication
• Normalization
– Shift data (at most by one)
– Decrement exponent
• Rounding data
54 EESM5020/19 Lecture 4
Floating Point Multiplier
sign exponent mantissa
A B A B A B
Exp. Add
Ex-or
Multiplier
(Mantissa)
Exp. update
Norm
Round
C
C
sign mantissa C
55 exponent EESM5020/19 Lecture 4
Other Data Operator
56 EESM5020/19 Lecture 4
Comparator
• A = B, A > B, A < B
57 EESM5020/19 Lecture 4
High speed comparator
• A single-cycle comparator based on the priority-
encoding algorithm and dynamic circuit design
technique
• 4 steps:
1. XOR gate is used to determine whether each corresponding bit of
the two numbers is equal or not.
2. A priority encoder is used to set the most significant unequal bit
of the result from step 1 to ‘1’ and reset all other bits to ‘0’.
3. The result of step 2 is “ANDed” with the two input numbers.
4. All the bits of the results of step 3 are “ORed” together to
determine which number is greater.
58 EESM5020/19 Lecture 4
Dynamic Priority Encoder
59 EESM5020/19 Lecture 4
Wide bit width comparator – 64 bits
• Hierarchical- multistages
• Phase pipelining to achieve single clock
60 EESM5020/19 Lecture 4