0% found this document useful (0 votes)
81 views

Lecture4 Multiplier

Uploaded by

Yi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
81 views

Lecture4 Multiplier

Uploaded by

Yi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

ESM5020 VLSI System Design and

Design Automation
Spring 2019
Lecture 4 - Shifter and
Multiplier Design
Reading Assignment:
Weste: Chapter 8
Rabaey: Chapter 11

Note: some of the figures in this slide set are adapted from the slide set
of “ Digital Integrated Circuits” by Rabaey. Et. al. 2002
1 EESM5020/19 Lecture 4
Shifter Design
• Shifting operations are important and are used extensively for
– arithmetic shifting, logical shifting, rotation,
– floating point operations, scaling and multiplications by
constant number
– Data alignment
– Field extraction/combination
– Address generation
• Shifting a data-word left or right over a constant amount is
trivial hardware operation. A programmable shifter, however, is
more complex.
• E.g. shift left or right for a variable number of bit
• Design style
– Two dimension arrays
– Variable size
– Rotate
– Padding with zeros/ones

2 EESM5020/19 Lecture 4
A simple shifter
Right nop Left

Ai Bi

Ai-1 Bi-1

Bit-Slice i

•The above design will...rapidly become complex and


slow for larger shift values
•More structural approach is advisable: Two
commonly used shift structures, the barrel shifter
and the logarithmic shifter.
3 EESM5020/19 Lecture 4
Barrel Shifter

• It consists of array of transmission gates, where the


number of row equals the word length of the data
and the number of columns equals the maximum
shift length.
• A major advantage for this shifter is that the signal
has to pass through at most one transmission gate
and hence the delay is theoretically constant and
independent of the shift value or shifter size. This is
not true in reality since the capacitance at the input
of the buffers rise linearly with the maximum shift-
width.

4 EESM5020/19 Lecture 4
The Barrel Shifter (2)

A3
B3

Sh1
A2
B2

Sh2 : Data Wire


A1
B1 : Control Wire

Sh3
A0
B0

Sh0 Sh1 Sh2 Sh3

Area Dominated by Wiring


5 EESM5020/19 Lecture 4
Logarithmic Shifter
• While the barrel shifter implements the whole shifter as a single
array of pass-transistors, the log. shifter uses a staged approach. It
uses stages of multiplexers which decompose the shift into power-
of-two stages.
• A shifter with a maximum shift width of M consists of log2M stages,
where the ith stage either shifts over 2i or passes the data
unchanged.
• Log. shifter is usually smaller than the barrel shifter. For larger
values, of M, it is definitely the structure of choice.
• The speed depends upon the shift-width in a log. way since a n-bit
shifter requires log2n stages.
• Other shift options are frequently required, for instance, shuffles,
bit reversals, and interchanges.

6 EESM5020/19 Lecture 4
Logarithmic Shifter (2)
Sh1 Sh1 Sh2 Sh2 Sh4 Sh4

A3 B3

A2 B2

A1 B1

A0 B0

• In general, it can be concluded that a barrel-shifter is


appropriate for smaller shifters. For large shift values, the
log. shifter becomes more effective, in terms of area and
speed. Also log. shifter is more regular and hence can be
easily generated automatically.
7 EESM5020/19 Lecture 4
Multiplexer-based shifter

8 EESM5020/19 Lecture 4
The Multiplier
• Very important operation. Often the speed of multiplication
limits the performance of the digital processor.
• Multiplications are used in many digital signal processing
applications:
– correlations, convolution, filtering, and frequency analysis.
– Vector product, matrix multiplication.
– Weighted sums required in many DSP such as Neural
network, Filtering etc…
• Multipliers are in fact complex adder arrays.
• The analysis of the multiplier gives us some further insight
on how to optimize the performance (or the area) of
complex circuit topologies.

9 EESM5020/19 Lecture 4
Example
• Example: 10x5
Multiplicand: 1 0 1 0 10
Multiplier: 0 1 0 1 5

1 0 1 0
0 0 0 0 4 partial products
1 0 1 0
0 0 0 0

0 1 1 0 0 1 0 50
•The multiplication process may be viewed to consist
of two steps:
•Evaluation of partial products
•Accumulation of the shifted partial products.
• Partial products can be generated using an array of AND gates.
10 EESM5020/19 Lecture 4
The Multiplier(II)

• Binary multiplication is equivalent AND operation.


Evaluation of the partial products consists of the
logical ANDing of the multiplicand and the relevant
multiplier bit.
• Different techniques exist. The choice of technique
is based on factors such as speed, throughput,
numerical accuracy and area.
• N*N multiplier has 2n bits output
– Integer multiplier – takes the n LSB bits
– Floating point multiplier (or fixed point with decimal
point in the MSB) e.g. FP, 1.XXX * 1.XXX, takes the n
MSB bits

11 EESM5020/19 Lecture 4
Simple multiplier

• Generates and add one partial product at each


cycles.
• Takes n cycles.
multiplicand

Partial Product
multiplier
generation
Shift right
Adder every cycle

Shift

12 EESM5020/19 Lecture 4
The Array Multiplier
• Consider two unsigned binary number X and Y that
are M and N bits wide, respectively
M -1 N -1 M + N -1
X = å X i 2 Y = åYj 2
å
i j
P = X ´Y = Pk 2 k
i =0 j =0
k =0
æ M -1 öæç N -1 ö M -1æ N -1
÷ ç
ö
i+ j ÷
=çX =
ç å X i 2i ÷ç Y =
÷ç å
Yj 2 j ÷ =
÷
å åç X Y
i j 2 ÷
è i =0 øè j =0 ø i =0 ç
è j =0
÷
ø
•Pk the partial product terms called summands. There are
M*N summands which are generated in parallel by a set of
M*N AND gates

13 EESM5020/19 Lecture 4
The Array Multiplier (II)
• A n*n multiplier requires n(n-2) full adders, n half
adders, and n2 AND gates. The worst case delay is
(2n+1)tg, where tg is the worst case adder delay.

14 EESM5020/19 Lecture 4
The Array Multiplier (III)
• The following is a basic cell used in array multiplier
B C
Y Y
X
+
CO
X PO

15 EESM5020/19 Lecture 4
A 4*4 array multiplier
Y0
x3 x2 x1 x0
Z0
X3 X2 X1 X0 Y1

HA FA FA HA

X3 X2 X1 X0 Y2 Z1

FA FA FA HA

X3 X2 X1 X0 Y3 Z2

FA FA FA HA

Z7 Z6 Z5 Z4 Z3

16 EESM5020/19 Lecture 4
The MxN Array Multiplier - Critical Path

HA FA FA HA

FA FA FA HA Critical Path 1
Critical Path 2

FA FA FA HA

t mult » [( M - 1) + ( N - 2)]tcarry + ( N - 1)t sum + t and


17 EESM5020/19 Lecture 4
Adder Cells in Array Multiplier

P
VD D
VDD
A
Ci

A A P S
P
Ci
A
B B P
VD D
VDD
P A

P Co
Ci Ci

Ci
A

Identical Delays for Carry and Sum

18 EESM5020/19 Lecture 4
Issues for design fast multiplier

• Reduce the number of partial products


• Fast adder cells
• Reducing the number of addition required to sum
the partial products – e.g. use tree adders

19 EESM5020/19 Lecture 4
Booth Encoding
• The multiplier we studied before use radix-2
multiplication, i.e. by observing one bit of the
multiplicand at a time.
• Higher radix multipliers may be designed to reduce the
number of adders and hence the delay required to
compute the partial sums.
• Booth encoding - perform two’s complement
multiplication and perform several steps of the
multiplication at once.
• It takes the advantage of the fact that an add-subtracter
is nearly as fast and small as a simple adder.
• The most common form of Booth’s algorithm looks at
three bits of the multiplier at a time to perform two
stages of multiplication.
20 EESM5020/19 Lecture 4
Booth Multiplier: Example
• 2a = 2a+1- 2a and hence we can recode each 1 in
multiplier as “+2-1”
– Converts sequences of 1 to 10…0(-1)
– Might reduce the number of 1’s

0 0 1 1 1 1 1 1 0 0

+1 -1
+1 -1
Less 1’s in +1 -1
this sequence +1 -1
+1 -1
+1 -1

0 1 0 0 0 0 0 -1 0 0
21 [© K. Bazaragan] EESM5020/19 Lecture 4
Booth Recoding: Multiplication Example

0 0 1 1 0 6x
Sign extension 0 1 1 1 0 14 Only two
+1 0 0 -1 0 rows of
0 0 0 0 0 partial sums

1 1 1 1 1 0 1 0 (-6)
0 0 0 0 0
0 0 0 0 0
0 0 1 1 0
0 0 1 0 1 0 1 0 0 84

22 [© K. Bazaragan] EESM5020/19 Lecture 4


Booth Recoding: Advantages and
Disadvantages

• Major advantage: Can reduce the number of 1’s


in multiplier
• So far:
– We did not improve the speed of the multiplier as we
still have to wait for the critical path, e.g., the shift-add
delay in sequential multiplier.
– Booth recording results in increased area as we need
recoding circuitry AND subtraction

23 EESM5020/19 Lecture 4
Modified Booth Multiplier
• We can reduce the # of partial sums –Group more bits
• Group pairs, leaving –2, -1, 0, 1, 2
– Grouping reduces # of partial products by half
• Booth recoding results in:
– Gets rid of 3’s (sequences of 1’s in general)
0 1 1 0 1 1 1 0 0 0 1 0

(+1 -1) (+1 -1) (+1 -1)


(+1 -1) (+1 -1)
(+1 -1)

+1 0 -1 +1 0 0 -1 0 0 +1 -1 0
+2 -1 0 -2 +1 -2
[©Hauck]
24 EESM5020/19 Lecture 4
Modified Booth Encoding (II)

• Consider the two’s complement representation of


the multiplier y:
n n -1 n - 2
y = -2 y n + 2 y n -1 + 2 y n-2 + !
• We can rewrite 2a = 2a+1- 2a and hence
n n-1 n-2
y = 2 ( yn-1 - yn ) + 2 ( yn-2 - yn-1) + 2 ( yn-3 - yn-2 ) + !
• Look at the first two terms
n n-1
2 ( yn-1 - yn ) + 2 ( yn-2 - yn-1 )

25 EESM5020/19 Lecture 4
Modified Booth Multiplier

• Can encode the digits by looking at three bits at a


time (reduce the partial sums)
• Booth recoding table:
i+1 i i-1 add – Must be able to add
multiplicand times –2, -1,
0 0 0 0*M 0, 1 and 2
0 0 1 1*M – Since Booth recoding got
0 1 0 1*M rid of 3’s, generating
partial products is not
0 1 1 2*M
that hard (shifting and
1 0 0 –2*M negating)
1 0 1 –1*M
1 1 0 –1*M
1 1 1 0*M
[©Hauck]
26 EESM5020/19 Lecture 4
Booth Multiplier: Example

• Retire two bits per shift operation


0 0 1 1 0 1 13
• Addition: signed
1 1 1 0 1 0 -6
– Sign extend 2 bits if adding
two partial products at a time 0 -1 -2
1 1 0 0 1 1 0
1
1 1 1 1 0 0 1 1
i+1 i i-1 add
0 0 0 0 0 0
0 0 0 0*M
0 0 1 1*M
0 1 0 1*M
1 1 1 0 1 1 0 0 1 0
0 1 1 2*M
1 0 0 –2*M
1 0 1 –1*M
1 1 0 –1*M
1 1 1 0*M
27 [© K. Bazaragan] EESM5020/19 Lecture 4
Booth Multiplier
• The following shows a structure of a Booth multiplier

Stage j+1 Stage j

Pj+ Left shift 2 Left shift 2


1

yi+4 yi+2
Adder/subtractor
yi+3
Adder/subtractor yi+1
Pj+ code code
yi+2 yi
1
mux sel mux sel
0 x 2x 0 x 2x

Pj

28 EESM5020/19 Lecture 4
Modified Booth Multiplier -
Summary
• Uses high-radix to reduce number of intermediate
addition operands
– Can go higher: radix-8, radix-16
– Radix-8 should implement *3, *-3, *4, *-4
– Recoding and partial product generation becomes
more complex
• Can automatically take care of signed multiplication

29 EESM5020/19 Lecture 4
Multiply by a constant

• Occur very often in digital signal processing,


multiply the data with a coefficient.
• Use shift and add to calculate
• Number of addition depends on the number of 1 in
the constant.
• Objective: to reduce the number of 1s to represent
the constant.
• Canonical signed digit arithmetic

30 EESM5020/19 Lecture 4
Canonic Signed Digit Arithmetic
• Encoding a binary number such that it contains the
fewest number of non-zero bits is called canonic
signed digit(CSD).
• The following are the properties of CSD numbers:
– No 2 consecutive bits in a CSD number are non-zero.
– The CSD representation of a number contains the minimum
possible number of non-zero bits, thus the name canonic.
– The CSD representation of a number is unique.
– CSD numbers cover the range (-4/3,4/3), out of which the
values in the range [-1,1) are of greatest interest.
– Among the W-bit CSD numbers in the range [-1,1), the
average number of non-zero bits is W/3 + 1/9 + O(2-W).
Hence, on average, CSD numbers contains about 33% fewer
non-zero bits than two’s complement numbers.

31 EESM5020/19 Lecture 4
Canonic Signed Digit Arithmetic
• Conversion of W-bit number to a'-1 = 0; g -1 = 0; a'W = a'W -1
CSD format:
– A = a’W-1. a’W-2… a’1. a’0 =2’s For(i=0 to W-1)
complement number { q = a' Åa' ;
i i i -1
– Its CSD representation is aW-1. aW-
2… a1. a0 g i = g i -1q i ;
• Algorithm to obtain CSD
representation: ai = (1 - 2a'i +1 )g i ;
}

32 EESM5020/19 Lecture 4
Example

# of non-zero 001011101111100010101111
term = 15 1 means -1
001011101111100010110001
# of non-zero 001011101111100011010001
term = 8
001011101111100101010001
001011110000100101010001
010100010000100101010001

33 EESM5020/19 Lecture 4
CSD Multiplication

• Horner’s rule for precision improvement : This


involves delaying the scaling operations common
to the 2 partial products thus increasing accuracy.
• For example, x·2-5 + x·2-3 can be implemented as
(x·2-2 + x)2-3 to increase the accuracy.

34 EESM5020/19 Lecture 4
Using Horner’s rule for partial product
accumulation to reduce the truncation error.

35 EESM5020/19 Lecture 4
36 EESM5020/19 Lecture 4
Reduce the number of addition in adding the
partial sums - Wallace-Tree Based Multiplier

• Principle
– Sum N shifted partial products
– Do N-input addition efficiently
– Reduced N-input addition in steps
– Use counters, e.g. carry-save adder (CSA) (3/2
reduction)
• CSA is simple, it is just a full adder
– At the end of the array you need to add two parts
together.
– This take a fast adder, but you only need one at the
end, not one for each partial product.

37 EESM5020/19 Lecture 4
Carry –save adder – 3-to-2
reduction
• The Wallace tree multiplier uses logic tricks to
speed up the required addition. It is an adder tree
built from carry save adders using 3-to-2 reduction
ABC CS No. of 1‘s
000 00 0 A 1-bit adder provides a 3:2
001 01 1 compression in the number
010 01 1 of bits. The addition of
011 10 2 partial products in a column
100 01 1 of an array multiplier may be
101 10 2 thought of as totaling up the
110 10 2 number of 1’s in that column,
111 11 3 with an carry being passed
to the next column to the
left.
38 EESM5020/19 Lecture 4
Carry-save adder example

Adding 3 numbers at the same time


Reducing to adding 2 number after 1 1-bit full adder
delay time.

10010101
01101011
10101010
Reduce 2 N-bit additions to
01010100 1 N+1 bit additions
10101011

39 EESM5020/19 Lecture 4
Wallace Tree Multiplier
Multiplicand

Partial Product Generator

Partial Products

Summation Network

Two 2n bit operands

Carry Propagate Adder

40 Final 2n bit Product EESM5020/19 Lecture 4


Wallace Tree Example

Delay = 4 CSA + 1 CLA


[Par00] p130
[© Oxford U Press]
41 EESM5020/19 Lecture 4
Wallace-Tree Based Multiplier
y0 y1
y2

y0 y1 y2 y3 y4 y5
Ci-1
FA

y3
FA FA
Ci Ci Ci-1
Ci-1
FA Ci Ci-1

y4
FA
Ci Ci-1 Ci Ci-1
FA

y5

Ci FA
FA

C S
C S

42 EESM5020/19 Lecture 4
Use 4:2 counter instead of 3:2 reduction
4:2 counter
• Use two (3:2) reduction circuit to
build a 4:2 counter
3:2 CSA • Generate two carries, one internal
and 1 external
3:2 CSA • The internal carry is sent to the
next higher digit

3:2 CSA 3:2 CSA 3:2 CSA

Advantages:
More regular layout 3:2 CSA 3:2 CSA 3:2 CSA

43 EESM5020/19 Lecture 4
Example 4*4 multiplier

44 EESM5020/19 Lecture 4
Example 8*8 multiplier

45 EESM5020/19 Lecture 4
Sign extension with Booth multiplier
• When the partial product is negative, we
need to do sign extension.
• If we do it just by copying of bit, there is
impact on the delay since the fanout can 11111111
111111
be large. 1111
11
• We can do some tricks 10101011
– Pre-add the triangle of 1’s
11111111
S
– The to clear out 1’s by adding 1 to the row 0 0 0 0 0 0 0 0 (S=0)
or 1 1 1 1 1 1 1 1 (S=1)
– Now you only need to add few bits SSS
1S
– Adding these few bits is equivalent to 1S
complete sign extension 1S
1 0 1 0 1 0 11
46 EESM5020/19 Lecture 4
Example to illustrate
s s s s s s s s s
s s s s s s s
s s s s s
s s s
s

s
1 1 1 1 1 1 1 1 1
s
1 1 1 1 1 1 1
s
1 1 1 1 1
s
1 1 1
s
1

s s s
1 s
1 s
1 s
s

47 EESM5020/19 Lecture 4
Implementing Large multiplier
using several smaller multipliers
• If we already have n*n bit multipliers, we can use it
to implement large bit-width multipliers
• E.g. 2n*2n bit multiplier can be implemented by 4
n*n bit multipliers
X ⋅ Y = ( X H ⋅ 2 n + X L ) ⋅ (YH ⋅ 2 n + YL )
= X H ⋅ YH ⋅ 2 2n + (X H ⋅ YL + YH ⋅ X L ) ⋅ 2 n + X L ⋅ YL

• XH, XL – most and least significant halves of X


• YH, YL – most and least significant halves of Y

48 EESM5020/19 Lecture 4
Implementing Large multiplier
using several smaller multipliers
• Use 4 n*n bit multiplier to XH XL
generate 4 partial products of 2n YH YL
bits, align them correctly and add XL*YL

• n least significant bits, already XH*YL


XL*YH
there, do not need further
XH*YH
calculation
XH*YL
• 2n center bits – use 2n-bit CAS to
XH*YH XL*YL
calculate
XL*YH
• n most significant bits – use CPA
to merge with the outputs
generated from the 2n-bit CSA for
the 2n center bits

49 EESM5020/19 Lecture 4
Multipliers —Summary

• Optimization Goals Different Vs Binary Adder

• Once Again: Identify Critical Path

• Other possible techniques


- Logarithmic versus Linear (Wallace Tree Mult)
- Data encoding (Booth)
- Pipelining
FIRST GLIMPSE AT SYSTEM LEVEL OPTIMIZATION

50 EESM5020/19 Lecture 4
Floating-point units
• More complex operation/more time
• Fewer access
• Often designed outside the normal ALU
• Co-processor
• Floating point representation
• Data = (-1)sign*0.1 Fraction*2exp
• Normalization:
– 1 < Data <= ½ (Exp =0, Sign =0)
– First Decimal Digit is one
– No need for representing it
• IEEE standard: sign – 1 bit, exponent – 11 bits,
fraction – 52 bits => total 64 bits

51 EESM5020/19 Lecture 4
Floating Point Addition

• Align operands
– Check exponents
– Shift data
• Add fractional bits
– Integer addition
• Normalization
– Shift data
– Increment or decrement exponents
• Rounding data

52 EESM5020/19 Lecture 4
Floating point adder
+/- sign exponent mantissa
A B A B A B

Shift
Exp. Diff.
Align
Sign
Unit
Adder
(Mantissa)
Exp. update
Norm
Round
C
C
sign mantissa C
53 exponent EESM5020/19 Lecture 4
Floating Point Multiplication

• Add exponents
– 11 bit addition
• Multiply the mantissa
– Integer multiplication
• Normalization
– Shift data (at most by one)
– Decrement exponent
• Rounding data

54 EESM5020/19 Lecture 4
Floating Point Multiplier
sign exponent mantissa
A B A B A B

Exp. Add
Ex-or

Multiplier
(Mantissa)
Exp. update
Norm
Round
C
C
sign mantissa C
55 exponent EESM5020/19 Lecture 4
Other Data Operator

• Parity Generators - using XOR gate


Parity = A0 Å A1 Å A2 Å !Å An

56 EESM5020/19 Lecture 4
Comparator

• A = B, A > B, A < B

57 EESM5020/19 Lecture 4
High speed comparator
• A single-cycle comparator based on the priority-
encoding algorithm and dynamic circuit design
technique
• 4 steps:
1. XOR gate is used to determine whether each corresponding bit of
the two numbers is equal or not.
2. A priority encoder is used to set the most significant unequal bit
of the result from step 1 to ‘1’ and reset all other bits to ‘0’.
3. The result of step 2 is “ANDed” with the two input numbers.
4. All the bits of the results of step 3 are “ORed” together to
determine which number is greater.

58 EESM5020/19 Lecture 4
Dynamic Priority Encoder

Critical path: 7 transistors because of the NAND gate implementation

59 EESM5020/19 Lecture 4
Wide bit width comparator – 64 bits

• Hierarchical- multistages
• Phase pipelining to achieve single clock

60 EESM5020/19 Lecture 4

You might also like