0% found this document useful (0 votes)
43 views

4-Distributed Arithmetic SD

Distributed arithmetic is a technique for implementing digital multipliers using lookup tables (LUTs) in field programmable gate arrays (FPGAs). It works by performing multiplication one bit at a time and storing precomputed partial products in a lookup table. This allows for efficient implementation of filters with multiple taps. Distributed arithmetic can trade off processing speed, which is slower due to bit serial operations, for reduced logic area requirements compared to parallel multipliers. It is commonly used to implement finite impulse response (FIR) filters in FPGAs.
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views

4-Distributed Arithmetic SD

Distributed arithmetic is a technique for implementing digital multipliers using lookup tables (LUTs) in field programmable gate arrays (FPGAs). It works by performing multiplication one bit at a time and storing precomputed partial products in a lookup table. This allows for efficient implementation of filters with multiple taps. Distributed arithmetic can trade off processing speed, which is slower due to bit serial operations, for reduced logic area requirements compared to parallel multipliers. It is commonly used to implement finite impulse response (FIR) filters in FPGAs.
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 13

Distributed Arithmetic

Dr Sumam David S.
Dept. of E&C, NITK Surathkal

Courtesy for slides – Xilinx Professor’s Workshop Resources


Objective

 Distributed arithmetic
 What ?
 Where ?
 How ?
What is DA?

 Multiplication using LUT


 Used to implement multipliers in LUT rich
FPGAs
Twos Complement Multiplication

 One bit at a time:


SDA 1-Tap FIR Filter

N BITS WIDE
SAMPLE DATA

A0 Partial
Parallel
X0 1 Product +/- Z-1
ROM
to serial
converter Scaling Accumulator

A0
0 00000...0
1 C0

LUT contains two locations


Distributed Arithmetic
for a 2-Tap Filter

 Partial products of equal weight are added together before being


summed to next higher partial product weight
 Create look-up table of summed partial products

-23 22 21 20 -23 22 21 20
C0 = 1 0 0 1 (-7) C1 = 0 1 1 0 ( 6)
X X0 = 0 1 1 1 ( 7) X X1 = 0 1 0 1 ( 5)
( 1 0 0 1 + 0 1 1 0) 1111 (-1)
( 1 0 0 1 + 0 0 0 0 ) 1001 (-14)
( 1 0 0 1 + 0 1 1 0 ) 1111 (-4)
(0 0 0 0 + 0 0 0 0 ) 0000 (0)
1 1 0 0 1 1 1 1 (-49) 0 0 0 1 1 1 1 0 ( 30) =11101101 (-19)

= Sign Extension (Serial-Data / Tap-Parallel Multiply)


SDA 2-Tap FIR Filter

N BITS WIDE
SAMPLE DATA

X0
A0 Partial
Product +/- Z-1
A1
X1 1
ROM
Scaling Accumulator

00 0000...0 LUT contains all possible


01 C0 sums of the partial
10 C1 products
11 C0 + C1
SDA 4-Tap FIR Filter
N BITS WIDE
SAMPLE DATA

A0
X0
0000...0
C0
1
A1
+
X1
0000...0
1
C1 Partial Z-1
+/-
+
A2 0000...0
Product
X2
C2 Scaling
1 ROM Accumulator
A3
+
X3
0000...0
C3
SDA 8-Tap FIR Filter
N BITS WIDE
SAMPLE DATA
A0
X0
1

A1
X1 Partial
1

A2 Product
X Pre-Adder
2 1 ROM
A3
X3
+/- Z-1
1
+
A0
X4
1
Scaling
Accumulator
A1
X5
1 Partial
X
A2 Product
6 1
ROM 4 -input LUT contains all
A3 possible sums of the
X7 partial products
Xilinx DA FIR Performance
60 6000
Sample Rate (MSPS)

Single MAC

Performance (MMACs/s)
Dual MAC
50 DA FIR B=8 5000 DA FIR B=8
DA FIR B=12 DA FIR B=12
40 DA FIR B=16 4000 DA FIR B=16
30 3000

20 Serial FPGA 2000 Serial FPGA


FIR FIR
10 1000

0 0
0 50 100 150 200 250 0 50 100 150 200 250
Filter Length (Taps) Filter Length (Taps)

fclk = 200 MHz for both processor and FPGA


B = data sample precision for FPGA
Trade Clock Cycles
for Logic Area
Trade Clock Cycles for Logic Area
20Ms/s Multi bits per clock cycle 160Ms/s

b7 b7 b7
Serial-DA Parallel-DA
b4
b3

b0
Hardware b0 Hardware b0 Hardware
b0
Over-sampling = 8 Over-sampling = 4 Over-sampling = 2
b 7 b3

Hardware
Over-sampling = 1
b4 b0
The sample is serialized The sample is serialized
and processed 1 bit per and processed 2 bits
clock cycle. 8 clock per clock cycle. 4 clock The sample is
cycles are thus required cycles are thus required The sample is serialized b0 processed in
to process the whole to process the whole and processed 4 bits per parallel 8 bits
sample sample clock cycle per clock cycle
Conclusion

 Efficiency of computation
 Slow as its bit serial
 Memory requirements
References

 The role of Distributed Arithmetic in FPGA


based signal processing, www.xilinx.com

You might also like