FPGAIntroduction Xilinx
FPGAIntroduction Xilinx
Introduction
Return Return
• In the last 20 years the majority of DSP applications have been enabled
by DSP processors:
• But the most recent technology platform for high speed DSP
applications is the Field Programmable Gate Array (FPGA)
This primer course is all about the DSP with Xilinx FPGAs!
Hence a DSP algorithm or problem is often specified in terms of its MAC requirements. In particular, when
comparing two algorithms, if they both perform the same job but one with less MACs than the other, then clearly
the “cheaper” one would be the best choice. However this implies some assumptions. One is that the required
MACs are the same - but surely a multiply is multiply! Well, yes in the traditional DSP processor based situations
we are likely to be using, say, a 16 bit device which will process 16 bit inputs, using 16 bit digital filter coefficients
etc. With FPGAs this constraint is removed - we can use as many, or as few bits, as are required. Therefore we
can choose to optimise and schedule DSP algorithms in a completely different way.
Circuit Board
General Purpose Input/Output Bus
DSP Processor
Amplifiers/Filters
• Since around 1998 the evolution of FPGAs into the DSP market has
been sustained by classic technology progress such as the ever
present Moore’s law.
Anyone who has purchased a new laptop knows the feeling. If you just wait, then in the next quarter you will get
the new model with integrated WiFi or WiMax, a faster processor. Of course wait another quarter and in 6
months it will be improved again - also, the new faster, better, bigger machine is likely to be cheaper also! Such
is technology
DSP for FPGAs is just the same. If you wait another year its likely the next Xilinx device might just feature more
pre-packaged algorithms for precisely what you want to do. And they will be easier to work with - higher level
design tools, design wizards and so on.
So if you are planning to design a QR adaptive equalizing beamformer for MIMO implementation of a software
radio for 802.16 - then if you wait, it will probably be a free download in a few years. But of course who can wait?
Therefore in this course, we discuss and review the fundamental strategies of designing DSP for FPGAs. Like
all technologies you still need to know how it works if you really want to use it.
Top
FPGAs: A “Box” of DSP blocks 1.3
• However, the high level concept, take the blocks, & build it:
Clocks Input/Output
Design
Verify
Place and Route
Yes in both cases, but a toolset such as Simulink, System Generator and the ISE tools makes the design flow
very accessible and we will find both FPGA and DSP engineers designing advanced DSP systems.
There is lots to worry about. In terms of the DSP design; is the arithmetic correct (ie overflows, underflows,
saturates etc). Do the latency or delays used allow the integrity to be maintained. However the tools will give us
lots of support for this.
For the FPGA, can we clock at a high enough rate? Does the device place and route. What device do we need,
and how efficient is the implementation (just like compilers.
As higher level components (such as the Xilinx DSP48 slice allow a complete FIR to be implemented) then
issues such as overflow, numerical integrity and so on are taken care of.
Top
Binary Addition and Multiply 1.4
• The bottom line for DSP is multiplies and adds - and lots of them!
N N N+1
+ = =
Within traditional DSP processors this wordlength growth is well known an catered for.
For a typical DSP filtering type operation we may require to take, say an array of 24 bit numbers and multiply
by an array of another 24 bits numbers. The result of each multiply will be a 48 bit number. If we then add two
48 bit numbers together, if they both just happen to be large positive values then the result could be a 49 bit
number. Now if we add many 48 bit numbers together (and they just all happen to be large positive values),
then the final result may have a word growth of quite a few bits.
Top
The “Cost” of Addition 1.5
A3 B3 A2 B2 A1 B1 A0 B0
C3
Σ C2
Σ C1
Σ C0
Σ ‘0’
S4 S3 S2 S1 S0
MSB LSB
A3 A2 A1 A0
+ B3 B2 B1 B0
C3 C2 C1 C0
0 carry in
S4 S3 S2 S1 S0
Adds two bits + one carry in bit, to produce sum and carry out
Sout
1011 +11
+ 1101 +13
11000 +24
Top
The “Cost” of Multiply 1.6
b3
0
p7 p6 p5 p4 p3 p2 p1 p0
bout = b
z = a.b
X Y P
Memory Memory Memory
16k
Data Address
Registers Registers
Instruction
Arithmetic Decoder
“Parallel”
Logic
Multiplier Unit
So what if you only require 8 bit data and 8 bit operators? Fine - this 24 bit processor can still be used, however
the 8 bit numbers will be treated as 24 bit numbers therefore meaning that 16 bits are essentially not used.
Clearly this is therefore not very efficient to use the 24 bit device for an 8 bit problem.
Of course if you know you have an “8 bit” problem, then at the design outset you should aim to use an 8 bit
processor or microcontroller. But what if you have a 17 bit problem? Using the 24 bit device means that we still
have 7 “wasted” bits of resolution.
If you have an application where at different places in the calculation you require different wordlengths then the
DSP processor you choose must be capable of working with the longest wordlength that occurs. This simply
means that when processing the shorter wordlength data, the device is NOT being use efficiently.
Note that the general speed-area complexity of an N by N bit multiply is one quarter of the general speed-area
complexity of an 2N by 2N multiplier. Therefore if using a 16 bit processor to solve an 8 bit problem the
arithmetic processing silicon is being used at 25% MAC efficiency. If using a 24 bit processor for an 8 bit
problem then we at around 11% (1/9th) MAC efficiency.
So here’s one of the differences between “traditional” and FPGA enabled. With FPGAs you can choose your
wordlength. Not only are you not constrained to the traditional 8, 16 or 32, you can choose whatever is required
whether it be 5, 9 or 21! Moreover you have the flexibility to change! At one part of your FPGA you might be
using 7 bits resolution and at another, you might be using 19 bits. You are in charge! So you’d better know what
is going on!
Top
Using a DSP Processor 1.8
• If the clock speed is 200MHz, then a peak rate of 200 million MACs is
achievable.
The concurrency inside any DSP processor is of course best suited to DSP algorithms. Many DSP algorithms
require that data is read from memory and multiplied with filter weights or coefficients also stored in memory.
Therefore if we can read data from two memories (i.e. the distinct X and Y memories for this processor) at the
same time as doing a MAC, then only one clock cycle is required for fetch and multiply and store.
Later in the course we will discuss again the DSP requirements of divide and square root. Both operations are
in fact quite rare in DSP, however we do note that they are found in some communications applications, such
as the QR algorithm for beamforming or equalisation, and in communications where we often “rotate” a
constellation (requiring cosine and sine, in the form x .
-----------------------
x2 + y2
Top
Assembly Language Code 1.9
• A digital FIR filter requires two lines of code to specify the computation
required (need also to read in data samples, and output samples):
y(k) = ∑ wn x ( k – n )
n=0
REP 5
MAC X0,Y0,A X(R0)+,X0 Y(R4)+,Y0
x(k)
y(k)
Declare variables
int weights[5], x_data[5], i; /* declare arrays & i*/
long Y_k, X_k_input; /* input/output variables*/
As part of the design flow, we could then cross compile the c-code to assembler.
Top
The Gate Array (GA) 1.10
Early gate array design flow would be design, simulate/verify, device production and test.
From GA to FPGA
However simple gate arrays although very generic, were used by many different users for similar systems.....
....for example to implement two level logic functions, flip-flops and registers and perhaps addition and
subtraction functions.
For a GA once a layer(s) of metal had been laid on a device - that’s it! No changes, no updates, no fixes.
So then we move to field programmable gate arrays. Two key differences between these and gate arrays:
• They can be reprogrammed in the “field”, i.e. the logic specified is changeable
• They no longer are just composed of NAND gates. A carefully balanced selection of multi-input logic, flips-
flops, multiplexors and memory.
Top
Generic FPGA Architecture 1.11
I/O
I/O
I/O
Column
interconnects
I/O
If you were designing a slice from first principles you might find the components below would give a very
versatile set of logic with which to build large arithmetic and DSP systems:
Cascade/
Carry FLIP FLIP
Logic FLOP FLOP
Interconnects
LUT
Select
MUX
Logic
Logic Slice
Early uses of FPGAs was for implementation of general sequential and combinational digital logic functions
(counters, MUXs, etc.) More recently FPGAs have been used for DSP. Specifically we find FPGAs used for the
front end of many digital or software radio systems where billions of MACs per second are required.
Top
Using FPGA - Rethinking DSP Design 1.12
• Think very fast - Current data clocking rates of 200MHz and more are
achievable now.
• Think DSP “tricks” - we will be using some “simple” (i.e. low FGPA
cost filters) - CIC, difference filters, moving average.
• Think algorithms with square root and divide operations. which are
traditionally avoided for conventional DSP processors.
DSP “tricks”
If we can constrain certain multiplications etc to be 1, 0 or -1, then the cost is negligible compared to a full N by
N bit multiply. Similarly where possible, if we constrain multiplier values in filters etc to be powers of 2 (2, 4, 8
etc) then we have multiplies that can be implemented as shifts.
Oversampling Strategies
We can tradeoff sampling rate and wordlength. Oversampling by 4 gives one more bit of Nyquist band
resolution. For example for sampling rate of 100kHz single and 10 bits resolution, we can maintain the same
baseband resolution (from 0 to 50KHz) as a 400kHz at 9 bits, or 1600kHz at 8 bits. Therefore if appropriate we
can decrease the wordlength at the expense of increased sampling rate.
Undersampling Strategies
We can perform direct digital downconversion techniques using undersampling
• The power of FPGAs for DSP is primarily in their low level simplicity
on which to build high level complexity.
Sout
• With a typical FPGA logic block we can produce one or more FAs
(either from available logic or via look-up table approaches).
In the following sequence of high level designs we want to demonstrate how this simple FA can be use to
produce a powerful DSP digital filter also, potentially, running at 200 Msamples/second. Therefore the slides
perhaps do not present exactly how a custom DSP digital filter would be produced. However the design will
allow us to demonstrate the difference between data rates and logic clock rates, and strategies for reducing
costs by efficient design.
The design techniques associated with implementing multipliers and adders and so on, are probably well known
to ASIC engineers, but probably not well known to DSP engineers.
Top
FPGA - 8 Bit Parallel Adder 1.14
0 1 0 0 0 1 0 1
0 1 0 0 0 1 0 1
Σ Σ Σ Σ Σ Σ Σ Σ Σp
0
0 1 1 0 1 1 1 0
fclk = 200MHz
00101001
+01000101
01101110
If we chose not to pipeline (insert no single bit delays or flip-flops between FAs)
Σ Σ Σ Σ Σ Σ Σ Σ
then the adder has a carry ripple which can limit the maximum clocking speed. (This is not necessarily a wrong
thing to do and in some cases not pipelining may be desirable, however for this sequence of examples we
choose to pipeline.)
Note that we could also use a FA and perform the addition serially. Because we are sharing the FA then the
data processing rate is reduced by a factor of 8, i.e. fdata = 200 / 8 = 25MHz, 25,000,000 adds/s.
10010100
fdata = 25MHz
10100010
LSB MSB
Σ Σs
Delay
LSB MSB
01110110 fclk = 200MHz Bit serial 8 bit adder
Note to extend to, for example, a 14 bit serial adder just add 6 more stages for the parallel adder, or clock the
serial adder for another 6 cycles, however the data rate then reduces to 200/14 = 14.28 million adds/s
Top
FPGA - 8 bit Multiplier 1.15
• With just a few additional logic gate, we can produce an 8 bit parallel
multiplier:
Parallel Multiplier
Σ Σ Σ Σ Σ Σ Σ Σ 0
Σ - Full adder for multiply array - FAx
Σ Σ Σ Σ Σ Σ Σ Σ 1
Σ Σ Σ Σ Σ Σ Σ Σ 1
Σ Σ Σ Σ Σ Σ Σ Σ
Mp
1
Σ Σ Σ Σ Σ Σ Σ Σ 1
Σ Σ Σ Σ Σ Σ Σ Σ 0
fdata = 200MHz
Σ Σ Σ Σ Σ Σ Σ Σ 0
0 1 0 0 1 0 1 1
Alternatively we could reduce the hardware costs and use one parallel adder and feedback partial products to
produce a serial multiplier. The logic in this circuit could still be clocked at 200MHz, but one multiply would
take 8 cycles and hence the data rate is only 200/8 = 25MHz.
Serial Multiplier
Ms
Σ Σ Σ Σ Σ Σ Σ Σ
fdata = 25MHz
The concept constant area-speed product is evident from a comparison of the parallel and serial multipliers
(and also the parallel and serial adder example above). Generally speaking if we reduce the silicon area/
resources required by a factor of N, then the computation time increases by N, as the resources must be shared
by N different sub-computations in a time sequential manner.
Top
FPGA FIR Filter 1.16
w0 w1 w2 w3 w4 w5 w6
Mp Mp Mp Mp Mp Mp Mp FIRp
Σp Σp Σp Σp Σp Σp
fdata = 200MHz
Data
Mult
Ms Ms Ms Ms Ms Ms Ms
Σs Σs Σs Σs Σs Σs
FIRs
fdata = 25MHz
Or alternatively we could share a single parallel multiplier and a single parallel adder - approximately the same
cost of the circuit above - but this time only have a data sampling rate of 200/8 = 25MHz.
Mp
FIRs
Σp
fdata = 25MHz
Sharing a single parallel multiplier is of course similar to the concept of a DSP processor!
Top
FIR Filter Banks 1.17
magnitude
freq
• We can set up 4 parallel FIR filters with each one running at 200MHz
sampling rate
fdata = 200MHz
Typically a state of the art DSP processor could implement around 500 million multiply-adds per second.
Therefore around 12 are required to sustain this rate! Of course the DSP processor has other capabilities and
flexibilities that the FPGA does not have however for this specific requirement a DSP processor is a poor
solution compared to the FPGA solution.
Once again, if our requirement was different and only a data sampling rate of 25MHz was required then we
could design using serial FIR filters with a total of 1/8 of the hardware cost (remember the individual logic
elements are still clocked at 200MHz):
fdata = 25MHz
Or we could share one fully parallel FIR filter and multiplex the four channels
FIRp
fdata = 25MHz
Top
Optimising the Design 1.18
• For this 4 channel FIR filter design we could (gu)esstimate “a” cost as:
• Note if we used a fixed point DSP processor, then any “savings” related
to wordlengths are irrelevant - the ALU works with 16 bits resolution
and only using 5 bits of this gives no benefit.
Σ Σ Σ Σ Σ Σ Σ Σ
Σ Σ Σ Σ Σ Σ Σ Σ
Σ Σ Σ Σ Σ Σ Σ Σ
64 Full Adders 214 11010110
x29 x11101 Σ Σ Σ Σ Σ Σ Σ Σ 0
6206 11010110
00000000
11010110 Σ Σ Σ Σ Σ Σ Σ Σ 1
11010110
11010110 Σ Σ Σ Σ Σ Σ Σ Σ 1
1100000111110
Σ Σ Σ Σ Σ Σ Σ Σ 1
1 1 0 0 0 0 0 1 1
40 Full Adders
Top
Using Some DSP Knowledge.....! 1.19
• For this 4 channel application we note from a DSP point of view that we
are in fact performing subband filtering with 4 channel filters:
0
Gain/dB
0 1 2 3 fs = 200MHz
0 25 50 75 100 freq
• For this type of subband filtering structure we know that the channel 0
and channel 3 filters can have the same magnitude for the filter
weights/coefficients (similarly for channel 1 and channel 2):
• If, for example, the 7 weights of channel 0 low pass filter are:
the weights of channel 3 filter high pass are (negate every 2nd weight):
11 64 31 127 31 64 11
z(k)
High Pass Channel 1
Efficient reordering of computation...
11 64 31 127 31 64 11
....and we can now implement two filters for little more than the price of one! (Note the filter here is very “simple”
and is more likely to be 10’s or 100’s of weights in a real system._)
Top
Downsampling Hardware Savings 1.20
time
time
• In DSP you only calculate what you need, and therefore we only
calculate every 4th sample, therefore computation reduces by 1/4
again
• Based on this saving, and the observation above we can now re-
engineer the DSP design to be:
With some DSP knowledge we have reduced the cost of the design by a factor of 1/8th, i.e. 1/8th of the overall
hardware costs. In some further careful DSP computation reordering would bring this down by another factor of
2!
Top
Simplifying the Multiplies 1.21
We need to understand DSP first and foremost. Get that right, then the next stage is aiming to ensure that your
DSP algorithms are as efficiently as possible mapped to the target FPGA. Software tools will undoubtedly help
here, but there is always scope for some final stages of hand-crafting to yield the lowest cost implementation.
Top
DSP FPGA Design Software 1.22
• Industry Trends
• HDL synthesis
• IP Core libraries
• Simulink library of arithmetic, logic operators and DSP functions (Xilinx Blockset)
• Arithmetic abstraction
• As part of this course we will design from System Generator and then
implement to FPGA device, ensuring that we test and verify on the way!
MATLAB/Simulink
HDL
System Generator System Verification
• Being able to include new or legacy modules is essential for many DSP
system designers
Configuration Wizard
detects VHDL files &
customizes block
Top
Hardware in the Loop Simulation 1.26
A free running clock is provided to the design, thus the hardware is no longer running in lockstep with the
software. The test is started, and after some time a 'done' flag is set to read the results from the FPGA and
display them in Simulink. Using this hardware co-simulation method, designers can achieve up to 6 orders of
magnitude performance enhancement over original software simulation.
If a system has feedback present then hardware in the loop can be more difficult to set up as the feedback must
be done on a sample by sample basis, rather than with blocks of data.
Top
System Debug Facility 1.27
System Generator also provides a resource estimator so that for a large design you can get a quick “estimate”
of the hardware cost before going through the full ISE design flow.
Most of the blocks in the System Generator Blockset carries the resources information of
• LUTs
• FFs
• BRAM
• Embedded multipliers
• 3-state buffers
• I/Os
Top
System Generator for DSP Design 1.28
• Advantages
Advantages:
• Portability
• Complete control of the design implementation and tradeoffs
• Easier to debug and understand a code that you own
Disadvantages:
• Can be time-consuming
• Don’t always have control over the Synthesis tool
• Need to be familiar with the algorithm and how to write it
• Must be conversant with the synthesis tools to obtain optimized design
Advantages
Disadvantages
• Know how to use Simulink and System Generator for FPGA DSP
algorithm design.
• Know how to use Xilinx ISE tools to implement System Generator DSP
designs on FPGAs.