0% found this document useful (0 votes)
60 views

FPGAIntroduction Xilinx

Uploaded by

ksreddy2002
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views

FPGAIntroduction Xilinx

Uploaded by

ksreddy2002
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

The DSP Primer 1

Introduction

Return Return

DSPprimer Home DSPprimer Notes

August 2005, University of Strathclyde, Scotland, UK For Academic Use Only


THIS SLIDE IS BLANK
Top
Introduction: DSP and FPGAs 1.1

• In the last 20 years the majority of DSP applications have been enabled
by DSP processors:

• More recently a number of DSP cores have been available.

• ASICs (Application specific integrated circuits) have been widely


used for specific (high volume) DSP applications

• But the most recent technology platform for high speed DSP
applications is the Field Programmable Gate Array (FPGA)

This primer course is all about the DSP with Xilinx FPGAs!

... and how to do it!

August 2005, For Academic Use Only, All Rights Reserved


Notes:
DSP is all about multiplies and accumulates/adds (MACs). As we progress through the course, we will see that
most algorithms that are used for different applications employ digital filters, adaptive filters, Fourier transforms
and so on. These algorithms all require multiplies and adds (note that a divide or square root is quite a rare thing
in DSP).

Hence a DSP algorithm or problem is often specified in terms of its MAC requirements. In particular, when
comparing two algorithms, if they both perform the same job but one with less MACs than the other, then clearly
the “cheaper” one would be the best choice. However this implies some assumptions. One is that the required
MACs are the same - but surely a multiply is multiply! Well, yes in the traditional DSP processor based situations
we are likely to be using, say, a 16 bit device which will process 16 bit inputs, using 16 bit digital filter coefficients
etc. With FPGAs this constraint is removed - we can use as many, or as few bits, as are required. Therefore we
can choose to optimise and schedule DSP algorithms in a completely different way.
Circuit Board
General Purpose Input/Output Bus

DSP DAC ADC

DSP Processor
Amplifiers/Filters

Voltage Output Voltage Input


Top
The FPGA DSP Evolution 1.2

• Since around 1998 the evolution of FPGAs into the DSP market has
been sustained by classic technology progress such as the ever
present Moore’s law.

• Late 1990s FPGAs allow multipliers to be implemented in


FPGA logic fabric. A few multipliers per device are possible.

• Early 2000s FPGA place hardwired multipliers onto the device


with clocking speeds of > 100MHz. No of multpliers from 4 to >
500.

• Mid 2000s FPGA place DSP algorithms signal flow graphs


(SFGs) onto devices. Full (pipelined) FIR SFGs filters for
example are available (DSP48 slice)

• Late 2000s - who knows! Probably more DSP power, more


arithmetic capability (fast square root, divide), perhaps block
floating point. But rest assured there is more coming....

August 2005, For Academic Use Only, All Rights Reserved


Notes:
Technology just keeps moving.

Anyone who has purchased a new laptop knows the feeling. If you just wait, then in the next quarter you will get
the new model with integrated WiFi or WiMax, a faster processor. Of course wait another quarter and in 6
months it will be improved again - also, the new faster, better, bigger machine is likely to be cheaper also! Such
is technology

DSP for FPGAs is just the same. If you wait another year its likely the next Xilinx device might just feature more
pre-packaged algorithms for precisely what you want to do. And they will be easier to work with - higher level
design tools, design wizards and so on.

So if you are planning to design a QR adaptive equalizing beamformer for MIMO implementation of a software
radio for 802.16 - then if you wait, it will probably be a free download in a few years. But of course who can wait?

Therefore in this course, we discuss and review the fundamental strategies of designing DSP for FPGAs. Like
all technologies you still need to know how it works if you really want to use it.
Top
FPGAs: A “Box” of DSP blocks 1.3

• We might be tempted to view latest Xilinx FPGAs as repositories of


DSP components just waiting to be connected together.

• In the days of circuits boards one had to be careful about running


busses close together, lengths of wires etc. Similiar considerations are
required for FPGAs and dealt with by synthesis and other tools.

• However, the high level concept, take the blocks, & build it:

Clocks Input/Output

Registers and Memory

Design
Verify
Place and Route

“Connectors” Logic Arithmetic

August 2005, For Academic Use Only, All Rights Reserved


Notes:
This is undoubtedly the modern concept of FPGA design. Take the blocks, connect them together and the
algorithm is in place.

Do we actually need an FPGA/IC engineer then?

Do we actually need a DSP engineer?

Yes in both cases, but a toolset such as Simulink, System Generator and the ISE tools makes the design flow
very accessible and we will find both FPGA and DSP engineers designing advanced DSP systems.

There is lots to worry about. In terms of the DSP design; is the arithmetic correct (ie overflows, underflows,
saturates etc). Do the latency or delays used allow the integrity to be maintained. However the tools will give us
lots of support for this.

For the FPGA, can we clock at a high enough rate? Does the device place and route. What device do we need,
and how efficient is the implementation (just like compilers.

As higher level components (such as the Xilinx DSP48 slice allow a complete FIR to be implemented) then
issues such as overflow, numerical integrity and so on are taken care of.
Top
Binary Addition and Multiply 1.4

• The bottom line for DSP is multiplies and adds - and lots of them!

• Adding two N bit numbers will produce up to an N+1 bit number:

N N N+1
+ = =

• Multiplying two N bit numbers can produce up to a 2N bit number:


N N 2N
x =

• So with a MAC (multiply and accumulate/add) of two N bit numbers we


could, in the worst case, end up with 2N+1 bits wordlength.

August 2005, For Academic Use Only, All Rights Reserved


Notes:
If the wordlength grows beyond the maximum value you can store we clearly have the situation of numerical
overflow which is a non-linear operation and not desirable.

Within traditional DSP processors this wordlength growth is well known an catered for.

For a typical DSP filtering type operation we may require to take, say an array of 24 bit numbers and multiply
by an array of another 24 bits numbers. The result of each multiply will be a 48 bit number. If we then add two
48 bit numbers together, if they both just happen to be large positive values then the result could be a 49 bit
number. Now if we add many 48 bit numbers together (and they just all happen to be large positive values),
then the final result may have a word growth of quite a few bits.
Top
The “Cost” of Addition 1.5

• A 4 bit addition can be performed using a simple ripple adder:

A3 B3 A2 B2 A1 B1 A0 B0

C3
Σ C2
Σ C1
Σ C0
Σ ‘0’

S4 S3 S2 S1 S0
MSB LSB
A3 A2 A1 A0
+ B3 B2 B1 B0
C3 C2 C1 C0
0 carry in
S4 S3 S2 S1 S0

• Therefore an N bit addition could be performed in parallel at a cost of N


full adders.

August 2005, For Academic Use Only, All Rights Reserved


Notes:
The simple Full Adder (FA):

Adds two bits + one carry in bit, to produce sum and carry out

S out = ABC + ABC + ABC + ABC


A B Cin Cout Sout
= A⊕B⊕C
0 0 0 0 0
0 0 1 0 1 C out = ABC + ABC + ABC + ABC
0 1 0 0 1
0 1 1 1 0 = AB + AC + BC
1 0 0 0 1
1 0 1 1 0 A B
1 1 0 1 0
1 1 1 1 1
Cout Σ Cin

Sout
1011 +11
+ 1101 +13
11000 +24
Top
The “Cost” of Multiply 1.6

• A 4 bit multiply operation requires an array of 16 multiply/add cells:


a3 a2 a1 a0 0 a3 0 a2 0 a1 0 a0
b3 b2 b1 b0
c3 c2 c1 c0 b0
0
d3 d2 d1 d0
e3 e2 e1 e0
b1
f3+ f2 f1 f0 0
p7 p6 p5 p4 p3 p2 p1 p0
b2
0

b3
0

p7 p6 p5 p4 p3 p2 p1 p0

• Therefore an N by N multiply requires N 2 cells......

......so for example a 16 bit multiply is nominally 4 times more


expensive to perform than an 8 bit multiply.
August 2005, For Academic Use Only, All Rights Reserved
Notes:
Each cell is composed of a Full Adder (FA) and an AND gate, plus some broadcast wires:
s
a
bout b
cout c 1011 11
1001 x9
aout sout 1011 Partial Product
0000
cout = s.z.c + s.z.c + s.z.c + s.z.c 0000
aout = a +1 0 1 1
sout = (s ⊕ z) ⊕ c 1100011 99

bout = b
z = a.b

An 8 bit by 8 bit multiplier would require 8 x 8 = 64 cells


Top
“Traditional” DSP Processors 1.7

• Consider a traditional 24 bit DSP processor:


24 bits

X Y P
Memory Memory Memory

16k

24k 24k DSP

Data Address
Registers Registers
Instruction
Arithmetic Decoder
“Parallel”
Logic
Multiplier Unit

• The DSP programmer only needs knowledge of a few parts of the


actual processor.
August 2005, For Academic Use Only, All Rights Reserved
Notes:
This particular example device is “24 bit”. This simply means that ALL arithmetic operations are performed on
24 bit data. Therefore if two 24 bit numbers are added, then result is up to a 25 bit number. If two 24 bit numbers
are multiplied, then the result is up to 48 bits.

So what if you only require 8 bit data and 8 bit operators? Fine - this 24 bit processor can still be used, however
the 8 bit numbers will be treated as 24 bit numbers therefore meaning that 16 bits are essentially not used.
Clearly this is therefore not very efficient to use the 24 bit device for an 8 bit problem.

Of course if you know you have an “8 bit” problem, then at the design outset you should aim to use an 8 bit
processor or microcontroller. But what if you have a 17 bit problem? Using the 24 bit device means that we still
have 7 “wasted” bits of resolution.

If you have an application where at different places in the calculation you require different wordlengths then the
DSP processor you choose must be capable of working with the longest wordlength that occurs. This simply
means that when processing the shorter wordlength data, the device is NOT being use efficiently.

Note that the general speed-area complexity of an N by N bit multiply is one quarter of the general speed-area
complexity of an 2N by 2N multiplier. Therefore if using a 16 bit processor to solve an 8 bit problem the
arithmetic processing silicon is being used at 25% MAC efficiency. If using a 24 bit processor for an 8 bit
problem then we at around 11% (1/9th) MAC efficiency.

So here’s one of the differences between “traditional” and FPGA enabled. With FPGAs you can choose your
wordlength. Not only are you not constrained to the traditional 8, 16 or 32, you can choose whatever is required
whether it be 5, 9 or 21! Moreover you have the flexibility to change! At one part of your FPGA you might be
using 7 bits resolution and at another, you might be using 19 bits. You are in charge! So you’d better know what
is going on!
Top
Using a DSP Processor 1.8

• Using the above processor we are constrained to using 24 bits


arithmetic.

• Working with a smaller wordlength means that efficiency is low.

• This DSP processor is single-cycle-MAC

i.e. one multiply-accumulate per clock cycle

• If the clock speed is 200MHz, then a peak rate of 200 million MACs is
achievable.

• Certain concurrent operations can be performed: parallel data fetch

Data concurrently read from X,Y memory during a MAC operation

• The ALU (arithmetic logic unit) provides other facilities such as


arithmetic operators of divide, square root - these are NOT single cycle.

August 2005, For Academic Use Only, All Rights Reserved


Notes:
The peak MAC rate is of course unlikely to be achievable as the processor will also be performing other related
processing such as branching, fetching data, and so on. However the architecture of DSP processors has been
so optimised in the last few years, that with very careful programming and using all of the concurrent aspects
of the processor we might expect to get very close to the peak MAC rate.

The concurrency inside any DSP processor is of course best suited to DSP algorithms. Many DSP algorithms
require that data is read from memory and multiplied with filter weights or coefficients also stored in memory.
Therefore if we can read data from two memories (i.e. the distinct X and Y memories for this processor) at the
same time as doing a MAC, then only one clock cycle is required for fetch and multiply and store.

Later in the course we will discuss again the DSP requirements of divide and square root. Both operations are
in fact quite rare in DSP, however we do note that they are found in some communications applications, such
as the QR algorithm for beamforming or equalisation, and in communications where we often “rotate” a
constellation (requiring cosine and sine, in the form x .
-----------------------
x2 + y2
Top
Assembly Language Code 1.9

• A digital FIR filter requires two lines of code to specify the computation
required (need also to read in data samples, and output samples):

y(k) = ∑ wn x ( k – n )
n=0

REP 5
MAC X0,Y0,A X(R0)+,X0 Y(R4)+,Y0

x(k)

y(k)

August 2005, For Academic Use Only, All Rights Reserved


Notes:
The c code for this filter is also relatively straightforward:

Declare variables
int weights[5], x_data[5], i; /* declare arrays & i*/
long Y_k, X_k_input; /* input/output variables*/

for every input sample....repeat


for (i = 1; i < 5; i++)
{x_data[i] = x_data[i -1];} /* shift data */
x_data[0] = (int) X_k_input; /* get new input data*/

Y_k = 0; /* initialise to zero */

for (i = 0; i < 5; i++) /* FIR MACs*/


{ Y_k += weights[i] ∗ x_data[i];}

As part of the design flow, we could then cross compile the c-code to assembler.
Top
The Gate Array (GA) 1.10

• Early gate-arrays were simply arrays of NAND gates:

• Designs were produced by interconnecting the gates to form


combinational and sequential functions.

August 2005, For Academic Use Only, All Rights Reserved


Notes:
The NAND gate is often called the Universal gate, meaning that it can be used to produce any Boolean logic
function.

Early gate array design flow would be design, simulate/verify, device production and test.

From GA to FPGA

However simple gate arrays although very generic, were used by many different users for similar systems.....

....for example to implement two level logic functions, flip-flops and registers and perhaps addition and
subtraction functions.

For a GA once a layer(s) of metal had been laid on a device - that’s it! No changes, no updates, no fixes.

So then we move to field programmable gate arrays. Two key differences between these and gate arrays:

• They can be reprogrammed in the “field”, i.e. the logic specified is changeable

• They no longer are just composed of NAND gates. A carefully balanced selection of multi-input logic, flips-
flops, multiplexors and memory.
Top
Generic FPGA Architecture 1.11

I/O I/O I/O I/O

logic logic logic logic logic Row


block block block block block interconnects

I/O

logic logic logic logic logic


block block block block block

I/O

logic logic logic logic logic


block block block block block

I/O
Column
interconnects

I/O

logic logic logic logic logic


block block block block block

I/O I/O I/O I/O

August 2005, For Academic Use Only, All Rights Reserved


Notes:
The logic block in this “sketch” FPGA contains a few logic elements. Our terminology will specifically be that
of CLBs (configurable logic blocks) being composed of slices.

If you were designing a slice from first principles you might find the components below would give a very
versatile set of logic with which to build large arithmetic and DSP systems:

Cascade/
Carry FLIP FLIP
Logic FLOP FLOP

Interconnects
LUT

Select
MUX
Logic

Logic Slice

Early uses of FPGAs was for implementation of general sequential and combinational digital logic functions
(counters, MUXs, etc.) More recently FPGAs have been used for DSP. Specifically we find FPGAs used for the
front end of many digital or software radio systems where billions of MACs per second are required.
Top
Using FPGA - Rethinking DSP Design 1.12

• Think very fast - Current data clocking rates of 200MHz and more are
achievable now.

• Think minimum data bit-widths - FPGA data words need only be as


wide as is necessary for the algorithm/application.

• Think DSP “tricks” - we will be using some “simple” (i.e. low FGPA
cost filters) - CIC, difference filters, moving average.

• Think Oversampling Strategies - using sigma delta techniques


produce simple multiplier-free digital filters.

• Think Undersampling Strategies - for communications we can make


use of high sampling rates and digital filters for digital downconversion.

• Think algorithms with square root and divide operations. which are
traditionally avoided for conventional DSP processors.

• Think differently - it’s a new design challenge.

August 2005, For Academic Use Only, All Rights Reserved


Notes:
Very fast
With the latest FPGAs we can find many hundreds of multipliers (18 bits for example on Xilinx Virtex family)
which can operate at 100’s MHz, making the MAC rate of the order of 10’s of billions of MACs in one second.

Minimum data bit-widths


Frequently DSP algorithms are designed and tested with floating point arithmetic, then the final implementation
is implemented in 16 bits - the typical wordlength of many DSP processors and cores. With FPGA we have more
options for the wordlength - it can be any value - 13 bits, 11 bits. In fact we can even mix wordlengths (e.g. 8
bit data, 12 bit filter coefficients). The smaller the wordlength the lower the hardware cost. Therefore at all
stages we want to use the fewest number of bits possible. Hence the design stage should now include a careful
fixed point analysis where we design for the lowest cost implementation.

DSP “tricks”
If we can constrain certain multiplications etc to be 1, 0 or -1, then the cost is negligible compared to a full N by
N bit multiply. Similarly where possible, if we constrain multiplier values in filters etc to be powers of 2 (2, 4, 8
etc) then we have multiplies that can be implemented as shifts.

Oversampling Strategies
We can tradeoff sampling rate and wordlength. Oversampling by 4 gives one more bit of Nyquist band
resolution. For example for sampling rate of 100kHz single and 10 bits resolution, we can maintain the same
baseband resolution (from 0 to 50KHz) as a 400kHz at 9 bits, or 1600kHz at 8 bits. Therefore if appropriate we
can decrease the wordlength at the expense of increased sampling rate.

Undersampling Strategies
We can perform direct digital downconversion techniques using undersampling

Square root and divide operations


Square roots and divides have long been avoided in DSP algorithms - this will change!
Top
DSP Implementation with FPGAs 1.13

• The power of FPGAs for DSP is primarily in their low level simplicity
on which to build high level complexity.

• We can demonstrate some of the design options by building digital filter


from first principles using just full adders (FA):

A B Cin Cout Sout


S out = ABC + ABC + ABC + ABC = A ⊕ B ⊕ C
0 0 0 0 0
0 0 1 0 1 C out = ABC + ABC + ABC + ABC = AB + AC + BC
0 1 0 0 1
0 1 1 1 0 A B
1 0 0 0 1
fclk = 200MHz
1 0 1 1 0
1 1 0 1 0 Cout Σ Cin
1 1 1 1 1

Sout

• With a typical FPGA logic block we can produce one or more FAs
(either from available logic or via look-up table approaches).

August 2005, For Academic Use Only, All Rights Reserved


Notes:
In a typical FPGA the FA circuit can be clocked at a very high rate, for arguments sake we will specify fclk =
200MHz.

In the following sequence of high level designs we want to demonstrate how this simple FA can be use to
produce a powerful DSP digital filter also, potentially, running at 200 Msamples/second. Therefore the slides
perhaps do not present exactly how a custom DSP digital filter would be produced. However the design will
allow us to demonstrate the difference between data rates and logic clock rates, and strategies for reducing
costs by efficient design.

The design techniques associated with implementing multipliers and adders and so on, are probably well known
to ASIC engineers, but probably not well known to DSP engineers.
Top
FPGA - 8 Bit Parallel Adder 1.14

• 8 FAs and some flip-flops produce an 8 bit full adder (pipelined):

0 1 0 0 0 1 0 1
0 1 0 0 0 1 0 1

Σ Σ Σ Σ Σ Σ Σ Σ Σp
0

0 1 1 0 1 1 1 0
fclk = 200MHz

00101001
+01000101
01101110

• Data can be clocked into this circuit at a rate of 200MHz

i.e. 200,000,000 8-bit additions per second.

August 2005, For Academic Use Only, All Rights Reserved


Notes:
If we choose to pipeline the 8 bit parallel adder then we can reliably clock this also at fclk = 200MHz, meaning
200,000,000 adds/s.

If we chose not to pipeline (insert no single bit delays or flip-flops between FAs)

Σ Σ Σ Σ Σ Σ Σ Σ

then the adder has a carry ripple which can limit the maximum clocking speed. (This is not necessarily a wrong
thing to do and in some cases not pipelining may be desirable, however for this sequence of examples we
choose to pipeline.)

Note that we could also use a FA and perform the addition serially. Because we are sharing the FA then the
data processing rate is reduced by a factor of 8, i.e. fdata = 200 / 8 = 25MHz, 25,000,000 adds/s.

10010100
fdata = 25MHz
10100010
LSB MSB

Σ Σs
Delay

LSB MSB
01110110 fclk = 200MHz Bit serial 8 bit adder

Note to extend to, for example, a 14 bit serial adder just add 6 more stages for the parallel adder, or clock the
serial adder for another 6 cycles, however the data rate then reduces to 200/14 = 14.28 million adds/s
Top
FPGA - 8 bit Multiplier 1.15

• With just a few additional logic gate, we can produce an 8 bit parallel
multiplier:
Parallel Multiplier

Σ Σ Σ Σ Σ Σ Σ Σ - Additional logic (flip-flop, XOR gate...)

Σ Σ Σ Σ Σ Σ Σ Σ 0
Σ - Full adder for multiply array - FAx
Σ Σ Σ Σ Σ Σ Σ Σ 1

Σ Σ Σ Σ Σ Σ Σ Σ 1

Σ Σ Σ Σ Σ Σ Σ Σ
Mp
1

Σ Σ Σ Σ Σ Σ Σ Σ 1

Σ Σ Σ Σ Σ Σ Σ Σ 0
fdata = 200MHz
Σ Σ Σ Σ Σ Σ Σ Σ 0

0 1 0 0 1 0 1 1

• Data in this circuit can also be pipelined and clocked in at 200MHz


(although there would be a latency)
i.e. 200,000,000 8 bit multiplies per second
August 2005, For Academic Use Only, All Rights Reserved
Notes:
This “example” array is simply a “mapping” of a direct 8 bit multiplication 11010110
x00101101
whereby 8 partial products are created and added together. The cost of each 11010110
“cell” is just a little more than the logic cost of a full adder (FA), which we might 00000000
denote simply as FAx (Regardless of how the multiplier is implemented there 11010110
11010110
is a cost associated, and the more bits then the higher the cost, e.g. if done 00000000
using memory then require more memory for more bits) 11010110
00000000
00000000
The array has many variants (for signed numbers, carry lookahead etc) 0010010110011110

Alternatively we could reduce the hardware costs and use one parallel adder and feedback partial products to
produce a serial multiplier. The logic in this circuit could still be clocked at 200MHz, but one multiply would
take 8 cycles and hence the data rate is only 200/8 = 25MHz.

Serial Multiplier

Ms
Σ Σ Σ Σ Σ Σ Σ Σ
fdata = 25MHz

The concept constant area-speed product is evident from a comparison of the parallel and serial multipliers
(and also the parallel and serial adder example above). Generally speaking if we reduce the silicon area/
resources required by a factor of N, then the computation time increases by N, as the resources must be shared
by N different sub-computations in a time sequential manner.
Top
FPGA FIR Filter 1.16

w0 w1 w2 w3 w4 w5 w6

• Using 7 parallel multipliers and 6 parallel adders we can produce an 8


tap parallel FIR digital filter (FIRp):

Mp Mp Mp Mp Mp Mp Mp FIRp
Σp Σp Σp Σp Σp Σp

fdata = 200MHz

• Data in this FIR filter is pipelined and clocked at 200MHz

i.e. 200,000,000 samples per second

August 2005, For Academic Use Only, All Rights Reserved


Notes:
If we chose to use the slower serial multipliers and adders, then the cost of the hardware reduces by 1/8,
however the data rate is only 25MHz:

Data
Mult
Ms Ms Ms Ms Ms Ms Ms
Σs Σs Σs Σs Σs Σs
FIRs

fdata = 25MHz

Or alternatively we could share a single parallel multiplier and a single parallel adder - approximately the same
cost of the circuit above - but this time only have a data sampling rate of 200/8 = 25MHz.

Mp
FIRs
Σp

fdata = 25MHz

Sharing a single parallel multiplier is of course similar to the concept of a DSP processor!
Top
FIR Filter Banks 1.17

• For a particular digital communications application we require 4


channels each of 25 MHz bandwidth:
25MHz

magnitude
freq

• We can set up 4 parallel FIR filters with each one running at 200MHz
sampling rate

FIRp FIRp FIRp FIRp

fdata = 200MHz

• So the total computation rate is 4 x 7 x 200M = 5.6 billion MAC/sec!

MAC - Multiply/accumulate operation


August 2005, For Academic Use Only, All Rights Reserved
Notes:
5.6 billion MACs/sec is a lot of processing! With current FPGA technology this type of design is absolutely
possible. In fact we can easily go an order of magnitude higher with high specification FPGAs.

Typically a state of the art DSP processor could implement around 500 million multiply-adds per second.
Therefore around 12 are required to sustain this rate! Of course the DSP processor has other capabilities and
flexibilities that the FPGA does not have however for this specific requirement a DSP processor is a poor
solution compared to the FPGA solution.

Once again, if our requirement was different and only a data sampling rate of 25MHz was required then we
could design using serial FIR filters with a total of 1/8 of the hardware cost (remember the individual logic
elements are still clocked at 200MHz):

FIRs FIRs FIRs FIRs

fdata = 25MHz

Or we could share one fully parallel FIR filter and multiplex the four channels

FIRp
fdata = 25MHz
Top
Optimising the Design 1.18

• For this 4 channel FIR filter design we could (gu)esstimate “a” cost as:

(Channels) x (Filter Length) x (8 bits data) x (8 bits weight)

= 4 x 7 x 8 x 8 = 1792 FAx’s + some other bits of logic / memory!!!

• If, after careful simulation investigation, we note that in fact we only


require 5 bit wordlength for the filter coefficients then the cost is 4 x 7 x
8 x 5 = 1120 Full Adders, which is approximately 62.5% of the previous
hardware requirement!

• Note if we used a fixed point DSP processor, then any “savings” related
to wordlengths are irrelevant - the ALU works with 16 bits resolution
and only using 5 bits of this gives no benefit.

August 2005, For Academic Use Only, All Rights Reserved


Notes:
The 62.5% cost saving going from 5 bits to 8 bits can be demonstrated by just noting the simplifications of the
multiplier array.:
214 11010110
x29 x00011101
Σ Σ Σ Σ Σ Σ Σ Σ
6206 11010110
00000000
11010110 Σ Σ Σ Σ Σ Σ Σ Σ
11010110
11010110
Σ Σ Σ Σ Σ Σ Σ Σ
00000000
00000000
00000000 Σ Σ Σ Σ Σ Σ Σ Σ
0001100000111110
Σ Σ Σ Σ Σ Σ Σ Σ

Σ Σ Σ Σ Σ Σ Σ Σ

Σ Σ Σ Σ Σ Σ Σ Σ 62.5% cost reduction

Σ Σ Σ Σ Σ Σ Σ Σ

Σ Σ Σ Σ Σ Σ Σ Σ
64 Full Adders 214 11010110
x29 x11101 Σ Σ Σ Σ Σ Σ Σ Σ 0
6206 11010110
00000000
11010110 Σ Σ Σ Σ Σ Σ Σ Σ 1
11010110
11010110 Σ Σ Σ Σ Σ Σ Σ Σ 1
1100000111110
Σ Σ Σ Σ Σ Σ Σ Σ 1

1 1 0 0 0 0 0 1 1

40 Full Adders
Top
Using Some DSP Knowledge.....! 1.19

• For this 4 channel application we note from a DSP point of view that we
are in fact performing subband filtering with 4 channel filters:
0

Gain/dB
0 1 2 3 fs = 200MHz

0 25 50 75 100 freq

• For this type of subband filtering structure we know that the channel 0
and channel 3 filters can have the same magnitude for the filter
weights/coefficients (similarly for channel 1 and channel 2):

• If, for example, the 7 weights of channel 0 low pass filter are:

{11, 64, 31, 127, 31, 64, 11}

the weights of channel 3 filter high pass are (negate every 2nd weight):

{11, -64, 31, -127, 31, -64, 11}

August 2005, For Academic Use Only, All Rights Reserved


Notes:
Therefore we can share the cost of the actual multiply operations since there are only 7 distinct multiplies to be
done for channel 0 and 3 (similar for channel 1 and 2), i.e. the result of channel 0 is calculated by adding the
multiplication results, and channel 1 by adding/subtracting in a suitable order, i.e:

11 64 31 127 31 64 11

y(k) Low Pass Channel 0

11 -64 31 -127 31 -64 11

z(k)
High Pass Channel 1
Efficient reordering of computation...

11 64 31 127 31 64 11

+ y(k) Low Pass Channel 0


+
_
z(k) High Pass Channel 1
+

....and we can now implement two filters for little more than the price of one! (Note the filter here is very “simple”
and is more likely to be 10’s or 100’s of weights in a real system._)
Top
Downsampling Hardware Savings 1.20

• For the channelisation being performed we also note that the


bandlimiting allows us to downsample each channel by 4:

time
time

• In DSP you only calculate what you need, and therefore we only
calculate every 4th sample, therefore computation reduces by 1/4
again

• Based on this saving, and the observation above we can now re-
engineer the DSP design to be:

1/4 x 1/2 = 1/8th of the full implementation cost

August 2005, For Academic Use Only, All Rights Reserved


Notes:
(Bear in mind we are using some simple filters to demonstrate the concepts, and its most likely that a 7 weight
filter would not be sufficient roll-off to allow downsampling without aliasing!)

With some DSP knowledge we have reduced the cost of the design by a factor of 1/8th, i.e. 1/8th of the overall
hardware costs. In some further careful DSP computation reordering would bring this down by another factor of
2!
Top
Simplifying the Multiplies 1.21

• Then we again look at the filter weights.....

{11, 64, 31, 127, 31, 64, 11}

.....and note that multiply by 64 is a left shift 6 places (26)

....and multiply by 31, is equivalent to multiply a left shift 5 places (26)


and subtract the multiplicand.

both much cheaper than a generic multiply.

• And so on... Careful thought and knowledge of DSP allows us to realise


a circuit that is precisely optimised for a particular application and often
much less costly than first thought.

August 2005, For Academic Use Only, All Rights Reserved


Notes:
And this is what its all about with DSP for FPGAs....

We need to understand DSP first and foremost. Get that right, then the next stage is aiming to ensure that your
DSP algorithms are as efficiently as possible mapped to the target FPGA. Software tools will undoubtedly help
here, but there is always scope for some final stages of hand-crafting to yield the lowest cost implementation.
Top
DSP FPGA Design Software 1.22

• Industry Trends

• Towards high complexity platform chips (FPGAs, DSP)

• Highly flexible systems required to meet changing standards

• Multiple design methodologies - control plane/datapath

• Challenges in modeling and implementing an entire platform

• Hardware in the loop verification for complex system design

• System Design Challenges

• Leveraging legacy HDL code

• Modeling & implementing control logic and datapath

• No expert exists for all facets of system design

August 2005, For Academic Use Only, All Rights Reserved


Notes:
In this course we will use MatlLab release 14) from The MathWorks, or more precisely Simulink which runs in
MatlLab to design DSP systems using Xilinx System Generator

System Generator is an incredibly flexible which has many features including:

• Industry’s system-level design environment (IDE) for FPGAs

• Integrated design flow from Simulink to bit file

• Leverages existing technologies

• HDL synthesis

• IP Core libraries

• Integration to Xilinx ISE FPGA implementation tools

• Simulink library of arithmetic, logic operators and DSP functions (Xilinx Blockset)

• Bit and cycle true to FPGA implementation

• Arithmetic abstraction

• Arbitrary precision fixed-point, including quantization and overflow

• Simulation of double precision as well as fixed point


Top
Xilinx System Generator 1.23

• VHDL code generation for Virtex-4™, Virtex-II-Pro™, Virtex™-II,


Virtex™-E, Virtex™, Spartan™-3, Spartan™-IIE & Spartan™-II
devices

• Hardware expansion and mapping


• Synthesizable VHDL with model hierarchy preserved
• Mixed language support for Verilog
• Automatic invocation of CORE Generator to utilize IP cores
• ISE project generation to simplify the design flow
• HDL testbench and test vector generation
• Constraint file (.xcf), simulation ‘.do’ files generation
• HDL Co-Simulation via HDL C-Simulation
• Verification acceleration using Hardware in the Loop

August 2005, For Academic Use Only, All Rights Reserved


Notes:
The general System Generator system and features can be summarised as below:
Top
System Generator Design Flow 1.24

• As part of this course we will design from System Generator and then
implement to FPGA device, ensuring that we test and verify on the way!

MATLAB/Simulink
HDL
System Generator System Verification

Synthesis Functional Simulation

Implementation Timing Simulation

Download In-Circuit Verification

August 2005, For Academic Use Only, All Rights Reserved


Notes:
A Simulink system generator model will allow use to visually relate this to the actual FPGA hardware:

I/O blocks used as interface between the Xilinx


Blockset and other Simulink blocks

Simulink sources SysGen blocks


realizable in Hardware
Top
HDL Co-simulation 1.25

• Being able to include new or legacy modules is essential for many DSP
system designers

• HDL modules can be imported into Simulink

• “Black box” function allows designers to import HDL

• Single HDL simulator for multiple black boxes

• HDL modules can be simulated in Simulink to significantly


reduce development time

• HDL is co-simulated transparently

• HDL simulated using industry-standard ModelSim tool from


Mentor Graphics directly from Simulink framework

August 2005, For Academic Use Only, All Rights Reserved


Notes:
Later in the course we can consider bringing in black boxes to our design.

Drag a Black Box


into the model

Configuration Wizard
detects VHDL files &
customizes block
Top
Hardware in the Loop Simulation 1.26

• Configure any development board for hardware-in-the-loop using


JTAG header in <20 minutes

• Automatically create FPGA bit-stream from Simulink

• Transparent use of FPGA implementation tools

• Accelerate & verify the Simulink design using FPGA hardware

• Mirrors traditional DSP processor design flows

• Combine with black box to


simulate HDL & EDIF

August 2005, For Academic Use Only, All Rights Reserved


Notes:
Results for hardware in the loop can lead to significant speed ups:

Single Step Clock Mode (bit and cycle accurate)


Software Hardware
Application Simulation Simulation
Speed-up
Time Time
(seconds) (seconds)
Image Filtering 676 6 112X
QAM Demodulator + Extension 1203 18 67X

5 x 5 Image Filter 170 4 43X


Cordic Arc Tangent 187 27 7X
Additive White Gaussian Noise Channel 600 80 7.5X

A free running clock is provided to the design, thus the hardware is no longer running in lockstep with the
software. The test is started, and after some time a 'done' flag is set to read the results from the FPGA and
display them in Simulink. Using this hardware co-simulation method, designers can achieve up to 6 orders of
magnitude performance enhancement over original software simulation.

If a system has feedback present then hardware in the loop can be more difficult to set up as the feedback must
be done on a sample by sample basis, rather than with blocks of data.
Top
System Debug Facility 1.27

• Insert Chipscope block into


Simulink design

• Configure FPGA using JTAG


interface

• Perform in-system debug at near


system speeds

August 2005, For Academic Use Only, All Rights Reserved


Notes:

System Generator also provides a resource estimator so that for a large design you can get a quick “estimate”
of the hardware cost before going through the full ISE design flow.

Most of the blocks in the System Generator Blockset carries the resources information of

• LUTs

• FFs

• BRAM

• Embedded multipliers

• 3-state buffers

• I/Os
Top
System Generator for DSP Design 1.28

• Advantages

• Huge productivity gains through high-level modeling


• Ability to simulate the complete designs at a system level
• Very attractive for DSP engineers
• Excellent capabilities for designing complex testbenches
• HDL Testbench, test vector produced automatically
• Hardware in the loop simulation improves productivity and
provides quick verification
• Disadvantages

• Cost of abstraction: not always minimum FPGA resources


• Needs external support for multiple clock designs
• No bi-directional bus supported
August 2005, For Academic Use Only, All Rights Reserved
Notes:
System Generator is of course not the only way to design for FPGAs, one could use direct VHDL, or CORE
genertor.

Designing with Full VHDL/Verilog (RTL code)

Advantages:

• Portability
• Complete control of the design implementation and tradeoffs
• Easier to debug and understand a code that you own

Disadvantages:

• Can be time-consuming
• Don’t always have control over the Synthesis tool
• Need to be familiar with the algorithm and how to write it
• Must be conversant with the synthesis tools to obtain optimized design

Designing with Xilinx CORE Generator

Advantages

• Can quickly access and generate existing functions


• No need to reinvent the wheel and re-design a block if it meets specifications
• IP is optimized for the specified architecture

Disadvantages

• IP doesn’t always do exactly what you are looking for


• Need to understand signals and parameters and match them to your specification
• Dealing with black box and have little information on how the function is implemented
Top
Course Objectives......To 1.29

• Understand the type of DSP algorithms & applications for FPGAs.

• Know some reduced complexity strategies for algorithms on FPGA.

• Appreciate FPGA families, architectures and capabilities.

• Know how to use Simulink and System Generator for FPGA DSP
algorithm design.

• Know how to use Xilinx ISE tools to implement System Generator DSP
designs on FPGAs.

• Understand how FPGA-DSP will complement and extends current


DSP processing techniques.

August 2005, For Academic Use Only, All Rights Reserved


Notes:

You might also like