0% found this document useful (0 votes)
19 views

Unit-5 DSP

The document describes the design of an FFT processor with the following key points: 1. The FFT processor will perform 1024-point FFTs or IFFTs at a throughput of over 1000 per second, with 16-bit input/output data and 21-bit internal data. 2. An initial design partitions the system into input, FFT, output, and memory processes to read data in, perform the computation, write data out, and handle data storage. 3. Further iterations aim to parallelize the design by sharing twiddle factors between two butterfly units and mapping the algorithm onto a parallel description with input, output, control, and memory transactions as separate processes.

Uploaded by

salapu upendra
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

Unit-5 DSP

The document describes the design of an FFT processor with the following key points: 1. The FFT processor will perform 1024-point FFTs or IFFTs at a throughput of over 1000 per second, with 16-bit input/output data and 21-bit internal data. 2. An initial design partitions the system into input, FFT, output, and memory processes to read data in, perform the computation, write data out, and handle data storage. 3. Further iterations aim to parallelize the design by sharing twiddle factors between two butterfly units and mapping the algorithm onto a parallel description with input, output, control, and memory transactions as separate processes.

Uploaded by

salapu upendra
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

3

We will not discuss the specification of the FFT processor in detail. Here
we only assume that the speed and required accuracy of the FFT algo-
rithm is given. The FFT processor shall compute a 1024-point Sande-
Tukey FFT or IFFT. It will have a sustained throughput of more than
1000 FFTs per second.
The system will communi- Operator
cate with an operator through a
host processor. A 32-bit I/O inter-
face to a microprocessor is re-
quired. The I/O data rate will be at
least 8 MHz. The input and output Host Processor
data word length will be 16 bits for
the real and imaginary part,
respectively. The internal data

Output Interface
word length will be 21 bits. The

Input Interface
word length for the coefficients, Input Output
i.e., the twiddle factors, is 16 bits. FFT Processor
The FFT processor will be
implemented as a self-contained
system on one single chip. The
chip area and power consumption
of the system shall be minimized.
Fig. 1. The FFT processor.
4 Partitioning of the
FFT
The original algorithm is described in a sequential form, in this case
using Pascal, as shown in Fig. 2. An inverse Fourier transform (IFFT)
is performed if the real and imaginary parts of the input and output
sequences are interchanged [Wanh91].

Program ST_FFT;
const
N = 1024;
NU = 10;
Nminus1 = 1023;
type
Complex = record
re : Double;
im : Double;
end;
Complexarr = array[0..Nminus1] of Complex;
var
x : Complexarr;
Stage, Ns, k, kNs, i, p, j : Integer;
WCos, WSin, TwoPiN, TempRe, TempIm : Double;
begin
{ READ INPUT DATA INTO x }
Ns := N;
TwoPiN := 2 * Pi/N;
for Stage := 1 to NU do
begin
k := 0;
Ns := Ns div 2;
for j := 1 to (N div (2 * Ns)) do
begin
for i := 1 to Ns do
4

begin
p := k * 2^(Stage - 1) mod (N div 2);
W_Process(Wp, p);
kNs := k + Ns;
Butterfly(k, kNs, Wp); { Butterfly process }
k := k + 1;
end;
k := k + Ns;
end;
end;
Unscramble;
{ OUTPUT DATA STORED IN x }
end.

Fig. 2. The original algorithm.

4.1 First Design Iteration


In the first design iteration, the system is partitioned into four commu-
nicating processes: input process, FFT process, output process, and
memory process. The input process is responsible for reading data into
the FFT processor from the outside world. The data are stored by the
memory process. Conversely, the output process writes data, from the
memory, to the outside world. The output process will also handle the
unscrambling of the data array in the last stage of the FFT. The FFT
process handles the interchange of real and imaginary parts of data that
is required for computation of the IFFT. Hence, both of these tasks can
be accomplished without using any extra processing time.
Start

Data Read/Write
Input Real/Imag
FFT/IFFT Address
Data

Read/Write Read/Write
Real/Imag Real/Imag
FFT Address Memory
Address
Data Data

Data Read/Write
Real/Imag
Output Address
Data

Finished
Fig. 3. First iteration.

The total time required for input and output is estimated to:
5

2.1024
tI/O = = 0.256 ms
8 106

This is assuming that complex words are transferred sequentially


to and from the FFT processor. The time remaining for the actual FFT
computation with a throughput of 1000 FFTs per second, is:

tFFT = 0.744 ms

The number of butterfly operations required in the FFT is:

N
2 log2(N) = 5120

A bit-serial butterfly PE can be implemented using 24 clock cycles.


Typical clock frequencies are about 110 MHz. Hence, the minimal
number of butterfly PEs is:

24 ⋅ 5120
NPEb = ≈ 1.5
0.744 10-3 ⋅ 110 106

Thus, we need at least two butterfly PEs to reach the necessary


speed. We can also make estimates of the data rate to and from the
memory process. For each butterfly operation we must read and write
two complex data. Hence, the data rate will be:

(2 + 2) . 5120
= 27.5 . 106 complex words/s
0.744 10-3

In principle, it is possible to use only one logical memory.


Memories with this data rate can easily be implemented. However, we
choose to use two logical memories. This will make the implementation
of the memories simpler. Also, it is desirable that the memory clock
frequency is a multiple of the I/O frequency. We therefore select the
memory clock frequency to 16 MHz.
The following design iterations aim to transform the original
sequential algorithm into a parallel description that can be efficiently
mapped onto the hardware resources.

4.2 Second Design Iteration


In the second design iteration we explore the fact the twiddle factors can
shared between the two PEs. In stage 1, of the 16-point Sande-Tukey FFT
which is shown in Fig. 4. we have the following relation between Wp and
W p+N/4:

Wp + N/4 = WN/4 Wp = - j Wp , since Wp = e-j2πp/N


6

Data
index

0
W0 W0 W0 W0
1
2
W1 W2 W4 W0
3

4
W2 W4 W0 W0
5
6
W3 W6 W4 W0
7
8
W4 W0 W0 W0
9
10
W5 W2 W4 W0
11
12
W6 W4 W0 W0
13
14
W7 W6 W4 W0
15

Stage 1 2 3 4 Unscramble
Fig. 4. 16-point Sande-Tukey FFT.

Hence, only one factor is required. In stage 2 to 4 it is possible to


schedule the butterfly processes so that two butterflies, that use the same
twiddle factor, are performed concurrently.
The sequential description is transformed into the form, shown in
Fig. 5. This is still a sequential description but the two butterfly
processes can in principle be performed in parallel.

Program ST_FFT;
begin
{ READ INPUT DATA INTO x }
TwoPiN := 2 * Pi / N;
Ns := N;
for Stage := 1 to NU do
begin
Ns := Ns div 2;
for m := 0 to N div 4 - 1 do
begin
Addresses(p, k, kNs, k2, k2Ns, m, Stage);
W_Process(WCos, WSin, p);
Butterfly1(k, kNs, Wp, Stage);
Butterfly2(k2, k2Ns, Wp, Stage);
end;
end;
Unscramble;
{ OUTPUT DATA STORED IN x }
end.

Fig. 5. Second sequential description.

4.3 Final Design Iteration of the Algorithm


7

In the final iteration, we map the sequential description onto a parallel


description, as shown in Fig. 6. We also include memory transactions
and control loops as processes.
Input

TwoPiN := 2 * Pi / N;
Ns := N; doit
for Stage:= 1 to NU do

next

Ns := Ns div 2;
doit
for m:= 0 to (N div 4 - 1) do

next
Addresses

Output
W_Process Memory

Butterfly Butterfly

Memory

Fig. 6. Parallel description of the FFT.

5 Scheduling
As an example of the scheduling process, we will describe the schedul-
ing of the inner loop [1]. The inner loop processes are shown in Fig. 7.
Estimates of the execution time of the corresponding PEs are included in
the figure. The precedence relations denoted with dashed arrows are
precedence relations imposed for control purposes. Hence, they do not
not correspond to some interchange of data. The inner loop is scheduled
according to Fig. 8.

Ns := Ns div2;
for m:= 0 to (N div 4 - 1) do
2

Addresses 2

2 W_Process Read Read Read Read 2

Butterfly Butterfly 8

Write Write Write Write 2


8

Fig. 7. The inner loop processes.

m A

R1 W1
Butterfly1
R2 W2

p
W

R3 W3
Butterfly2
R4 W4

Fig. 8. Final schedule of the inner loop.

The rectangles denote the life-time of the processes. The white area
is the interval when the process is active. However, the result is not
available until after the grey interval, because of pipelining of the PE
that will execute the process.

6 Resource Allocation
Generally the resource allocation step is simple. A lower bound on the
number of PEs can be found from the total amount of operations per
second. The required amount have to be determined from the process
schedule. The number of logical memories, or ports, is also determined
from the schedule, and is equal to the maximal number of values that
are read/written simultaneously [7].

7 Resource Assignment
In this step, the processes are assigned to specific resources, e.g.,
butterfly processes to butterfly PEs and variables to memories and
memory cells. The chip area required for memory is in this application
significant. We will therefor use an in-place FFT where the result of a
butterfly operation is always written back to the same memory cells that
were used as inputs. Using this scheme only 1024 complex-valued
memory cells is required.
Several memory assignments are possible. Figure 9 shows two
alternatives for a 16-point FFT. In the first alternative, the first half of
the data variables are allocated to RAM 0 and the second half to RAM 1.
In the second alternative, the variables are assigned so that a butterfly
always receive input data from two different memories. The second
9

assignment alternative can be described by an EXOR function of all bits


in the binary representation of memory address index i. The resulting
assignment is also shown in Fig. 9.

Data (x(i))
index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

RAM 0

RAM 0

RAM 0
RAM 0

RAM 0
RAM 0

RAM 0

RAM 0
RAM 1
RAM 1

RAM 1

RAM 1
RAM 1

RAM 1

RAM 1
RAM 1
Alt. 2

RAM 0

RAM 1
Alt. 1

Fig. 9. Two different RAM assignment alternatives.

There are also several PE alternatives for assignment of the


butterfly PEs. One alternative is to assign the first half of the butterfly
processes to PE0 and the second half to PE1 , as indicated in Fig. 10.
Another alternative, that assures that a data variable, x(i), in always is
used as an input data to the same input port of the butterfly PE. The re-
sulting assignment is also shown in Fig. 10. The mapping function of a
process corresponds to the EXOR function of all bits in the binary
representation of the butterfly number in a stage, counted from top to
bottom.

Alt. 1 Alt. 2
PE 0 0 0 0 0
W W W W

PE 1 W 1 W 2 W 4 W 0
PE 0
PE 1 W 2 W 4 W 0 W 0

PE 0 3 6 4 0
W W W W

PE 1 W 4 W 0 W 0 W 0

PE 0 5 2 4 0
W W W W
PE 1
PE 0 W 6 W
4
W
0
W
0

PE 1 W 7 W 6 W 4 W 0

Stage 1 2 3 4

Fig. 10. The two different PE assignment alternatives.

7 Synthesis of an Optimal Architecture


First, we use a shared-memory architecture with bit-serial PEs and bit-
parallel RAMs. To convert between bit-parallel and bit-serial we use a
set of shift registers, one for each input/output of the PEs. These
10

registers are also used as cache memories to equalize the


communication data rates to the main memories.
The implementation of the interconnection network (ICN) depends
heavily on the PE and RAM assignment. Further, this part of the circuit
can prove to be expensive in terms of chip area and power consumption.
The scheduling and resource allocation and assignment steps usually
optimize only resources as PEs and RAMs. In this section we will show
that the resource assignment of PEs and RAMs will influence the ICN
as well as the complexity of the control structures that are required.
In principle, we will have four different alternatives, due to the two
different assignment alternatives for RAMs and PEs, respectively.
The first
alternative consists of
RAM 0 RAM 1
the simple RAM
assignment and using
the EXOR pattern for
the PE assignment. S/P S/P
Its main advantage is
a simple address
S/P S/P
generation for RAM.
However, the ICN
contain switches on
the bit-serial side of
the architecture. It
would be favorable to
remove them. PE 0 PE 1

Fig. 11. First architectural alternative.

The second ar-


chitectural alternative is RAM 0 RAM 1
the combination of the
simple type of both RAM
and PE assignment.
S/P S/P

S/P S/P

PE 0 PE 1

Fig. 12. Second architectural alternative.


11

RAM 0 RAM 1
The third architectural
alternative is shown in Fig. 13.
This architecture is the result of
using an EXOR pattern for both the S/P S/P
memory and PE assignments. The
main advantage with this S/P S/P
architecture is that the high-speed
interconnection network on the bit-
serial side is fixed. The only means
to control of the ICN is the
possibility to choose which of the
S/P registers to write to or read PE 0 PE 1
from. Further, the address
generation for the RAM can be
Fig. 13. Third architectural alternative.
designed so that this control be-
comes simple
Base index
Generator

Address Address
RAM 0 RAM 1
generator 0 generator 1

Cache Cache
control 0 S/P S/P control 1

S/P S/P

PE 0 PE 1

Fig. 14. Final architecture.

The third architectural alternative is chosen as the final. In Fig. 14


we have included the control of the architecture.

8 Acknowledgements
This work was supported by the swedish board for technical develop-
ment (STU).

9 References
International Journal of Computer Applications (0975 – 8887)
Volume 116 – No. 7, April 2015

FFT Architectures: A Review


Shubhangi M. Joshi.
Sathyabhama University,
Chennai

ABSTRACT generates the address for reading data for butterfly


Fast Fourier Transform (FFT) is one of the most efficient operations and also for storing the output data results in
algorithm widely used in the field of modern digital signal RAM. Sequential control unit generates the control signals
processing to compute the Discrete Fourier Transform for each module.[1]
(DFT).FFT is used in everything from broadband to 3G and
Digital TV to radio LAN’s. Due to its intensive computational Butterfly Processing Unit
requirements, it occupies large area and consumes high power
in hardware. Different efficient algorithms are developed to
improve its architecture. This paper gives an overview of the
work done of different FFT processor previously. The ROM
comparison of different architecture is also discussed. Serial –to-Parallel Address
Generator
Keywords
Fast Fourier Transform (FFT), FFT architectures RAM1 RAM
2 Parallel-to-serial
1. INTRODUCTION
FFT processors are involved in a wide range of applications
today. Not only a savery important block in broadband
Sequential Control Unit
systems, digital TV etc., but also in are as like radar, medical
electronics, imaging and the SETI project(Search for Extra- Fig.1 Block Diagram of FFT processor
terrestrial Intelligence).Many of these systems are real-time
systems, which mean that the systems has to produce a 2. FFT ARCHITECTURES:
result within a specified time. Different FFT architectures are classified as
1. Memory Based
The work load for FFT computations are high and a better
approach than a general purpose processor is required, to 2. Cache Memory Based
fulfil the requirements at a reasonable cost. The major
concerns for researchers are to meet real-time processing 3. Sequential
requirements and to reduce hardware complexity mainly 4. Parallel
with respect to are and power and to improve processing
speed of processor. 5. Parallel Iterative

The DFT Algorithm: A DFT transform that is defined as 6. Array Architecture


7. Pipelined

(Eq.1) 2.1 Memory Based


Memory based- architectures mainly rely on the use of the
(Eq.2) memory for its operation. These Architectures generally
consists of one or more processing elements (PE) or
These equations show that to compute all N values DFT butterflies depending on computation, memory blocks and
requires N²complex multiplications and N(N-1)complex control unit.
additions Since the amount of computation and thus the
computation time, is approximately proportional to N², it Memory based architectures are classified into
will cost a long computation time for large values of N. For
 Single memory architecture
this reason, It is very important to reduce the number of
multiplications and additions. The algorithm is an efficient  Dual memory architecture
algorithm to compute the DFT, is called Fast Fourier
Transform (FFT) algorithm.The FFT algorithm deals with 2.1.1 Single memory architecture
these complexity problems by exploiting regularities in the In this architecture processing element is connected to a single
DFT algorithm. memory unit by bidirectional bus. Data exchanges are taken
place between the processor and memory at every stage using
1.1 FFT Processor this bus.
The FFT structure of FFT processor contains a butterfly
processing unit, a RAM and ROM unit for the storage of Main Memory
Proc
data, address generation unit and a sequential control unit.
The main units of FFT processor are butterfly processing
unit and address generation unit. The dualport RAM used to
store input data and intermediate results and output. Twiddle Fig.2 Single Memory Architecture
factors are stored in ROM. The address generation unit

33
International Journal of Computer Applications (0975 – 8887)
Volume 116 – No. 7, April 2015

2.1.2 Dual memory architecture 2.5 Parallel Iterative Architecture


In this type of architecture both memories are connected to Performance of FFT processor can be improved further by
processing element with two separate bidirectional data buses. adding more processing elements in every sequential pipeline
Data inputs are passed from one memory to another memory stage. Butterflies are computed in parallel in every stage.
through the processing elements (PE) and vice versa till the Total execution time requires is log2N cycles.
transform is completed.[2]
2.6 Array Architecture
A fully parallel structure can be obtained by having a PE for
each of the butterfly operations. A number of processing
Main Main elements with local buffers are interconnected in a network
Memory Proc Memory
fashion to compute FFT. As the architecture requires huge
area and a lot of hardware this is not the attractive option for
large N.

Fig.3 Dual Memory Architecture


Proc Proc
2.2 Cache memory Architecture + +
This architecture is mainly used to increase the speed of the
Buffer Buffer
memory access, energy efficiency and for reducing the power
consumption. Architecture is similar to that of single memory 
architecture except that the cache between the processer and
main memory to pre fetch the data. This architecture is not
widely used due to extra hardware and controller Proc Proc
complexity.[2] + +
Buffer Buffer

Main Fig.6: Array Architecture


Proc Cache
Memory
2.7 Pipeline Architecture
This architecture is also known as cascaded FFT architecture,
and used in most of the designs. the basic structure of
pipelined is as shown in fig. between each stage of radix-r
Fig.4 Cache Memory Architecture pe’s there is a commutator and last stage is unscrambling
stage. the commutator records the output data from previous
2.3 Sequential Architecture stage and feed to the next stage. the unscramble rearranges
The basic sequential processer uses processing elements (PE) data in natural sorted order.in figure a denotes the stage
for computing butterfly. The same memory can be used to number in the pipeline. the number in boxes gives the size of
store input data, output data intermediate results and twiddle that fifo r in complex sampling. c2 is a switch and radix-4
factors. The amount of hardware involved is very small and it butterfly element.
requires N/2 log2N sequential operation to compute the FFT.

2.4 Parallel Architecture


This is also known as In- Place architecture. It consists of Radix Radix
butterfly unit and three multiport buffers, one to parallelize the -r -r
input data, one for processing data and one for the output. At
the butterfly output a switching module branches the result to C ……… C

the right memory locations. The control of this type of
architecture is complicated as there is lot of resource sharing.
PE PE
It is used for low to moderate speed applications. The feature
of this architecture is high throughput but worst hardware
efficiency.[17][18]
Fig.7:General structure of a pipelined FFT architecture.
Butterfly Butterfly Butterfly Performance of this architecture can be improved by
Unit Unit Unit
Parallelism using separate arithmetic unit for each stage of
FFT processer and through put can be increased by factor
Butterfly Butterfly Butterfly log2N using different units in pipelined. Pipelined FFT
Unit Unit Unit
Output

...
Input

processers have features like high throughput, simplicity, fast,


small area and energy efficient implementation.
Butterfly Butterfly Butterfly
Unit Unit Unit
The most commonly used pipelined architectures such as
Multipath Delay Commutator (MDC) , Single Path Delay
Butterfly Butterfly Butterfly Commutator (SDC) and Single Path Delay Feedback (SDF)
Unit Unit Unit
[2][18]

(b) Parallel Architecture

34
International Journal of Computer Applications (0975 – 8887)
Volume 116 – No. 7, April 2015

2.7.1 Multipath Delay Commutator 3. RESULTS AND CONCLUSION


In this architecture, input sequence is first divided into Among these various architectures, memory based
multiple parallel data streams by commutator. This data is architectures and Pipelined architectures are most widely
then goes to butterfly unit for computation. Butterfly used.
operation is then followed by twiddle factor multiplication
with proper delay at each data streams. All butterflies and Table 1: Comparison of Pipelined FFT Architectures
multiplier units are 100% utilised with proper input buffering.
Architecture R2- R4- R2- R4-
SDF MDC MDC
a SDF
2
Delay Buffer N-1 N-1 3N/2-2 5N/2-4
C2
Complex 2log2N 8log4N 2log2N 8log4N
Adder
a
2
Adder 50% 25% 100% 100%
Utilization
Fig.8: Multipath Delay Commutator structure.
Complex log2N-1 Log4N- log2N-1 3Log4N-
Multiplier 1 1

z--2 z--2 Multiplier 50% 75% 100% 100%


Radix 2 Butterfly
Radix 2 Butterfly

Utilization
Commutator

Commutator

Clock Rate 1 1 0.5 0.25

z--1 Control Simple Medium Simple Simple

Above comparison shows that in case of multipath delay


commutator (MDC) two samples can be processed in parallel
Fig.9: Radix-2 Multipath Delay Commutator structure which improves the performance than designs which are serial
(N=16). in nature but requires larger memory.
2.7.2 Single Path Delay Commutator Table2showsthe comparison between pipelined Single Delay
feedback (SDF) architecture and memory based architecture
for radix-r N point FFT implementation. The comparison is
made in terms of Storage Requirement Memory banks,
Complex multipliers and Complex adders. Power
6X4a consumption can be reduced in the pipelined SDF architecture
with the efficient implementations of sequential buffers
whereas in memory based architecture, to achieve a conflict
free memory access, random addressing is necessary. So,
Fig.8: Single path Delay Commutator structure. pipelined architectures are preferred when performance and
power are the main concern than the complexity of hardware. On
2.7.3 Single Path Delay Feedback the other hand memory based architectures are good choice
In this single data stream goes through multiplier in every where complexity is of main concern.[2]
stage. The commutator used for SDF is somewhat different
because it also feeds data backwards. The delay units are more Further the performance can be improved by using high radix
efficiently utilised by sharing the same storage between input algorithm, higher parallel architectures or using folding
and output of butterfly unit. Multiplier and Butterfly units can technique.
be utilised 50% because they are bypassed half the time. Table 2: Comparison between Memory Based and
Pipelined SDF FFT Architectures

3X4a Architecture  Memory Based Single Path


Architecture delay feedback
architecture

Algorithm Radix-r Radix-r


BF4
Storage N N-1
Requirement

Memory banks r Log2N


(dual port)
Fig.10:Single Path Delay Feedback Structure

35

You might also like