Unit-5 DSP
Unit-5 DSP
We will not discuss the specification of the FFT processor in detail. Here
we only assume that the speed and required accuracy of the FFT algo-
rithm is given. The FFT processor shall compute a 1024-point Sande-
Tukey FFT or IFFT. It will have a sustained throughput of more than
1000 FFTs per second.
The system will communi- Operator
cate with an operator through a
host processor. A 32-bit I/O inter-
face to a microprocessor is re-
quired. The I/O data rate will be at
least 8 MHz. The input and output Host Processor
data word length will be 16 bits for
the real and imaginary part,
respectively. The internal data
Output Interface
word length will be 21 bits. The
Input Interface
word length for the coefficients, Input Output
i.e., the twiddle factors, is 16 bits. FFT Processor
The FFT processor will be
implemented as a self-contained
system on one single chip. The
chip area and power consumption
of the system shall be minimized.
Fig. 1. The FFT processor.
4 Partitioning of the
FFT
The original algorithm is described in a sequential form, in this case
using Pascal, as shown in Fig. 2. An inverse Fourier transform (IFFT)
is performed if the real and imaginary parts of the input and output
sequences are interchanged [Wanh91].
Program ST_FFT;
const
N = 1024;
NU = 10;
Nminus1 = 1023;
type
Complex = record
re : Double;
im : Double;
end;
Complexarr = array[0..Nminus1] of Complex;
var
x : Complexarr;
Stage, Ns, k, kNs, i, p, j : Integer;
WCos, WSin, TwoPiN, TempRe, TempIm : Double;
begin
{ READ INPUT DATA INTO x }
Ns := N;
TwoPiN := 2 * Pi/N;
for Stage := 1 to NU do
begin
k := 0;
Ns := Ns div 2;
for j := 1 to (N div (2 * Ns)) do
begin
for i := 1 to Ns do
4
begin
p := k * 2^(Stage - 1) mod (N div 2);
W_Process(Wp, p);
kNs := k + Ns;
Butterfly(k, kNs, Wp); { Butterfly process }
k := k + 1;
end;
k := k + Ns;
end;
end;
Unscramble;
{ OUTPUT DATA STORED IN x }
end.
Data Read/Write
Input Real/Imag
FFT/IFFT Address
Data
Read/Write Read/Write
Real/Imag Real/Imag
FFT Address Memory
Address
Data Data
Data Read/Write
Real/Imag
Output Address
Data
Finished
Fig. 3. First iteration.
The total time required for input and output is estimated to:
5
2.1024
tI/O = = 0.256 ms
8 106
tFFT = 0.744 ms
N
2 log2(N) = 5120
24 ⋅ 5120
NPEb = ≈ 1.5
0.744 10-3 ⋅ 110 106
(2 + 2) . 5120
= 27.5 . 106 complex words/s
0.744 10-3
Data
index
0
W0 W0 W0 W0
1
2
W1 W2 W4 W0
3
4
W2 W4 W0 W0
5
6
W3 W6 W4 W0
7
8
W4 W0 W0 W0
9
10
W5 W2 W4 W0
11
12
W6 W4 W0 W0
13
14
W7 W6 W4 W0
15
Stage 1 2 3 4 Unscramble
Fig. 4. 16-point Sande-Tukey FFT.
Program ST_FFT;
begin
{ READ INPUT DATA INTO x }
TwoPiN := 2 * Pi / N;
Ns := N;
for Stage := 1 to NU do
begin
Ns := Ns div 2;
for m := 0 to N div 4 - 1 do
begin
Addresses(p, k, kNs, k2, k2Ns, m, Stage);
W_Process(WCos, WSin, p);
Butterfly1(k, kNs, Wp, Stage);
Butterfly2(k2, k2Ns, Wp, Stage);
end;
end;
Unscramble;
{ OUTPUT DATA STORED IN x }
end.
TwoPiN := 2 * Pi / N;
Ns := N; doit
for Stage:= 1 to NU do
next
Ns := Ns div 2;
doit
for m:= 0 to (N div 4 - 1) do
next
Addresses
Output
W_Process Memory
Butterfly Butterfly
Memory
5 Scheduling
As an example of the scheduling process, we will describe the schedul-
ing of the inner loop [1]. The inner loop processes are shown in Fig. 7.
Estimates of the execution time of the corresponding PEs are included in
the figure. The precedence relations denoted with dashed arrows are
precedence relations imposed for control purposes. Hence, they do not
not correspond to some interchange of data. The inner loop is scheduled
according to Fig. 8.
Ns := Ns div2;
for m:= 0 to (N div 4 - 1) do
2
Addresses 2
Butterfly Butterfly 8
m A
R1 W1
Butterfly1
R2 W2
p
W
R3 W3
Butterfly2
R4 W4
The rectangles denote the life-time of the processes. The white area
is the interval when the process is active. However, the result is not
available until after the grey interval, because of pipelining of the PE
that will execute the process.
6 Resource Allocation
Generally the resource allocation step is simple. A lower bound on the
number of PEs can be found from the total amount of operations per
second. The required amount have to be determined from the process
schedule. The number of logical memories, or ports, is also determined
from the schedule, and is equal to the maximal number of values that
are read/written simultaneously [7].
7 Resource Assignment
In this step, the processes are assigned to specific resources, e.g.,
butterfly processes to butterfly PEs and variables to memories and
memory cells. The chip area required for memory is in this application
significant. We will therefor use an in-place FFT where the result of a
butterfly operation is always written back to the same memory cells that
were used as inputs. Using this scheme only 1024 complex-valued
memory cells is required.
Several memory assignments are possible. Figure 9 shows two
alternatives for a 16-point FFT. In the first alternative, the first half of
the data variables are allocated to RAM 0 and the second half to RAM 1.
In the second alternative, the variables are assigned so that a butterfly
always receive input data from two different memories. The second
9
Data (x(i))
index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
RAM 0
RAM 0
RAM 0
RAM 0
RAM 0
RAM 0
RAM 0
RAM 0
RAM 1
RAM 1
RAM 1
RAM 1
RAM 1
RAM 1
RAM 1
RAM 1
Alt. 2
RAM 0
RAM 1
Alt. 1
Alt. 1 Alt. 2
PE 0 0 0 0 0
W W W W
PE 1 W 1 W 2 W 4 W 0
PE 0
PE 1 W 2 W 4 W 0 W 0
PE 0 3 6 4 0
W W W W
PE 1 W 4 W 0 W 0 W 0
PE 0 5 2 4 0
W W W W
PE 1
PE 0 W 6 W
4
W
0
W
0
PE 1 W 7 W 6 W 4 W 0
Stage 1 2 3 4
S/P S/P
PE 0 PE 1
RAM 0 RAM 1
The third architectural
alternative is shown in Fig. 13.
This architecture is the result of
using an EXOR pattern for both the S/P S/P
memory and PE assignments. The
main advantage with this S/P S/P
architecture is that the high-speed
interconnection network on the bit-
serial side is fixed. The only means
to control of the ICN is the
possibility to choose which of the
S/P registers to write to or read PE 0 PE 1
from. Further, the address
generation for the RAM can be
Fig. 13. Third architectural alternative.
designed so that this control be-
comes simple
Base index
Generator
Address Address
RAM 0 RAM 1
generator 0 generator 1
Cache Cache
control 0 S/P S/P control 1
S/P S/P
PE 0 PE 1
8 Acknowledgements
This work was supported by the swedish board for technical develop-
ment (STU).
9 References
International Journal of Computer Applications (0975 – 8887)
Volume 116 – No. 7, April 2015
33
International Journal of Computer Applications (0975 – 8887)
Volume 116 – No. 7, April 2015
...
Input
34
International Journal of Computer Applications (0975 – 8887)
Volume 116 – No. 7, April 2015
Utilization
Commutator
Commutator
35