FFT Full Docc
FFT Full Docc
DISSERTATION .
SUBMITTED IN PARTIAL FULFILMENT OF
THE REQUIREMENTS FOR THE AWARD OF THE DEGREE OF
.1/' .''' ; \ s^^?
BY
1999-2000
DS3225
CONTENTS
CHAIRMAN
DEPARTMENT OF ELECTRONICS ENGINEERING
Z.H. COLLEGE OF ENGINEERING AND TECHNOLOGY
AMUALIGARH
INDIA
CERTIFICATE
Certified that the dissertation entitled "VHDL modeling and FPGA based
record of the candidates own work carried out by him under our supervision and
guidance. The matter embodied in this dissertation has not been submitted for the
(Reader)
J/^ feel priuileaed and kiaklu ooliaed to mu parenti and all mu fam.ilu
I'm ipecial thankd aoed to all the lab attendantd for their cooperation.
^C- fabrication
am.aruzzam.an
^u^u6t 7000;
CHAPTER-1
INTRODUCTION
INTRODUCTION
Digital signal processing, a field which has its roots in 1 f and 18' century
mathematics, has become an important modern tool in a multitude of diverse fields of
science and technology. The techniques and applications of this field are as old as
Newton and Gauss and as new as digital computers and integrated circuits.
Digital signal processing is concerned with the representation of signals by
sequences of numbers or symbols and the processing of these sequences. The purpose
of such processing may be to estimate characteristic parameters of a signal or to
transform a signal into a form, which is in some sense more desirable.
The evolution of a new point of view toward digital signal processing was
further accelerated by the disclosure in 1965 of an efficient algorithm for computation
of Fourier transforms. This class of algorithms has come to be known as the fast
Fourier transform or FFT. The implications of the FFT were significant from a
number of points of view. Many signal-processing algorithms, which had been
developed on digital computers, required processing times several orders of
magnitude greater than real time. Often this was tied to the facts that spectrum
analysis was an important component of the signal processing and that no efficient
means had been known for implementing it. Tlie fast Fourier transform algorithm
reduced the computation time of the Fourier transform by orders of magnitude. This
permitted the implementation of increasingly sophisticated signal processing
algorithms with processing times that allowed interaction with the system.
Furthennore, with the realization that the fast Fourier transform algorithm might, in
fact, be implementable in special purpose digital hardware, many signal processing
algoritluns which previously had appeared to be impractical began to appear to have
practical implementations with special purpose digital hardware.
2.1 Introduction
Discrete Fourier transform plays an important role in the analysis, the design,
and the implementation of digital signal processing algorithms and systems. One of
the reasons that Fourier analysis is of such wide-ranging importance in digital signal
processing is because of the existence of efficient algorithms for computing the
discrete Fourier transform [1].
The discrete Fourier transform (DFT) is
In Eqs. [2.1] and [2.2], both x{n) andy^(A;) may be complex. The expressions of Eqs.
UN. Thus a discussion of computation procedures for Eq. [2.1] applies with
[2.1] and [2.2] differ only in the sign of the exponent of ^^ and in a scale factor
X{k) = 2;{(Re[x(«)]Re[]^™]-Im[x(«)]Im[j^;])
+ 7(ReW«)]Im[J^^] + Im[;v(«)]Re[J^J)},
k = 0,l,...,N-\ [2.3]
From Eq. [2.3] it is clear that for each value of k, the direct computation of X(k)
requires 4A'^ real muhiplications and (4N - 2) real additions. Since X(k) must be
computed for A^ different values of k, tire direct computation of tlie discrete Fourier
transform of a sequence x{n) requires 4N^ real muhiplications and A^(4JV - 2) real
additions or, alternatively, N^ complex multiplications and N{N - 1) complex
additions, hi addition to the multiplications and addhions called for by Eq. [2.3], the
implementation of the computation of the DFT on a general-purpose'digital computer
or with special-puipose hardware of course requires provision for storing and
kn
accessing the input sequence values x(n) and values of the coefficients JY^^ • Since
the amount of accessing and storing of data in numerical computation algorithms is
generally proportional to the number of arithmetic operations, it is generally accepted
that a meaningful measure of complexity, or, of the time required to implement a
computational algorithm, is the number of multiplications and additions required.
Thus, for the direct computation of the discrete Fourier transform, a convenient
measure of the efficiency of the computation is the fact that 4N^ real multiplications
and Ni4N - 2) real additions are required. Since the amount of computation, and thus
the computation time, is approximately proportional to N^, it is evident that the
number of arithmetic operations required to compute the DFT by the direct method
becomes very large for large values of N. For this reason, computational procedures
that reduce the number of multiplications and additions are of considerable interest.
Most approaches to improving the efficiency of the computation of the DFT
kn
exploit one or both of the following special properties of the quantities JY ^ '-
1. wT"'=(w>-
For example, using the first property, i.e., the symmetry of the cosine and sine
functions, we can group terms in Eq. [2.3] as
and
sequence J^*" were known long before the era of high-speed digital computation. At
that time, any scheme that reduced hand computation by even a factor of 2 was
welcomed. Runge [2] and later Danielson and Lanczos [3] described algorithms for
which computation was roughly proportional to N log A^' rather than N . The
possibility of greatly reduced computation was generally overlooked until about 1965,
when Cooley and Tukey [1] published an algorithm for the computation of the
discrete Fourier transform that is applicable when A^ is a composite number; i.e., A^' is
the product of two or more integers. The publication of this paper touched off a flurry
of activity in the application of the discrete Fotirier transform to signal processing and
resulted ui the discovery of a number of computational algorithms which have come
to be known as fast Fourier transform, or simply FFT, algorithms. Collectively, the
entire set of such algorithms is often loosely referred to as "the FFT" [5].
The fundamental principle that all these algorithms are based upon is that of
decomposing the computation of the discrete Fourier transform of a sequence of
length A'' into successively smaller discrete Fourier transforms. The manner in which
this principle is implemented leads to a variety of different algorithms, all with
comparable improvements in computational speed. Here we will discuss two classes
of algorithms.
1. Decimation-in-time FFT algorithm.
2. Decimation-in-frequency FFT algorithm.
the sequence x{n), into successively smaller subsequences, are called decimation-in-
time algorithms. The principle of decimation-in-time is most conveniently illustrated
by considering the special case of A^ an integer power of 2; i.e.,
A^=2"
Since A'^ is an even integer, we can consider computing X{k) by separating x(ji) into
two M2-point sequences consisting of the even-numbered points in x{n) and the odd-
numbered points in x(n). With X{k) given by
and separating x(n) into its even- and odd-numbered points we obtain
X{k)=Y.<n)W'",^Y.<^)Wl
noii
N
or with the substitution of variables n = 2r for n even and n = 2r+l for n odd,
Each of the sums in Eq. [2.6] is recognized as an 7V/2-point DFT, the first sum being
tlie AV2-point DFT of the even-numbered points of the original sequence and the
second being the AV2-point DFT of the odd-numbered points of the original sequence.
Although the index k ranges over A^ values, A: = 0, 1, . . . , TV^ - 1, each of the sums need
only be computed for k between 0 and M2 - 1, since G(}z) and H{\i) are each periodic
in k with period iV/2. After the two DFTs corresponding to the two sums in Eq. [2.6]
are computed, they are then combined to yield the A^-point DFT, X{]c). Fig. 2.1
indicates the computation involved in computing Xili) according to Eq. [2.6] for an
eight-point sequence, i.e., for N= 8. Li this figure we have used the signal flow graph
conventions for representing difference equations [5,7]. That is, branches entering a
node are summed to produce the node variable. When no coefficient is indicated, the
branch transmittance is assumed to be one. For other branches, the transmittance of a
branch is an integer power of J^ .
X(0)
X(l)
X(,2)
X(3)
A(4)
X(5)
x(5)0 X(6)
X(l)
x(7)a
Thus we note in Fig. 2.1 that two four-point DFTs are computed, with G{k)
designating thefo-ur-pointDFT of the even-numbered points and H{K) designating the
four-point DFT of the odd-numbered points. X(0) is then obtained by multiplymg
H(0) by JYfi ^"^ adding the product to G(Q). X{1) is obtained by multiplying H{1) by
]Yi^ and adding the result to G(l). For X{A) we would want to multiply H{A) by j ^ ^
and add the result to G(4). However, since G(k) and H{k) are both periodic in k with
period 4, //(4) = //(O) and G(4) = G(0). Thus X(4) is obtained by multiplying //(O) by
4
] ^ ^ and adding the result to G(0).
With the computation restructured according to Eq. [2.6], we can compare the
number of multipHcations and additions required with those required for a direct
computation of the DFT. Previously we saw that for direct computation without
exploiting symmetry, N complex multiplications and additions were required. By
comparison, Eq. [2.6] requires the computation of two M2-point DFTs, which in turn
requires 2{NI2f complex multiplications and approximately 2(M2)^ complex
additions. Then the two AV2-point DFTs must be combined, requiring A^ complex
or
(A'/4)-l (A'/4)-l
Thus if the four-point DFTs in Fig. 2.1 are computed according to.Eqs. [2.7]
and [2.8], then that computation would be carried out as indicated in Fig. 2.2.
Inserting the computation indicated in Fig. 2.2 into the flow graph of Fig. 2.1, we
obtain the complete flow graph of Fig. 2.3. Note that we have used the fact that
For the eight-point DFT that we have been using as an illustration, the
computation has been reduced to a computation of two-point DFTs. The two-point
DFT of, for example, x(0) and x(4), is depicted in Fig. 2.4. With the computation of
Fig. 2.4 inserted in the flow graph of Fig. 2.3, we obtain the complete flow graph for
computation of the eight-point DFT, as shown in Fig. 2.5.
AO) O- <V(0)
xW 0 - * C(l)
G-(2)
•'^(2) a
N
' - point
4
^(6)0J '
DFT G(3)
W.
J:(0) JV(0)
J:(4) Ad)
x(2) A(2)
x(6) X(3)
^(I) XW
J:(5) X(5)
j:(3) X(6)
•'(T) O A-(7)
W2=wT'-i
Fig. 2.4 Flow graph of a two-point
DFT.
^0) m)
x(4) m)
x(2) X(2)
X(3)
A'(4)
x{5) X(5)
x(3) ;tf[6)
A<7) ;^7)
(f.
For the more general case with A^' a power of 2 greater than 3, we would
proceed by decomposing the M4-point transforms in Eqs. [2.7] and [2.8] into AV8-
point transforms, and continue until left with only two-point transforms. This requires
V stages of computation, where v = log2 N. Previously we found that in the original
10
decomposition of an A'^-point transform into two iV/2-point transforms, the number of
complex multiplications and additions required was A^ + 2{NI2f. When then M2-point
-J
transforms are decomposed into //M-point transforms, then the factor of (A72) is
replaced by Nil + 2{NIAf, so the overall computation then requires N + N + 4(N/4f
complex multipications and additions. If N = 2\ this can be done at most v= log2 N
times, so that after carrying out this decomposition as many times as possible the
number of complex multiplications and additions is equal to A'' log2 N.
The flow graph of Fig. 2.5 displays the operations explicitly. By counting
branches with transmittances of the form fP"^, we note that each stage has N
complex multiplications and N complex additions. Since there are log2 N stages, we
have, as before, a total of A'' Iog2 A^ complex multiplications and additions. This is the
substantial computational savings that previously indicated was possible. We shall see
that the symmetry and periodicity of fy^ ^^^ ^^ exploited to obtain fiirther
reductions in computation.
n =0 n = N 12
or
It is important to obsei-ve that while Eq. [2.9] contains two summations over A^/2
points, each of these summations is not an A'/2-point DFT since j ^ " ^ rather than
u
Jj/"'' ^ appears in each of the sums. Combining the two summations in Eq. [2.9] and
n=0
c{n) + (-\^ x{n+ — w: nk
[2.10]
Let us now consider k even and k odd separately, with X{2r) and X{2r + 1)
representing the even-numbered points and the odd-numbered points, respectively, so
that
(W/2)-l
X{2r)=
n=0
2 x{n) + X « + • w:2/71
[2.11]
(A'/2)-l
yv
X{2r + \)^ X
n=0
x(«)- n+- r;jr;' 2m
r=0,l,...,(iV/2-l) [2.12]
Equations [2.11] and [2.12] can be recognized as M2-point DFTs; in the case of Eq.
[2.11], of the sum of the first half and the last half of the mput sequence, and the in
case of Eq. [2.12], of the product of J ^ ^ with the difference of the first half and the
last half of the input sequence. As distinguished from Eq. [2.10] the two summations
in Eqs. [2.11] and [2.12] correspond to 7V/2-pokit DFTs because
w7=w:n
Thus on the basis of Eqs. [2.11] and [2.12] with g{n) - x(n) + x{n + N/2) and h{n) =
x(n) - x(n + NI 2), the DFT can be computed by first forming the sequences g{ri) and
h{n), then computing h{n) J^"^, and finally computing the A'72-point DFTs of these
two sequences to obtain tlie even-numbered output points and odd-numbered output
points, respectively. The procedure suggested by Eqs. [2.11] and [2.12] is illustrated
for the case of an eight-point DFT in Fig. 2.6.
Proceeding in a manner similar to that followed in deriving the decimation-in-
time algorithm, we note that since iV is a power of 2, N/2 is even, and consequently,
the A^/2-point DFTs can be computed by computing the even-numbered and odd-
numbered output points for those DFTs separately. As in the case of the original
decomposition leading to Eqs. [2.11] and [2.12], this is accomplished by combining
the first half and the last half of the input points for each of the N/2 -point DFTs and
then computing W4-point DFTs. The flow chart resulting from taking this step for the
eight-point example is shown in Fig. 2.7. For the eight-point example, the
computation has now been reduced to the computation of two-point DFTs, which, as
was discussed previously, are implemented by adding and subtracting the input point.
Thus the two-point DFTs in Fig. 2.7 can be replaced by the computation shown in
Fig. 2.8, so the computation of the eight-point DFT becomes that shown in Fig. 2.9.
By counting the arithmetic operations in Fig. 2.9, and generalizing to A'^ = 2",
we see that the computation of Fig. 2.9 requires Nil \0g2N complex multiplications
and N log2N complex additions. Thus the total computation is the same for the
decimation-in-frequency and the decimation-in-time algorithms.
0^(0)
o^(2)
* OX(4)
OX(6)
oxa)
* OA:(3)
OX(5)
OX(7)
13
xm
O '*^(''>
*• Qxii)
O ^(«)
^—O^'"
0^(5)
»^—O^m
X(7)
>-.(p) OXip)
X..(q) OX^q)
^•(0) o * O^m
^0)0
•^a) o
14
2.4 Prevention of overflow in fixed point arithmetic
15
CHAPTERS
INTRODUCTION TO VHDL
INTRODUCTION TO VHDL
As the size and complexity of the digital systems increases, more computer
aided design tools are introduced into the hardware design process. The early paper-
and- pencil design methods have given way to sophisticated design entry, verification
and automatic hardware generation tools. The newest addition to this design
methodology is the introduction of Hardware Description Languages (HDL). Based
on HDLs, new digital system CAD (Computer Aided Design) tools have been
developed and are now being utilized by the hardware designers. Hardware
description languages are used to describe hardware for the purpose of simulation,
modeling, testing, design, and documentation of digital systems. These languages
provide a convenient and compact format for the hierarchical representation of
functional and wiring details of digital systems. Some Hardware Description
Languages consists of a simple set of symbols and notations which replace schematic
diagrams of digital circuits, while others are more formally defined and may present
the hardware at one or more levels of abstraction. Available software for HDLs
includes simulators and hardware synthesis programs. For the design of large digital
systems, much engineering time is spent in changing formats for using various design
aids and simulators. An integrated design environment is useful for better design
efficiency in these systems. In an ideal design environment, the high level description
of the system is understandable to the managers and to the designers, and it uniquely
and unambiguously defines the hardware. This high level description can serve as the
documentation for the part as well as an entry point into the design process. As the
design process advances, additional details are added to the initial description of the
part. These details enable the simulation and testing of the system at various levels of
abstraction. By the last stage of design, the initial description has evolved into a
detailed description, which can be used by a program controlled machine for
generation of final hardware in the form of layout, printed circuit board, or gate
arrays. This ideal design process exists only if a language exists to describe hardware
at various levels so that it can be understood by the managers, users, designers,
testers, simulators and machines. The IEEE standard VHDL hardware description
language is such a language. VHDL stands for very high speed integrated circuit
hardware description language (VHSIC). In 1980 US government developed VHSIC
project to enhance the electronic design process, technology and procurement,
spawning development of many advanced integrated circuit process technologies.
VHDL was defined because a need existed for an integrated design and
documentation language to communicate design data between various levels of
abstraction. At the time, none of the existing hardware description languages fully
satisfied these requirements, and the lack of precision in English made it too
ambiguous for this purpose. Introducing VHDL and synthesis enables the design
community to explore a new design methodology. The traditional design approach, as
shown in Fig. 3.1 starts with drawing schematics and then performs functional and
timing simulation based on the same schematic. If there is any design error, the
process iterates back to update schematics. After the layout, functions and back-
aimotated timing are verified again with the same schematics.
Schematic
Function and
time checking
Layout
The VHDL based design approach is illustrated in Fig. 3.2. The design is
functionally described with VHDL. VHDL simulation is used to verify the
functionality of the design. In general, modifying VHDL source code is much faster
than changing schematics. This allows designers to make faster functionally correct
designs, to explore more architecture trade-offs, and to have more impact on the
designs. After the function match the requirements, the VHDL code is synthesized to
generate schematics (or equivalent netlists). The netlist can be used to layout the
circuit and to verify the timing requirements (both before and after the layout). The
design changes can be made by modifying the VHDL code or changing the
constraints (timing, area and so on) in the synthesis. This new design approach and
methodology has improved the design process by shortening the design time, reducing
the number of design iterations, and increasing the design complexity that designers
can manage.
VHDL Synthesis
1^
Functional Timing
Simulation verification
Layout
18
(IV) Portability: The same VHDL code can be simulated and used in
many design tools and at different stages of the design process. This
reduces dependency on a set of design tools whose limited capability
may not be competitive in later markets. The VHDL standard also
transforms design data much easier than a design database of a
proprietary design tool.
(V) Modeling capability: VHDL was developed to model all levels of
design, from electronic boxes to transistors. VHDL can accommodate
behavioral constructs and mathematical routines that describe complex
models, such as queuing networks and analog circuits. It allows use of
multiple architecture and associates with the same design during
various stages of the design process. As shown in Fig. 3.3, VHDL can
describe low-level transistors up to very large systems.
^
Lc
20
efficiency at higher levels of design abstraction is much more than the gate level.
VHDL design process flowchart is shown in Fig. 3.4.
DESIGN IDEA
'
BEHAVIORAL DESCRIPTION
IM VHDL
'
SIMULATION
IT
DESIGN NOT
FEASIBLE
RTL DESCRIPTION
IN VHDL
MODIFY RTL
DESCRIPTION
21
CHAPTER-4
DESIGN OF FFT PROCESSOR
DESIGN OF FFT PROCESSOR
The architecture design of the FFT processor is aimed to achieve higli degree
of parallelism and thus high speed. In most general-purpose computers, a single
hardware multiplier is available. In the pipeline FFT there can be as many as ten
separate "butterfly boxes" (for 1024 point radix 2-FFT), which correspond to 40 real
multipliers (since each butterfly contains a complex multiplier that contains 4 real
multipliers). The pipeline FFT structure is 2 to 20 times more efficient than any
general-purpose computer structures. Because of its high efficiency and also because
of a relatively simple control mechanism, the pipeline FFT appears at present to be
the most important special FFT processor for very high-speed applications.
The flow diagrams for In-place 8-point FFT with normally ordered inputs and
bit-reversed outputs shown in Fig. 4.1.
*(0) o 0^(0)
xmo OX(4)
x(2)0 O Xp)
x{3) O- 0^(6)
xwo o;*^(i)
x(5)0 OX{5)
•t(6)a 0^(3)
*(7) Q oxa)
Let us assume that the signal samples appear at the input sequentially, x(0),
x(\), etc. The Fig. 4.2 shows a very simple arrangement for perfonning the first stage
of an FFT corresponding to the flow diagram of Fig. 4.1. The first four samples x(0)
through ;c(3) are switched into the four-stage delay element. The next four samples are
n
switched to the other input line of the system. Assuming that the butterfly
computation time is exactly equal to the sampling interval, the entire first stage of the
FFT is performed in the subsequent four-sample intervals following the switching.
The results of the first stage labeled as x\n) appear in parallel pairs at the butterfly
output.
I''BUTTERFLY
n = 0,1,2,3
n = 4,5.6,7
COEFFICIENT
MEMORY
Since the coefficient J/f/'^ changes fi"om sample to sample, the coefficient
memory must be entering its information to the butterfly at the same rate (the
sampling rate) as the signal. It is evident fi:om Fig. 4.1 that the structural form of
stage 1 is repeated twice in stage 2. Thus, an arrangement has to be devised that will
process x (n) {n = 0,1,...,3} and x (n) (n = 4, 5,...,7} in a manner similar to the way
x(n) { n = 0,1,...,7} was processed. This contrivance is shown in Fig. 4.3. By using
appropriate delays and switches, the partly processed samples are lined up in exactly
the way specified by Fig. 4.1. Tlius, the "spacing" (difference between the samples in
time) is four time units for the first butterfly and two time units for the second. A
complete 8-point pipeline FFT is shown in Fig. 4.4. Tlie symmetry in the structure can
be exploited to construct pipeline FFT's with larger W through extrapolation.
23
2-stage
delay
4-stage delay element element
^"(n)
A 0 1 2 3 .... x(n)
B 4 5 6 7
C Q 1 2 3
D 4 5 6 7
C->E, D-^F C-vE, D->F
Fig. 4.3 First and second stage of 8-point pipeline FFT, radix 2.
DIF
2-stage 1-stage
^ D4
x(n)
A I 2 3 ..xin)
B ..V(,>)
E 6 7
H ..X (n)
I 2 3 6 7
24
The following points are important with reference to Fig. 4.4.
1. The delay elements in a given stage are half as that of the delay elements
in an earlier stage.
2. The arithmetic elements are busy only half the sampling time.
3. Each switch switches at double the rate of its predecessor.
4. The basic clocking interval of the whole system is naturally equal to the
sampling rate.
5. The output is bit reserved as a function of real time.
To prove statement 5, it has been noticed that the indices in Fig. 4.4 are in
exact correspondence with the indices in Fig. 4.1. Since in Fig. 4.1, the resultant
output is bit reserved, so is the output of Fig. 4.4. Fig. 4.4 is a specific implementation
of Fig. 4.1, and thus possesses all its properties in addition to timing properties. The
pipeline FFT structure has a two-port output so that two frequency samples at a time
are available. The important point is that the indices shown on the last two line of
Fig. 4.4 are in actuality the bit-reversed indices of the output frequency samples.
With regard to statement 2, this is rather tricky point and the on time of the
butterfly is really dependent on how the input is interfaced with the processor. For
example, in Fig. 4.5 it is required that contiguous data blocks be processed in real
time.
REAL-TIME INPUT
N
I" DATA BLOCK 2"" DATA BLOCK
ON ON
I OFF I I" BUTTERFLY
N/2
ON ON
_| 2'"' BUTTERFLY
I I OFF I
3N/4
ON ON
1 OFF I 1 3"* BUTTERFLY
7N/8
ON ON
I 1 OFF I •4'" BUTTERFLY
I5N/I6
25
It is clear from Fig. 4.2 through 4.4 processing cannot begin until half of the
data block has entered the processor. The first stage is completed in the next (7V/2)
cycles. At this moment, the first butterfly is turned off until the initial (AV2) values of
the next data block has been gathered into 4-stage delay element. The other butterfly
follows the same pattern with a delay. Therefore, the overall system efficiency is
50% since every butterfly is on exactly half the time. The timing diagram for 8-point
FFT is shown in Fig. 4.6.
CLK
RESET
SAMPLING
PULSE tn
SAMPLES )fci)i(t>i<i:;^(i)i(t)i^^
COUNT
I Bl I
butterfly K>1
operation
butterfly
operation
3"
butterfly
operation
svvi sl[aig(it ^Wl|cris^-crj
C(jl|nnc[:tcd| | co|nnc^ted
1" switch
operation S\V2 straight
connected
SW2 criss-cross
2 switch connected
operation
26
switches (Sw4_l, Sw4_2), three butterfly blocks (Bfl), one countei (Cn), and one
weight factor generator block (Wfg) all connected in a particular fashion.
Radix-2 butterfly (Bfl) consists of divide by two and adder / subtracter module
(Adsbl6), four two's complement multiplier module (Tml6), Correction and
rounding-off module (CR16). The hierarchical order of modules is shown in Fig. 4.8.
Plff
i
\ ] ] ? T 1 }
Sh4 1 16 Sh2 2 16 Shi 2 16 Sw4 1 Sw4 2 Cn Wfg
r
Bfl
The signal flow graph of butterfly is shown in Fig. 4.9. As shown in Fig. 4.9
the butterfly circuit requires two adders, two subtracters and one complex multiplier.
27
(A„+B,)+ yW+Bl)
AB<- M
BR+ JB,
yA , . i(Ai,-n„)+ ;(A-B,))x((r„+jito
( A R - B 8 ) + y(/V-B,)
WM-fK + yW
The expression {(AR - BR) + X A I - Bi)} x {W^ +jWi) can be expanded and as follows.
{(AR - BR) +7(AI - Bi)} X (WK +J-W:)
= [{(AR - BR) X ^R} - {(Ai - Bi) X Wi}]
+/-[{(AR - BR) X Wi} + {(Ai - BI) X F R } ] [4.1]
The expression shows that four real multipliers are required for complex number
multiplication. The block diagram of butterfly is shown in Fig. 4.10. It has three
blocks namely ADSB16, TM16, CR16. Here overflow is prevented by having
\x(n)\ < 1 i.e. the input sequence is normalized to a fraction and by incorporating an
attenuation of 1/2 at the input of each stage (right shifting the input at every stage).
The prevention of overflow, and addition and subtraction are performed in block
ADSB16. TM16 is a two's complement multiplier, which performs multiplication
with weight factors. CR16 is correction and rounding-off circuit.
TMti raoDtJOtO]
TMK raorHMfcoi
ASL9COR1IS 0]
ISLSCORllS-OJ
HJIAO)
TM16 rR0D(3a«I BI'"I ££14
^ODVSIllSOl
DIM-O]
UiARllSQl
TMI> PSODlJtOl
P rt30-0]
C(JO-01
AR[1S.0]
-BRLZ
-BILZ
[NBR[1S.0J • BR(I3 0\ " ^ OSmiJ 01
ABDK(ll.a|
28
(a) Add-Subtract module (ADSB16)
library SYNTH;
use SYNTH.VHDLSYNTH.all;
library IEEE;
use IEEE.STD_LOGIC_1164.all;
Entity ADSB16 is
Generic (SIZE : INTEGER := 16);
Port (AR, A I , BR, BI : IN STD_tILOGIC_VECTOR(SIZE - 1 downto 0);
ARL, AIL, BRL, BIL : OUT STD_ULOGIC;
ABSR, ABSI, ABDR, ABDI
: OUT STD_ULOGIC_VECTOR(SIZE - 1 downto 0));
End ADSB16;
Architecture BEHAVE of ADSBIS is
Signal ARI, All, BRI, BII : STD_UI,OGIC_VECTOR (SIZE - 1 downto 0) ;
FUNCTION TC{QQ:STD_ULOGIC_VECTOR) return STD_ULOGIC_VECTOR is
Constant INC: STD_ULOGIC_VECTOR(SIZE - 1 downto 0)
:=INT_TO_STD_ULOGIC_VECTOR(1,SIZE);
Begin
return {NOT{QQ) + INC);
End TC;
Begin
ARI(SIZE - 2 downto 0) <= AR(SIZE - 1 downto 1 ) ;
ARKSIZE - 1) <= AR(SIZE - 1);
ARL <= AR(0);
All(SIZE - 2 downto 0) <= AI(SIZE - 1 downto 1 ) ;
29
All (SIZE - 1) <= AKSIZE - 1) ;
AIL <= AI(0) ;
BRKSIZE - 2 downto 0) <= BR (SIZE 1 downto 1) ;
BRKSIZE - 1) <= BR (SIZE - 1) ;
BRL <= BR(0);
BIKSIZE - 2 downto 0) <= BI(SIZE 1 downto 1 ) ;
BIKSIZE - 1) <= BKSIZE - 1) ;
BIL <= BI(0);
ABSR <= ARI + BRI;
ABSI <= All + BII;
ABDR <= ARI + TC(BRI);
ABDI <= All + TC(BII);
End BEHAVE;
This module multiplies two numbers in two's complement format. The multiplier
used is an array multiplier and thus it is a very fast multiplier. Four real multipliers are
required for complex number multiplication.
0 0 0 0 C c
1 /
BI 3= B ^
1 /
B 3=
\
B
/
B
/
^
\
B
/
5i: B 0
-^ , INX(O)
\ / \ / •' / \
/ • / \ /
BI 3l B ^ B 3= B B 5; B •*
-* , INX(l)
• / • / \ / • / '
BI :3^ B 5r B 5^ B " ^ B
y
• ^
1 INX(2)
1 /
BI 3=
''
"1
/
5-
''
BI
/
3- BI
-^— INY(1)
\
PROD((i)
\
l'R0D(5)
1
l'R0D(4)
\
PR0D(3) PRODp) PROD(l) PROD(O)
It has two cells namely 'B' and 'BF as shown in Fig. 4.13. The function of
cells B and BI are defined in the truth table in Fig. 4.14.
30
CellB
It has two inputs namely 'a' and '6', two control inputs namely 'x,' and 'Xi_ i'
and two outputs namely 'z' and 'bo'.
L/ L/
BI
bo z
T
Fig. 4.13 Block diagram of cells B and BI.
Cell BI
It has two inputs namely 'a' and '6', two control inputs namely 'xC and 'Xi_ i'
and one output namely 'z'. The truth table for BI is same as B except there is no '6o'
output.
The block diagram for two's complement multiplier as shown in Fig. 4.12 can
be simplified and is shown in Fig. 4.15.
OOOOOOOVa INY(3:a)
LVX(0)
INX(l)
rV INX(2)
n«(3)
PROD(6:4)
It has three cells namely 'BLEX', 'BLI' and 'BLII' for which symbols and
their truth table are shown in Fig. 4.16 and Fig. 4.17 respectively.
i A-l-l
BLEX BLI
A(S : 0) B(S : 0)
i
BLII
XI
Z(S - 1 : 0) zl
32
TRUTH TABLE FOR CELL BLEX
P(SZx2 : SZ + 1) = K(SZ) & K(SZ) & & K(SZ), (SZx2 - SZ) times
P(SZ : 0) = K(SZ : 0)
X\ X;., z zl BO
0 0 A(S : 1) A(0) B(S - 1 : 0)
0 1 CA + B)fS:l) fA + B)(0) B(S - 1 : 0)
1 0 (A + NOT(B) + l ) ( S : l ) (A + NOT(B) + 1)(0) BfS - 1 : 0)
1 1 A(S : 1) A(0) B(S - 1 : 0)
J^i Xi.\ z zl
0 0 AfS: 1) A(0)
0 1 (A + B)(S:1) (A + B)fO)
1 0 (A + N0T(B) + 1XS:1) (A + NOT(B) + 1)(0)
1 1 ACS : 1) ACO)
Fig. 4.17 Truth tables for cells BLEX, BLI and BLH
Two's complement multiplier shown in Fig. 4.15 can be further simplified and
is shown in Fig. 4.19. It has four cells namely 'BLKO_I', 'BLKO_n', 'BLKI' and
'BLKII' for which symbols and their truth tables are shown in Fig. 4.18 and Fig. 4.20
respectively.
i
p(SZ: 0) i
k(S : 0)
BLKO_I
BLKO_n
j(SZ)c2 :0)
T
ii(S - 1 : 0) m
T^
i ^ J ZA
«(S : 0)
BLKI
b(S': 0)
blc(S : 0)
. .
A .(S : 0)
BLKII
b(S : 0)
bIc(S : 0)
. . .
xi_xi_l JII_XI_I
A b(S - I : 0)
botc(S - 1 : 0 ) ili(S - 1 : 0 ) zl zhlS - 1 : 0 ) zl
33
INYP : 0)
INX(O)
,1NX(1:0)
,INX(2:1)
34
TRUTH TABLE FOR CELL BLKO_I
j(SZ x2 : SZ+ 1) = p(SZ) & p(SZ) & &p(SZ), (SZ x 2 - SZ) times
j(SZ : 0) = p(SZ : 0)
1 n m
0 ooa.o (so's) 0
1 k(S : 1) k(0)
Truth table for cell BLKII is not shown in Fig. 4.20 but it is same as cell
BLKI except it has only two outputs namely 'zh' and 'zl'. Now VHDL code can be
written and is shown below.
library SYNTH;
use SYNTH.VHDLSYNTH.all;
library IEEE;
use IEEE. std_logic_1164.all;
entity Tnil6 is
Generic (SIZE : INTEGER := 16);
Port (INX, INY : IN STD_ULOGIC_VECTOR(SIZE - 1 downto 0);
PROD : OUT STD_lILOGIC_VECTOR (SIZE * 2 2 downto 0)) ;
end Tml6;
architecture behave of Tml6 is
signal ai, bi, btci :
STD_ULOGIC_VECTOR((SIZE - 1) * ((SIZE * 2 - 2) - (SIZE / 2)
downto 0) ;
procedure BLKO (signal X : IN STD_UL0G1C;
signal Y : IN STD_UL0G1C_VECT0R;
signal zl : OUT STD ULOGIC;
35
signal zh : OUT STD_ULOGIC_VECTOR;
signal bo, bote : OUT STD_ULOGIC_VECTOR) is
Constant INC: STD_ULOGIC_VECTOR(Y'High * 2 - Y'Low downto Y'Low)
:=INT_TO_STD_ULOGIC_VECTOR(l, Y'Length * 2 - 1 ) ;
Constant G : STD_ULOGIC_VECTOR(Y'High * 2 - Y"Low downto Y'Low)
:=INT_TO_STD_ULOGIC_VECTOR{0, Y'Length * 2 - 1 ) ;
Variable YX, YXTC : STD_ULOGIC_VECTOR(Y'High * 2 - Y'Low downto
Y'Low) ;
Variable zi : STD_ULOGIC_VECTOR(Y'High * 2 - Y'Low downto Y'Low);
begin
YX(Y'High downto Y'Low) := Y;
LOl : For i in 1 to Y'Length - 1 loop
YX(Y'High + i) := Y(Y'High);
YXTC := ((NOT YX) + INC);
end loop LOl;
Case X is
WHEN '0' => zi := G;
WHEN others => zi := YXTC;
End Case;
zl <= zi(Y'Low);
zh <= zi(Y'High * 2 - Y'Low downto Y'Low + 1 ) ;
bo <= YX(Y'High * 2 - Y'Low - 1 downto Y'Low);
bote <= YXTC(Y'High * 2 - Y'Low - 1 downto Y'Low);
return;
End BLKO;
procedure BLKI (signal a, b, btc : IN STD_ULOGIC_VECTOR;
signal xi_xi_l : IN STD_ULOGIC_VECTOR(1 downto 0 ) ;
signal zl : OUT STD_ULOGIC;
signal zh : OUT STD_ULOGIC_VECTOR;
signal bo, bote : OUT STD_ULOGIC_VECTOR) is
Constant G : STD_ULOGIC_VECTOR(b'High downto b'Low)
:=INT_TO_STD_ULOGIC_VECTOR(0 , b'Length) ;
Variable zix, zi : STD_ULOGIC_VECTOR(b'High downto b'Low);
begin
Case xi_xi_l is
WHEN "01" => zix := b;
WHEN "10" => zix := btc;
WHEN Others => zix := G;
End Case;
zi := a + zix;
zl <= zi(b'Low);
36
zh <= zi(b'High dovmto b'Low + 1 ) ;
bo <= b(b'High - 1 downto b'Low);
bote <= btc(btc'High - 1 downto btc'Low);
return;
End BLKI;
procedure BLKII (signal a, b, btc : IN STD_ULOGIC_VECTOR;
signal xi_xi_l : IN STD_ULOGIC_VECTOR(1 downto 0 ) ;
signal zl : OUT STD_ULOGIC;
signal zh : OUT STD_ULOGIC_VECTOR) is
Constant G : STD_ULOGIC_VECTOR(b'High downto b'Low)
:=INT_TO_STD_ULOGIC_VECTOR{0, b'Length);
Variable zix, zi : STD_ULOGIC_VECTOR(b'High downto b'Low);
begin
Case xi_xi_l is
WHEN "01" => zix := b;
WHEN "10" => zix := btc;
WHEN Others => zix := G;
End Case;
zi := a + zix;
zl <= zi(b'Low);
zh <= zi(b'High downto b'Low + 1 ) ;
return;
End BLKII;
begin
GO : For i in 0 to SIZE - 1 generate
GI : if (i = 0) generate
BLKO(X => INX{i), Y => INY, zl => PROD(i),
zh => ai((i + 1) * (SIZE * 2 - 2) - (i * (i + 1) / 2) - 1
downto i * (SIZE * 2 - 2) - i * (i - 1) / 2 ) ,
bo => bi((i + 1) * (SIZE * 2 - 2) - (i * (i + 1) / 2) - 1
downto i * (SIZE * 2 - 2) - i * (i - 1) / 2) ,
bote => btci((i + 1) * (SIZE * 2 - 2) - (i * (i + 1) / 2) - 1
downto i * (SIZE * 2 - 2) - i * (i - 1) / 2));
end generate;
GII : if (i > 0) AND (i < SIZE - 1) generate
BLKI(a => ai(i * (SIZE * 2 - 2 ) - i * ( i - l ) / 2 - l
downto (i - 1) * (SIZE * 2 - 2) - (i - 1) * (i - 2) / 2 ) ,
b => bi(i * (SIZE * 2 - 2 ) - i * ( i - l ) / 2 - l
downto (i - 1) * (SIZE * 2 - 2) - (i - 1) * (i - 2) / 2 ) ,
btc => btci(i * (SIZE * 2 - 2 ) - i * ( i - l ) / 2 - l
downto (i - 1) * (SIZE * 2 - 2) - (i - 1) * (i - 2) / 2 ) ,
37
xi_xi_l(0) => INX(i - 1),xi_xi_l(l) => INX{i), zl => PROD(i),
2h => ai((i + 1) * (SIZE * 2 - 2) - i * (i + D / 2 - 1
downto i * (SIZE * 2 - 2) - i * (i - 1) / 2) ,
bo => bi((i + 1) * (SIZE * 2 - 2) - i * (i + 1) / 2 - 1
downto i * (SIZE * 2 - 2) - i * (i - D / 2) ,
bote => btci((i + 1) * (SIZE * 2 - 2) - i * (i + 1) / 2 - 1
downto i * (SIZE * 2 - 2) - i * (i - 1) / 2));
end generate;
GUI : if (i = SIZE - 1) generate
B L K l K a => a i d * (SIZE * 2 - 2 ) - i * ( i - l ) / 2 - l
downto (i - 1) * (SIZE * 2 - 2) - (i - 1) * (i - 2) / 2 ) ,
b => bi(i * (SIZE * 2 - 2 ) - i * ( i - l ) / 2 - l
downto (i - 1) * (SIZE * 2 - 2) - (i - 1) * (i - 2) / 2 ) ,
btc => btci(i * (SIZE * 2 - 2 ) - i * ( i - l ) / 2 - l
downto (i - 1) * (SIZE * 2 - 2) - (i - 1) * (i - 2) / 2) ,
xi_xi_l(0) => INX(i - 1 ) , xi_xi_l(l) => INX(i),
zl => PROD(i), zh => PROD(SIZE * 2 - 2 downto SIZE));
end generate;
end generate;
end behave;
overcome by choosing a number out of 2'^ possible combinations which is not taken
as a weight factor. The weight factors for 1024 points FFT are calculated and a
number other than these (1024/2) - 1 =511 weight factors is chosen to represent a
unity weight factor, say 012C\H. A number multiplied with 1.0 gives the number
itself but when a number is multiplied with 012C\H the output of multiplier is not
correct so it has to be coaected. This correction is done in correction module.
Output from multiplier is 31-bit wide, which has to be rounded to 16-bit. Suppose
a 5-bit number is available and it is to be rounded-off to 3-bit then 2-bit LSBs are
compared with 10 (binaiy). If 2-bit LSBs is greater or equal to (10) then 001 (binary)
is added to 3-bh MSBs otherwise 000 (binary) is added. After addition the number
available is rounded-off to 3-bits. The correction and rounding-off circuit is shown in
Fig. 4.21. Truth tables for MUXl and MUX2 are shown in Fig. 4.22. The VHDL code
is shown below.
38
A[IS.O|l
OOOOQOOOOOOOOOO
B|I5.0|I
00000{H)(K)000000 XHllSI
D[30:0|l
FIJO.OH
RSLCOR|I5.0|
BZI(I5|
E|30:a|a
CpOiOIB SpOKll
T(30KII OIOOOOOOOOOOOQOO
ISLCORIlS:0Ia J
Fig. 4.21 Correction and rounding-off circuit.
INPUT OUTPUT
L K
0 I
Others J
library SYNTH;
use SYNTH.VHDLSYNTH.all;
library IEEE;
use IEEE.STD_LOGIC_1164.all;
39
Entity CR16 is
Generic (SIZE : INTEGER := 16);
Port (RSLCOR, ISLCOR, A, B : IN STD_ULOGIC_VECTOR(SIZE - 1 downto 0 ) ;
D, E, F, G : IN STD_ULOGIC_VECTOR(SIZE * 2 - 2 downto 0 ) ;
H, I : OUT STD_ULOGIC_VECTOR(SIZE - 1 downto 0));
End CR16;
Architecture BEHAVE of CR16 is
Signal COI, COII, COIII, COIV, ICOIV, IICOIII
: STD_ULOGIC_VECTOR(SIZE * 2 - 2 downto 0 ) ;
Signal AZI, BZI : STD_ULOGIC_VECTOR(SIZE - 1 downto 0 ) ;
Begin
PO : Process (RSLCOR, ISLCOR, A, B, D, E, F, G )
Constant SIZE : INTEGER := 16;
Variable RSL, ISL : STD_ULOGIC_VECTOR(SIZE - 1 downto 0 ) ;
Begin
RSL := RSLCOR; ISL := ISLCOR;
Case RSL is
WHEN "0000000100101100" => COI(SIZE * 2 - 2 downto SIZE - 1)
<= A(SIZE - 1 downto 0 ) ;
COI(SIZE - 2 downto 0) <= "000000000000000";
COIII(SIZE * 2 - 2 downto SIZE - 1)
<= B(SIZE - 1 downto 0 ) ;
COIII(SrZK - 2 downto 0) <= "000000000000000";
WHEN Others => COI <= D; COIII <= F;
End Case;
Case ISL is
WHEN "0000000100101100" => COII(SIZE * 2 - 2 downto SIZE - 1)
<= A(SIZE - 1 downto 0 ) ;
COII(SIZE - 2 downto 0) <= "000000000000000";
COIV(SIZE * 2 - 2 downto SIZE - 1)
<= B(SIZE - 1 downto 0 ) ;
COIV(SIZE - 2 downto 0) <= "000000000000000";
WHEN Others => COII <= E; COIV <= G;
End Case;
End Process PO;
ICOIV <= (COI + COIV);
IICOIII <= (COII + COIII);
P2 : Process(ICOIV, AZI)
Constant SIZE : INTEGER := 16;
Begin
AZI <= C O ' & ICOIV (SIZE * 2 - 2 - 16 downto 0))
40
+ "0100000000000000";
Case AZKSIZE * 2 - 2 - 15) is
WHEN '0' => H <= ICOIV{SIZE * 2 - 2 downto SIZE - 1 ) ;
WHEN others => H <= ICOIV(SIZE * 2 - 2 downto SIZE - 1)
+ "0000000000000001";
End Case;
End Process P2;
P3 : Process(IICOIII, BZI)
Constant SIZE : INTEGER := 16;
Begin
BZI <= ('0' & IICOIII{SIZE * 2 - 2 - 16 downto 0))
+ "0100000000000000";
Case BZI(SIZE * 2 - 2 - 15) is
WHEN '0' => I <= IICOIII(SIZE * 2 - 2 downto SIZE - 1 ) ;
WHEN others => I <= IICOIII(SIZE * 2 - 2 downto SIZE - 1)
+ "0000000000000001";
End Case;
End Process P3;
End BEHAVE;
library SYNTH;
use SYNTH.VHDLSYNTH.all;
library IEEE;
use IEEE.STD_LOGIC_1164.all;
Entity BFL is
Generic (SIZE : INTEGER := 16);
Port (INAR, INAI, INBR, INBI, INWR, INWI
: IN STD_UL0GIC_VECT0R(SIZE - 1 downto 0 ) ;
ARLZ, AILZ, BRLZ, BILZ : OUT STD_UL0GIC;
GSR, OSI, ODWR, ODWI
: OUT STD_UL0GIC_VECT0R(SIZE - 1 downto 0));
End BFL;
Architecture BEHAVE of BFL is
Component ADSB16 Port (AR, Al, BR, BI
: IN STD_ULOGIC_VECTOR(SIZE - 1 downto 0 ) ;
ARL, AIL, BRL, BIL : OUT STD ULOGIC;
ABSR, ABSI, ABDR, ABDI
: OUT STD_tILOGIC_VECTOR{SIZE - 1 d o w n t o 0) ) ;
End Component;
Component TM16 Port {INX, INY
: IN STD_ULOGIC_VECTOR(SIZE - 1 downto 0 ) ;
PROD '
: OUT STD_ULOGIC_VECTOR(SIZE * 2 - 2 dovmto 0));
End Component;
Component CR16 Port (RSLCOR, ISLCOR, A, B
: IN STD_ULOGIC_VECTOR(SIZE - 1 downto 0 ) ;
D, E, F, G
: IN STD_ULOGIC_VECTOR(SIZE * 2 - 2 downto 0 ) ;
H, I : OUT STD_U1.0GIC_VECT0R(SIZE - 1 downto 0));
End Component;
For all : ADSB16 USE ENTITY W0RK.ADSB16;
For all : TM16 USE ENTITY WORK.TMIS;
For all : CR16 USE ENTITY W0RK.CR16;
Signal ABDRX, ABDIX : STD_UL0GIC_VECT0R(SIZE - 1 downto 0 ) ;
Signal DX, EX, FX, GX : STD_ULOGIC_VECTOR(SIZE * 2 - 2 downto 0 ) ;
Begin
CPO : ADSB16 Port Map (INAR, INAI, INBR, INBI, ARLZ, AILZ,
BRLZ, BILZ, OSR, OSI, ABDRX, ABDIX);
CPl : TM16 Port Map (ABDRX, INWR, DX)
CP2 : TM16 Port Map (ABDRX, INWI, EX)
CP3 : TM16 Port Map (ABDIX, INWR, FX)
CP4 : TM16 Port Map (ABDIX, INWI, GX)
CP5 : CR16 Port Map (INWR, INWI, ABDRX, ABDIX, DX, EX, FX,
GX, ODWR, ODWI);
End BEHAVE;
As mentioned above there are tliree shift register modules namely Sh4_l_16,
Sh2_2_16, Shl_2 16. All shift registers are positive edge triggered.
This module is a collection of sixteen 4-bit shift registers. The inputs of each
shift register is merged to form 16-bit input bus namely 'A' and outputs of each
42
register is also merged to form 16-bit output bus namely ' C . Tlie VHDL code is
given below.
library SYNTH,-
use SYNTH.VHDLSYNTH.all;
library IEEE;
use IEEE.STD_LOGIC_1164.all;
Entity SH4_1_16 is
Generic (SIZE : INTEGER := 16);
Port (A : I N STD_UIiOGIC_VECTOR ( S I Z E - 1 d o w n t o 0) ;
CLK -. IN STD_ULOGIC;
C : OUT STD_ULOGIC_VECTOR(SIZE - 1 downto 0));
End SH4_1_16;
Architecture BEHAVE of SH4_1_16 is
Signal AI, All, AIII : STD_ULOGIC_VECTOR(SIZE - 1 downto 0 ) ;
Begin
Process
Begin
WAIT UNTIL CLK'EVENT AND CLK='1';
AI <= A;
All <= AI;
AIII <= All;
C <= AIII;
End Process;
End BEHAVE;
library SYNTH;
use SYNTH.VHDLSYNTH.all;
library IEEE;
use IEEE.STD_LOGIC_1164.all;
Entity SH2_2_16 is
Generic (SIZE : INTEGER := IS);
Port (A, B : IN STD_ULOGIC_VECTOR(SIZE - 1 downto 0 ) ;
43
CLK : IN STD_ULOGIC;
C, D : OUT STD_ULOGIC_VECTOR(SIZE - 1 downto 0));
End SH2_2_16;
Architecture BEHAVE of SH2_2_16 is
Signal AI, BI : STD_ULOGIC_VECTOR(SIZE - 1 downto 0 ) ;
Begin
Process
Begin
WAIT UNTIL CLK'EVENT AND CLK='1';
AI <= A;
C <= AI;
BI <= B;
D <= BI;
End Process;
End BEHAVE;
library SYNTH;
use SYNTH.VHDLSYNTH.all;
library IEEE;
use IEEE.STD_L0GIC_11S4.all;
Entity SH1_2_16 is
Generic (SIZE : INTEGER := 16);
Port (A, B : IN STD_ULOGIC_VECTOR(SIZE - 1 downto 0 ) ;
CLK : IN STD_ULOGIC;
C, D : OUT STD_-aLOGIC_VECTOR(SIZE - 1 downto 0));
End SH1_2_16;
Architecture BEHAVE of SH1_2_16 is
Begin
Process
Begin
WAIT UNTIL CLK'EVENT AND CLK='1';
C <= A;
D <= B;
End Process;
44
End BEHAVE;
4.3.3 Switches
Tliere are two switches namely Sw4_l, Sw4_2 which routes the data in a
particular fashion as required by the signal-flow graph shown in Fig. 4.1. Routing of
data is controlled by counter (Cn), which will be described later.
The switch has four input buses of sixteen bit each namely 'A', 'B', ' C , 'D';
two bit select bus namely 'SEL' which controls the routing of data and four output
buses of sixteen bit each namely 'E', 'F', 'G', 'H'. The truth table for Switch Sw4_l
is shown in Fig. 4.23. The behavior of switch is described in VHDL code given
below.
INPUT OUTPUT
SEL E F G H
00 C D A B
01 A B C D
10 A B C D
11 C D A B
A[lS:OJv^ ,-£[15:01
B115:0I\ ,\ /F[15:01
/ \
D|15:01' II[15K)|
SELU:01
library SYNTH;
use SYNTH.VHDLSYNTH.all;
library IEEE;
use IEEE.STD_L0GIC_1164.all;
Entity SW4_1 is
Generic (SIZE : INTEGER := 16);
Port (A, B, C, D : IN STD_ULOGIC VECTOR(SIZE 1 downto 0)
45
SEL : IN STD_UL0GIC_VECT0R(1 downtO 0 ) ;
E, F, G, H : OUT STD_ULOGIC_VECTOR(SIZE - 1 downto 0));
End SW4_1;
Architecture BEHAVE of SW4_1 is
Begin
PO : Process (A, B, C, D, SEL)
Begin
Case SEL is
WHEN "10" => E <= A; F <= B; G <= C; H <= D;
WHEN "01" => E <= A; F <= B; G <= C; H <= D;
WHEN Others => G <= A; H <= B; E <= C; F <= D;
End Case;
End Process PO;
End BEHAVE;
The switch has four input buses of sixteen bit each namely 'A', 'B', ' C , 'D';
select line namely 'SELO' which controls the routing of data and four output buses of
sixteen bit each namely 'E', 'F', 'G', 'H'. The truth table for Switch Sw4_2 is shown
in Fig. 4.24. The behavior of switch is described in VHDL code given below.
INPUT OUTPUT
SELO E F G H
0 • c D A B
1 A B C D
SnA 2
ciis-.o] • C1I5:0I C115:0I' ,\ t;|I5:0I
DI15:0)' 1I[1S:0|
SELO SELO
1
For SELO - ' r For SELO - '0'
library SYNTH;
use SYNTH.VHDLSYNTH.all;
library IEEE;
use IEEE.STD_LOGIC_1164.all;
46
Entity SW4_2 is
Generic (SIZE -. INTEGER : = 16) ;
Port (A, B, C, D : IN STD_ULOGIC_VECTOR{SIZE - 1 downto 0 ) ;
SELO : IN STD_ULOGIC;
E, F, G, H : OUT STD_UL,OGIC_VECTOR (SIZE - 1 dovmto 0));
End SW4__2;
Architecture BEHAVE of SW4_2 is
Begin
PO : Process (A, B, C, D, SELO)
Begin
Case SELO is
WHEN '0' => G <= A; H <= B; E <= C; F <= D;
WHEN Others => E <= A; F <= B; G <= C; H <= D;
End Case;
End Process PO;
End BEHAVE;
Counter Cn is a two bit counter. It has one 2-bit output namely 'COUNT' and
lower bit of COUNT (i.e. COUNT(O)) is also taken out to form a output line namely
'CSWn'. The counter is negative edge triggered. The behavior of counter is described
in VHDL code given below.
library SYNTH;
use SYNTH.VHDLSYNTH.all;
library IEEE;
use IEEE.STD_L0GIC_1164.all;
Entity CN is
Port (RSTn, CLK : IN STD_ULOGIC;
COUNT : BUFFER STD_ULOGIC_VECTOR(1 downto 0 ) ;
CSWII : BUFFER STD_ULOGIC);
End CN;
Architecture BEHAVE of CN is
Begin
Process (RSTn, CLK )
Begin
If ( RSTn = '1') then
COUNT <= "00" ;
Elsif (CLK'event and CLK = '0') then
COUNT <= COUNT + "01";
47
End if;
End Process;
CSWII <= COUNT(0);
End BEHAVE;
As the name implies weight factor generator generates 16-bit weight factor for
three butterflies, which is controlled by counter Cn to produce appropriate weight
factors at an appropriate time. It has one 2-bit control input namely 'SELC and six
16-bit outputs namely 'WFRI', 'WFH', 'WFRH', 'WFIH', 'WFRIH', 'WFEU'. The
truth table for weight factor generator is shown in Fig. 4.25. The behavior of weight
factor generator is described in VHDL code given below.
library SYNTH;
use SYNTH.VHDLSYNTH.all;
library IEEE;
use IEEE.STD_LOGIC_1164.all;
Entity WFG is
Generic (SIZE : INTEGER := 16);
Port (SELC : IN STD_ULOGIC_VECTOR(1 downto 0 ) ;
WFRI, WFII, WFRII, WFIII, WFRIII, WFIIII
: OUT STD_ULOGIC_VECTOR{SIZE - 1 downto 0));
End WFG;
Architecture BEHAVE of WFG is
Begin
PO : Process(SELC)
Begin
Case SELC is
WHEN "10" => WFRI <= "0101101010000010"; --5A82
WFII <= "1010010101111110"; --A57E
48
WHEN "11" => WFRI <= "0000000000000000"; --0000
WFII <= "1000000000000000"; --8000
WHEN "00" => WFRI <= "1010010101111110"; --A57E
WFII <= "1010010101111110"; --A57E
WHEN Others => WFRI <= "0000000100101100"; --012C
WFII <= "0000000000000000"; --0000
End Case;
End Process PO;
PI : Process(SELC)
Begin
Case SELC is
Tlie overall code for Pipeline FFT is shovm below. It mainly describes the
interconnection of all the blocks discussed previously
library SYNTH;
use SYNTH.VHDLSYNTH.all;
library IEEE;
use IEEE.STD_LOGIC_1164.all;
Entity PLFF is
Generic (SIZE : INTEGER := 1 6 ) ;
49
lARL, lAIL, IBRL, IBIL, IIARL, H A I L , IIBRL, IIBIL,
IIIARL, IIIAIL, IIIBRL, IIIBIL
: OUT STD_ULOGIC,-
OPAR, OPAI, OPBR, OPBI
: OUT STD_ULOGIC_VECTOR(SIZE - 1 dovmto 0 ) ) ;
End PLFF;
Architecture BEHAVE of PLFF is
Component SH4_1_16 Port (A : IN STD_ULOGIC_VECTOR(SIZE - 1 dovmto
0) ;
CLK : IN STD_ULOGIC;
C : OUT STD_ULOGIC_VECTOR(SIZE - 1 dovmto 0 ) ) ;
End Component;
Component BFL Port (INAR, INAI, INBR, INBI, INWR, INWI
: IN STD_ULOGIC_VECTOR(SIZE - 1 dovmto 0 ) ;
ARLZ, AILZ, BRLZ, BILZ : OUT STD_ULOGIC,-
OSR, OSI, ODWR, ODWI
: OUT STD_ULOGIC_VECTOR(SIZE - 1 downto 0 ) ) ;
End Component;
Component SH2_2_16 Port (A, B
: IN STD_ULOGIC_VECTOR(SIZE - 1 downto 0 ) ;
CLK : IN STD_ULOGIC;
C, D
: OUT STD_ULOGIC_VECTOR(SIZE - 1 downto 0 ) ) ;
End Component;
Component SH1_2_16 Port (A, B
: IN STD_ULOGIC_VECTOR(SIZE - 1 downto 0 ) ;
CLK : IN STD_ULOGIC;
C, D
: OUT STD_ULOGIC_VECTOR{SIZE - 1 downto 0 ) ) ;
End Component;
Component SW4_1 Port (A, B, C, D
: IN STD_ULOGIC_VECTOR(SIZE - 1 downto 0 ) ;
SEL : IN STD_UL0GIC_VECT0R{1 downto 0 ) ;
E, F, G, H
: OUT STD_ULOGIC_VECTOR(SIZE - 1 downto 0 ) ) ;
End Component;
Component SW4_2 Port (A, B, C, D
: IN STD_ULOGIC_VECTOR(SIZE - 1 dovmto 0 ) ;
SELO : IN STD_ULOGIC;
E, F, G, H
: OUT STD_ULOGIC_VECTOR{SIZE - 1 downto 0 ) ) ;
50
End Component;
Component CN Port (RSTn, CLK IN STD_ULOGIC;
COUNT BUFFER STD_UL0GIC_VECT0R(1 d o v m t o 0);
51
COMPH : SW4_2 Port Map (IISWA, IISWB, IISWC, IISWD, SELOX,
IISWE, IISWF, IISWG, IISWH);
COMPI : SH1_2_1S Port Map (IISWE, IISWF, CL, IISHIC, IISHID);
COMPJ : BFL Port Map (IISHIC, IISHID, IISWG, IISWH, IIIWR, IIIWI,
IIIARL, IIIAIL, IIIBRL, IIIBIL,
OPAR, OPAI, OPBR, OPBI);
COMPK : CN Port Map (REST, CL, SELIX, SELOX);
COMPL : WFG Port Map (SELIX, IWR, IWI, IIWR, IIWI, IIIWR, IIIWI);
End BEHAVE;
The code is synthesized and simulated using Viewlogic's synthesis tool. The
schematics are attached in chapter 6 , 'Simulation and implementation results'.
52
CHAPTERS
FPGA BASED IMPLEMENTATION
FPGA BASED IMPLEMENTATION
5.1 Introduction
ASICs
i
Full Custom Linear
Semicustom FPLDs
;
Gate Standard
]
Mixed
[ ^
53
Tliere are basically four major technology available for ASIC implementation in
common use today. These are field-programmable logic devices (FPGAs), gate array,
standard cells, and full customs.
FPLDs are characterized by their ability to be configured by the customer.
Although knovm by many nmemonics, all FPLD devices are basically one of two
types: (1) programmable logic devices (PLDs) and (2) field-programmable gate arrays
(FPGAs).
PLDs are characterized by fixed intercormect and an AIvfD-OR plane driving
flip-flops, which are routed to output pins. FPGAs, on the other hand, posses more
flexible intercormect and are comprised of an array of logic blocks, which can be
configured to perform various logic functions. FPLDs find application primarily in
lower complexity (fewer than 2000 gates) and low volume applications. However, a
number of manufacturers are reading offerings that will reportedly contains over
20,000 gates. PLDs are available in CMOS, bipolar, and GaAs, while FPGAs are
most often fabricated in CMOS.
Some of the advantages of FPLDs are:
• Shortest fabrication time. Since the devices are not actually fabricated for
personalization but are typically programmed using a PROM programmer,
a completed design may be implemented in a matter of hours or days
instead of weeks.
• Low cost in low volume. For very low volumes, FPLDs are very cost-
effective since there are no non-recurring engineering (NRE) charges.
• Charges easier and faster. If changes are likely to be made or
personalization is necessary. FPLDs are possibly the most effective
vehicle. This also makes tliem an effective functional verification tool.
54
Lowest integration for larger designs, FPLDs may require use of multiple
chips, where other ASIC approaches would allow the implementation in a
single chip.
Recently Field Programmable Gate Arrays (FPGAs) have become very
popular for implementing Application Specific Integrated Circuits (ASICs). As the
technology evolves the low and medium end ASICs are being implemented using
FPGAs. An effective design approach with FPGAs allows earlier market entry than
with other ASICs.
A graph depicting relative cost versus quantity for the design in the preceding
four product categories is shown in Fig. 5.2. The assumption is that the design
implemented can be realized in all four of the above products, and that each product
utilizes the same basis process technologies such as CMOS. \^^na^A2ad'TS;^\
-r- 'V,
>
S /I
)
0 5 - 3 2 2 5" if
'^A^ -'
Actual UsBge
55
parts. The result is a no low-risk design style, where the price of logic error is small,
both in money and project delay. The reduced risk makes FPGAs useful for rapid
product development and prototyping. Moreover, FPGAs can be ftiUy tested after
manufacture, so user's design not require test program generation, and design for
testability [14].
Many kinds of programmable logic products are called FPGAs. FPGAs are
categorized according to their combination of progranmiing technology and devices
architecture. Three programming technologies are commonly used for FPGAs. Each
has associated area and performance costs, and the devices architectures reflect these
costs [14].
• D DD DD , DD
ED-
CZh
Svrllch boi /
ltd M^L. '
Wiring cliannci
56
The lOBs provide a programmable interface between the internal logic array
and the devices packages pins, CLBs perform user specific logic functions, and the
interconnect resources carry signals among the blocks [15]. A configuration program
stored in internal static memory cells determines the logic functions and the
interconnect. Interconnect segments connect to CLB pins in the channels and to other
segments in the svi^itch boxes through pass transistors controlled by configuration
memory cells. Because SRAM cells and pass transistors are comparatively expensive
in area and delay the switch boxes are not full cross bar switches.
An SRAM FPGA program consist of a single long program word. On chip
cncuitry loads the program word, reading it serially out of an external memory every
time power is applied to the chip. The program bits set the values of all configuration
memory every time power is applied to the chip. The program bits set values of all
configuration memory cells on the chip selecting which segments connect to each
other. SRAM FPGAs are inherently re-programmable. They can be updated in the
system, providuig designer' with new design option, and capabilities, such as logic
updates that do not reqiure hardware modification and time-shared virtual logic [14].
All Xilinx FPGAs use CMOS SRAM technology.
Lo^ic block
' Wirinccbiiiacl ~
57
extend into the channel. A logic block is comparatively simple gate level network,
which one programs by connecting its input pins to fixed values or to interconnect
nets [14].
In CPLD architecture Fig 5.3(c), the user creates logic and interconnections by
programming EPROM (EEPROM) transistors to form wide fan-in-gates. A CPLD
consist of a few function blocks contain a PLD-AND array that feeds its macrocells.
The user programs the AND-array by turning on EPROM transistors that allow
selective inputs to be included in a product term.
^y
/
18/ \ \
/ S\
FB3
V*
^
MC J
s*
A J
V*
N J
wc
21 D
uc J —^-
a ^^
r )
^
r MC >
a
>
y MC
>
FB4
MC J .^
V *
UC )
\*
A MC >
\*
N /
21 D ^
/
/*- a
r
y
w
r >
a ^
J
y
J
^^
M C Macrocell
FB Function block
UIM Universal interconnect mechanism
59
5.3.1 CMOS XC4000 series
-7^
^v/^. SR/H,
-5i—J1>"'
S/R
u COTfTROL
D
LOGIC -T&
HJNCriON
OF C
C1-C4
r>
LOGIC
RWCTION
P.C. H*
AND
SfR
CONTROL
LDffNv
LOGIC
HJNCTION
OFR -B F ^
«r-t>0
n]
EC ,
Input C l u c k
60
Features :
1. It has logic densities upto 130,000 usable gates and supports system clock rates of
upto 66 MHz.
2. Compared to older Xilinx FPGA families, the XC4000EX/XL families are more
powerful, offering on-chip ultra-fast RAM with synchronous write option and
dual-port RAM option
3. The XC4000EX/XL families are fully PCI compliant.
4. The XC4000EX/XL families have abundant flip-flops, flexible function
generators, dedicated high-speed carry logic, wide edge decoders on each edge,
internal 3-state bus capability, 8 global low-skew clock or signal distribution
networks, flexible Array Architecture
5. The XC4000EX/XL family have generous routing resources to accommodate the
most complex interconnect patterns.
6. The XC4000EX/XL families are supported by powerful and sophisticated
software, covering every aspect of design from schematic entry, to simulation, to
• automatic block placement and routing of interconnections, and finally the
creation of the configuration bit stream.
7. The schematic library for the XC4000 FPGAEX/XL contains 400 primitives and
macros, ranging from 2- input AND gates to 16- bit accumulators, and including
arithmetic functions, comparators, counters, data registers, decoders, encoders, I/O
functions, lathes, Boolean functions, RAM and ROM memory blocks,
multiplexers, and shift registers.
8. Operational power consumption is totally dynamic.
9. Typical power consumption is between lOOmW to 2W depending upon tlie size of
the devices.
10. Buffered interconnect for maximum speed.
11. Flexible new high-speed clock network
• 8 additional early buffers for shorter clock delays.
• 4 additional fastCLK buffers for fastest clock input.
• Virtually unlimited number of clock signals.
12. Optional multiplexer or 2-input function generator on device outputs.
CHAPTER-6
SIMULATION AND
IMPLEMENTATION RESULTS
SIMULATION AND IMPLEMENTATION RESULTS
Eight point FFT is calculated for the eight samples x(0) =x(l)= x(2) = x(3) =
32512 X 2'^^ =0.9921875 and x(4) = x(5) = x(6) = x(7) = 0.0. The signal flow graph for
8-point FFT (DIF) is shown in Fig. 6.1.
•oxm
o^w
O-'^O)
-Qxm
ow)
•><o)o O ^'C)
•v(i) a o xw
A<2) O
Ox(i)
•*(3)Q
•<-i) O
v<5)Q
•vt(i) O-
.v<7) o -
62
As mentioned in previous chapters, in fixed-point arithmetic, the input should
be normalized to a fraction and the input to each butterfly should be attenuated by a
factor of 2. Thus the signal flow graph for 8-point FFT is modified and is shown in
Fig. 6.2.
Both signal flow graphs give identical results (intermediate values will be
different but the final output will be identical). The results are calculated (using signal
flow graph in Fig. 6.2) by hand as follows.
Here,
w\ 1 . 1
W^=-^
Wr
,.^s ^(0) ^(4) 32512x2'^ 0 32512x2'^ !/„„„,, \
x<(Q)=-L±+^U.^ + - = =-(0.9921875)
2 2 2 2 2 2^ '
= 0.49609375
( 1\ . \ I\ iT^n^i-IS
V(5) = j i ^ - i f ) } x , ^ ; j 3 ^ 1 1 ^ _ ^ , . 32512x2- 1 . 1
= -(0.701582509-y0.701582509) = 0.350791254-y0.350791254
63
,^,U^_A^\j^l__ln5n^_o_ ^^(_^^_^ 32512x2-'^ . ! ( , ,0.9921875)
2 2
= -yO 49609375
32512x2"'^ 32512x2''^
, . . ( 0 ) = ^ , ^ = 2 , 1.
^ 2 2 2 2 4 4
= 0 49609375
32512x2"'^ 32512x2"'^
,..(1) = ^ . ^ =- 2 + 1. ^^°21^i2:^=i(l.984375)
2 2 2 4 4
= 0.49609375
32512x2-'^ 32512x2-'^
-<^)=i^-^hr:- 2
2
2
2
xl = l{0.0} = 0 0
4' ^
32512x2"'^ 32512x2"'^
,.,3,4^_£mL,|.;. 2
2
2
2
x(-7)=-(0.0)=0 0
32512x2"'^ 32512x2''^
2 2
-(0.9921875-y0.9921875)=0.248046875-y 0 248046875
32512x2"'^ f I 11 32512x2-'^ I I 1
^, x'(5)^A-(7)_ 2 ^^/2"^V2 V^^''^/i[ 1 (32512x2"'^ XV2)
2 2 2 2 --'-A
1
= -(-;1.403165019)=-./0 3507912S4
4
64
32512x2- ^ 32512x2"'^ 1 j _
2 \r2''r2
vm-j^-^W:- (-7)
xl-V
65024x2- 15 65024x2''^
^(0)=j^.^U8 = 4
2
+ 4
2
• x8=-|l30048x2-'^}x8 = 3.96875
X(2) = | ^ + ^ | x 8 = (0+0)x8=0.0
65024x2-'^ 65024x2"'^
;,(4) = f ^ - ^ l x J ^ ° x 8 = 4
2
4
2
x l x 8 = 0.0
A ' ( 6 ) = | ^ - ^ | x p | ^ ° x 8 = (0-0)xlx8=0.0
65
The values shown by bold face corresponds to signal flow graph shown in Fig. 6.1.
The butterfly circuit is simulated and result obtained is tabulated in table 6.1.
Results are correct up to four decimal places. Tlie error is due to the right
shifting of the input Xo the butterfly Xo attenuate iX by a factor of 2 and also by
rounding-off the output of the butterfly to 16-bit.
66
(b) Timing dingnim of FFT circiiiC.
The pipeline FFT circuit is also simulated and wavclbrms arc attached. The
outputs available at a particular time are shown in table 6.2
Time Output
1600ns-1700ns Xm = (3F80, 0000~) X(4) = (0000,00001
1800ns-1900ns X(2) = (0000, 0000) X(6) = (0000, 0000)
2000ns-2100ns xm = (OFEO, D9AD) X(5) = (OFEO, 0693)
2200ns-2300ns X(3) = (OFEO, F96D) X(7) = (OFEO, 2653)
The gate level schematics are generated using Viewlogic's synthesis tool. Some
important schematics are attached such as plff (pipeline FFT), bfl (butterfly circuit),
adsbl6 (Add / subtract circuit), sh4_l__16 (four stage shift register), sh2_2_16 (two
stage shift register), shl_2_16 (One stage shift register), Sw4_l (First switch), Sw4_2
(Second switch), en (Counter), wfg (Weight factor generator).
IMPLEMENTATION REPORT
Target device xc4052xl Remarks
Target package -1- bg432
Number of CLB's 1897 out of 1936 97%
Total CLB Fhpflops 0 out of 3872 0%
Total CLB Latchs 0 out of3872 0%
4 input LUT's 4388 out ot 3872 90%
3 input LUT's 60 out of 1936 3%
Number of bonded lOBs 164 out of 352 46%
JOB Flops 0
lOB Latchs 0
Number of TBUF's 0
Total equivalent gate count 32748
Maximum delay 350.29ns
Maximum net delay 233.706ns
67
mn \ IX
:^b^ \/
X ^ X 7\V
/•
\ X
\\ 2
II [J
V^
X
5 III
1 O
I b
ly
o b b- - o
in in in m , .
1" ?>
!" !" i"
J 1 _ i _ _J_ ^
ra ^ J3
^ (1) _ J K ^ 1
CO
2
< < nCO ro
u- u. n Uj
1
o O O o to >
:XlXl^lXI HMfsi
I—1
„ _
r1 J
1
1
1
i
-,1
ZJfl 3 X Si
5 X \
o o 0 1
o o 0 1
o o
0 '
0
0
l'
ll
)4 IX X
XXXX
^SD X i
X -Mr-
o o ll
X
e X
i
n
IT)
o j j j VD
X^X
cn
o
-Ka ^
Xe o
X
0 X
X
S o 0
o u
e 0
X XI XX
X X y^
o
-Soi
CO 5
00
CO
E K K
•<r y
"IBS §
^•
n ^ 0
00 ~fl'^ K
o
Si 0
gj a 0
00
CV4
^, XXX OO
eg
CO
o
—HH^
—zjSf--
o
o
o
o XXX XX
c
1
c
o
o
o
3:
0
0
^r
0
0
0
O o o .-H M 0
u ~^S
Xi XX
w
3 1
— -a
ci
XXX 5
[
o —-8!='
—^S 3i X
o
"o "5 0 0
IQ. a 0 0
0
X X - ^ 3
zaio)
X
X X
3: 1
o 0
—zHf" o 0
0
o 0
0
X XX d^
X 0
1
1 1
1
iJ — - ^ 3 0 1 i
^^1li^X
—ggS"^
C C O O 1 O 1 O —,
O O 1 ih° 1 in ' vh
1 in j in 1 "P
J2I 3
@) @) (U 0) i CU ' dj , 0) 0) t (U 0)
69
ppiiii iiL
^ t
t 11 5
e^
-ow
-l!»r-
—tKr-
—tnr
""tw^
-*itr-
i|iiii...""i4r-
d s b 1B
SHtEI 1 or 1
J-=
• • • ^
• • • ^
• ^
• ^
^
&
• ^
^
: ^
- . ^
: ^
• ^
• • • ^
• ^
• ^
• ^
: ^
^
^
&-*
• hi. 2 - 1 B
^ =ffi FHUHllI b
B L i l —ST r HB* — I *
gaJ|.= [ bWi^&«
^ 1 Uni-'-i -^
^w=w&
-j:^=&
LJ LJ u u
CD Q en X
o u t - ) O U LJ M (J
79
MAP REPORT
Xilinx Mapping Report File for Design "BFL"
Copyright (c) 1995-1997 Xilinx, Inc. All rights reserved.
Design Information
Design Summary
Number of errors: 0
Number of warnings: 107
Number of CLBs: 1897 out of 1936 97%
CLB Flip Flops: 0
CLB Latches: 0
4 input LUTs: 3488
3 input LUTs: 60
Number of bonded lOBs 164 out of 352 46^
lOB Flops: 0
JOB Latches: 0
Total equivalent gate count for design: 32748
Additional JTAG gate count for lOBs: 7872
80
LOGIC LEVEL TIMING REPORT
Timing summary:
Design statistics:
Maximum combinational path delay: 78.000ns
Maximum net delay: 0.984ns
82
Placer score = 1134120
Placer score = 1128480
Placer score = 1124940
Placer score = 1122210
Placer score = 1119570
Placer score = 1117560
Placer score = 1115940
Placer score = 1113990
Finished Constructive Placer. REAL time: 2 mins 28 sees
The Number of signals not completely routed for this design is: 0
83
The Average Clock Skew for this design is: 0.000 ns
The Maximum Pin Delay is: 233.706 ns
The Average Connection Delay on the 10 Worst Nets is: 203.143 ns
d <= 10 < d <= 20 < d <= 30 < d <= 40 < d <= 50 d >
50
Timing Score: 0
PAR done.
84
PAD REPORT
PAR: Xilinx Place And Route Ml.4.12.
Copyright (c) 1995-1997 Xilinx, Inc. All rights reserved.
Thu Oct 07 20:32:19 1999
AILZ AG31
ARLZ F4
BILZ Rl
BRLZ AC29
INAIO AF29
INAIl U28
INAIIO K28
INAIll K29
INAI12 B23
INAIl3 C24
INAIl4 L29
INAIl5 J28
INAI2 W29
INAI3 V30
INAI4 R30
INAI5 R2e
INAI6 T29
INAI7 W30
INAI8 N31
INAI9 N30
INARO El
INARl A13
INARlO Bll
INARl1 D15
INARl2 D13
1NAR13 A12
INARl4 CIS
INAR15 D14
INAR2 C14
INAR3 CI 3
INAR4 CIO
INAR5 B14
INAR6 BIO
INAR7 B15
INAR8 B12
INAR9 A15
INBIO R2
INBIl AK24
INBIIO
A24
INBIll
AJ24
INBI12
D23
INBI13
A26
INBH4
K31
INBI15
J29
INBI2
U29
INBI3
V2 9
INBI4
U30
INBI5
R29
INB16
P29
INBI7
R31
INBI8
M28
INBI9
H31
INBRO
INBRl AC30
C22
85
INBRIO A20
INBRll C21
INBR12 C20
INBR13 C23
INBR14 A22
INBR15 D12
INBR2 C18
INBR3 04
INBR4 C19
INBR5 D19
INBR6 A17
INBR7 A16
INBR8 C17
INBR9 B21
INWIO F30
INWIl AK8
INWIIO AK4
INWIll P3
INWI12 RK5
INWI13 AK15
INWI14 Kl
INWIl5 G29
INWI2 P4
INWI3 AHl
INWI4 AA3
INWI5 AH25
INWI6 AK25
INWI7 AJ26
INWI8 AJ21
INWI9 Ml
INWRO AL12
INWRl AFl
INWRIO AK3
INWRll D9
INWR12 K2
INWR13 B7
INWR14 D28
INWRl5 E29
INWR2 AL17
INWR3 Wl
INWR4 AH23
INWR5 AJ5
INWR6
ALIO
INWR7
H4
INWR8
AG3
INWR9
D2 6
ODWIO
N3
ODWIl
M4
ODWIIO
DIO
ODWIll
K3
0DWI12
J2
ODWIl 3
B9
0DWI14
J4
ODWIl5
H2
0DWI2
J3
0DWI3
M3
0DWI4
N4
0DWI5
M2
0DWI6
Ml
0DWI7
K4
0DWI8
0DWI9 N2
ODWRO L3
ODWRl AK20
ODWRIO AJ19
ODWRl1 AHl 9
ODWRl2 AH18
ODWRl3 AL13
0DWR14 AH14
ODWRl5 AK14
0DWR2 AJ14
0DWR3 AK21
0DWR4 AL19
0DWR5 AHl 7
0DWR6 AK19
0DWR7 AL20
AK18
86
i 0DWR8 I AK17
0DWR9 I p^jie
OS 10 U31
/ OSIl ( T31
I OSIIO I N29
I OSIll I M31
I 0SI12 I B24
I 0SI13 I L30
I 0SI14 I j^29
I 0SI15 I K30
I 0SI2 I V28
I 0SI3 I „3i
I 0SI4 I T30
I 0SI5 I N28
i 0SI6 I P28
I 0SI7 I f,3o
I 0SI8 I P30
I 0SI9 I J30
I OSRO I JJ20
I OSRl I C16
I OSRIO , ^^g
I OSRll I P22
1 0SR12 I B20
1 0SR13 I ^-^2
I 0SR14 I B22
I 0SR15 I gj^3
I 0SR2 I g^g
I 0SR3 ^^^
I 0SR4
I °S=^5 I AlO
I 0SR6
I 0SR8
1 0SR9 ; ^^^
ASYNCHRONOUS DELAY REPORT
Thu Oct 07 20:31:51 1999
File: bfl.dly
233.706 CP1/VLX_PROCESS_OSIG_114 5
226.325 CP1/VLX_PROCESS_OSIG_114 12
208.967 CP1/VLX_PROCESS_OSIG_114^ 3
204.696 CP1/VLX_PROCESS_OSIG_114 0
203.288 CP1/VLX_PROCESS_OSIG_114 4
200.997 CP1/SGEN_NODE_390
200.639 CP1/SGEN_N0DE_387
188.841 CPl/VLX_PROCESS_0SIG_114 11
18 6.4 68 CP1/SGEN_N0DE_38 6
177.504 CPl/VLX_PROCESS_0SIG_114 9
170.385 CP1/VLX_PROCESS_OSIG_114 20
169.060 CP1/VLX_PROCESS_OSIG_114 10
160.4 59 CP4/SGEN_NODE_37 7
159.968 CP1/SGEN_N0DE_378
142.851 CP1/VLX_PROCESS_OSIG_114 2
139.017 CP4/VLX_PROCESS_0SIG_114 0
138.722 CP4/SGEN_NODE_378
138.543 CP4/VLX_PROCESS_0SIG_114 11
134.4 67 CP1/SGEN_N0DE_382
127.850 CP4/VLX PROCESS OSIG 114 12
88
POST LAYOUT TIMING REPORT
Timing summary:
Design statistics:
Maximum combinational path delay: 350.290ns
Maximum net delay: 233.706ns
89
CHAPTER-7
CONCLUSION AND FUTURE
SCOPE
CONCLUSION AND FUTURE SCOPE
Fast Fourier transform processor has been successfully coded at a higher level
of design abstraction using VHDL. The design was simulated exhaustively at the
VHDL level using the Viewlogic's Speedwave simulator. It was then subsequently
synthesized with the help of Viewlogic's Aurora Synthesis tool by using the Xilinx
FPGA library. The gate level schematics, generated by the synthesis tool, were then
verified by using the Viewlogic Viewsim gate level simulator. The logically verified
gate level netlist was then implemented into the Xilnx's XC-4052XL device. The
worst case static timing information generated by the implementation tools indicates
that the butterfly post layout delay is around 350.29ns. The designed FFT processor
calculates the 8-point FFT but the code can be very easily modified for higher point
FFT's. The multiplier used is an array multiplier and the code is written in such a way
that the same code can be easily used for higher order multipliers by just changing the
generic parameter namely 'SIZE'. The delay will be drastically reduced if the same
VHDL code is implemented in the form of an ASIC.
The concept of intellectual property (IP) has become extremely popular in the
design world at the moment. Most of the complex designs are generated by the
combination of the pre-designed blocks called IP's. The objective of this dissertation
is to create an IP for FFT. This IP is available in the form of a verified VHDL code.
Any body can use this IP in his/lier complex ASIC design.
90
REFERENCES
1. J.W. Cooley and J.W. Tukey, "An Algorithms for the machine calculation of
Complex Fourier Series, ''Math Computation", Vol 19, 1965, pp. 297-301.
2. C. Runge, Z. Math. Physik, Vol. 48, 1903, p.443; also Vol. 53, 1905, p.l 17.
3. G. C. Danilson and C. Lanczos, "Some Improvements in Practical Fourier
Analysis and Their Application to X-Ray Scattering from Liquids," J.
Franklin Inst., Vol. 233, pp.365-380, 435-452.
4. J. W. Cooley, P. A. W. Lewis, and P. D. Welch, "Historical Notes on the Fast
Fourier Transform," IEEE Trans. Audio Electroacoust., Vol. AU-15, June
1967,pp.76-79.
5. W. T. Cochran et al., "What is the Fast Fourier Transform ?" IEEE Tram.
Audio Electroacoust., Vol. AU-15, June 1967, pp.45-55.
6. R. C. Singleton, "A Method for Computing the Fast Fourier Transform with
Auxiliary Memory and Limited High-Speed Storage," IEEE Trans. Audio
Electroacoust., Vol. AU-15, June 1967, pp.91-97.
7. B. Gold and C. M. Rader, Digital Processing of Signals, Mc Graw-Hill Book
Company, New York, 1969.
8. A. V. Oppenhiem and R. W. Schafer, Digital Signal Processing, Prentice Hall,
Englewood Cliffs, N. J., 1975
9. L. R. Rabiner and G. Gold, Theory and Application of Digital Signal
Processing, Prentice Hall, Englewood Cliffs, N. J., 1975
10. John G. Proakis and Dimitris G. Manolakis, Digital Signal Processing,
Principles, Algorithms and Applications, Prentice Hall, Englewood Cliffs,
N. J., 1996
11. K. C. Chang, Digital Design and Modeling with VHDL and Synthesis, IEEE
Computer Society Press, Los Alamitos, California, 1997
12. Zainalabedin Navabi, VHDL Analysis and Modeling of Digital Systems,
McGraw-Hill Inc, New York, 1993
13. Jeffory, I. Hilbert, "ASIC Technology" pp. 217-219, Academic Press Inc.
1991.
14. Stephen Trimberger, "Manager, Advanced Development, Xilinx hic", "Field
Programmable Gate An-ays", Guest Editor's Introduction, pp. 3-5, IEEE, Sept.
1992.
15. Xilinx, "XACTUser Guide", pp. 1-1 to 1-3, April 1993.
16. Xilinx, "Programmable Logic Data Book", 1994.
92
APPENDIX
The program shown below is in C language for calculating weight factors for
1024-pointFFT.
I**********************************************************************************************/
# mclude<stido h>
# mclude<math h>
# mclude<stdhb h>
# define PI 3 141592654
# define TA 32768 0
mainO
{
FILE • fp,
static char name[l ]= "|",
static char mame[3]= " |",
static char Iame[2]= " 1",
float x,nv,iw,n,r,rmax,xd,rww,iww
fp=fopen("co", "w"),
n=1024 0,
r=0 0,
fprmtf(fp," RESULTS \n"),
^nntf ( fp. " -\n"),
fpnntf(fp,"l 1 1 1 1 j\n"),
f p r m t f (fp, "I WTFAC I RWF I IWF | RWW | IWW |\n"),
fpnntf(fp,"| 1 1 1 1 |\n"),
do
{
X = (PI*2 0)*(r/n),
xd = (180/PI)*x.
rw = cos(x),
Avw = rw*TA,
iw = -sin(x),
iw\v = iw*TA,
miax = n/2 1,
fprintf (fp, "I I I I I IV,"),
fpnntf (fp, " %s %-8 If %s %i2 7f %12 7f %s %I0 2f %s %I0 2f %s\n" ,name, r.name, rw, niame, iw, mame rww, lame, iww,
lame),
fpnntf (fp, "I I I I I |\n"),
r = r+l
)
while (r <= miax),
fpnntf (fp, "\n"),
)
This program was compiled and run on UNIX system. The weight factor
calculated is multiplied by 2'^ = 32768.0 to facilitate easy searching. The values of
93
real and imaginary parts of weight factor vary between 32768.00 to -32767.38 and
0.0 to -32768.00 respectively. A portion of output is illustrated below.
RESULTS
1 1 1 1 1 1
I I 1 I I I
94