Low-Power Split-Radix FFT Processors Using Radix-2 Butterfly Units
Low-Power Split-Radix FFT Processors Using Radix-2 Butterfly Units
9, SEPTEMBER 2016
Abstract Split-radix fast Fourier transform (SRFFT) is an ideal can be found in [5] and [6]. For split-radix FFT, it convention-
candidate for the implementation of a low-power FFT processor, because ally involves an L-shaped butterfly datapath whose irregular shape
it has the lowest number of arithmetic operations among all the
has uneven latencies and makes scheduling difficult. In this brief,
FFT algorithms. In the design of such processors, an efficient addressing
scheme for FFT data as well as twiddle factors is required. The signal we show that the SRFFT can be computed by using a modified
flow graph of SRFFT is the same as radix-2 FFT, and therefore, the con- radix-2 butterfly structure. Our contribution consists of mapping the
ventional address generation schemes of FFT data could also be applied split-radix FFT algorithm to the shared-memory architecture, leverag-
to SRFFT. However, SRFFT has irregular locations of twiddle factors and ing the lower multiplicative complexity of the algorithm to reduce the
forbids the application of radix-2 address generation methods. This brief
presents a shared-memory low-power SRFFT processor architecture. dynamic power and developing two novel twiddle factor addressing
We show that SRFFT can be computed by using a modified radix-2 schemes for the split-radix FFT.
butterfly unit. The butterfly unit exploits the multiplier-gating technique The rest of this brief is organized as follows. Section II pro-
to save dynamic power at the expense of using more hardware resources. vides a theoretical comparison of the number of complex multi-
In addition, two novel address generation algorithms for both the trivial
and nontrivial twiddle factors are developed. Simulation results show that
plications between the radix-2 FFT and the SRFFT. Section III
compared with the conventional radix-2 shared-memory implementations, discusses the architecture of the proposed design. Section IV pro-
the proposed design achieves over 20% lower power consumption when vides the implementation results and Section V concludes this
computing a 1024-point complex-valued transform. brief.
Index Terms Address generation, low power, radix-2,
split-radix fast Fourier transform (SRFFT), twiddle factors. II. C OMPARISON OF SRFFT AND R ADIX -2 FFT
The N-point discrete Fourier transform is defined by
I. I NTRODUCTION
N1
The fast Fourier transform (FFT) is one of the most important X (k) = x(n)W Nnk (1)
and fundamental algorithms in the digital signal processing area. n=0
Since the discovery of FFT, many variants of the FFT algorithm
where k = 0, 1, . . . , N 1 and W Nnk = e j 2nk/N . If we split X (k)
have been developed, such as radix-2 and radix-4 FFT. In 1984,
into even and odd terms, radix-2 FFT can be derived as
Duhamel and Hollmann [1] proposed a new variant of FFT algo-
rithm called split-radix FFT (SRFFT). Their algorithm requires the N/21
X (2k) = nk
[x(n) + x(n + N/2)]W N/2 (2)
least number of multiplications and additions among all the known
FFT algorithms. Since arithmetic operations significantly contribute n=0
to overall system power consumption, SRFFT is a good candidate N/21
for the implementation of a low-power FFT processor. X (2k + 1) = [x(n) x(n + N/2)]W Nn W N/2
nk . (3)
In general, all the FFT processors can be categorized into two main n=0
groups: pipelined processors or shared-memory processors. Examples
of pipelined FFT processors can be found in [2] and [3]. A pipelined The basic idea behind the SRFFT is the application of a radix-2
index map to the even-index terms and a radix-4 map to the
architecture provides high throughputs, but it requires more hardware
odd-index terms. For the even-index terms, it can be decomposed
resources at the same time. One or multiple pipelines are often
as (2). For the odd-index terms, it can be decomposed as
implemented, each consisting of butterfly units and control logic.
In contrast, the shared-memory-based architecture requires the least N/41
amount of hardware resources at the expense of slower throughput. X (4k + 1) = [x(n) x(n + N/2)
Examples of such processors can be found in [4] and [5]. In the n=0
radix-2 shared-memory architecture, the FFT data are organized into
j (x(n + N/4) x(n + 3N/4))]W Nn W N/4
nk (4)
two memory banks. At each clock cycle, two FFT data are provided
by memory banks and one butterfly unit is used to process the data. N/41
At the next clock cycle, the calculation results are written back to X (4k + 3) = [x(n) x(n + N/2)
the memory banks and replace the old data. The scope of this brief n=0
is limited to the shared-memory architecture. + j (x(n + N/4) x(n + 3N/4))]W Nn W N/4
nk (5)
In the shared-memory architecture, an efficient addressing scheme
for FFT data as well as coefficients (called twiddle factors) where k = 0, 1, . . . , N/4. The formulas above result in the L-shaped
is required. For the fixed-radix FFT, previous works of this topic split-radix butterfly structure, which can be found in [2] and the
scheduling of the L-shaped butterfly is irregular.
Manuscript received August 23, 2015; revised November 28, 2015 and Assume that we have N = 2 S point FFT, both SRFFT and
January 30, 2016; accepted March 4, 2016. Date of publication April 12,
radix-2 FFT require S passes to finish the computation, as shown
2016; date of current version August 23, 2016.
The authors are with the Department of Electrical and Computer Engineer- in Figs. 1 and 2. For SRFFT, the total number of the L butterflies
ing, University of Massachusetts Lowell, Lowell, MA 01851 USA (e-mail: NSR is given by [2]
[email protected]; [email protected]).
Digital Object Identifier 10.1109/TVLSI.2016.2544838 NSR = [(3S 2)2 S1 + (1) S ]/9. (6)
1063-8210 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 24, NO. 9, SEPTEMBER 2016 3009
M R2 = 2 S1 (S 1). (10)
C. Address Generation of Twiddle Factors
The flow graph for the 16-point SRFFT is shown in Fig. 2.
III. H ARDWARE I MPLEMENTATION
In Fig. 2, there are two kinds of twiddle factors: j and Wn . For
A. Shared-Memory Architecture those multiplications involving j is called trivial multiplications,
The architecture of shared-memory processor is shown in Fig. 3. because these operations are essentially the swapping of the real
The FFT data and the twiddle factors are stored in the RAM and and imaginary part of the multiplier, hence no multiplication is
ROM banks, respectively. We observed that the flow graph of split- involved. For those multiplications involving Wn are called nontrivial
radix algorithm is the same as radix-2 FFT except for the locations multiplications, because complex multipliers are used to complete
and values of the twiddle factors and therefore, the conventional these operations. In Fig. 2, each area surrounded by the dashed lines
radix-2 FFT data address generation schemes could also be applied to is called one L block which is formed by L butterflies in each pass [8]
SRFFT (RAM address generator). However, the mixed-radix property and there are totally five L blocks for a 16-point SRFFT.
3010 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 24, NO. 9, SEPTEMBER 2016
TABLE I
ROM C ONFIGURATION FOR A 16-P OINT SRFFT
TABLE II
A DDRESS G ENERATION TABLE OF THE P ROPOSED
A LGORITHM 1 FOR A 16-P OINT SRFFT
TABLE III
I MPLEMENTATION R ESULTS ON S PARTAN -6 FPGA
TABLE IV
C OMPARISON OF E ACH C OMPONENT FOR A 1024-P OINT FFT ON S PARTAN -6 FPGA
In the last pass, L_Flag is set to one and J_Flag is set to of 100 MHz in Xilinx ISE 14.7 targeting for Spartan-6 XC6SLX4
zero. device. The simulation results are shown in Table III. Power is
When nontrivial multiplication is required, twiddle factors need to measured by Xilinx XPower analyzer using the switching activity
be retrieved from the ROM banks. Unlike conventional method that interchange format file recorded in a sufficient long simulation
stores all the Wn in one ROM bank, we organize Wn in two ROM time. For a 1024-point FFT, the proposed algorithm could achieve
banks: one stores Wn for the upper leg of the butterfly unit and the over 20% lower power but almost maintains the static power con-
other stores those for the lower leg of the butterfly unit. Table I shows sumption. In the given architecture, when the FFT size increases,
the content of the two ROM banks for a 16-point SRFFT. Started in a larger RAM and ROM size is required, but the butterfly unit
pass 1, in the Pth pass the address of each ROM bank is given by does not change. The limitation of the proposed design is the large
resources used in the butterfly unit. This limitation could possibly
b S2P b S3P . . . b0 0 . . . 0 (following (P 1) 0 s). (12)
be removed by using different butterfly structures for additions and
It is worth mentioning that in conventional implementations, the multiplications. Table IV shows the power consumption of each
twiddle factors are required for each butterfly so ROM banks are component for a 1024-point FFT. The reduction of dynamic power
always enabled, and in our implementation, the L_Flag signal can consumption is due to the fact that multipliers and ROM banks are
be used as the enable signal for the ROM banks, since that if the enabled only when necessary.
butterfly belongs to the L block, no multiplication is required. This We have also synthesized the design using OSU gscl45 nm library
could lead to further power reduction. Table II shows an example of in Cadence RTL compiler. All the three FFTs are able to run above
the proposed algorithm for the 16-point SRFFT. 200 MHz. The library does not have a memory intellectual property,
Both the address generation of FFT data and twiddle factors and the memories are constructed using basic cells and flip-flops. The
depends on certain butterfly processing order in each pass. Other than implementation results of a 1024-point FFT are shown in Table V.
Xiaos [6] method (Fig. 2), there are other methods of ordering the A large number of cells are used to implement the memory banks,
butterfly sequence, such as [5]. We have also developed the address which become the most power hungry component in the design.
generation methods for this kind of butterfly sequence using the Compared with the radix-2 addressing schemes in [5] and [6], our
similar ideas stated above. The details are not discussed here and addressing method requires additional 2 S1 -bit memory. However,
we only give the conclusions. In each pass except for the last one, the SRFFT algorithm has the irregular signal flow graph and makes
J_Flag equals to the control of such processors more difficult than the fixed-radix
ones. Although a software solution for the indexing problem has
b S2 . (13) been given in [9], the indexing scheme is designed for the L butterfly
The address for each ROM bank is given by structure, which is not suitable for the hardware implementation due
to its uneven latencies. Some previous works such as [10] use lookup
b S2 b S3 . . . b1 . (14) tables to solve the indexing problem. It is obvious that the proposed
algorithm requires significantly less memory than the lookup table
IV. I MPLEMENTATION AND R ESULTS approach.
The proposed design is compared with the two conventional Compared with a pipelined SRFFT architecture such as
shared-memory architectures. Our two proposed address generation split-radix single-path delay feedback (SRSDF) given in [11], the
algorithms of the twiddle factors are similar and therefore, we only shared-memory architecture offers significantly reduced hardware
implement algorithm 1, which is within the ROM address generator. cost and power consumption at the expense of slower throughput. For
The address generation of RAM is based on [6] and datapath an N-point FFT, SRSDF requires log4 N 1 multipliers and 4 log4 N
width is 32 b. The three FFTs are synthesized under the constraint adders. In contrast, only two multipliers and two adders are used in
3012 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 24, NO. 9, SEPTEMBER 2016
TABLE V
C OMPARISON OF E ACH C OMPONENT FOR A 1024-P OINT FFT U SING gscl45-nm T ECHNOLOGY AT 100 MHz AND 1.1 V
the proposed architecture. In addition, in order to arrange the different [3] J. Chen, J. Hu, S. Lee, and G. E. Sobelman, Hardware efficient mixed
butterfly structures for different operations, SRSDF still needs to track radix-25/16/9 FFT for LTE systems, IEEE Trans. Very Large Scale
Integr. (VLSI) Syst., vol. 23, no. 2, pp. 221229, Feb. 2015.
the trivial and nontrivial multiplications, and the indexing scheme is
[4] L. G. Johnson, Conflict free memory addressing for dedicated FFT
much more complicated than the proposed one, since an additional hardware, IEEE Trans. Circuits Syst. II, Analog Digit. Signal Process.,
encoding step (bit-inverse and bit-reverse) is applied to the butterfly vol. 39, no. 5, pp. 312316, May 1992.
sequences. [5] D. Cohen, Simplified control of FFT hardware, IEEE Trans. Acoust.,
Speech, Signal Process., vol. 24, no. 6, pp. 577579, Dec. 1976.
V. C ONCLUSION [6] X. Xiao, E. Oruklu, and J. Saniie, An efficient FFT engine with reduced
addressing logic, IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 55,
In this brief, a shared-memory-based SRFFT processor is proposed. no. 11, pp. 11491153, Nov. 2008.
The proposed method reduces the dynamic power consumption at the [7] Z. Qian, N. Nasiri, O. Segal, and M. Margala, FPGA implementation
expense of more hardware resources. We also present two addressing of low-power split-radix FFT processors, in Proc. 24th Int. Conf. Field
schemes for both the trivial and nontrivial twiddle factors. Since Program. Logic Appl., Munich, Germany, Sep. 2014, pp. 12.
[8] A. N. Skodras and A. G. Constantinides, Efficient computation of the
SRFFT has the minimum number of multiplications compared with split-radix FFT, IEE Proc. F-Radar Signal Process., vol. 139, no. 1,
other types of FFT, the results could be more optimal in the sense pp. 5660, Feb. 1992.
of floating point operations. [9] H. V. Sorensen, M. T. Heideman, and C. S. Burrus, On computing the
split-radix FFT, IEEE Trans. Acoust., Speech Signal Process., vol. 34,
R EFERENCES no. 1, pp. 152156, Feb. 1986.
[10] J. Kwong and M. Goel, A high performance split-radix FFT with
[1] P. Duhamel and H. Hollmann, Split radix FFT algorithm, Electron. constant geometry architecture, in Proc. Design, Autom. Test Eur. Conf.
Lett., vol. 20, no. 1, pp. 1416, Jan. 1984. Exhibit. (DATE), Dresden, Germany, Mar. 2012, pp. 15371542.
[2] M. A. Richards, On hardware implementation of the split-radix [11] W.-C. Yeh and C.-W. Jen, High-speed and low-power split-radix
FFT, IEEE Trans. Acoust., Speech Signal Process., vol. 36, no. 10, FFT, IEEE Trans. Signal Process., vol. 51, no. 3, pp. 864874,
pp. 15751581, Oct. 1988. Mar. 2003.