A Simple Circular-Shift Network
A Simple Circular-Shift Network
Abstract—There is an increasing need for configurable quasi- the parity-check matrices for the QC-LDPC codes can be parti-
cyclic low-density parity-check (QC-LDPC) decoders that can tioned blockwise or CPM-wise and implemented with partially
support a family of structurally compatible codes instead of a parallel decoder architectures [2], [3], which achieve an effi-
single code. The key component in a configurable QC-LDPC cient tradeoff between very-large-scale-integration complexity
decoder is a programmable circular-shift network that supports
cyclic shifts of any size up to a predefined maximum submatrix and decoding throughput. For a given code, the interconnect
size. This paper presents a QC-LDPC shift network (QSN), which network between the CNUs and the VNUs is predetermined;
has two key advantages over state-of-the-art solutions in recent lit- therefore, it can be optimized for that code. However, emerg-
erature. First, the QSN reduces the number of stages in the critical ing applications such as 802.11n, 802.16e, and DVB-S2 need
path, which improves the clock frequency and makes it scalable, decoders that work for a set of codes, not just a single code.
particularly in a field-programmable gate array (FPGA)-based A decoder that can implement a set of structurally compatible
implementation where an interconnect delay is dominant. Second,
codes is called a reconfigurable (or sometimes just configurable
the QSN’s control logic is simple to generate and occupies a
significantly smaller area. The QSNs for a variety of codes suit- or flexible) decoder. A reconfigurable QC-LDPC decoder re-
able for emerging applications are implemented, targeting both quires a programmable shift network to accommodate different
a 180-nm Taiwan Semiconductor Manufacturing Company Ltd. submatrix sizes, code rates, and block lengths.
complimentary metal–oxide–semiconductor library and a Xilinx The field-programmable gate array (FPGA)-based emulation
Virtex 4 FPGA. The proposed implementation is shown to be of LDPC decoders is widely used to design and optimize
2.1 times faster than the best known implementation in literature LDPC codes. However, the design and the optimization of an
and requires almost eight times less control area. Furthermore,
this paper presents analytical models of the critical-path and data-
efficient LDPC decoder on an FPGA are formidable and time-
path complexity for arbitrary-sized submatrices and proves that consuming tasks. A single reconfigurable decoder that can op-
the QSN indeed generates all the output combinations required erate for a family of related codes [1] is highly desirable. Such
for implementing reconfigurable QC-LDPC decoders. a reconfigurable decoder once again requires a programmable
Index Terms—Benes network, error correction codes, quasi-
shift network to implement different codes that are derived
cyclic low-density parity-check (QC-LDPC) codes, very large scale from a mother code through algebraic transformations, such as
integration, WiFi, WiMAX. masking, row, and column decomposition [1]. The dominance
of the interconnect (or routing delay) makes the design of a
I. I NTRODUCTION fast and scalable circular-shift network even more challenging
on a FPGA. An architecture that minimizes the interconnect
{O[0], O[1], . . . , O[PM − 1]} be the input and output mes- Nleft = (PM − 2 ) = bPM −
i
(2i ) = bPM − 2b + 1.
sages of the QSN, respectively. When p = PM , O[i] = I[(i + i=0 i=0
(2)
c) mod p] for all 0 ≤ i < PM − 1. When p < PM , the input
messages {I[p], I[p + 1], . . . , I[PM − 1]} and the output mes- Similarly, the total number of multiplexors Nright in the right-
sages {O[p], O[p + 1], . . . , O[PM − 1]} would be ineffective, shift network is equal to bPM − 2b + 1.
and the related ports would not be used, as showed in Fig. 1(b). The merge network chooses the proper output messages
The other ports maintain the circular-shift property, i.e., O[i] = from L[0, 1, . . . , PM − 1] and R[0, 1, . . . , PM − 1], based on
I[(i + c) mod p] for 0 ≤ i < p − 1. Overall, the number of a (PM − 1)-bit control signal m. m[i] corresponds to the
M
output combinations for the QSN network would be P p=2 (p − multiplexor whose inputs are L[i] and R[i]. When m[i] =
1) + 1 since input size has p − 1 shifted combinations, and 0, O[i] ← R[i]; otherwise, O[i] ← L[i]. R[PM − 1] is routed
when c = 0, the combinations for all p are the same. directly to O[PM − 1]. The total number of multiplexors in the
In the succeeding subsections, we propose a low-complexity merge network is denoted by Nmerge and is equal to PM − 1.
switch network design for the configurable LDPC decoders, Therefore, the total number of 2:1 multiplexors (width equal to
which can be implemented with a small area. Also, an efficient the input-message width) is given by
algorithm and its hardware implementation to generate all the
control signals of the proposed switch network are discussed. Ntotal = Nleft + Nright + Nmerge
= 2(bPM − 2b + 1) + (PM − 1)
= (2b + 1)PM − 2b+1 + 1. (3)
A. Overall Architecture
The output of the QSN can be divided into two parts: 1) the Fig. 3 shows an example of an 11 × 11 QSN and its output
left part, i.e., for 0 ≤ i < p − c, O[i] = I[i + c]; 2) the right when c = 7 and p = 11. The p − c = 4 effective outputs of
part, i.e., for p − c ≤ i < p, O[i] = I[i − (p − c)]. Based on the left-shift network {I7 , I8 , I9 , I10 } and the c = 7 effective
the observation, the output of the QSN can be viewed as the outputs of the right-shift network {I0 , I1 , I2 , I3 , I4 , I5 , I6 } are
combination of two shifted arrays of the inputs. Thus, the reorganized at the merge network and generate the final out-
generation of the circular-shifted array has three steps, which put, i.e., {I7 , I8 , I9 , I10 , I0 , I1 , I2 , I3 , I4 , I5 , I6 }. The number
are listed below: of multiplexors used is 68, which agrees with the calculation
Step 1) Left shift: Generate the left part of the final output shown in (3). Furthermore, the number of stages in our design
messages by performing left-shift operation on the is only five. In contrast, the OPN would need a seven-stage net-
array; let L[i] be the left-shift output, then L[i] ← work with 72 multiplexors to implement the same functionality,
I[i + c]. as shown in Fig. 4.
Step 2) Right shift: Generate the right part of the final output
messages by performing right-shift operation on the
array; let R[i] be the right-shift output, then R[i] ← B. Proof of Correctness
I[i − (p − c)]. Next, we prove that the QSN implementation can actually
Step 3) Merge: Extract the useful part from the left-shift out- generate all the required output combinations, given by O[i] =
put and the right-shift output. When 0 ≤ i < (p − I[(i + c) mod p] for all 0 ≤ i < PM − 1. The output of the
c), O[i] ← L[i]; when (p − c) ≤ i < p, O[i] ← left-shift network is L[i] = I[i + c] for 0 ≤ i < (p − c). The
R[i]. output of the right-shift network is R[i] = I[i − (p − c)] for
Steps 1 and 2 are independent and, thus, can be performed in (p − c) ≤ i < p. Based on our merge-network control-signal
parallel. Step 3 depends on the output of step 1 and step 2. The generating function, the final output would be O[i] = I[i + c]
overall architecture is shown in Fig. 2. for 0 ≤ i < (p − c) and O[i] = I[i − (p − c)] for (p − c) ≤
As shown in Fig. 2, the QSN has three components, i.e., i < p. As I[(i + c) mod p] = I[i + c] when 0 ≤ i < (p − c),
the left-shift network, the right-shift network, and the merge I[(i + c) mod p] = I[i + c − p] = I[i − (p − c)] when (p −
network, which corresponds to steps 1, 2, and 3 described c) ≤ i < p. Hence, it is proven.
CHEN et al.: QSN—A SIMPLE CIRCULAR-SHIFT NETWORK FOR RECONFIGURABLE QUASI-CYCLIC LDPC DECODERS 785
TABLE I
ALGORITHM TO GENERATE CONTROL SIGNALS IN PSEUDO VERILOG
Fig. 5. Circuit for generating the control signals of the merge network for the
11 × 11 QSN.
TABLE II
HARDWARE COMPLEXITY COMPARISONS FOR CONFIGURABLE SWITCH NETWORKS (8-BIT WORD LENGTH)
TABLE III
FPGA RESULTS ON VIRTEX 4 LX160-10 AFTER PLACE AND ROUTING (8-BIT WORD LENGTH). NETWORK ONLY, NO CONTROLLER
TABLE IV R EFERENCES
FPGA RESULTS ON VIRTEX 4 LX160-10 FOR A CONFIGURABLE DECODER
SUPPORT IEEE 802.11n AND 802.16e STANDARD [1] L. Lan, L. Zeng, Y. Tai, L. Chen, S. Lin, and K. Abdel-Ghaffar, “Construc-
tion of quasi-cyclic LDPC codes for AWGN and binary erasure channels:
A finite field approach,” IEEE Trans. Inf. Theory, vol. 53, no. 7, pp. 2429–
2458, Jul. 2007.
[2] Y. Chen and K. Parhi, “Overlapped message passing for quasi-cyclic low-
standard-cell complimentary metal–oxide–semiconductor li- density parity check codes,” IEEE Trans. Circuits Syst. I, Reg. Papers,
vol. 51, no. 6, pp. 1106–1113, Jun. 2004.
brary. Table II compares the datapath and the controller area and [3] Y. Dai, Z. Yan, and N. Chen, “Optimal overlapped message passing
clock frequency of the QSN with the OPN (which represents decoding of quasi-cyclic LDPC codes,” IEEE Trans. Very Large Scale
the state of the art in the published research literature) and a Integr. (VLSI) Syst., vol. 16, no. 5, pp. 565–578, May 2008.
more conventional Benes-network-based implementation. Note [4] G. Masera, F. Quaglio, and F. Vacca, “Implementation of a flexible
LDPC decoder,” IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 54, no. 6,
that the control complexity, as quantified by the area required pp. 542–546, Jun. 2007.
to implement the control logic, is almost a factor of 8, smaller [5] M. Karkooti, P. Radosavljevic, and J. Cavallaro, “Configurable LDPC
than that required by the OPN (0.015 versus 0.114 mm2 ). Also, decoder architectures for regular and irregular codes,” J. Signal Process.
because of the reduced critical-path delay, the clock frequency Syst., vol. 53, no. 1, pp. 73–88, 2008.
[6] J. Tang, T. Bhatt, V. Sundaramurthy, and K. Parhi, “Reconfigurable shuffle
of our implementation is 200 MHz, as compared with 94 MHz network design in LDPC decoders,” in Proc. Int. Conf. Appl.-Specific Syst.
reported by that of the OPN in [13]. Archit. Process., Sep. 2006, pp. 81–86.
As mentioned before, our goal is to use the QSN to build [7] K. Gunnam, G. Choi, M. Yeary, and M. Atiquzzaman, “VLSI architec-
tures for layered decoding for irregular LDPC codes of WiMax,” in Proc.
a single decoder to emulate a family of complex QC-LDPC
IEEE ICC, 2007, pp. 4542–4547.
codes. Therefore, we also mapped the design to a FPGA and [8] T. Brack, M. Alles, F. Kienle, and N. Wehn, “A synthesizable IP core for
compared our results with the approach described in [13], as WIMAX 802.16E LDPC code decoding,” in Proc. IEEE Int. Symp. Pers.,
showed in Table III. Based on our design scheme, we design Indoor, Mobile Radio Commun., Sep. 2006, pp. 1–5.
[9] M. Rovini, G. Gentile, and L. Fanucci, “Multi-size circular shifting net-
a reconfigurable decoder for the IEEE 802.11n and 802.16e works for decoders of structured LDPC codes,” Electron. Lett., vol. 43,
standard (PM = 96) and present the results in Table IV. no. 17, pp. 938–940, Aug. 2007.
[10] C.-H. Liu, C.-C. Lin, H.-C. Chang, C.-Y. Lee, and Y. Hsua, “Multi-mode
message passing switch networks applied for QC-LDPC decoder,” in
V. C ONCLUSIONS Proc. IEEE Int. Symp. Circuits Syst., May 2008, pp. 752–755.
[11] D. Oh and K. Parhi, “Area efficient controller design of barrel shifters for
We have presented the architecture and the implementation reconfigurable LDPC decoders,” in Proc. IEEE Int. Symp. Circuits Syst.,
of the QSN, a simple circular-shift network that can be used to May 2008, pp. 240–243.
implement reconfigurable QC-LDPC decoders efficiently. Un- [12] J. Lin, Z. Wang, L. Li, J. Sha, and M. Gao, “Efficient shuffle network
like existing solutions to this problem in the research literature, architecture and application for WiMAX LDPC decoders,” IEEE Trans.
Circuits Syst. II, Exp. Briefs, vol. 56, no. 3, pp. 215–219, Mar. 2009.
the QSN has not been derived from a Benes topology and has [13] D. Oh and K. Parhi, “Low-complexity switch network for reconfigurable
hence resulted in simpler control logic and fewer stages in the LDPC decoders,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst.,
critical path. Consequently, the proposed network is suitable for vol. 18, no. 1, pp. 85–94, Jan. 2010.
both the ASIC and FPGA implementation of low-overhead and [14] C. Liu, C. Lin, S. Yen, C. Chen, H. Chang, C. Lee, Y. Hsu, and S. Jou, “De-
sign of a multimode QC-LDPC decoder based on shift-routing network,”
fast circular-shift networks for decoding QC-LDPC codes. IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 56, no. 9, pp. 734–738,
Sep. 2009.
[15] V. Benes, “Optimal rearrangeable multistage connecting networks,”
ACKNOWLEDGMENT Bell Syst. Tech. J., vol. 43, no. 7, pp. 1641–1656, 1964.
[16] Z. Li, L. Chen, L. Zeng, S. Lin, and W. Fong, “Efficient encoding of quasi-
The authors would like to thank D. Truong for helping with cyclic low-density parity-check codes,” IEEE Trans. Commun., vol. 53,
the TSMC ASIC library and synthesis. no. 11, p. 1973, Nov. 2005.