0% found this document useful (0 votes)
68 views

VLSI Decoder Architecture For High Throughput, Variable Block-Size and Multi-Rate LDPC Codes

This document describes a VLSI decoder architecture presented at the IEEE International Symposium on Circuits and Systems (ISCAS'07) in May 2007. The architecture supports variable block sizes from 360 to 4200 bits and multiple code rates between 1/4 and 9/10 for low-density parity-check (LDPC) codes. It is based on structured quasi-cyclic LDPC codes and addresses the hardware complexity of supporting variable block sizes and code rates. The overall decoder implemented on a 0.13-micron CMOS technology has a core area of 4.5 square millimeters and can achieve an average throughput of 1 Gbps at 2.2 dB SNR.

Uploaded by

ig77
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views

VLSI Decoder Architecture For High Throughput, Variable Block-Size and Multi-Rate LDPC Codes

This document describes a VLSI decoder architecture presented at the IEEE International Symposium on Circuits and Systems (ISCAS'07) in May 2007. The architecture supports variable block sizes from 360 to 4200 bits and multiple code rates between 1/4 and 9/10 for low-density parity-check (LDPC) codes. It is based on structured quasi-cyclic LDPC codes and addresses the hardware complexity of supporting variable block sizes and code rates. The overall decoder implemented on a 0.13-micron CMOS technology has a core area of 4.5 square millimeters and can achieve an average throughput of 1 Gbps at 2.2 dB SNR.

Uploaded by

ig77
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

IEEE International Symposium on Circuits and Systems (ISCAS'07).

May 2007

VLSI Decoder Architecture for High Throughput,


Variable Block-size and Multi-rate LDPC Codes
Yang Sun, Marjan Karkooti and Joseph R. Cavallaro
Department of Electrical and Computer Engineering
Rice University, Houston, TX 77005
Email: {ysun, marjan, cavallar}@rice.edu

Abstract— A low-density parity-check (LDPC) decoder archi- code rates is designed by storing 12 different parity check
tecture that supports variable block sizes and multiple code matrices on-chip. As we can see, the main design challenge for
rates is presented. The proposed architecture is based on the supporting variable block sizes and multiple code rates stems
structured quasi-cyclic (QC-LDPC) codes whose performance
compares favorably with that of randomly constructed LDPC from the random or unstructured nature of the LDPC codes.
codes for short to moderate block sizes. The main contribution Generally support for different block sizes of LDPC codes
of this work is to address the variable block-size and multi- would require different hardware architectures. To address this
rate decoder hardware complexity that stems from the irregular problem, we propose a generalized decoder architecture based
LDPC codes. The overall decoder, which was synthesized, placed on the quasi-cyclic LDPC (QC-LDPC) codes that can support
and routed on TSMC 0.13-micron CMOS technology with a core
area of 4.5 square millimeters, supports variable code lengths a wider range of block sizes and code rates at a low hardware
from 360 to 4200 bits and multiple code rates between 1/4 and requirement.
9/10. The average throughput can achieve 1 Gbps at 2.2 dB SNR.
II. S TRUCTURED QC-LDPC CODES

DP
I. I NTRODUCTION v0 v1 ...
c0 Layer 0
Low-density parity-check (LDPC) codes have received 10101100
Expand by P
tremendous attention in the coding community because of 01010110
BP
c1
.
Layer 1

their excellent error correction capability and near-capacity 10101011 . .


. .
performance. Some randomly constructed LDPC codes, mea- 11011001
P Layer B-1
sured in Bit Error Rate (BER), come very close to the (a) B x D seed matrix
P x
Shannon limit for the AWGN channel (within 0.05 dB) with P x P Identity matrix
= cyclically shifted by x = Zero matrix

iterative decoding and very long block sizes (on the order of (b) BP x DP generated PCM
106 to 107 ). However, for many practical applications (e.g.
packet-based communication systems), shorter and variable c0
cluster
c1
cluster
c2
cluster
c3
cluster
block-size LDPC codes with good Frame Error Rate (FER) check node
messages
performance are desired. Communications in packet-based
wireless networks usually involve a large per-frame overhead Permutation (Shift) Network
including both the physical (PHY) layer and MAC layer variable node
headers. As a result, the design for a reliable wireless link messages

v0 v1 v2 v3 v4 v5 v6 v7
often faces a trade-off between channel utilization (frame size) cluster cluster cluster cluster cluster cluster cluster cluster

and error correction capability. One solution is to use adaptive


= Check node cluster (size P) = Variable node cluster (size P)
burst profiles in which transmission parameters relevant to (c) Factor graph representation of a BP x DP PCM
modulation and coding may be assigned dynamically on a
burst-by-burst basis. Therefore, LDPC codes with variable Fig. 1. Parity check matrix and its factor graph representation
block lengths and multiple code rates for different quality-of-
service under various channel conditions are highly desired. To balance the implementation complexity and the decoding
In the recent literature, there are many LDPC decoder throughput, a structured QC-LDPC code was proposed in [4]
architectures but few of them support variable block-size and recently for modern wireless communication systems includ-
muti-rate decoding. For example, in [1] a 1 Gbps 1024-bit, ing but not limited to IEEE 802.16e and IEEE 802.11n.
rate 1/2 LDPC decoder has been implemented. However this As shown in Fig. 1(a)(b), for a QC-LDPC code, the parity
architecture just supports one particular LDPC code by wiring check matrix (PCM) is constructed from a B × D seed matrix
the whole Tanner graph into hardware. In [2], a code rate by replacing each ’1’ in the seed matrix with a P ×P cyclically
programmable LDPC decoder is proposed, but the code length shifted identity sub-matrix, where P is an expansion factor.
is still fixed to 2048 bit for simple VLSI implementation. In A corresponding Tanner factor graph representation of this
[3], a LDPC decoder that supports three block sizes and four BP × DP generated PCM is shown in Fig. 1(c). It divides
the variable nodes and the check nodes into clusters of size P the variable node j. Let L(qij ) denote the variable node LLR
such that if there exists an edge between variable and check messages sent from the variable node j to the check node i.
clusters, then it means P variable nodes connect to P check Let L(qj ) (j = 1, . . . , N ) represent the APP messages for all
nodes via a permutation (cyclic shift) network. the variable nodes (coded bits) which are initialized with the
Generally, support for different block sizes and code rates channel messages (assuming BPSK on AWGN channel) for
implies usage of multiple PCMs. Storing all the PCMs on- each code bit j by 2rj /σ 2 , where σ 2 is the noise variance
chip is almost impractical and expensive. In this work, we and rj is the received value. For each variable node j inside
utilize the expansion factor P so that only one parity check the current horizontal layer, messages L(qij ) that correspond
matrix needs to be stored for a given seed matrix. Denote to a particular check equation i are computed according to:
P0 as the largest possible expansion factor for a given seed
matrix, we can construct a QC-LDPC code of B × D array L(qij ) = L(qj ) − Rij . (2)
of P0 × P0 sub-matrices, which are either zero matrices or For each check node i, messages Rij , corresponding to all
cyclically shifted identity matrices. The corresponding shift variable nodes j that participate in a particular parity-check
values denoted as m(P0 , i, j) are stored on-chip. For all the equation, are computed according to:
other expansion factors Px , the shifted values are derived from  
m(P0 , i, j) by: Y X
Rij = sign (L(qij0 )) Ψ  Ψ (L(qij0 )) , (3)
m(P0 , i, j) · Px j0∈N (i)\{j} j0∈N (i)\{j}
m(Px , i, j) = b c. (1)
P0
where N (i) is the set of all variableh nodes from parity-
Obviously, with the help of expansion factor Px , (Px <
 i
|x|
check equation i, and Ψ(x) = − log tanh 2 . The APP
P0 ), we are able to generate different size PCMs from the
messages in the current horizontal layer are updated by:
same seed matrix to support different size codes. However, to
support different code rates, different seed matrices as well L(qj ) = L(qij ) + Rij . (4)
as the shift values associated with each seed matrix must
be constructed. Table I presents the seed matrices needed to For QC-LDPC codes, the parity check matrix can be
support different code rate requirements. Each seed matrix can viewed as a Brow cluster × Dcolumn cluster structure by
be constructed by an algebraic construction method proposed grouping variable nodes and check nodes into clusters of size
by Tanner in [4]. P . Now let i and j denote the row cluster index and the
column cluster index, a layered partially parallel decoding
TABLE I algorithm is given by:
C ODE RATE VERSUS SEED MATRIX
for iter = 0 : max iteration − 1
Rate Hseed Rate Hseed Rate Hseed for layer (row cluster) i = 0 : B − 1
1/4 18 × 24 3/5 10 × 25 5/6 4 × 24 for column cluster j = 0 : D − 1
1/3 16 × 24 2/3 8 × 24 7/8 3 × 24 if P CMi,j is a non-zero sub-matrix {
2/5 15 × 25 3/4 6 × 24 8/9 3 × 27 Read a cluster of APP data L(qj ) from APP memory
1/2 12 × 24 4/5 5 × 25 9/10 3 × 30
Read a cluster of Check data Rij from Check memory
Calculate shift value from (1) and permute APP data
Calculate equation (2)(3)(4)
III. LDPC D ECODER H ARDWARE A RCHITECTURE Update new APP and Check data to memory
}
A. Layered partially parallel soft decoding algorithm
A good tradeoff between design complexity and decoding
where the decoding will stop whenever all the parity check
throughput is partially parallel decoding by grouping a certain
constraints are satisfied or the max number of iterations is
number of variable and check nodes into a cluster for parallel
reached.
processing. Furthermore, the layered decoding algorithm [5]
can be applied to improve the decoding convergence time by B. Min-sum algorithm and fixed-point implementation
a factor of two and hence increases the throughput by 2X.
The structured QC-LDPC code makes it effectively suitable The belief propagation algorithm [6] is the most powerful
for efficient VLSI implementation by significantly simplifying iterative soft decoding algorithm for LDPC codes. But due to
the memory access and message passing. As shown in Fig. its high design complexity in (3), many implementations for
1(b), the PCM can be viewed as a group of concatenated decoding LDPC codes are based on the modified (normalized
horizontal layers, where the column weight is at most 1 or offset) min-sum algorithm because of its satisfactory perfor-
in each layer due to the cyclic shift structure. The belief mance and simple implementation [7]. By applying the offset
propagation algorithm is repeated for each horizontal layer min-sum algorithm, equation (3) is reduced to:
and the updated APP (a posteriori probability) messages are  
passed between layers. Let Rij denote the check node LLR Y
Rij ≈ sign (L(qij0 ))× max min|L(qij0 )| − β, 0
0
(3 )
(Log-likelihood ratios) messages sent from the check node i to j0∈N (i)\{j}
j0∈N (i)\{j}
min1
For logic circuit design with finite precision, we consider the ABS FMIN min2 unsign Rij
Rij_new
2sign
received values to be quantized in the range of [−Z, Z] and sgn

represented by W quantization bits. With a properly chosen XOR DFF


Rij XOR
offset value β and Z, a 6-bit quantized min-sum algorithm sgn bit sgn

exhibits only about 0.1 dB of degradation in performance L(qj) - FIFO


L(qij)_fifo
+ L(qj)_new
L(qij)
compared with the unquantized standard BP algorithm [7].
Fig. 3. Processing Engine (PE)
C. Partially parallel decoder architecture

L(qj)_new from PEs Rij_new from PEs D. Flexible permuter


APP Memory Addr CHECK Memory One of the main challenges of the LDPC decoder architec-
(D x PW bits) Gen (E x PW bits) ture is the permuter (π) design that is responsible for routing
the messages between variable nodes and check nodes. How-
Flexible Shift PCMs ever for QC-LDPC codes, the permuter is just a barrel shifter
Permuter CTRL network (size-P ) for cyclically shifting the node messages to
L(qj) in cluster Rij in cluster
the correct PEs. Fig. 4 gives an example of a size-4 barrel
APP DATA IN shifter network. The hardware design complexity of this type
CHECK DATA IN
of network is O(P dlog2 P e) as compared to O(P 2 ) for the
directly connected network. For large size P (e.g. 128), the
PE 1 PE 2 ... X P0 PE P barrel shifter network needs to be partitioned into multiple
pipeline stages for high speed VLSI implementation.
APP DATA OUT Traditionally a de-permuter (π −1 ) would be needed to
CHECK DATA OUT permute the shuffled data back and save it to memory, which
Partially Parallel Decoding would occupy a significant portion of the chip area [2].
L(qj)_new Rij_new
However, due to the cyclic shift property of the QC-LDPC
codes, no de-permuter is needed. We can just store the shuffled
Fig. 2. Top level decoder architecture data back to memory and for the next iteration we should
then shift this ”shuffled data” by an incremental value ∆ =
Fig. 2 shows the block diagram of the decoder architecture (shif tn − shif tn−1 ) mod P .
based on the layered partially parallel decoding algorithm.
In each sub-iteration, a cluster of APP messages and check 0 3
Switch Switch Switch
1 0
messages are fetched from APP and Check memory, and then
the APP messages are passed through a flexible permuter to 2 1
3 2
be routed to the correct Processing Engines (PEs) for updating Switch Switch Switch
Barrel shifted by 1
new APP messages and check messages. The PEs are the
central processing units of the architecture that are responsible Fig. 4. A 4 × 4 Barrel shifter network
0
for updating messages based on (2)(3 )(4). The number of PEs
determines the parallelism factor of the design. For a certain
block-size code, only Px PEs are working while the rests are E. Pipelined decoding for higher throughput
in a power saving mode. As shown in Fig. 3, the PE inputs wr The decoding throughput can be further improved by over-
elements of L(qj ) and Rij , where wr is the number of nonzero lapping the decoding of two layers using a pipelined method.
values in each row of the PCM. L(qij ) is calculated based on The decoding of each layer of the parity check matrix is
(2). The sign and magnitude of L(qij ) are processed based on
0
(3 ) to generate new Rij . Then the L(qij ) are added to the
Data depency
Rij to generate new L(qj ) (wr of them) based on (4). The Index 0 1 2 3 4 5
Layer i Read/Min-sum Write back
outputs (L(qj ) and Rij ) of all the Px PEs are concatenated Layer i X X X X

and stored in one address of the APP and Check memories. Layer i+1 Read/Min-sum Write back Layer i+1 X X X

For each layer’s sub-iteration, it takes about 2wr clock cycles


(a) Two layer pipelined decoding (b) Two adjacent layers of the matrix
to process, so the decoding throughput is:
0 1 2 3 4 5 6 7 8 9 10 11 12 13
Clock cycle
D × Px × R × f clkmax R = Read
T hroughput ≈ Layer i R0 R2 R3 R5 W0 W2 W3 W5 W = Write
2 × E × iterations ST = Stall
Layer i+1 R1 ST ST R3 R4 W1 W3 W4
where R is the code rate and E is the total number of edges
Two memory read stalls due to data depency
between all variable nodes and check nodes in the seed matrix. (c) Pipelining data hazard
Clearly, the throughput would be linearly proportional to the
expansion factor Px for a given seed matrix. Fig. 5. Pipelined decoding
performed in two stages: 1) Memory read and min-sum calcu- 0
LDPC Codes, BPSK on AWGN Channel
10
lation and 2) Memory write back. However, due to the possible
data dependence between two consecutive layers (there is no
−1
data dependency inside each layer because the column weight 10
Proposed code
is at most 1 in each layer), a pipelining data hazard might N=2400, R=3/4

occur. Fig. 5 shows an example of pipelined decoding. In 10


−2

Fig. 5(c), at clock cycle 6, layer (i + 1) is trying to access

Frame Error Rate (FER)


APP memory address 3 which will not be updated by layer −3
10 802.11n code
i until clock cycle 7, hence two pipeline stalls need to be N=1296, R=1/2

inserted. Moreover, a horizontal rescheduling algorithm can 802.11n code


N=1944, R=2/3
also be applied to help reduce pipeline stalls. For example, 10
−4 Proposed code
N=1296, R=1/2
in Fig. 5, layer (i + 1)’s reading can be rescheduled from the Proposed code
N=1944, R=2/3
original sequence 1-3-4 to 1-4-3 to reduce pipeline stalls. This −5
10
way, the decoding throughput will be increased to Proposed code
N=1600, R=2/5

−6
D × Px × R × f clkmax 10
1 1.5 2 2.5 3 3.5
P ipelined T hroughput ≈ Eb/No [dB]
E × iterations
IV. P HYSICAL VLSI DESIGN
Fig. 7. FER performance comparison with IEEE 802.11n codes
A flexible LDPC decoder which supports variable block
sizes from 360 to 4200 bits in fine steps, where the step size TABLE II
can be 24 (at rate 1/4, 1/3, 1/2, 2/3, 3/4, 5/6 and 7/8), or C OMPARISON OF PROPOSED DECODER WITH EXISTING LDPC DECODERS
25 (at rate 2/5, 3/5 and 4/5), or 27 (at rate 8/9), or 30 (at rate
Proposed Decoder Blanksby [1] Mansour [2]
9/10), was described in Verilog HDL. Layout was generated Throughput 1.0 [email protected] 1.0 Gbps [email protected]
for a TSMC 0.13µm CMOS technology as shown in Fig. 6 Area 4.5 mm2 52.5mm2 14.3 mm2
Frequency 350 MHz 64 MHz 125 MHz
Power 740 mW 690 mW 787 mW
Block size 360 to 4200 bit 1024 bit fixed 2048 bit fixed
Code Rate 1/4 : 9/10 1/2 fixed 1/16 : 14/16
Technology 0.13µm, 1.2V 0.16µm, 1.5V 0.18µm, 1.8V

Check Memory
using TSMC 0.13 µm, 1.2V , eight metal layers CMOS tech-
nology. The decoder can support high throughput decoding,
APP Permuter
Memory PEs for example, 1 Gbps at 2.2 dB SNR, at less area.
CTRL VII. ACKNOWLEDGEMENT
PCM
Memory Glue Logic This work was supported in part by Nokia and by NSF under
grants CCF-0541363, CNS-0551692, and CNS-0619767.
R EFERENCES
[1] A.J. Blanksby and C.J. Howland, “A 690-mW 1-Gb/s 1024-b, rate-
Fig. 6. Flexible LDPC decoder VLSI layout (0.13µm) 1/2 low-density parity-check code decoder,” IEEE Journal of Solid-State
Circuits, vol. 37, no. 3, pp. 404–412, 2002.
[2] M.M. Mansour and N.R. Shanbhag, “A 640-Mb/s 2048-Bit Programmable
V. P ERFORMANCE ANALYSIS AND COMPARISON LDPC Decoder Chip,” IEEE Journal of Solid-State Circuits, vol. 41, pp.
684–698, March 2006.
Fig. 7 shows the FER performance and compares the two [3] M. Karkooti, P. Radosavljevic, and J. R. Cavallaro, “Configurable, High
cases that also exist in the IEEE 802.11n (WWiSE Proposal) Throughput, Irregular LDPC Decoder Architecture:Tradeoff Analysis and
codes. Table II compares this decoder with the state-of-the-art Implementation,” IEEE 17th International Conference on Application-
specific Systems, Architectures and Processors, pp. 360–367, Sep. 2006.
LDPC decoders of [1] and [2]. As we can see, the proposed [4] R.M. Tanner, D. Sridhara, A. Sridharan, T.E. Fuja, and D.J. Costello
decoder shows significant performance in throughput, flexibil- Jr., “LDPC block and convolutional codes based on circulant matrices,”
ity, area and power. IEEE Transactions on Information Theory, vol. 50, no. 12, pp. 2966–
2984, 2004.
VI. C ONCLUSION [5] M. M. Mansour and N. R. Shanbhag, “High-throughput LDPC decoders,”
IEEE Transactions on Very Large Scale Integration (VLSI) Systems,
A VLSI decoder architecture that supports variable block- vol. 11, pp. 976–996, Dec. 2003.
size and multi-rate LDPC codes has been presented. By [6] R. Gallager, “Low-density parity-check codes,” IEEE Transactions on
Information Theory, vol. 8, pp. 21–28, Jan. 1962.
utilizing structured QC-LDPC codes, we proposed a pipelined [7] J. Chen, A. Dholakia, E. Eleftheriou, M. Fossorier and X. Hu, “Reduced-
partially parallel decoding algorithm which is well suited for Complexity Decoding of LDPC Codes,” IEEE Transactions on Commu-
VLSI implementation. The decoder has been placed and routed nications, vol. 53, pp. 1232–1232, 2005.

You might also like