0% found this document useful (0 votes)
18 views

Systematic_construction_verification_and_implement

This article presents a systematic methodology for constructing, verifying, and implementing Low-Density Parity-Check (LDPC) codes, utilizing simulated annealing, GPU acceleration, and ant colony optimization. The proposed approach significantly enhances decoding performance and hardware efficiency, achieving a throughput of 1.2 Gbps on FPGA with minimal performance loss. Simulation results indicate substantial improvements in cycle reduction and memory conflict management compared to traditional methods.

Uploaded by

yasernajafi76
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Systematic_construction_verification_and_implement

This article presents a systematic methodology for constructing, verifying, and implementing Low-Density Parity-Check (LDPC) codes, utilizing simulated annealing, GPU acceleration, and ant colony optimization. The proposed approach significantly enhances decoding performance and hardware efficiency, achieving a throughput of 1.2 Gbps on FPGA with minimal performance loss. Simulation results indicate substantial improvements in cycle reduction and memory conflict management compared to traditional methods.

Uploaded by

yasernajafi76
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Yu et al.

EURASIP Journal on Wireless Communications and Networking 2012, 2012:84


http://jwcn.eurasipjournals.com/content/2012/1/84

RESEARCH Open Access

Systematic construction, verification and


implementation methodology for LDPC codes
Hui Yu*, Jing Cui, Yixiang Wang and Yibin Yang

Abstract
In this article, a novel and systematic Low-density parity-check (LDPC) code construction, verification and
implementation methodology is proposed. The methodology is composed by the simulated annealing based LDPC
code constructor, the GPU based high-speed code selector, the ant colony optimization based pipeline scheduler
and the FPGA-based hardware implementer. Compared to the traditional ways, this methodology enables us to
construct both decoding-performance-aware and hardware-efficiency-aware LDPC codes in a short time. Simulation
results show that the generated codes have much less cycles (length 6 cycles eliminated) and memory conflicts
(75% reduction on idle clocks), while having no BER performance loss compared to WiMAX codes. Additionally, the
simulation speeds up by 490 times under float precision against CPU and a net throughput 24.5 Mbps is achieved.
Finally, a net throughput 1.2 Gbps (bit-throughput 2.4 Gbps) multi-mode LDPC decoder is implemented on FPGA,
with completely on-the-fly configurations and less than 0.2 dB BER performance loss.
Keywords: low-density parity-check codes, simulated annealing, ant colony optimization, graphic processing unit,
decoder architecture

1. Introduction combined into PEG [8] to lower error floor. However,


Low-density parity-check (LDPC) code is first proposed these performance-aware methods do not take hardware
by Gallager [1] and rediscovered by Mackay and Neal implementation into account, which usually result in low
since they introduce Tanner Graph [2] into LDPC code efficiency or high complexity.
[3]. LDPC code with soft decoding algorithms on Tanner As to the decoder implementation, the fully-parallel
Graph can achieve outstanding capacity and approach architecture [9] is first proposed for achieving the highest
Shannon limit over noisy channels at moderate decoding decoding throughput, but the hardware complexity due to
complexity [4]. Most algorithms root from the famous the routing overhead is very high. The semi-parallel
believe propagation (BP) algorithm, such as min-sum layered decoder [10] is then proposed to achieve the trade-
algorithm (MSA) with simplified calculation, modified off between hardware complexity and decoding through-
MSA (MMSA) [5] with improved BER performance and put. Memory conflict is a critical problem for layered
layered versions [6] with fast decoding convergence. decoder, which is modeled as a single-layer traveling sales-
The existence of “cycle” in Tanner Graph is a critical man problem (TSP) in [11]. However, this model ignores
constraint of the above algorithms, as it breaks the “element permutation”, i.e., the order assignment of the
“message independence hypothesis” and degrades the edges in each layer, and its search does not cover the entire
BER performance. As a result, “girth” becomes an impor- solution space. Further, fully-parallel graphic processing
tance metric of estimating the performance of the LDPC unit (GPU) based implementation is also proposed in [12].
code. The progressive edge-growth (PEG) algorithm [7] In this article, a novel and systematic LDPC code con-
is a girth-aware construction method that tries to make struction, verification, and implementation methodology
shortest cycle as large as possible. Approximate cycle is proposed, and a software and hardware platform is
extrinsic (ACE) message degree constraint is further implemented, which is composed by four modules as
shown in Figure 1. The simulated annealing (SA) based
* Correspondence: [email protected]
LDPC code constructor continuously constructs good
Department of Electronic Engineering, Shanghai Jiao Tong University, candidate codes. The BER performance of the generated
Shanghai, P. R. China

© 2012 Yu et al; licensee Springer. This is an Open Access article distributed under the terms of the Creative Commons Attribution
License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium,
provided the original work is properly cited.
Yu et al. EURASIP Journal on Wireless Communications and Networking 2012, 2012:84 Page 2 of 13
http://jwcn.eurasipjournals.com/content/2012/1/84

/'3& &RGH &RQVWUXFWLRQ 9HULILFDWLRQ +DUGZDUH ,PSOHPHQWDWLRQ 3ODWIRUP


JHQHUDWH VHOHFW SLFN RSWLPL]H
FDQGLGDWH FRGHV FRGHV KDUGZDUH HIILFLHQF\

*38EDVHG $&2EDVHG
6$EDVHG
KLJKVSHHG SLSHOLQLQJ
/'3& FRGH
SHUIRUPDQFH VFKHGXOH
FRQVWUXFWRU
HYDOXDWRU RSWLPL]RU

KDUGZDUH LPSOHPHQWDWLRQ
JRRG /'3&
)3*$EDVHG FRGHV
PXOWLPRGH KLJKWKURXJKSXW /'3& GHFRGHU
KDUGZDUH DUFKLWHFWXUH LPSOHPHQWHU

Figure 1 LDPC code construction and verification platform.

codes, especially the error floor, is then evaluated by the early-stopping scheme, a net-throughput 1.2 Gbps is
high-speed GPU based simulation platform. Next, the further achieved on Stratix III FPGA.
hardware pipeline of the selected codes are optimized by
the ant colony optimization (ACO) based scheduling The remainder of this paper is organized as follows.
algorithm, which can reduce much of the memory con- Section 2 presents the background of our research.
flicts. Finally, detailed implementation schemes are pro- Sections 3, 4, and 5 introduces the ACO based pipeline
posed, i.e., reconfigurable switch network (adopted by [13]), scheduler, the SA based code constructor and the GPU
offset-threshold decoding, split-row MMSA core, early-stop- based performance evaluator, respectively, followed by
ing scheme and multi-block scheme, and the corresponding hardware implementation schemes and issues of the
multi-mode high-throughput decoder of the optimized multi-mode high-throughput LDPC decoder discussed in
codes is implemented on FPGA. The novelties of the pro- Section 6. Simulation results are provided in Section 7 and
posed methodology are listed as follows: hardware implementation results are given in Section 8.
Finally, Section 9 concludes this article.
• Compared to traditional methods (PEG, ACE), the
SA-based constructor takes both decoding perfor- 2. Background
mance and hardware efficiency into consideration dur- 2.1. LDPC codes and Tanner graph
ing construction process. An LDPC code is a special linear block code, character-
• Compared to existed work [11], the ACO-based sche- ized by a sparse parity-check matrix H with dimensions
duler covers both layer and element permutation, and M × N; Hj,i= 1 if code bit i is involved in parity-check
maps the problem to a double-layer TSP, which is a equation j, and 0 otherwise. An LDPC code is usually
complete solution and can provide better pipelining described by its Tanner Graph, a bipartite graph defined
schedule. on the code bit set ℝ and parity-check equation set ℂ,
• Compared to existed works, the GPU-based evaluator whose elements are called a “bit node” and a “check
first implements the semi-parallel layered architecture node”, respectively. An edge is assigned between bit
on GPU. The obtained net throughput is similar to the node BNi and check node CNj if Hj,i = 1. A simple 4 ×
highest report [12] (about 25 Mbps), while the pro- 6 LDPC code and the corresponding Tanner Graph is
posed scheme has higher precision and better BER per- shown in Figure 2.
formance. Further, we put the whole coding and Quasi-cyclic LDPC codes (QC-LDPC) is a popular class
decoding system into GPU rather than a single decoder. of structured LDPC codes, which is defined by its base
• Compared to existed FPGA or ASIC implementa- matrix Hb, whose elements satisfying −1 ≤ Hbj,i < zf .zf is
tions [14-16], the proposed multi-mode high-through-
called the expansion factor. Each element in the base
put decoder not only supports multiple modes with
matrix should be further expanded to a zf × zf matrix to
completely on-the-fly configurations, but also has a
performance loss within 0.2 dB against float precision obtain H. The elements Hbj,i = −1 are expanded to zero

and 20 iterations, and a stable net-throughput 721.58 L(Q i ) = L(ci ) + L(rj i ) = L(qij ) + L(rji )
matrices, while
Mbps under code rate 1/2 and 20 iterations. With j ∈ci
Yu et al. EURASIP Journal on Wireless Communications and Networking 2012, 2012:84 Page 3 of 13
http://jwcn.eurasipjournals.com/content/2012/1/84

ex + 1
Y Y Y Y Y Y Y Y Y Y Y Y neighbor set of CNj. (x) = log . These equations
ex − 1
F §     · T
can also be applied in layered BP, the difference is that
¨ ¸
¨     ¸
F ELW QRGH the L(qij) and L(rji) should be updated in each layer of
U FKHFN QRGH
F ¨     ¸ OHQ F\FOH
the iteration.
¨ ¸ The above equations requires the independence of all
F ©     ¹
WKH SRVLWLRQ RI VSOLW URZ
F F F F the messages L( q i′j ), i′ Î ℛ j and Hbj,k . However, the
D + PDWUL[ E 7DQQHU *UDSK existence of “cycle” in Tanner Graph invalidates this
Figure 2 H matrix and Tanner Graph of LDPC. independence assumption, thus degrades the BER per-
formance of BP algorithm. A length 6 cycle is shown
with bold lines in Figure 2. In this case, if BP algorithm
are expanded to a cyclic-shift identity matrices with per- proceeds for more than three iterations, the receive
mutation factors Hbj,i ≥ 0 . QC-LDPC is naturally avail- messages of the involved bit nodes v2 ,v4,v5 will partly
able for layered algorithms, whose j-th row is exactly contain its own message sent three iterations before. For
this reason, the minimum cycle length in the Tanner
layer j. We call the “1"s of j-th row as the set p = Hbj,i .
Graph, called “girth”, has a strong relationship with its
See Figure 3 for an example of a 4 × 6 base matrix with BER performance, and is considered as an important
zf = 4. metric in LDPC code construction algorithms (PEG,
ACE) [7,8].
2.2. The BP algorithm and effect of cycle
The BP algorithm is a general soft decoding scheme for 2.3. Decoder architecture and memory conflict
codes described by Tanner Graph. It can be viewed as The semi-parallel structure with layered MMSA core is
the process of iterative message exchange between bit a popular decoder architecture due to its good tradeoff
nodes and check nodes. For each iteration, each bit among low complexity, high BER performance and high
node or check node collects the messages passed from throughput. As shown in Figure 4, the main compo-
its neighborhood, updates its own message and passes nents in the top-level architecture include an LLRSUM
the updated message back to its neighborhood. BP algo- RAM storing L(Qi), an LLREX RAM storing L(rji) and a
rithm has many modified versions, such as log-domain layered MMSA core pipeline. The two RAMs should be
BP, MSA, and layered BP. All of them originate from readable and writable. Old values of L(Qi) and L(rji) are
the basic log-domain message passing equations, given read, and new values are calculated through the pipeline
as follows. and written back to RAMs. For QC-LDPC codes, the
 values are processed layer by layer, and the “1"s in each

{Hbj,i Hbj,i ≥ 0} (1) layer is processed one by one.
Memory conflict is a critical problem that constrains
L(qij ) = L(Qi ) − L(rji ) (2) the throughput of the semi-parallel decoder. Essentially,
memory conflict occurs when the read-after-write
 (RAW) dependency of L(Qi) is violated. Note that the
L(Qi ) = L(ci ) + L(rj i ) = L(qij ) + L(rji ) (3) new value of L(Q i ) will not be written back to RAM
j ∈ci until the pipelined calculation finishes. If L(Qi) is again
needed during this calculation period, the old value will
where L(ci) is the initial channel message, L(qij) is the
be read, while the new one is still under processing, see
message passing from BNi to CNj, L(rji) is the message
L(Q6) in Figure 4. This case happens when the layers j
of inverse direction, and L(Qi) is the a-posteriori of bit
and j + l have “1"s in the same position
node BN i . Ci is the neighbor set of BN i , ℛ j is the
i (Hbj,i ≥ 0, Hbj+l,i ≥ 0) . We call it a gap-l conflict.
Memory conflict slows the decoding convergence and
Y Y Y Y Y Y thus reduces the BER performance. The traditional

F §      ·  method of handling memory conflict is to insert idle
¨ ¸
F ¨      ¸  clocks in the pipeline, with the cost of throughput

F ¨      ¸ reduction. It’s obvious that the smaller l, the more idle
¨ ¸ H[SDQVLRQ IDFWRU 
clocks should be inserted, since the pipeline need to
F ©      ¹ SHUPXWDWLRQ IDFWRU 

wait at least K stages before writing back the new


Figure 3 A simple 4 × 6 base matrix Hb with zf = 4.
values. Usually, the number of gap-1, gap-2, gap-3
Yu et al. EURASIP Journal on Wireless Communications and Networking 2012, 2012:84 Page 4 of 13
http://jwcn.eurasipjournals.com/content/2012/1/84

/006$ //5680 5$0

//5(; 5$0
FRUH 5HDG 5HDG 5HDG 5HDG 5HDG 5HDG
6WDJH  ««
/ 4 / 4 / 4 / 4 / 4 / 4
6WDJH 
&RQIOLFW˖5HDG EHIRUH :ULWH

««
:ULWH :ULWH :ULWH :ULWH
6WDJH . ««
/ 4 / 4 / 4 / 4
'HFRGH 3LSHOLQH 'HOD\
7LPH $[LV

Figure 4 The layered MMSA decoder architecture and memory conflict.

conflicts, denote c 1 , c 2 and c 3 , are considered as the M, and element permutation of layer m as n ® μm,n, 1
metrics of measuring memory conflict. ≤ n,μm,n ≤ wm.
Based on the above definitions, a memory conflict
3. The ACO-based pipelining scheduler occurs between layer i, element k and layer j, element l if
In this section, we propose the ACO-based pipeline the following conditions are satisfied: (1) layers i,j are
scheduling algorithm to minimize memory conflict. We assigned to be adjacent, i.e., lj = li + 1; (2) hi,k = 1 and hj,
first formulate this problem, then map it to the double- l = 1; (3) the pipeline time interval is less than pipeline
layered TSP and finally use ACO to solve it. stages, i.e., wi−μi,k+μj,l ≤ K. Further, we define the “conflict
set” C as C (i, j) = {(k, l)|elements (i,k) and (j, l) cause a
3.1. Problem formulation memory conflict}, and the “conflict stages”, also the mini-
Consider a QC LDPC code described by its base matrix mum number of idle clocks inserted due to this conflict, as
H with dimensions M × N. Thus, there are M layers.
Denote wm,1 ≤ m ≤ M as the number of elements ("1"s) c(i, k; j, l) = max{K − (wi − μi,k + μj,l ), 0} (4)
in m-th layer. Denote hm,n, 1 ≤ n ≤ wm as the column
index in H of the n-th element, m-th layer. Additionally,
3.2. The double-layered TSP
we assume the core pipeline is K stages.
As discussed above, the decoder processes all the “1"s This part introduces the mapping from the above memory
in H exactly once by processing layer-by-layer in each conflict minimization problem to a double-layered TSP.
iteration, and element-by-element in each layer. How- TSP is a famous NP-hard problem, in which the salesman
ever, the order can be arbitrary, which enables us to should find the shortest path to visit all the n cities exactly
schedule the elements carefully to minimize memory once and finally return to the starting point. Denote di,j as
conflict. We have two ways to solve it. the distance between city i and city j. TSP can be mathe-
matically described as follows: given distance matrix D =
• Layer permutation: We can assign which layer to [di,j]n×n, find the optimal permutation of the city indices
be processed first and which to be next. If two layers x1,x2, ...,xn to minimize the loop distance,
i,j have 1s at totally different positions, i.e., such j,l  n−1 

do not exist that hi,k = hj,l, they tend to be assigned min dxi ,xi +1 + dxn ,x1 (5)
as the adjacent layers with no conflict. i=1
• Element permutation: In a certain layer, we can
assign which element to be processed first and Compared to layer permutation which can contribute
which to be next. If two adjacent layers i,j still have most part of the memory conflict reduction, element
conflict, i.e., h i,k = h j,l for some k,l, then we can permutation only deals with minor changes for the opti-
assign element k to be first in layer i, and l to be last mization when layer permutation is already determined.
in layer j. By this way, we increase the time interval Therefore, we map the problem to a double-layered
between the conflicting elements k and l. TSP, where layer permutation is mapped to the first
layer, and element permutation is mapped to the second
Therefore, the memory conflict minimization problem layer based on result of the first layer. Details are
is exactly a scheduling problem, in which layer permuta- described as follows:
tion and element permutation should be designed to
minimize the number of idle pipeline clock insertions. • Layer permutation layer: In this layer we only deal
We denote layer permutation as m ® lm, 1 ≤ m, lm ≤ with layer permutation. We define the “distance”,
Yu et al. EURASIP Journal on Wireless Communications and Networking 2012, 2012:84 Page 5 of 13
http://jwcn.eurasipjournals.com/content/2012/1/84

also “cost” between layers i and j as the minimum which the shortest cycle is recorded as the best solution.
number of idle clocks inserted before the processing One ant cycle is finished in VERTEX_NUM ant-move
of layer j. If more conflict position pairs exist, i.e., steps, where one step is consist of four sub-steps: Ant
 
C (i, j) > 1 , then we should take the maximum Choose, Ant Move, Local Update and Global Update.
one. Thus in this layer, the distance matrix should Further, the Bonus is rewarded to the shortest cycle. All
be defined by specific parameters (e.g., p and ) are referred to the
suggestion of [17].
di,j = max C (i, k; j, l) (6)
(k,l)εC(i,j)
4. The SA-based code constructor
In this section, we propose a joint optimized construc-
tion algorithm that takes both performance and effi-
and the target function remains the same as (5). ciency into consideration during construction the H
• Element permutation layer: In this layer we inherit matrix of the LDPC code. We first give the SA based
the layer permutation result, and map element per- framework and then discuss the details of the algorithm.
mutation of each layer to an independent TSP. In the
TSP for layer i, we fix the schedule of the prior layer 4.1. Problem formulation
p (lp = li − 1) and next layer q (lq = li + 1), and only We now deal with the classic code construction pro-
tune the elements of layer i. We define the “distance” blem. Given the code length N, code rate R, and perhaps
dk,l as the change on the number of idle clocks if ele- other constraints such as QC-RA type (e.g., WiMAX,
ment k is assigned to the position l, i.e., μi,k = l. Note DVB-S2), or fixed degree distribution (optimized by
that element k can conflict with layer p or q, and dk,l density evolution), we should construct a “good” LDPC
varies by different conflict cases, given by code described by its H matrix that meets practical
need. The word “good” here mainly have the following

⎪ two metrics.
⎨0 both conflict or neither conflict
dk,l = k − l k only conflict with layer p (7)

⎩ • High performance, which means the code should
l − k k only conflict with layer q have high coding gain and good BER/BLER perfor-
mance, including early water-fall region, low error
floor and anti-fading ability. This is strongly related
Since the largest dk,l becomes the bottleneck of ele- to large girth, large ACE spectrum, few trapping
ment permutation, the target function should change sets, and etc.
to the following max form: • High efficiency, which means the implementation of
the encoder and decoder should have moderate com-
min max{dx1 ,x2 , dx2 ,x3 , . . . , dxn −1 ,xn , dxn ,x1 } (8) plexity, and high throughput. This is strongly related
to QC-RA type, high degree of parallelism, short
decoding pipeline, few memory conflicts, and etc.
3.3. The ACO-based algorithm
This part introduces the ACO based algorithm to solve Traditional construction methods mainly focus on
the double-layered TSP discussed above. ACO is a heuris- high performance of the code, such as PEG and ACE,
tic algorithm to solve computational problems which can which motivates us to find a joint optimized construc-
be reduced to finding good paths through graphs. Its idea tion method concerning both performance and
originates from mimicking the behavior of ants seeking a efficiency.
path between their colony and a source of food. ACO is
especially suitable for solving TSP. 4.2. The double-stage SA framework
Algorithm 1 [see Additional file 1] gives the ACO- In this part, we introduce the double-stage SA [18]
based double-layered memory conflict minimization based framework for the joint optimized construction
algorithm. First we try layer permutation LAYER1_MAX problem. SA is a generic probabilistic metaheuristic for
times, and for each layer permutation, we try element the global optimization problem which should locate a
permutation for LAYER2_MAX times. We record the good approximation to the global optimum of a given
pipeline schedule with smallest idle clocks as the best function in a large search space. Since our search space
solution for this algorithm. is a large 0-1 matrix space, denoted as {0, 1}M×N, SA is
The detailed ACO algorithm for TSP is described in very useful for this problem.
Algorithm 2. We try SOL_MAX solutions, and for each Note that the performance metric is the more impor-
solution, all ants should finish CYCLE_MAX cycles, in tant metric for LDPC construction compared with
Yu et al. EURASIP Journal on Wireless Communications and Networking 2012, 2012:84 Page 6 of 13
http://jwcn.eurasipjournals.com/content/2012/1/84

efficiency metric. Therefore, we divide the algorithm For a higher temperature t, we allow the neighbor
into two stages, aiming at performance and efficiency, searching process to search in a wider space. This is
respectively, and regard performance as the major stage done by performing the atomic operations more
that should be satisfied first. For a specific target mea- times.
sured by “performance energy” e 1 and “efficiency • effi_neighbor searches for a neighbor of h in the
energy” e2, we set two thresholds: upper bound e1h = e1, matrix space when aiming at efficiency. This is simi-
and lower bound e 1l <e1 . The algorithm enters in the lar as perf_neighbor, however, typically we should
second stage when the current performance energy is remove the permutation change operation, as it does
less than e1l. At the second stage, the algorithm ensures nothing to help reduce conflicts.
the performance energy to be not larger than e1h, and
try to reduce the e2. Algorithm 3 shows the details.
5. The GPU-based performance evaluator
4.3. Details of the algorithm In this section, we introduce the implementation of
This part discusses the details of the important func- high-speed LDPC verification platform based on com-
tions and configurations of Algorithm 3. pute unified device architecture (CUDA) supported
GPUs. We first give the architecture and algorithm on
• sample_temperature is the temperature sampling GPU, and then talk about some details.
function, decreasing with k. It can be an exponential
form ae−bk. 5.1. Motivation and architecture
• prob is the accept probability function of the new Compute unified device architecture is NVIDIA’s parallel
search point h_new. If h_new is better (E_new <E), it computing architecture. It enables dramatic increases in
returns 1, otherwise, it decreases with E_new−E, and computing performance by executing multiple parallel
increases with t. It can be an exponential form ae−b independent and cooperated threads on GPU, thus is
(E_new−E)/t particularly suitable for the Monte Carlo model. The BER
• perf_energy is the performance energy function. It simulation of LDPC code is Monte Carlo since it collects
evaluates the performance related factors of the huge amount of bit error statistics of the same decoding
matrix h, and gives a lower energy for better perfor- process, especially in the error floor region when the BER
mance. Typically, we can calculate the number of is low (10−7 to 10−10 ). This motivates us to implement
length-l cycles cl, then calculate a total cost given by the verification platform on GPU where many decoders
∑ l w l c l , where w l is the cost weight of a length-l run parallel like hardware such as ASIC/FPGA to provide
cycle, decreasing with l. statistics.
• effi_energy is the efficiency energy function, similar Figure 5 shows our GPU architecture. CPU is used as
as perf_energy except that it gives a lower energy for the controller, which puts the code into GPU constant
higher efficiency. Typically, we can calculate the the memory, raises the GPU kernels and gets back the sta-
number of gap-l memory conflicts cl, then calculate tistics. While in GPU grid, we implement the whole
a total cost given by ∑ l w l c l , where w l is the cost coding system for each GPU block, including source
weight of a layer gap l conflict, decreasing with l. generator, LDPC encoder, AWGN channel, LDPC deco-
• perf_neighbor searches for a neighbor of h in the der and statistics. Our decoding algorithm is layered
matrix space when aiming at performance, which is MMSA. In each GPU block, we assign zf threads to cal-
based on minor changes of h. For QC LDPC, we can culate new LLRSUM and LLREX of the zf rows in each
define three atomic operations for the base matrix layer, where zf is the expansion factor of QC LDPC. The
Hb as follows. zf threads cooperate to complete the decoding job.
- Horizontal swap: For chosen row i,j and col-
umn k, l, swap values of Hbi,k and Hbi,l , then 5.2. Algorithm and procedure
This part introduces the procedure that implements the
swap values of Hbj,k and Hbj,l . GPU simulation, given by Algorithm 4. P × Q blocks run
- Vertical swap: For chosen row i,j and column k, parallel, each simulating an individual coding system,
l, swap values of Hbi,k and Hbj,k , then swap values where P is the number of multiprocessors (MP) on the
device and Q is the number of cores per MP. In each sys-
of Hbi,l and Hbj,l . tem, z f threads cooperatively do the job of encoding,
- Permutation change: Change the permutation channel and decoding. When decoding, the threads pro-
factor for chosen element Hbi,k . cess data layer after layer, each thread performing
Yu et al. EURASIP Journal on Wireless Communications and Networking 2012, 2012:84 Page 7 of 13
http://jwcn.eurasipjournals.com/content/2012/1/84

'HYLFH *38 %ORFN  6KDUHG 0HPRU\


%ORFN  %ORFN  %ORFN  7KUHDG  7KUHDG  7KUHDG 
%ORFN  %ORFN  %ORFN  «« 7KUHDG  7KUHDG  7KUHDG 
7KUHDG  7KUHDG  7KUHDG 
%ORFN  %ORFN  %ORFN  ««
*ULG
7KUHDG  'HFRGH ,WHUDWLRQ
*OREDO
*HW 4L  UML
0HPRU\ *HQ VRXUFH ELW
+RVW &38
&RQVWDQW (QFRGH &DOFXODWH TLM
.HUQHO 0HPRU\
&RQWUROHU 3DVV FKDQQHO
8SGDWH UML  4L
6WDWLVWLFV

Figure 5 GPU architecture of the BER simulation for LDPC code.

LMMSA for one row of this layer. The procedure ends Note that the reconfigurable switch network is designed
up with the statistics of P × Q LDPC blocks. in the LLRSUM loop to support multi-mode feature. As to
achieve high-throughput, we propose the split-row MMSA
5.3. Details and instructions core, the early-stopping scheme and the multi-block
• Ensure “coalesced access” when reading or writing scheme. The split-row core has two data inputs and two
global memory, or the operation will be auto-serial- data outputs, hence it also “splits” the LLRSUM RAM and
ized. In our algorithm, the adjacent threads should LLREX RAM into two parts, meanwhile, two identical
access adjacent L(Qi) and L(rji). switch networks are needed to shuffle the data simulta-
• Shared memory and registers are fast yet limited neously. We also propose the offset-threshold decoding
resources and their use should be carefully planned. In scheme to improve BER/BLER performance. The above
our algorithm, we store L(Qi) in shared memory and L five techniques are described in detail as follows.
(rji) in registers due to the lack of resources.
• Make sure all the P × Q cores are running. This calls 6.2. The reconfigurable switch network
for careful assignment of limited resources (i.e., warps, A switch network is an S-input, S-output hardware struc-
shared memory, registers). In our case, we limit the ture that can put the input signals in the arbitrary order at
registers per thread to 16 and threads per block to the output. Formally, given input signals x1,x2,...,xS with
128, or some of the Q cores on each MP will “starve” data width W, the output of switch network has the form
and be disabled. xa1 , xa2 , ..., xaS where a1,a2,...,aS is any desired permutation
of 1,2,...,S. For the design of reconfigurable LDPC deco-
ders, two special kinds of output order are more impor-
6. Hardware implementation schemes tant, described as follows.
6.1. Top-level hardware architecture
Our goal is to implement a multi-mode high-throughput • Full cyclic-shift: The output has the cyclic-shift
QC-LDPC decoder, which can support multiple code rates form of the total S inputs, i.e., xc, xc+1,...xS, x1,x2...,xc
and expansion factors on-the-fly. The proposed decoder −1, where 1 ≤ c ≤ S.
consists of three main parts, namely, the interface part, the • Partial cyclic-shift: The output has the cyclic-shift
execution part and the control part. The top level architec- form of the first p inputs, while other signal can be
ture is shown in Figure 6. in arbitrary order, i.e., i.e., xc, xc+1,...xp, x1,x2...,xc−1,
The interface part buffers the input and output data as x*,...x*, where 1 ≤ c<p <S, and x* can be any signal
well as handling the configuration commands. In the from xp+1 to xS.
execution part, the LLRSUM and LLREX are read out
from the RAMs, updated in the Σ parallel LMMSA cores, For the implementation of QC-LDPC decoder, the
and written back to the RAMs, thus forming the switch network is an essential module. Suppose
LLRSUM loop and the LLREX loop, as marked red in b
Hj,i = Hk,i
b
≥ 0, j < k , and for any j < l < k, Hl,i
b
= −1 ,
Figure 6. The control part generates control signals,
then the same data is involved in the processing of the
including port control, LLRSUM control, LLREX control
above two “1"s, i.e., LLRSUM and LLREX of BNi × Zf to
and iteration control.
Yu et al. EURASIP Journal on Wireless Communications and Networking 2012, 2012:84 Page 8 of 13
http://jwcn.eurasipjournals.com/content/2012/1/84

([HFXWLRQ3DUW
,QWHUIDFH3DUW
//5680
/223
65006$&RUH
65006$&RUH
5$0B//5B68
//56805$0 //5,1,7
0 5$0B//5B,1
5$0
«« ,7 FIJ
3RUW,Q
65006$&RUH6 GLQ
6500$6&RUH6

//5(; 6ZLWFK '2875$0 3RUW2XW


6KXIIOH 5$0B'287 GRXW
/223 1HWZRUN
QHW
//5(;
5$0B//5B
5$0
(; 

//5(; //5680 &RQWURO3DUW


&75/ &75/

3RUW&75/

,7(5&75/
(DUO\
6WRSSLQJ

Figure 6 Top-level multi-mode high-throughput LDPC decoder architecture.

b
⎛ ⎞
BN(i+1) × zf−1. However, after processing Hj,i , the above
data should be cyclic-shifted to ensure correct order for sgn (L(rji )) = ⎝ sgn (L(qi j ))⎠ (11)
b i ∈Rj\i
the processing Hj,k which corresponds to the full cyclic-
shift case with  
abs(L(rji )) = min

(L(qi j )) (12)
i ∈Rj\i
S = zf , v
c = (Hk,i − Hj,i
b
+ S) mod S (9)
In [5], the normalized and offset MMSA schemes (13)
Further, in the case of multiple expansion factors, (14) are proposed to compensate the loss of the above
such as WiMAX [19] (zf = 24: 4: 96), the partial cyclic- approximation, described as follows.
shift is required with
abs(L(rji )) = α · abs(L(rji )) (13)
S = zmax
f , p = zf , b
c = (Hk,i − Hj,i
b
+ p) mod p (10)

The existing schemes to implement switch networks


abs(L(rji )) = max (abs(L(rji )) − β, 0) (14)
include the MS-CS network [20] and Benes network
[13,21,22]. The former structure can handle the case
In our simulation, for BLER, the offset MMSA per-
when S is not a power of 2, while the latter is proved
forms better than normalized one. However, as to BER,
more efficient in area and gate count. In [13], the most
both schemes show error floor at 10 −6 , as shown in
efficient on-the-fly generation method of control signals
Figure 8. The problem here is that, for most cases, the
is proposed. Therefore, we adopt the Benes network
offset MMSA works well, while in a few cases, the
proposed in [13] for our decoder. The structure is
decoding fails with many bit errors in one block. The
shown in Figure 7 and the features is given in Table 1.
intuitive explanation of such phenomenon is existence
of extremely large likelihoods (L(qij), L(rji)). In high SNR
6.3. The offset-threshold decoding scheme
region, the L(q ij ) likelihoods converge fast to a large
In this part, we propose the offset-threshold decoding
value, for both correct and wrong bits in some cases.
method, which is adopted in our decoder architecture.
The wrong bits not only remain wrong, but also propa-
Unlike existed modifications of MSA [5], the proposed
gate large L(rji) to other bits, resulting in more wrong
scheme uses an offset-threshold correction to further
bits and finally failure of decoding. For this reason, we
improve the BER/BLER performance.
need to set threshold upon offset MMSA to limit the
The traditional MSA is a simplified version of BP, by
likelihoods of becoming extremely large, which leads to
replacing the complicated Equation (2) with simple min
the proposed offset-threshold scheme, done by the fol-
operation, shown as follows.
lowing equation.
Yu et al. EURASIP Journal on Wireless Communications and Networking 2012, 2012:84 Page 9 of 13
http://jwcn.eurasipjournals.com/content/2012/1/84

S  F 6
&RQWURO 6LJQDO *HQHUDWLRQ 0RGXOH

//5B //5BF

//5B //5BF
6 ,QSXW

Ă
//5B //5BF
%HQHV 1HWZRUN
//5B //5BF

Ă
Ă
//5B6 //5B

Ă
6 ,QSXW
Ă
//5B6 //5B
%HQHV 1HWZRUN
//5B6 //5B

//5B6 //5B

Figure 7 The structure of Benes network.

abs(L(rji )) = min (max(abs(L(rji )) − β, 0), γ ) (15) architecture with k = 2 is shown in Figure 10. The
LLRSUM (L(Qi)), LLREX (L(rji)) and LLR (L(qij)) of the
The difference between traditional MSA, normalized left part and right part are stored in two individual RAM/
MMSA, offset MMSA and offset-threshold MMSA is FIFOs, respectively. Two minimum/sub-minimum fin-
shown in Figure 9. Simulation result (Figure 8) shows ders pass result to the merger for final comparison, thus
that the proposed scheme has lowest error floor (10−8) approximately shorten the process pipeline by half. Note
among the above schemes, while achieving good BLER that the split position must exist for the code H b such
performance as offset MMSA. that each row in each part contains nearly the same num-
ber of “1"s. Otherwise, we need RAMs with multiple read
6.4. The split-row MMSA core ports and write ports, which is not practical for FPGA
This part presents the split-row MMSA core. In tradi- implementation.
tional semi-parallel structure with layered MMSA core
(see Figure 4), since the “1"s in j-th row will be processed 6.5. The early-stopping scheme
one by one to find the minimum and sub-minimum of This part introduces the early-stoping scheme applied in
all L(qij), the decoding stages K for one iteration is pro- our decoder. In practical scenario, the decoding process
portional to the number of “1"s in each row of the base often gets to convergence much earlier than the preset
matrix Hb. The idea is that, if k “1"s can be processed at maximum iterations is reached, especially under favor-
the same time, the decoding time of one iteration will be able transmission conditions when SNR is large. Thus, if
shortened by a factor of k, and the throughput will have a the decoder can terminate the decoding iterations as
gain of k. This is done by split-row scheme, which verti-
cally splits Hb into multiple part. The “1"s in each part Modified MSA Comparison
0
10
are processed simultaneously to find the local minimum,
and the results are merged together. In this way, for Hb í1
10

with maximum row weight w, the minimum and sub- í2


10

minimum can be obtained in w/k clocks. See Figure 2 as í3


10
an example, we split the 4 × 6 H b into two parts, each í4

has one or two “1"s in every row. The corresponding 10


BER

í5
10

í6
10
Table 1 Features of reconfigurable Benes network
í7 Normalized BLER
10
Offset BLER
Scale W = 9, S = 256, 15 stages, 128 × 15 MUX OffsetíThreshold BLER

p = 64:8:256, 1 ≤ c ≤ p Normalized BER


í8
Support 10
Offset BER
OffsetíThreshold BER
Resource 20866 LE, 388 memory bits í9
10
1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9
Eb/N0 (dB)
Clock 100 MHz for Cyclone II, 240 MHz for Stratix III
Delay Two clocks for control signals, four clocks for output Figure 8 BER/BLER performance of different MSA schemes.
Yu et al. EURASIP Journal on Wireless Communications and Networking 2012, 2012:84 Page 10 of 13
http://jwcn.eurasipjournals.com/content/2012/1/84

algorithm to implement iteration-stopping module, which


QRQPRGLILHG evaluates one layer (zf equations) simultaneously. If the
QRUPDOL]HG
RIIVHW
number of the continuously successful-check layers reach
RIIVHWWKUHVKROG a threshold ω, the module will trigger a signal meaning
$EV / ULM DIWHU FRUUHFWLRQ

the decoding have got to convergence and the iteration


Ȗ can be stopped.
One important issue of Algorithm 5 is the estimation of
threshold ω. The BER/BLER performances and average
iteration times of different ω are shown in Figure 11,
where the stopping criterion of ideal iteration is HcT = 0.
We choose ω = 2.5 × M to achieve tradeoff between time
and performance. In this case, if the average iteration
ȕ $EV / ULM EHIRUH FRUUHFWLRQ
times is Iave (ideal iteration case), the decoding terminates
at approximately Iave + 2 iterations.
Figure 9 Comparison of the correction types of different MSA
schemes.
6.6. The multi-block scheme
Suppose the LDPC code with code length N and expan-
soon as it detect the convergence, the power of the cir- sion factor zf still has serious memory conflicts though
cuit can be reduced as well as the decoding delay. being optimized by our SA and ACO algorithms, which
Throughout is also increased if the system dynamically is common for large z f and relatively small N. To
adjusts the transmission rate according to statistics of address this problem, we propose a hardware method
average iteration numbers under current channel state. called “multi-block” to further avoid memory conflicts
Traditional stopping criterions focus on whether the and increase pipeline efficiency. The “multi-block”
code can be decoded successfully or not, which either cost scheme is explain as follows.
too much extra resource to store iteration parameters, We construct a new matrix H v by the parity-check
such as HDA [23] and NSPC [24], or use floating-point matrix H with M rows:
calculation to evaluate the current iteration situation, such
 
as VNR [25] and CMM [23]. All these methods are not H0
Hυ = (16)
suitable for the hardware implementation. 0H
Here, we propose a simple and effective scheme to
detect the convergence of the decoding. The “conver- Here “virtual matrix” H v is the combination of two
gence” means at some time, all of the hard decisions sgn codes H without “cross constraint” (edge between nodes
(L(Qi)) satisfy the check equations, The detection of con- from different codes in Tanner Graph) between each
vergence usually demands parallel calculation of each other. Suppose v1, v2 are any two legal encoded blocks
equation, However, due to the layered structure (QC) and satisfying that
lack of the hardware resources, we can use a semi-parallel

//5680 //5680
BER in iteration termination Average iterations in termination
0
10 20
//5(; //5(;  ω=2*M ω=2*M
6XEWUDFWRU 6XEWUDFWRU í1 ω=2.5*M ω=2.5*M
10
ω=3*M ω=3*M
Ideal Iteration 18 Ideal Iteration
í2
10

0LQLPXP 0LQLPXP í3
10 16
)LQGHU )LQGHU
Average iterations

í4
10
//5 ),)2 //5 ),)2
BER

14
í5
10

0HUJHU
í6
10 12

í7
10
2IIVHWWKUHVKROG
&RPSDUDWRU &RPSDUDWRU 10
&RUUHFWRU í8
10

í9
10 8
//5680 //5680 1 1.2 1.4 1.6 1.8 1 1.2 1.4 1.6 1.8
Eb/N0 (dB) Eb/N0 (dB)
//5(; //5(;
Figure 11 BER/BLER performance and average iterations under
Figure 10 The architecture of split-row MMSA core. different ω.
Yu et al. EURASIP Journal on Wireless Communications and Networking 2012, 2012:84 Page 11 of 13
http://jwcn.eurasipjournals.com/content/2012/1/84

v1 .HT = 0 v2 .HT = 0 (17) Table 2 Cycle and conflict performance of the two codes
Candidate code WiMAX code
Thus the vector (v1 v2) is also one legal block for Hv: Cycle:length 6/8 0/55 5/150
Conflict:gap 1/2/3 0/3/9 5/11/15
 
    HT 0 Pipeline occupancy Before ACO: 76/88 Only layer
v1 v2 .HTυ = v1 v2 . =0 (18)
0 HT After ACO: 76/81 permu.: 76/96

The key observation is that there are no memory con- We simulate the candidate code and WiMAX code
flicts between the two codes H due to the diagonal form through the GPU platform. The BER/BLER performance
of H v . This enables us to reorder and combine the is shown in Figure 12, while the platform parameters
decoding schedule of the two codes to reduce memory and throughput are listed in Table 3. The water-fall
conflict of each code. We rewrite H and Hv as follows: region and the error floor of our candidate code is
⎛ ⎞ almost the same as WiMAX code. For speed compari-
(1)
H1 0 son, we also include the fastest result that ever reported
⎛ ⎞ ⎜ 0 H1 ⎟ [12]. The “net throughput” is defined by the decoded
(2)
H1 ⎜ ⎟
⎜ ⎟ ⎜ . .. ⎟ “message bits” per second, given by:
H = ⎝ ... ⎠ Hvopt =⎜
⎜ .. . ⎟
⎟ (19)
⎜ (1) ⎟
HM ⎝ HM 0 ⎠ P·Q·N·R
(2)
net throughput = (20)
0 HM t
where t is the consumed time for running through the
where H(j)
i
denotes the i-th row of the j-th code. The GPU kernel (for us is Algorithm 4). As shown in Table
decoding schedule is given by above equation, i.e., H(1) 3, our GPU platform speeds up 490 times against CPU
i and achieves a net throughput 24.5 Mbps. Further, our
comes first, followed by H(2)
i
, and then H(1)
i+1
, and so on throughput approaches the fastest one, while providing
so forth. The benefit of this “multi-block” scheme is that better precision (floating-point vs. 8 bit fixed-point) for
the insertion of H(2) provides extra stages for the con- the simulation.
i
Finally, we optimize the pipeline schedule by ACO-
flicts between H(1)
i
and H(1)
i+1
. based scheduler, shown in Table 2. The “pipeline occu-
To sum up, the “multi-block” scheme changes any gap- pancy” is given by running/total clocks required for one
l memory conflict to gap-(2l − 1), thus can improve the iteration. For the candidate code, the number of idle
pipeline efficiency significantly. Meanwhile, it demands clock insertions after ACO is 5, compared with 12
no extra logic resources (LE) for the design, but may before ACO, achieving a 58.3% reduction. While for
double the memory bits for buffering two encoded WiMAX code, 20 idle clock insertions remain required
blocks. Since the depth of memory is not fully used on after layer-permutation-only (single-layer) scheme
our FPGA, the proposed method can make full use of it
with no extra resource cost.
BER/BLER Performance: Candidate code vs WiMAX code
0
10
Candidate code BER
7. Numerical simulation Candidate code BLER
í1 WiMAX code BER
10
In this section, we show how our platform produces WiMAX code BLER

“good” LDPC codes with outstanding decoding perfor- í2


10
mance and hardware efficiency. For comparison, we tar-
get on the WiMAX LDPC code (N = 2304, R = 0.5, zf = í3
10

96). We use the same parameters and degree distribu-


BER

í4
10
tions as WiMAX for our SA-based constructor. We set
“cycle” as performance metric and memory conflict as í5
10

efficiency metric. The performance of one of the candi-


í6
date codes and the WiMAX code are listed in Table 2. 10

The candidate code has much less length-6/8 cycles and í7


10
gap-1/2/3 memory conflict. Usually, the candidate codes
can eliminate length-6 cycles and gap-1 conflicts, which í8
10
1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6
ensures a larger-than-or-equal-to 8 girth and no conflict Eb/N0 (dB)

under short pipeline (when K ≤ wm). Figure 12 BER and BLER performance of the two codes.
Yu et al. EURASIP Journal on Wireless Communications and Networking 2012, 2012:84 Page 12 of 13
http://jwcn.eurasipjournals.com/content/2012/1/84

Table 3 Parameters and performance: GPU vs CPU (20 LDPC codes, more code lengths and code rates are natu-
iterations) rally supported, for example, the WiMAX codes (zf = 24:
GPU (ours) CPU GPU [12] 4: 96, R = 1/2, 2/3, 3/4, 5/6, 114 modes in total). The
Platform NV. GTX260 Intel Core2 Quad NV. 8800GTX only cost is that more memory bits are required to store
Clock frequency 1.24GHz 2.66 GHz 1.35 GHz the new base matrices Hb.
Decoding method Semi-parallel Semi-parallel Full-parallel
LMMSA LMMSA BP 9. Conclusion
Blocks×threads 216 × 96 1 128 × 256 In this article, a novel LDPC code construction, verifica-
Net throughput 24.5 Mbps 50 Kbps 25Mbps tion, and implementation methodology is proposed, which
Precision Floating-point Floating-point 8-bit fixed-point can produce LDPC codes with both good decoding perfor-
mance and high hardware efficiency. Additionally, a GPU
verification platform is built that can accelerate 490×
proposed by [11]. In this case, the double-layered ACO speed against CPU and a multi-mode high-throughput
achieves a 75% reduction against the single-layer scheme decoder is implemented on FPGA, achieving a net-
(5 vs. 20 idle clocks). throughput 1.2 Gbps and performance loss within 0.2 dB.
8. The multi-mode high-throughput decoder
Additional material
Based on the above techniques, namely, reconfigurable
switch network, offset-threshold decoding, split-row MMSA
Additional file 1: Algorithm. This file contains Algorithm 1, Memory
core, early-stoping scheme and multi-block scheme, we conflict minimization algorithm; Algorithm 2, ACO algorithm for TSP;
implement the multi-mode high-throughput LDPC deco- Algorithm 3, The SA based LDPC construction framework; Algorithm 4,
der on Altera Stratix III FPGA. The proposed decoder The GPU based LDPC simulation; and Algorithm 5, Semi-parallel early-
stopping algorithm.
supports 27 modes, including nine different code lengths
and three different code rates, and maximum 31 iterations.
The configurations for code length, code rate, and itera-
tion number are completely on-the-fly. Further, it has a Acknowledgements
This paper is partially sponsored by the Shanghai Basic Research Key Project
BER gap less than 0.2 dB against floating-point LMMSA, (No. 11DZ1500206) and the National Key Project of China (No. 2011ZX03001-
while achieving a stable net-throughput 721.58 Mbps 002-01).
under code rate R = 1/2 and 20 iterations (corresponding
Competing interests
to a bit-throughput 1.44 Gbps). With early-stopping mod- The authors declare that they have no competing interests.
ule working, the net-throughput can boost up to 1.2 Gbps
(bit-throughput 2.4 Gbps), which is calculated under aver- Received: 15 May 2011 Accepted: 6 March 2012
Published: 6 March 2012
age 12 iterations. The features are listed in Table 4.
One great advantage of the proposed multi-mode high- References
throughput LDPC decoder is that more modes can be 1. R Gallager, Low-density parity-check codes. IRE Trans. Inf. Theory. 8(1),
supported with only more memory bits consumed and 21–28 (1962). doi:10.1109/TIT.1962.1057683
2. R Tanner, A recursive approach to low complexity codes. IEEE Trans. Inf.
no architecture level change. Since the reconfigurable Theory. 27(9), 533–547 (1981)
switch network supports all expansion factors zf ≤ 256, 3. D MacKay, Good error-correcting codes based on very sparse matrices. IEEE
and the layered MMSA cores supports arbitrary QC- Trans. Inf. Theory. 45(3), 399–431 (1999)
4. T Richardson, M Shokrollahi, R Urbanke, Design of capacity approaching
irregular low-density parity-check codes. IEEE Trans. Inf. Theory. 47(2),
619–637 (2001). doi:10.1109/18.910578
Table 4 Features of the multi-mode high-throughput 5. J Chen, RM Tanner, C Jones, L Yan Li, Improved min-sum decoding algorithms
for irregular LDPC codes, in Proc. ISIT, (Adelaide, 2005), pp. 449–453
decoder 6. DE Hocevar, A reduced complexity decoder architecture via layered
FPGA platform Altera Stratix III EP3SL340F1517C2 decoding of LDPC codes, in IEEE workshop on SiPS, pp. 107–112 (2004)
Decoding scheme Layered offset-threshold MSA 7. Y Hu, E Eleftheriou, DM Arnold, Regular and irregular progressive edge
growth Tanner graphs. IEEE Trans. Inf. Theory. 51(1), 386–398 (2005)
Modes supported 9 × 3 = 27 modes 8. D Vukobratovic, V Senk, Generalized ACE constrained progressive Eedge-
Code length N = 1536:768:6144 (zf = 64:32:256) growth LDPC code design. IEEE Comm. Lett. 12(1), 32–34 (2008)
Code rate R = 1/2,2/3,3/4 (Hb :12 × 24, 8 × 24, 6 × 24) 9. AJ Blanksby, CJ Howland, A 690-mW 1-Gb/s 1024-b, rate-1/2 low-density
parity-check code decoder. J. Solid State Circ. 37(3), 404–412 (2002).
Iteration number iter = 1−31, 20 recommended doi:10.1109/4.987093
Resources usage 149, 976 LE, 3, 157, 136 bits memory 10. Z Cui, Z Wang, Y Liu, High-throughput layered LDPC decoding architecture.
BER performance gap ≤ 0.2 dB vs. 20 iteration float LMMSA IEEE Trans. VLSI Syst. 17(4), 582–587 (2009)
11. C Marchand, J Dore, L Canencia, E Boutillon, Conflict resolution for
Clock setup 225.58MHz pipelined layered LDPC decoders, in IEEE workshop on SiPS, (Tampere, 2009),
Stable net throughput 721.58 Mbps (zf = 256, R = 1/2, iter = 20) pp. 220–225
Max. net throughput 1.2 Gbps (early-stopping, iter = 12 ave.)
Yu et al. EURASIP Journal on Wireless Communications and Networking 2012, 2012:84 Page 13 of 13
http://jwcn.eurasipjournals.com/content/2012/1/84

12. G Falcao, V Silva, L Sousa, How GPUs can outperform ASICs for fast LDPC
decoding, in Proc. international conf on Supercomputing, (New York, 2009),
pp. 390–399
13. J Lin, Z Wang, Effcient shuffle network architecture and application for
WiMAX LDPC decoders, in IEEE Trans. on Circuits and Systems. 56(3),
215–219 (2009)
14. KK Gunnam, GS Choi, MB Yeary, M Atiquzzaman, VLSI architectures for
layered decoding for irregular LDPC codes of WiMax, in IEEE International
Conference on Communications, (Glasgow, 2007), pp. 4542–4547
15. T Brack, M Alles, F Kienle, N Wehn, A synthesizable IP core for WIMAX
802.16E LDPC code decodings, in IEEE Inter. Symp. on Personal, Indoor and
Mobile Radio Comm, (Helsinki, 2006), pp. 1–5
16. K Tzu-Chieh, AN Willson, A flexible decoder IC for WiMAX QC-LDPC codes,
in Custom Integrated Circuits Conference, (San Jose, 2008), pp. 527–530
17. M Dorigo, LM Gambardella, Ant colonies for the travelling salesman
problem. Biosystems. 43(2), 73–81 (1997). doi:10.1016/S0303-2647(97)01708-
5
18. S Kirkpatrick, CD Gelatt, MP Vecchi, Optimization by simulated annealing.
Science, New Series. 220(4598), 671–680 (1983)
19. IEEE Standard for Local and Metropolitan Area Networks Part 16. IEEE
Standard 802.16e (2008)
20. M Rovini, G Gentile, F Rossi, Multi-size circular shifting networking for
decoders of structured LDPC codes. Electron Lett. 43(17), 938–940 (2007).
doi:10.1049/el:20071157
21. J Tang, T Bhatt, V Sundaramurthy, Reconfigurable shuffle network design in
LDPC decoders, IEEE Intern Conf ASAP, (Steamboat Springs, CO, 2006), pp. 81–86
22. D Oh, K Parhi, Area efficient controller design of barrel shifters for
reconfigurable LDPC decoders, in IEEE Intern Symp on Circuits and Systems,
(Seattle, 2008), pp. 240–243
23. L Jin, Y Xiao-hu, L Jing, Early stopping for LDPC decoding: convergence of
mean magnitude (CMM). IEEE Commun Lett. 10(9), 667–669 (2006).
doi:10.1109/LCOMM.2006.1714539
24. S Donghyuk, H Kyoungwoo, O Sangbong, A Jeongseok Ha, A stopping
criterion for low-density parity-check codes, in Vehicular Technology
Conference, (Dublin, 2007), pp. 1529–1533
25. F Kienle, N Wehn, Low complexity stopping criterion for LDPC code
decoders, in Vehicular Technology Conference. 1, 606–609 (2005)

doi:10.1186/1687-1499-2012-84
Cite this article as: Yu et al.: Systematic construction, verification and
implementation methodology for LDPC codes. EURASIP Journal on
Wireless Communications and Networking 2012 2012:84.

Submit your manuscript to a


journal and benefit from:
7 Convenient online submission
7 Rigorous peer review
7 Immediate publication on acceptance
7 Open access: articles freely available online
7 High visibility within the field
7 Retaining the copyright to your article

Submit your next manuscript at 7 springeropen.com

You might also like