Systematic_construction_verification_and_implement
Systematic_construction_verification_and_implement
Abstract
In this article, a novel and systematic Low-density parity-check (LDPC) code construction, verification and
implementation methodology is proposed. The methodology is composed by the simulated annealing based LDPC
code constructor, the GPU based high-speed code selector, the ant colony optimization based pipeline scheduler
and the FPGA-based hardware implementer. Compared to the traditional ways, this methodology enables us to
construct both decoding-performance-aware and hardware-efficiency-aware LDPC codes in a short time. Simulation
results show that the generated codes have much less cycles (length 6 cycles eliminated) and memory conflicts
(75% reduction on idle clocks), while having no BER performance loss compared to WiMAX codes. Additionally, the
simulation speeds up by 490 times under float precision against CPU and a net throughput 24.5 Mbps is achieved.
Finally, a net throughput 1.2 Gbps (bit-throughput 2.4 Gbps) multi-mode LDPC decoder is implemented on FPGA,
with completely on-the-fly configurations and less than 0.2 dB BER performance loss.
Keywords: low-density parity-check codes, simulated annealing, ant colony optimization, graphic processing unit,
decoder architecture
© 2012 Yu et al; licensee Springer. This is an Open Access article distributed under the terms of the Creative Commons Attribution
License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium,
provided the original work is properly cited.
Yu et al. EURASIP Journal on Wireless Communications and Networking 2012, 2012:84 Page 2 of 13
http://jwcn.eurasipjournals.com/content/2012/1/84
*38EDVHG $&2EDVHG
6$EDVHG
KLJKVSHHG SLSHOLQLQJ
/'3& FRGH
SHUIRUPDQFH VFKHGXOH
FRQVWUXFWRU
HYDOXDWRU RSWLPL]RU
KDUGZDUH LPSOHPHQWDWLRQ
JRRG /'3&
)3*$EDVHG FRGHV
PXOWLPRGH KLJKWKURXJKSXW /'3& GHFRGHU
KDUGZDUH DUFKLWHFWXUH LPSOHPHQWHU
codes, especially the error floor, is then evaluated by the early-stopping scheme, a net-throughput 1.2 Gbps is
high-speed GPU based simulation platform. Next, the further achieved on Stratix III FPGA.
hardware pipeline of the selected codes are optimized by
the ant colony optimization (ACO) based scheduling The remainder of this paper is organized as follows.
algorithm, which can reduce much of the memory con- Section 2 presents the background of our research.
flicts. Finally, detailed implementation schemes are pro- Sections 3, 4, and 5 introduces the ACO based pipeline
posed, i.e., reconfigurable switch network (adopted by [13]), scheduler, the SA based code constructor and the GPU
offset-threshold decoding, split-row MMSA core, early-stop- based performance evaluator, respectively, followed by
ing scheme and multi-block scheme, and the corresponding hardware implementation schemes and issues of the
multi-mode high-throughput decoder of the optimized multi-mode high-throughput LDPC decoder discussed in
codes is implemented on FPGA. The novelties of the pro- Section 6. Simulation results are provided in Section 7 and
posed methodology are listed as follows: hardware implementation results are given in Section 8.
Finally, Section 9 concludes this article.
• Compared to traditional methods (PEG, ACE), the
SA-based constructor takes both decoding perfor- 2. Background
mance and hardware efficiency into consideration dur- 2.1. LDPC codes and Tanner graph
ing construction process. An LDPC code is a special linear block code, character-
• Compared to existed work [11], the ACO-based sche- ized by a sparse parity-check matrix H with dimensions
duler covers both layer and element permutation, and M × N; Hj,i= 1 if code bit i is involved in parity-check
maps the problem to a double-layer TSP, which is a equation j, and 0 otherwise. An LDPC code is usually
complete solution and can provide better pipelining described by its Tanner Graph, a bipartite graph defined
schedule. on the code bit set ℝ and parity-check equation set ℂ,
• Compared to existed works, the GPU-based evaluator whose elements are called a “bit node” and a “check
first implements the semi-parallel layered architecture node”, respectively. An edge is assigned between bit
on GPU. The obtained net throughput is similar to the node BNi and check node CNj if Hj,i = 1. A simple 4 ×
highest report [12] (about 25 Mbps), while the pro- 6 LDPC code and the corresponding Tanner Graph is
posed scheme has higher precision and better BER per- shown in Figure 2.
formance. Further, we put the whole coding and Quasi-cyclic LDPC codes (QC-LDPC) is a popular class
decoding system into GPU rather than a single decoder. of structured LDPC codes, which is defined by its base
• Compared to existed FPGA or ASIC implementa- matrix Hb, whose elements satisfying −1 ≤ Hbj,i < zf .zf is
tions [14-16], the proposed multi-mode high-through-
called the expansion factor. Each element in the base
put decoder not only supports multiple modes with
matrix should be further expanded to a zf × zf matrix to
completely on-the-fly configurations, but also has a
performance loss within 0.2 dB against float precision obtain H. The elements Hbj,i = −1 are expanded to zero
and 20 iterations, and a stable net-throughput 721.58 L(Q i ) = L(ci ) + L(rj i ) = L(qij ) + L(rji )
matrices, while
Mbps under code rate 1/2 and 20 iterations. With j ∈ci
Yu et al. EURASIP Journal on Wireless Communications and Networking 2012, 2012:84 Page 3 of 13
http://jwcn.eurasipjournals.com/content/2012/1/84
ex + 1
Y Y Y Y Y Y Y Y Y Y Y Y neighbor set of CNj. (x) = log . These equations
ex − 1
F § · T
can also be applied in layered BP, the difference is that
¨ ¸
¨ ¸
F ELW QRGH the L(qij) and L(rji) should be updated in each layer of
U FKHFN QRGH
F ¨ ¸ OHQ F\FOH
the iteration.
¨ ¸ The above equations requires the independence of all
F © ¹
WKH SRVLWLRQ RI VSOLW URZ
F F F F the messages L( q i′j ), i′ Î ℛ j and Hbj,k . However, the
D + PDWUL[ E 7DQQHU *UDSK existence of “cycle” in Tanner Graph invalidates this
Figure 2 H matrix and Tanner Graph of LDPC. independence assumption, thus degrades the BER per-
formance of BP algorithm. A length 6 cycle is shown
with bold lines in Figure 2. In this case, if BP algorithm
are expanded to a cyclic-shift identity matrices with per- proceeds for more than three iterations, the receive
mutation factors Hbj,i ≥ 0 . QC-LDPC is naturally avail- messages of the involved bit nodes v2 ,v4,v5 will partly
able for layered algorithms, whose j-th row is exactly contain its own message sent three iterations before. For
this reason, the minimum cycle length in the Tanner
layer j. We call the “1"s of j-th row as the set p = Hbj,i .
Graph, called “girth”, has a strong relationship with its
See Figure 3 for an example of a 4 × 6 base matrix with BER performance, and is considered as an important
zf = 4. metric in LDPC code construction algorithms (PEG,
ACE) [7,8].
2.2. The BP algorithm and effect of cycle
The BP algorithm is a general soft decoding scheme for 2.3. Decoder architecture and memory conflict
codes described by Tanner Graph. It can be viewed as The semi-parallel structure with layered MMSA core is
the process of iterative message exchange between bit a popular decoder architecture due to its good tradeoff
nodes and check nodes. For each iteration, each bit among low complexity, high BER performance and high
node or check node collects the messages passed from throughput. As shown in Figure 4, the main compo-
its neighborhood, updates its own message and passes nents in the top-level architecture include an LLRSUM
the updated message back to its neighborhood. BP algo- RAM storing L(Qi), an LLREX RAM storing L(rji) and a
rithm has many modified versions, such as log-domain layered MMSA core pipeline. The two RAMs should be
BP, MSA, and layered BP. All of them originate from readable and writable. Old values of L(Qi) and L(rji) are
the basic log-domain message passing equations, given read, and new values are calculated through the pipeline
as follows. and written back to RAMs. For QC-LDPC codes, the
values are processed layer by layer, and the “1"s in each
{Hbj,i Hbj,i ≥ 0} (1) layer is processed one by one.
Memory conflict is a critical problem that constrains
L(qij ) = L(Qi ) − L(rji ) (2) the throughput of the semi-parallel decoder. Essentially,
memory conflict occurs when the read-after-write
(RAW) dependency of L(Qi) is violated. Note that the
L(Qi ) = L(ci ) + L(rj i ) = L(qij ) + L(rji ) (3) new value of L(Q i ) will not be written back to RAM
j ∈ci until the pipelined calculation finishes. If L(Qi) is again
needed during this calculation period, the old value will
where L(ci) is the initial channel message, L(qij) is the
be read, while the new one is still under processing, see
message passing from BNi to CNj, L(rji) is the message
L(Q6) in Figure 4. This case happens when the layers j
of inverse direction, and L(Qi) is the a-posteriori of bit
and j + l have “1"s in the same position
node BN i . Ci is the neighbor set of BN i , ℛ j is the
i (Hbj,i ≥ 0, Hbj+l,i ≥ 0) . We call it a gap-l conflict.
Memory conflict slows the decoding convergence and
Y Y Y Y Y Y thus reduces the BER performance. The traditional
F § · method of handling memory conflict is to insert idle
¨ ¸
F ¨ ¸ clocks in the pipeline, with the cost of throughput
F ¨ ¸ reduction. It’s obvious that the smaller l, the more idle
¨ ¸ H[SDQVLRQ IDFWRU
clocks should be inserted, since the pipeline need to
F © ¹ SHUPXWDWLRQ IDFWRU
//5(; 5$0
FRUH 5HDG 5HDG 5HDG 5HDG 5HDG 5HDG
6WDJH ««
/ 4 / 4 / 4 / 4 / 4 / 4
6WDJH
&RQIOLFW˖5HDG EHIRUH :ULWH
««
:ULWH :ULWH :ULWH :ULWH
6WDJH . ««
/ 4 / 4 / 4 / 4
'HFRGH 3LSHOLQH 'HOD\
7LPH $[LV
conflicts, denote c 1 , c 2 and c 3 , are considered as the M, and element permutation of layer m as n ® μm,n, 1
metrics of measuring memory conflict. ≤ n,μm,n ≤ wm.
Based on the above definitions, a memory conflict
3. The ACO-based pipelining scheduler occurs between layer i, element k and layer j, element l if
In this section, we propose the ACO-based pipeline the following conditions are satisfied: (1) layers i,j are
scheduling algorithm to minimize memory conflict. We assigned to be adjacent, i.e., lj = li + 1; (2) hi,k = 1 and hj,
first formulate this problem, then map it to the double- l = 1; (3) the pipeline time interval is less than pipeline
layered TSP and finally use ACO to solve it. stages, i.e., wi−μi,k+μj,l ≤ K. Further, we define the “conflict
set” C as C (i, j) = {(k, l)|elements (i,k) and (j, l) cause a
3.1. Problem formulation memory conflict}, and the “conflict stages”, also the mini-
Consider a QC LDPC code described by its base matrix mum number of idle clocks inserted due to this conflict, as
H with dimensions M × N. Thus, there are M layers.
Denote wm,1 ≤ m ≤ M as the number of elements ("1"s) c(i, k; j, l) = max{K − (wi − μi,k + μj,l ), 0} (4)
in m-th layer. Denote hm,n, 1 ≤ n ≤ wm as the column
index in H of the n-th element, m-th layer. Additionally,
3.2. The double-layered TSP
we assume the core pipeline is K stages.
As discussed above, the decoder processes all the “1"s This part introduces the mapping from the above memory
in H exactly once by processing layer-by-layer in each conflict minimization problem to a double-layered TSP.
iteration, and element-by-element in each layer. How- TSP is a famous NP-hard problem, in which the salesman
ever, the order can be arbitrary, which enables us to should find the shortest path to visit all the n cities exactly
schedule the elements carefully to minimize memory once and finally return to the starting point. Denote di,j as
conflict. We have two ways to solve it. the distance between city i and city j. TSP can be mathe-
matically described as follows: given distance matrix D =
• Layer permutation: We can assign which layer to [di,j]n×n, find the optimal permutation of the city indices
be processed first and which to be next. If two layers x1,x2, ...,xn to minimize the loop distance,
i,j have 1s at totally different positions, i.e., such j,l n−1
do not exist that hi,k = hj,l, they tend to be assigned min dxi ,xi +1 + dxn ,x1 (5)
as the adjacent layers with no conflict. i=1
• Element permutation: In a certain layer, we can
assign which element to be processed first and Compared to layer permutation which can contribute
which to be next. If two adjacent layers i,j still have most part of the memory conflict reduction, element
conflict, i.e., h i,k = h j,l for some k,l, then we can permutation only deals with minor changes for the opti-
assign element k to be first in layer i, and l to be last mization when layer permutation is already determined.
in layer j. By this way, we increase the time interval Therefore, we map the problem to a double-layered
between the conflicting elements k and l. TSP, where layer permutation is mapped to the first
layer, and element permutation is mapped to the second
Therefore, the memory conflict minimization problem layer based on result of the first layer. Details are
is exactly a scheduling problem, in which layer permuta- described as follows:
tion and element permutation should be designed to
minimize the number of idle pipeline clock insertions. • Layer permutation layer: In this layer we only deal
We denote layer permutation as m ® lm, 1 ≤ m, lm ≤ with layer permutation. We define the “distance”,
Yu et al. EURASIP Journal on Wireless Communications and Networking 2012, 2012:84 Page 5 of 13
http://jwcn.eurasipjournals.com/content/2012/1/84
also “cost” between layers i and j as the minimum which the shortest cycle is recorded as the best solution.
number of idle clocks inserted before the processing One ant cycle is finished in VERTEX_NUM ant-move
of layer j. If more conflict position pairs exist, i.e., steps, where one step is consist of four sub-steps: Ant
C (i, j) > 1 , then we should take the maximum Choose, Ant Move, Local Update and Global Update.
one. Thus in this layer, the distance matrix should Further, the Bonus is rewarded to the shortest cycle. All
be defined by specific parameters (e.g., p and ) are referred to the
suggestion of [17].
di,j = max C (i, k; j, l) (6)
(k,l)εC(i,j)
4. The SA-based code constructor
In this section, we propose a joint optimized construc-
tion algorithm that takes both performance and effi-
and the target function remains the same as (5). ciency into consideration during construction the H
• Element permutation layer: In this layer we inherit matrix of the LDPC code. We first give the SA based
the layer permutation result, and map element per- framework and then discuss the details of the algorithm.
mutation of each layer to an independent TSP. In the
TSP for layer i, we fix the schedule of the prior layer 4.1. Problem formulation
p (lp = li − 1) and next layer q (lq = li + 1), and only We now deal with the classic code construction pro-
tune the elements of layer i. We define the “distance” blem. Given the code length N, code rate R, and perhaps
dk,l as the change on the number of idle clocks if ele- other constraints such as QC-RA type (e.g., WiMAX,
ment k is assigned to the position l, i.e., μi,k = l. Note DVB-S2), or fixed degree distribution (optimized by
that element k can conflict with layer p or q, and dk,l density evolution), we should construct a “good” LDPC
varies by different conflict cases, given by code described by its H matrix that meets practical
need. The word “good” here mainly have the following
⎧
⎪ two metrics.
⎨0 both conflict or neither conflict
dk,l = k − l k only conflict with layer p (7)
⎪
⎩ • High performance, which means the code should
l − k k only conflict with layer q have high coding gain and good BER/BLER perfor-
mance, including early water-fall region, low error
floor and anti-fading ability. This is strongly related
Since the largest dk,l becomes the bottleneck of ele- to large girth, large ACE spectrum, few trapping
ment permutation, the target function should change sets, and etc.
to the following max form: • High efficiency, which means the implementation of
the encoder and decoder should have moderate com-
min max{dx1 ,x2 , dx2 ,x3 , . . . , dxn −1 ,xn , dxn ,x1 } (8) plexity, and high throughput. This is strongly related
to QC-RA type, high degree of parallelism, short
decoding pipeline, few memory conflicts, and etc.
3.3. The ACO-based algorithm
This part introduces the ACO based algorithm to solve Traditional construction methods mainly focus on
the double-layered TSP discussed above. ACO is a heuris- high performance of the code, such as PEG and ACE,
tic algorithm to solve computational problems which can which motivates us to find a joint optimized construc-
be reduced to finding good paths through graphs. Its idea tion method concerning both performance and
originates from mimicking the behavior of ants seeking a efficiency.
path between their colony and a source of food. ACO is
especially suitable for solving TSP. 4.2. The double-stage SA framework
Algorithm 1 [see Additional file 1] gives the ACO- In this part, we introduce the double-stage SA [18]
based double-layered memory conflict minimization based framework for the joint optimized construction
algorithm. First we try layer permutation LAYER1_MAX problem. SA is a generic probabilistic metaheuristic for
times, and for each layer permutation, we try element the global optimization problem which should locate a
permutation for LAYER2_MAX times. We record the good approximation to the global optimum of a given
pipeline schedule with smallest idle clocks as the best function in a large search space. Since our search space
solution for this algorithm. is a large 0-1 matrix space, denoted as {0, 1}M×N, SA is
The detailed ACO algorithm for TSP is described in very useful for this problem.
Algorithm 2. We try SOL_MAX solutions, and for each Note that the performance metric is the more impor-
solution, all ants should finish CYCLE_MAX cycles, in tant metric for LDPC construction compared with
Yu et al. EURASIP Journal on Wireless Communications and Networking 2012, 2012:84 Page 6 of 13
http://jwcn.eurasipjournals.com/content/2012/1/84
efficiency metric. Therefore, we divide the algorithm For a higher temperature t, we allow the neighbor
into two stages, aiming at performance and efficiency, searching process to search in a wider space. This is
respectively, and regard performance as the major stage done by performing the atomic operations more
that should be satisfied first. For a specific target mea- times.
sured by “performance energy” e 1 and “efficiency • effi_neighbor searches for a neighbor of h in the
energy” e2, we set two thresholds: upper bound e1h = e1, matrix space when aiming at efficiency. This is simi-
and lower bound e 1l <e1 . The algorithm enters in the lar as perf_neighbor, however, typically we should
second stage when the current performance energy is remove the permutation change operation, as it does
less than e1l. At the second stage, the algorithm ensures nothing to help reduce conflicts.
the performance energy to be not larger than e1h, and
try to reduce the e2. Algorithm 3 shows the details.
5. The GPU-based performance evaluator
4.3. Details of the algorithm In this section, we introduce the implementation of
This part discusses the details of the important func- high-speed LDPC verification platform based on com-
tions and configurations of Algorithm 3. pute unified device architecture (CUDA) supported
GPUs. We first give the architecture and algorithm on
• sample_temperature is the temperature sampling GPU, and then talk about some details.
function, decreasing with k. It can be an exponential
form ae−bk. 5.1. Motivation and architecture
• prob is the accept probability function of the new Compute unified device architecture is NVIDIA’s parallel
search point h_new. If h_new is better (E_new <E), it computing architecture. It enables dramatic increases in
returns 1, otherwise, it decreases with E_new−E, and computing performance by executing multiple parallel
increases with t. It can be an exponential form ae−b independent and cooperated threads on GPU, thus is
(E_new−E)/t particularly suitable for the Monte Carlo model. The BER
• perf_energy is the performance energy function. It simulation of LDPC code is Monte Carlo since it collects
evaluates the performance related factors of the huge amount of bit error statistics of the same decoding
matrix h, and gives a lower energy for better perfor- process, especially in the error floor region when the BER
mance. Typically, we can calculate the number of is low (10−7 to 10−10 ). This motivates us to implement
length-l cycles cl, then calculate a total cost given by the verification platform on GPU where many decoders
∑ l w l c l , where w l is the cost weight of a length-l run parallel like hardware such as ASIC/FPGA to provide
cycle, decreasing with l. statistics.
• effi_energy is the efficiency energy function, similar Figure 5 shows our GPU architecture. CPU is used as
as perf_energy except that it gives a lower energy for the controller, which puts the code into GPU constant
higher efficiency. Typically, we can calculate the the memory, raises the GPU kernels and gets back the sta-
number of gap-l memory conflicts cl, then calculate tistics. While in GPU grid, we implement the whole
a total cost given by ∑ l w l c l , where w l is the cost coding system for each GPU block, including source
weight of a layer gap l conflict, decreasing with l. generator, LDPC encoder, AWGN channel, LDPC deco-
• perf_neighbor searches for a neighbor of h in the der and statistics. Our decoding algorithm is layered
matrix space when aiming at performance, which is MMSA. In each GPU block, we assign zf threads to cal-
based on minor changes of h. For QC LDPC, we can culate new LLRSUM and LLREX of the zf rows in each
define three atomic operations for the base matrix layer, where zf is the expansion factor of QC LDPC. The
Hb as follows. zf threads cooperate to complete the decoding job.
- Horizontal swap: For chosen row i,j and col-
umn k, l, swap values of Hbi,k and Hbi,l , then 5.2. Algorithm and procedure
This part introduces the procedure that implements the
swap values of Hbj,k and Hbj,l . GPU simulation, given by Algorithm 4. P × Q blocks run
- Vertical swap: For chosen row i,j and column k, parallel, each simulating an individual coding system,
l, swap values of Hbi,k and Hbj,k , then swap values where P is the number of multiprocessors (MP) on the
device and Q is the number of cores per MP. In each sys-
of Hbi,l and Hbj,l . tem, z f threads cooperatively do the job of encoding,
- Permutation change: Change the permutation channel and decoding. When decoding, the threads pro-
factor for chosen element Hbi,k . cess data layer after layer, each thread performing
Yu et al. EURASIP Journal on Wireless Communications and Networking 2012, 2012:84 Page 7 of 13
http://jwcn.eurasipjournals.com/content/2012/1/84
LMMSA for one row of this layer. The procedure ends Note that the reconfigurable switch network is designed
up with the statistics of P × Q LDPC blocks. in the LLRSUM loop to support multi-mode feature. As to
achieve high-throughput, we propose the split-row MMSA
5.3. Details and instructions core, the early-stopping scheme and the multi-block
• Ensure “coalesced access” when reading or writing scheme. The split-row core has two data inputs and two
global memory, or the operation will be auto-serial- data outputs, hence it also “splits” the LLRSUM RAM and
ized. In our algorithm, the adjacent threads should LLREX RAM into two parts, meanwhile, two identical
access adjacent L(Qi) and L(rji). switch networks are needed to shuffle the data simulta-
• Shared memory and registers are fast yet limited neously. We also propose the offset-threshold decoding
resources and their use should be carefully planned. In scheme to improve BER/BLER performance. The above
our algorithm, we store L(Qi) in shared memory and L five techniques are described in detail as follows.
(rji) in registers due to the lack of resources.
• Make sure all the P × Q cores are running. This calls 6.2. The reconfigurable switch network
for careful assignment of limited resources (i.e., warps, A switch network is an S-input, S-output hardware struc-
shared memory, registers). In our case, we limit the ture that can put the input signals in the arbitrary order at
registers per thread to 16 and threads per block to the output. Formally, given input signals x1,x2,...,xS with
128, or some of the Q cores on each MP will “starve” data width W, the output of switch network has the form
and be disabled. xa1 , xa2 , ..., xaS where a1,a2,...,aS is any desired permutation
of 1,2,...,S. For the design of reconfigurable LDPC deco-
ders, two special kinds of output order are more impor-
6. Hardware implementation schemes tant, described as follows.
6.1. Top-level hardware architecture
Our goal is to implement a multi-mode high-throughput • Full cyclic-shift: The output has the cyclic-shift
QC-LDPC decoder, which can support multiple code rates form of the total S inputs, i.e., xc, xc+1,...xS, x1,x2...,xc
and expansion factors on-the-fly. The proposed decoder −1, where 1 ≤ c ≤ S.
consists of three main parts, namely, the interface part, the • Partial cyclic-shift: The output has the cyclic-shift
execution part and the control part. The top level architec- form of the first p inputs, while other signal can be
ture is shown in Figure 6. in arbitrary order, i.e., i.e., xc, xc+1,...xp, x1,x2...,xc−1,
The interface part buffers the input and output data as x*,...x*, where 1 ≤ c<p <S, and x* can be any signal
well as handling the configuration commands. In the from xp+1 to xS.
execution part, the LLRSUM and LLREX are read out
from the RAMs, updated in the Σ parallel LMMSA cores, For the implementation of QC-LDPC decoder, the
and written back to the RAMs, thus forming the switch network is an essential module. Suppose
LLRSUM loop and the LLREX loop, as marked red in b
Hj,i = Hk,i
b
≥ 0, j < k , and for any j < l < k, Hl,i
b
= −1 ,
Figure 6. The control part generates control signals,
then the same data is involved in the processing of the
including port control, LLRSUM control, LLREX control
above two “1"s, i.e., LLRSUM and LLREX of BNi × Zf to
and iteration control.
Yu et al. EURASIP Journal on Wireless Communications and Networking 2012, 2012:84 Page 8 of 13
http://jwcn.eurasipjournals.com/content/2012/1/84
([HFXWLRQ3DUW
,QWHUIDFH3DUW
//5680
/223
65006$&RUH
65006$&RUH
5$0B//5B68
//56805$0 //5,1,7
0 5$0B//5B,1
5$0
«« ,7 FIJ
3RUW,Q
65006$&RUH6 GLQ
6500$6&RUH6
3RUW&75/
,7(5&75/
(DUO\
6WRSSLQJ
b
⎛ ⎞
BN(i+1) × zf−1. However, after processing Hj,i , the above
data should be cyclic-shifted to ensure correct order for sgn (L(rji )) = ⎝ sgn (L(qi j ))⎠ (11)
b i ∈Rj\i
the processing Hj,k which corresponds to the full cyclic-
shift case with
abs(L(rji )) = min
(L(qi j )) (12)
i ∈Rj\i
S = zf , v
c = (Hk,i − Hj,i
b
+ S) mod S (9)
In [5], the normalized and offset MMSA schemes (13)
Further, in the case of multiple expansion factors, (14) are proposed to compensate the loss of the above
such as WiMAX [19] (zf = 24: 4: 96), the partial cyclic- approximation, described as follows.
shift is required with
abs(L(rji )) = α · abs(L(rji )) (13)
S = zmax
f , p = zf , b
c = (Hk,i − Hj,i
b
+ p) mod p (10)
S F 6
&RQWURO 6LJQDO *HQHUDWLRQ 0RGXOH
//5B //5BF
//5B //5BF
6 ,QSXW
Ă
//5B //5BF
%HQHV 1HWZRUN
//5B //5BF
Ă
Ă
//5B6 //5B
Ă
6 ,QSXW
Ă
//5B6 //5B
%HQHV 1HWZRUN
//5B6 //5B
//5B6 //5B
abs(L(rji )) = min (max(abs(L(rji )) − β, 0), γ ) (15) architecture with k = 2 is shown in Figure 10. The
LLRSUM (L(Qi)), LLREX (L(rji)) and LLR (L(qij)) of the
The difference between traditional MSA, normalized left part and right part are stored in two individual RAM/
MMSA, offset MMSA and offset-threshold MMSA is FIFOs, respectively. Two minimum/sub-minimum fin-
shown in Figure 9. Simulation result (Figure 8) shows ders pass result to the merger for final comparison, thus
that the proposed scheme has lowest error floor (10−8) approximately shorten the process pipeline by half. Note
among the above schemes, while achieving good BLER that the split position must exist for the code H b such
performance as offset MMSA. that each row in each part contains nearly the same num-
ber of “1"s. Otherwise, we need RAMs with multiple read
6.4. The split-row MMSA core ports and write ports, which is not practical for FPGA
This part presents the split-row MMSA core. In tradi- implementation.
tional semi-parallel structure with layered MMSA core
(see Figure 4), since the “1"s in j-th row will be processed 6.5. The early-stopping scheme
one by one to find the minimum and sub-minimum of This part introduces the early-stoping scheme applied in
all L(qij), the decoding stages K for one iteration is pro- our decoder. In practical scenario, the decoding process
portional to the number of “1"s in each row of the base often gets to convergence much earlier than the preset
matrix Hb. The idea is that, if k “1"s can be processed at maximum iterations is reached, especially under favor-
the same time, the decoding time of one iteration will be able transmission conditions when SNR is large. Thus, if
shortened by a factor of k, and the throughput will have a the decoder can terminate the decoding iterations as
gain of k. This is done by split-row scheme, which verti-
cally splits Hb into multiple part. The “1"s in each part Modified MSA Comparison
0
10
are processed simultaneously to find the local minimum,
and the results are merged together. In this way, for Hb í1
10
í5
10
í6
10
Table 1 Features of reconfigurable Benes network
í7 Normalized BLER
10
Offset BLER
Scale W = 9, S = 256, 15 stages, 128 × 15 MUX OffsetíThreshold BLER
//5680 //5680
BER in iteration termination Average iterations in termination
0
10 20
//5(; //5(; ω=2*M ω=2*M
6XEWUDFWRU 6XEWUDFWRU í1 ω=2.5*M ω=2.5*M
10
ω=3*M ω=3*M
Ideal Iteration 18 Ideal Iteration
í2
10
0LQLPXP 0LQLPXP í3
10 16
)LQGHU )LQGHU
Average iterations
í4
10
//5 ),)2 //5 ),)2
BER
14
í5
10
0HUJHU
í6
10 12
í7
10
2IIVHWWKUHVKROG
&RPSDUDWRU &RPSDUDWRU 10
&RUUHFWRU í8
10
í9
10 8
//5680 //5680 1 1.2 1.4 1.6 1.8 1 1.2 1.4 1.6 1.8
Eb/N0 (dB) Eb/N0 (dB)
//5(; //5(;
Figure 11 BER/BLER performance and average iterations under
Figure 10 The architecture of split-row MMSA core. different ω.
Yu et al. EURASIP Journal on Wireless Communications and Networking 2012, 2012:84 Page 11 of 13
http://jwcn.eurasipjournals.com/content/2012/1/84
v1 .HT = 0 v2 .HT = 0 (17) Table 2 Cycle and conflict performance of the two codes
Candidate code WiMAX code
Thus the vector (v1 v2) is also one legal block for Hv: Cycle:length 6/8 0/55 5/150
Conflict:gap 1/2/3 0/3/9 5/11/15
HT 0 Pipeline occupancy Before ACO: 76/88 Only layer
v1 v2 .HTυ = v1 v2 . =0 (18)
0 HT After ACO: 76/81 permu.: 76/96
The key observation is that there are no memory con- We simulate the candidate code and WiMAX code
flicts between the two codes H due to the diagonal form through the GPU platform. The BER/BLER performance
of H v . This enables us to reorder and combine the is shown in Figure 12, while the platform parameters
decoding schedule of the two codes to reduce memory and throughput are listed in Table 3. The water-fall
conflict of each code. We rewrite H and Hv as follows: region and the error floor of our candidate code is
⎛ ⎞ almost the same as WiMAX code. For speed compari-
(1)
H1 0 son, we also include the fastest result that ever reported
⎛ ⎞ ⎜ 0 H1 ⎟ [12]. The “net throughput” is defined by the decoded
(2)
H1 ⎜ ⎟
⎜ ⎟ ⎜ . .. ⎟ “message bits” per second, given by:
H = ⎝ ... ⎠ Hvopt =⎜
⎜ .. . ⎟
⎟ (19)
⎜ (1) ⎟
HM ⎝ HM 0 ⎠ P·Q·N·R
(2)
net throughput = (20)
0 HM t
where t is the consumed time for running through the
where H(j)
i
denotes the i-th row of the j-th code. The GPU kernel (for us is Algorithm 4). As shown in Table
decoding schedule is given by above equation, i.e., H(1) 3, our GPU platform speeds up 490 times against CPU
i and achieves a net throughput 24.5 Mbps. Further, our
comes first, followed by H(2)
i
, and then H(1)
i+1
, and so on throughput approaches the fastest one, while providing
so forth. The benefit of this “multi-block” scheme is that better precision (floating-point vs. 8 bit fixed-point) for
the insertion of H(2) provides extra stages for the con- the simulation.
i
Finally, we optimize the pipeline schedule by ACO-
flicts between H(1)
i
and H(1)
i+1
. based scheduler, shown in Table 2. The “pipeline occu-
To sum up, the “multi-block” scheme changes any gap- pancy” is given by running/total clocks required for one
l memory conflict to gap-(2l − 1), thus can improve the iteration. For the candidate code, the number of idle
pipeline efficiency significantly. Meanwhile, it demands clock insertions after ACO is 5, compared with 12
no extra logic resources (LE) for the design, but may before ACO, achieving a 58.3% reduction. While for
double the memory bits for buffering two encoded WiMAX code, 20 idle clock insertions remain required
blocks. Since the depth of memory is not fully used on after layer-permutation-only (single-layer) scheme
our FPGA, the proposed method can make full use of it
with no extra resource cost.
BER/BLER Performance: Candidate code vs WiMAX code
0
10
Candidate code BER
7. Numerical simulation Candidate code BLER
í1 WiMAX code BER
10
In this section, we show how our platform produces WiMAX code BLER
í4
10
tions as WiMAX for our SA-based constructor. We set
“cycle” as performance metric and memory conflict as í5
10
under short pipeline (when K ≤ wm). Figure 12 BER and BLER performance of the two codes.
Yu et al. EURASIP Journal on Wireless Communications and Networking 2012, 2012:84 Page 12 of 13
http://jwcn.eurasipjournals.com/content/2012/1/84
Table 3 Parameters and performance: GPU vs CPU (20 LDPC codes, more code lengths and code rates are natu-
iterations) rally supported, for example, the WiMAX codes (zf = 24:
GPU (ours) CPU GPU [12] 4: 96, R = 1/2, 2/3, 3/4, 5/6, 114 modes in total). The
Platform NV. GTX260 Intel Core2 Quad NV. 8800GTX only cost is that more memory bits are required to store
Clock frequency 1.24GHz 2.66 GHz 1.35 GHz the new base matrices Hb.
Decoding method Semi-parallel Semi-parallel Full-parallel
LMMSA LMMSA BP 9. Conclusion
Blocks×threads 216 × 96 1 128 × 256 In this article, a novel LDPC code construction, verifica-
Net throughput 24.5 Mbps 50 Kbps 25Mbps tion, and implementation methodology is proposed, which
Precision Floating-point Floating-point 8-bit fixed-point can produce LDPC codes with both good decoding perfor-
mance and high hardware efficiency. Additionally, a GPU
verification platform is built that can accelerate 490×
proposed by [11]. In this case, the double-layered ACO speed against CPU and a multi-mode high-throughput
achieves a 75% reduction against the single-layer scheme decoder is implemented on FPGA, achieving a net-
(5 vs. 20 idle clocks). throughput 1.2 Gbps and performance loss within 0.2 dB.
8. The multi-mode high-throughput decoder
Additional material
Based on the above techniques, namely, reconfigurable
switch network, offset-threshold decoding, split-row MMSA
Additional file 1: Algorithm. This file contains Algorithm 1, Memory
core, early-stoping scheme and multi-block scheme, we conflict minimization algorithm; Algorithm 2, ACO algorithm for TSP;
implement the multi-mode high-throughput LDPC deco- Algorithm 3, The SA based LDPC construction framework; Algorithm 4,
der on Altera Stratix III FPGA. The proposed decoder The GPU based LDPC simulation; and Algorithm 5, Semi-parallel early-
stopping algorithm.
supports 27 modes, including nine different code lengths
and three different code rates, and maximum 31 iterations.
The configurations for code length, code rate, and itera-
tion number are completely on-the-fly. Further, it has a Acknowledgements
This paper is partially sponsored by the Shanghai Basic Research Key Project
BER gap less than 0.2 dB against floating-point LMMSA, (No. 11DZ1500206) and the National Key Project of China (No. 2011ZX03001-
while achieving a stable net-throughput 721.58 Mbps 002-01).
under code rate R = 1/2 and 20 iterations (corresponding
Competing interests
to a bit-throughput 1.44 Gbps). With early-stopping mod- The authors declare that they have no competing interests.
ule working, the net-throughput can boost up to 1.2 Gbps
(bit-throughput 2.4 Gbps), which is calculated under aver- Received: 15 May 2011 Accepted: 6 March 2012
Published: 6 March 2012
age 12 iterations. The features are listed in Table 4.
One great advantage of the proposed multi-mode high- References
throughput LDPC decoder is that more modes can be 1. R Gallager, Low-density parity-check codes. IRE Trans. Inf. Theory. 8(1),
supported with only more memory bits consumed and 21–28 (1962). doi:10.1109/TIT.1962.1057683
2. R Tanner, A recursive approach to low complexity codes. IEEE Trans. Inf.
no architecture level change. Since the reconfigurable Theory. 27(9), 533–547 (1981)
switch network supports all expansion factors zf ≤ 256, 3. D MacKay, Good error-correcting codes based on very sparse matrices. IEEE
and the layered MMSA cores supports arbitrary QC- Trans. Inf. Theory. 45(3), 399–431 (1999)
4. T Richardson, M Shokrollahi, R Urbanke, Design of capacity approaching
irregular low-density parity-check codes. IEEE Trans. Inf. Theory. 47(2),
619–637 (2001). doi:10.1109/18.910578
Table 4 Features of the multi-mode high-throughput 5. J Chen, RM Tanner, C Jones, L Yan Li, Improved min-sum decoding algorithms
for irregular LDPC codes, in Proc. ISIT, (Adelaide, 2005), pp. 449–453
decoder 6. DE Hocevar, A reduced complexity decoder architecture via layered
FPGA platform Altera Stratix III EP3SL340F1517C2 decoding of LDPC codes, in IEEE workshop on SiPS, pp. 107–112 (2004)
Decoding scheme Layered offset-threshold MSA 7. Y Hu, E Eleftheriou, DM Arnold, Regular and irregular progressive edge
growth Tanner graphs. IEEE Trans. Inf. Theory. 51(1), 386–398 (2005)
Modes supported 9 × 3 = 27 modes 8. D Vukobratovic, V Senk, Generalized ACE constrained progressive Eedge-
Code length N = 1536:768:6144 (zf = 64:32:256) growth LDPC code design. IEEE Comm. Lett. 12(1), 32–34 (2008)
Code rate R = 1/2,2/3,3/4 (Hb :12 × 24, 8 × 24, 6 × 24) 9. AJ Blanksby, CJ Howland, A 690-mW 1-Gb/s 1024-b, rate-1/2 low-density
parity-check code decoder. J. Solid State Circ. 37(3), 404–412 (2002).
Iteration number iter = 1−31, 20 recommended doi:10.1109/4.987093
Resources usage 149, 976 LE, 3, 157, 136 bits memory 10. Z Cui, Z Wang, Y Liu, High-throughput layered LDPC decoding architecture.
BER performance gap ≤ 0.2 dB vs. 20 iteration float LMMSA IEEE Trans. VLSI Syst. 17(4), 582–587 (2009)
11. C Marchand, J Dore, L Canencia, E Boutillon, Conflict resolution for
Clock setup 225.58MHz pipelined layered LDPC decoders, in IEEE workshop on SiPS, (Tampere, 2009),
Stable net throughput 721.58 Mbps (zf = 256, R = 1/2, iter = 20) pp. 220–225
Max. net throughput 1.2 Gbps (early-stopping, iter = 12 ave.)
Yu et al. EURASIP Journal on Wireless Communications and Networking 2012, 2012:84 Page 13 of 13
http://jwcn.eurasipjournals.com/content/2012/1/84
12. G Falcao, V Silva, L Sousa, How GPUs can outperform ASICs for fast LDPC
decoding, in Proc. international conf on Supercomputing, (New York, 2009),
pp. 390–399
13. J Lin, Z Wang, Effcient shuffle network architecture and application for
WiMAX LDPC decoders, in IEEE Trans. on Circuits and Systems. 56(3),
215–219 (2009)
14. KK Gunnam, GS Choi, MB Yeary, M Atiquzzaman, VLSI architectures for
layered decoding for irregular LDPC codes of WiMax, in IEEE International
Conference on Communications, (Glasgow, 2007), pp. 4542–4547
15. T Brack, M Alles, F Kienle, N Wehn, A synthesizable IP core for WIMAX
802.16E LDPC code decodings, in IEEE Inter. Symp. on Personal, Indoor and
Mobile Radio Comm, (Helsinki, 2006), pp. 1–5
16. K Tzu-Chieh, AN Willson, A flexible decoder IC for WiMAX QC-LDPC codes,
in Custom Integrated Circuits Conference, (San Jose, 2008), pp. 527–530
17. M Dorigo, LM Gambardella, Ant colonies for the travelling salesman
problem. Biosystems. 43(2), 73–81 (1997). doi:10.1016/S0303-2647(97)01708-
5
18. S Kirkpatrick, CD Gelatt, MP Vecchi, Optimization by simulated annealing.
Science, New Series. 220(4598), 671–680 (1983)
19. IEEE Standard for Local and Metropolitan Area Networks Part 16. IEEE
Standard 802.16e (2008)
20. M Rovini, G Gentile, F Rossi, Multi-size circular shifting networking for
decoders of structured LDPC codes. Electron Lett. 43(17), 938–940 (2007).
doi:10.1049/el:20071157
21. J Tang, T Bhatt, V Sundaramurthy, Reconfigurable shuffle network design in
LDPC decoders, IEEE Intern Conf ASAP, (Steamboat Springs, CO, 2006), pp. 81–86
22. D Oh, K Parhi, Area efficient controller design of barrel shifters for
reconfigurable LDPC decoders, in IEEE Intern Symp on Circuits and Systems,
(Seattle, 2008), pp. 240–243
23. L Jin, Y Xiao-hu, L Jing, Early stopping for LDPC decoding: convergence of
mean magnitude (CMM). IEEE Commun Lett. 10(9), 667–669 (2006).
doi:10.1109/LCOMM.2006.1714539
24. S Donghyuk, H Kyoungwoo, O Sangbong, A Jeongseok Ha, A stopping
criterion for low-density parity-check codes, in Vehicular Technology
Conference, (Dublin, 2007), pp. 1529–1533
25. F Kienle, N Wehn, Low complexity stopping criterion for LDPC code
decoders, in Vehicular Technology Conference. 1, 606–609 (2005)
doi:10.1186/1687-1499-2012-84
Cite this article as: Yu et al.: Systematic construction, verification and
implementation methodology for LDPC codes. EURASIP Journal on
Wireless Communications and Networking 2012 2012:84.