Flexible, Cost-Efficient, High-Throughput Architecture For Layered LDPC Decoders With Fully-Parallel Processing Units
Flexible, Cost-Efficient, High-Throughput Architecture For Layered LDPC Decoders With Fully-Parallel Processing Units
I. I NTRODUCTION
Low Density Parity Check (LDPC) codes are a class of
error correction codes known to closely approach to the
Shannon limit under iterative message-passing (MP) decoding
algorithms. MP architectures are composed of processing units
that perform the desired computation by passing messages
to each other. The way such architecture applies to LDPC
decoding is closely related to the bipartite graph representation
of LDPC codes [1]. It comprises two types of nodes, known as
variable-nodes and check-nodes, corresponding respectively to
coded bits and parity-check equations. Accordingly, an LDPC
decoder comprises two types of processing units, namely
Variable-Node Units (VNUs) and Check-Node Units (CNUs),
which exchange messages according to the structure of the
bipartite graph.
MP decoders may deal with different scheduling strategies,
according to the order in which variable and check-node messages are updated during the message passing iterative process.
The classical convention is that, at each iteration, all checknodes and subsequently all variable-nodes pass new messages
978-1-5090-2817-7/16 $31.00 2016 IEEE
DOI 10.1109/DSD.2016.33
230
n H(m)\n
SAT
// where m,n
is the value of m,n saturated to q bits
231
x
" #!$
$#
!
! #
#
!!
&
'
#
$#
! #
#
$#
!!
!
$#
!!
! #
#
!
!
! #
!!
(
!
'
x
'
x
x
!
x
x
%
%
%
%
x
%
%
')
x
%
%
x
%
%
x
%
x
*"
*"
"
&'(
%
%
#''#$
%#$#
#!
! # # $#
%
%
%
x
%
%
%
&
'
(
!
'
!"#"# "$#
!"#"# "$#
Figure 1.
#*#%
)
#''%
!
#
#
$#
Figure 2.
Compressed -message
A. Baseline Architecture
Figure 1 illustrates the baseline architecture of the layered
MS decoder, whose main blocks are further discussed below.
Each decoding iteration takes two clock cycles. All data are
read and processed at the rst rising edge clock, then written
at the second rising edge clock.
Memory blocks. Two memory blocks are used, one for
memory) and one for the m,n messages
the n values (
( memory). n values are quantized on q bits, and m,n
messages on q bits. memory is implemented by registers,
in order to allow massively parallel read or write operations.
The memory is organized in C blocks, denoted by APi
(i = 0, . . . , C 1) corresponding to the number of columns of
base matrix, each one consisting of Z q bits. Data are read
from/write to blocks corresponding to non-negative entries in
the row of B (layer) being processed. memory is implemented as a Random Access Memory (RAM). Each memory
232
&
'
(
!
'
&
'
(
!
'
x
x
%
%
x
x
x
x
x
+ +
x
x
%
x
%
x
%
+
%
')
')
%
%
%
!"#"$#
x
Figure 3.
x
+,
x
x
x
Figure 4.
Figure 5.
B. Enhanced Architecture
In this Section we discuss the main enhancements that we
are incorporating into the baseline architecture, which consist
of (1) a low-cost VNU/AP-LLR processing unit that merges
in an efcient way the logical functionalities of the VNU
and AP-LLR units, (2) a low-cost CNU architecture, which
is executed twice in order to complete computation of the
check-node messages, (3) a splitting of the iteration processing
in two perfectly symmetric stages, yielding an optimal clock
frequency. VNU/AP-LLR unit and the new CNU substitute to
the VNU, AP-LLR, and the old CNU units in the baseline
architecture, as shown in Figure 3 (where VNU/AP-LLR is
shortened to VN/AP). All the other blocks of the architecture
remain the same.
233
*
/
0
#$
#
#
$#
+#
#
#
$#
#,
!-.
-,
!-.
!-.
-.
x
Figure 7.
1
!-.
x
x
/
1
!-.
1
!-.
x
Figure 8.
"
$&'(
$&#'(
x
Figure 9.
2-FMIG architecture
1
!-.
#
-
Figure 6.
4-FMIG architecture
234
x
x
x
x
x
x
x
x
x
.
x
+ +
+
1
.
#$
')
!"#"$#
+
Figure 10.
Modied VNU to accommodate variable check-node degree
(example for dcmin = dcmax 1)
Figure 11.
Modied CNU to accommodate variable check-node degree
(example for dcmin = dcmax 1)
Throughput =
N fmax
,
iter number cyc iter
235
/
/
#
4
5
*
$
%
#'
#
%$*
67
<
16
1
16
1
66
6:
8<
89
6
9
9
68
<
<
11
16
9
1
8
9
98
6
8:
91
9
8;
9
1
9
17
8
81
1
6
6
8
8
61
<
1
9
7
451
*
Table I
PARAMETERS OF THE QC-LDPC
)
Figure 12.
6:
<
17
9
8
<
1:
8
:
9
16
;
6;
18
6:
67
<
7 89
3
#*
88
6;
17
Table II
C OMPARISON BETWEEN ENHANCED AND BASELINE ARCHITECTURES FOR
(3, 6)- REGULAR AND W I MAX QC-LDPC CODES
67
98
1;
Enhanced
WiMAX QC-LDPC
Baseline
Enhanced
111
250
83
250
Throughput (Mbps)
1198
2700
398
1200
Area (mm2 )
0.95
0.72
0.88
0.86
0.71
0.88
Frequency (MHz)
Area (mm2 )
111
0.95
83
0.76
CODES
RPL
dcmin
dcmax
q iter number
(3,6)-regular 12
24
54
20
12
24
96
20
WiMAX
6
451
#
Figure 13.
V. C ONCLUSION
In this paper we proposed a low-cost and exible architecture for high-throughput layered LDPC decoders with
fully-parallel processing units. To do so, we proposed new
processing unit architectures that allow a more efcient hardware usage, thus yielding a signicant cost reduction. The
proposed CNU further allows splitting the iteration processing
in two perfectly symmetric stages, resulting in a signicant
increase in the maximum operating frequency. The proposed
236
Table III
C OMPARISON BETWEEN THE PROPOSED ENHANCED ARCHITECTURE AND STATE OF THE ART IMPLEMENTATIONS FOR THE W I MAX QC-LDPC
Y. Ueng (2008) [14]
Proposed decoder
Code length
2304
2304
2304
576-2304
2304
Technology (nm)
180
90
130
90
65
Frequency (MHz)
200
950
100
149
250
Iterations
4.6 (average)
10
10
20
Throughput (Mbps)
106
2200
183
955
1200
294
3036
366
1318
1200
Area
(mm2 )
2.90
()
1.51
()
2010.60
211.56
221.89
1395.35
20106
2115.6
1109.45
27907
(mm2 )
(Mbps/mm2 )
CODE
6.90
()
1.73
()
11.42
5.94
()
()
0.86
()
0.86
()
()
ACKNOWLEDGMENT
The authors acknowledge support from the European H2020
Work Programme, project Flex5Gware, and the French ANR
Programme Blanc-2013, project DIAMOND.
R EFERENCES
[1] R. Tanner, A recursive approach to low complexity codes, IEEE Trans.
on Inf. Theory, vol. 27, no. 5, pp. 533547, 1981.
[2] F. R. Kschischang and B. J. Frey, Iterative decoding of compound
codes by probability propagation in graphical models, IEEE Journal on
Selected Areas in Communications, vol. 16, no. 2, pp. 219230, 1998.
[3] D. Hocevar, A reduced complexity decoder architecture via layered
decoding of LDPC codes, in IEEE Workshop on Signal Processing
Systems (SIPS), 2004, pp. 107112.
[4] J. Zhang, Y. Wang, M. P. Fossorier, and J. S. Yedidia, Iterative decoding
with replicas, IEEE Transactions on Information Theory, vol. 53, no. 5,
pp. 16441663, 2007.
[5] M. P. Fossorier, Quasicyclic low-density parity-check codes from circulant permutation matrices, IEEE Transactions on Information Theory,
vol. 50, no. 8, pp. 17881793, 2004.
[6] E. Boutillon and G. Masera, Hardware design and realization for iteratively decodable codes, in Channel Coding: Theory, Algorithms, and
Applications, D. Declercq, M. Fossorier, and E. Biglieri, Eds. Academic
Press Library in Mobile and Wireless Communications, Elsevier, June
2014.
[7] O. Boncalo, A. Amaricai, A. Hera, and V. Savin, Cost efcient FPGA
layered LDPC decoder with serial AP-LLR processing, in IEEE International Conference on Field Programmable Logic and Applications
(FPL), Munich, Germany, September 2014, pp. 16.
[8] M. Fossorier, M. Mihaljevic, and H. Imai, Reduced complexity iterative
decoding of low-density parity check codes based on belief propagation,
IEEE Trans. on Communications, vol. 47, no. 5, pp. 673680, 1999.
237