0% found this document useful (0 votes)
19 views

Flexible, Cost-Efficient, High-Throughput Architecture For Layered LDPC Decoders With Fully-Parallel Processing Units

httfuyu

Uploaded by

Anupam Das
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

Flexible, Cost-Efficient, High-Throughput Architecture For Layered LDPC Decoders With Fully-Parallel Processing Units

httfuyu

Uploaded by

Anupam Das
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

2016 Euromicro Conference on Digital System Design

Flexible, Cost-Efcient, High-Throughput


Architecture for Layered LDPC Decoders with
Fully-Parallel Processing Units
Thien T. Nguyen-Ly , Tushar Gupta , Manuel Pezzin , Valentin Savin , David Declercq and Sorin Cotofana
CEA-LETI,

MINATEC Campus, Grenoble, France


{thientruong.nguyen-ly, tushar.gupta, manuel.pezzin, valentin.savin}@cea.fr
ETIS, ENSEA / CNRS UMR-8051 / University of Cergy-Pontoise, France, [email protected]
Computer Engineering Laboratory, Delft University of Technology, The Netherlands, [email protected]

AbstractIn this paper, we propose a layered LDPC decoder


architecture targeting exibility, high-throughput, low cost, and
efcient use of the hardware resources. The proposed architecture
provides full design time exibility, i.e., it can accommodate
any Quasi-Cyclic (QC) LDPC code, and also allows redening a
number of parameters of the QC-LDPC code at the run time.
The main novelty of the paper consists of: (1) a new low-cost
processing unit that merges the logical functionalities of the
Variable-Node Unit (VNU) and the A Posteriori Log-Likelihood
Ratio (AP-LLR) unit in an efcient way, (2) a high speed, low-cost
Check-Node Unit (CNU) architecture, which is executed twice
at each iteration in order to complete the computation of the
check-node messages, (3) a splitting of the iteration processing
in two perfectly symmetric stages, executed in two consecutive clock cycles, each one using exactly the same processing
resources; the processing load is perfectly balanced between
the two clock cycles, thus yielding an optimal clock frequency.
Synthesis results targeting a 65nm CMOS technology for a
(3, 6)-regular (648, 1296) Quasi-Cyclic LDPC code and for the
WiMax (1152, 2304) irregular QC-LDPC code show signicant
improvements in terms of area and throughput compared to the
baseline architecture discussed in this paper, as well as several
state of the art implementations.

to their neighbors. This message-passing schedule is usually


referred to as ooding scheduling [2]. A different approach is
to split the parity check matrix in several horizontal layers,
then process horizontal layer sequentially, while check-nodes
(rows) within the same layer are processed by using a ooding
schedule strategy. Each time a layer is processed the decoder
updates the neighbor variable-nodes, so as to prot from the
propagated messages, and then proceeds to the next layer.
This message scheduling, known as layered scheduling [3],
propagates information faster and converges in about half the
number of iterations compared to the fully parallel scheduling
[4], thus yielding a lower decoding latency. Layered scheduling advantageously applies to Quasi-Cyclic (QC) LDPC codes
[5], which are naturally equipped with a layered structure,
and also known to signicantly reduce the complexity of the
interconnection network. Due to their benets in terms of
area/throughput/exibility, layered QC-LDPC decoders have
been widely adopted, and can be considered as a de facto
standard solution in most applications [6]. Additional considerations may address different optimizations at the processing
unit level, e.g., implementing different decoding algorithms
or processing the input data in either a serial or a parallel
manner [7]. Regarding the MP decoding algorithm, hardware
implementations of LDPC decoders mostly rely on the MinSum (MS) algorithm [8], since the corresponding VNUs
and CNUs can be implemented by very simple arithmetic
operations (additions and comparisons).
In this work, we propose a layered MS decoder architecture
targeting (i) exibility, (ii) high-throughput, and (iii) low
cost and efcient use of the hardware resources. Highest
exibility can be achieved by using serial processing units:
VNUs and CNUs process incoming messages in a serial
manner, which makes their implementation independent of the
variable or check-node degree. However, this comes at the
cost of a reduced throughput. Thus, in this paper we focus
on layered LDPC decoder architectures with fully parallel
processing units. Such architecture has some inherent limitations in terms of exibility, mainly concerning the number
of incoming messages into VNUs and CNUs, corresponding to
the degrees (i.e., number of connections) of the corresponding

I. I NTRODUCTION
Low Density Parity Check (LDPC) codes are a class of
error correction codes known to closely approach to the
Shannon limit under iterative message-passing (MP) decoding
algorithms. MP architectures are composed of processing units
that perform the desired computation by passing messages
to each other. The way such architecture applies to LDPC
decoding is closely related to the bipartite graph representation
of LDPC codes [1]. It comprises two types of nodes, known as
variable-nodes and check-nodes, corresponding respectively to
coded bits and parity-check equations. Accordingly, an LDPC
decoder comprises two types of processing units, namely
Variable-Node Units (VNUs) and Check-Node Units (CNUs),
which exchange messages according to the structure of the
bipartite graph.
MP decoders may deal with different scheduling strategies,
according to the order in which variable and check-node messages are updated during the message passing iterative process.
The classical convention is that, at each iteration, all checknodes and subsequently all variable-nodes pass new messages
978-1-5090-2817-7/16 $31.00 2016 IEEE
DOI 10.1109/DSD.2016.33

230

II. L AYERED MS D ECODING FOR QC-LDPC C ODES


We consider a QC-LDPC code dened by a base matrix B
of size R C, with integer entries bi,j 1. The paritycheck matrix H is obtained by expanding the base-matrix B
by an expansion factor Z; thus, each entry of B is replaced
by a square matrix of size Z Z, dened as follows: 1
entries are replaced by the all-zero matrix, while bi,j 0
entries are replaced by a circulant matrix, obtained by rightshifting the identity matrix by bi,j positions. Hence, H has
M = R Z rows and N = C Z columns. We also denote
by Mr the set of Z consecutive rows of H corresponding to
the r-th row in B. Mr is further referred to as a (decoding)
layer of H. Finally, we denote by N (m) the set of columns
of H having a non-zero (1) entry in the m-th row, for
any m = 1, . . . , M . In the bipartite graph, representation,
check and variable nodes correspond respectively to rows and
columns of H, and they are connected by edges according the
the non-zero entries of H. The number of edges incident to
each check or variable node (or equivalently, the weight of the
corresponding row/column) is referred to as the node degree.
Let (x1 , , xN ) denote a codeword that is sent over
a binary input channel, and (y1 , , yN ) be the received
word. The following notation for MP decoders will be used
throughout the paper:
n = log (Pr(xn = 0|yn )/ Pr(xn = 1|yn )), the LLR value
of xn according to the received yn value; it is also referred to
as the a priori LLR of variable node n;
n : the a posteriori (AP) LLR of variable node n;
m,n : message sent from variable-node n to check-node m;
m,n : message sent from check-node m to variable-node n;
The layered MS decoding is described in Algorithm 1. To
match to the hardware implementation that will be discussed

variable and check nodes in the Tanner graph [1]. To ensure


the highest possible exibility, the proposed architecture can
accommodate any QC-LDPC code, and also allows redening
a number of parameters at the run time, e.g., number of rows
of the QC base matrix, as well as the positions and values of
the non-negative entries within each row.
The classical solution to increase throughput and to also
ensure an efcient use of hardware resources in layered
architectures is to pipeline the datapath. However, the number
of stages in the datapath may impose specic constraints on
the base matrix of the QC-LDPC code, in order to ensure
that no memory conicts occur during the read/write operations from/to the memory storing the exchanged messages
or the a posteriori logarithmic likelihood ratios (AP-LLR)
values. Moreover, pipelined architectures violate the layered
scheduling principle, in the sense that each layer processing
starts before completing processing the previous layer, thus
reducing the convergence speed. To avoid such limitations,
the proposed architecture does not use pipeline. Instead, we
propose a specic design of the datapath processing units
(VNUs, CNUs, and AP-LLR units) that allow an efcient
reuse of the hardware resources, thus yielding signicant cost
reduction. Accordingly, the main novelty of the paper consists
of: (1) A low-cost VNU/AP-LLR processing unit that merges
in an efcient way the logical functionalities of the VNU
and AP-LLR units, and can be executed by selecting either
the VNU or the AP-LLR mode. (2) A high-speed, low-cost
CNU architecture, which only computes the rst minimum
(min1) and index of the rst minimum (indx min1), instead
of rst two minima and indx min1 as required by the MS
decoding algorithm. To compute the second minimum (min2),
the CNU is executed a second time with indx min1 input
set to the maximum value (according to the bit-length of the
exchanged messages). Due to a specic organization of the
datapath, the second execution of the CNU does not induce
any penalty in terms of throughput, as explained below. (3) We
split the iteration processing in two perfectly symmetric stages,
executed in two consecutive clock cycles, each one using the
same processing resources. In the rst clock cycle we perform
read operations, then execute the VNU/AP-LLR unit in VNU
mode, and the CNU to compute min1 and indx min1. In
the second clock cycle we execute the CNU to compute min2,
the VNU/AP-LLR unit in AP-LLR mode, and perform write
back operations. The processing load is perfectly balanced
between the two clock cycles, thus yielding an optimal clock
frequency. In particular, the second execution of the CNU
during the second clock cycle does not impose any penalty
on the operating clock frequency.
The paper is organized as follows. In Section II we briey
review QC-LDPC codes and the MS decoding algorithm.
Section III details the proposed low-cost, high-throughput
exible architecture for the layered MS decoder. We discuss
rst the baseline architecture, and then the main enhancements
that we are incorporating into this architecture. Implementation
results are provided in Section IV, and Section V concludes
the paper.

Algorithm 1 Layered MS decoding algorithm


Input: (1 , . . . , N )
 input LLRs
Output: (
x1 , . . . , x
N )
 estimated codeword
[Initialization]
for all n = 1, . . . , N do n = n ;
for all m = 1, . . . , M and n N (m) do m,n = 0;
[Decoding Iterations]
for all iter = 1, . . . , iter max do
 Iteration loop
for all r = 1, . . . , R do
 Loop over horizontal layers
for all m Mr and n N (m) do
 VNU
m,n = n m,n ;
for all m Mr and n N (m) do
 SAT  CNU

m,n =
sign(m,n )  min |m,n
| ;
n H(m)\n

n H(m)\n

SAT
// where m,n
is the value of m,n saturated to q bits

for all m Mr and n N (m) do


 AP-LLR
n = m,n + m,n ;
end (horizontal layers loop)
for all n = 1, . . . , N do x
n = sign bit(
n );  hard decision
if H x
N
=
0
then
exit
iteration
loop;
 syndrome check
1
end (iteration loop)

231

 



x

 






" # !$

 $#
 !
!  #
 #
!!

&
'




#
 $#
   
!  #
 #
  $#
!!
 !
  $#
!!
! #
 #
 !

 !
!  #

!!

(
!


'








x


'  


x

x







 





 !




 

   









 

x






 


 


x
%
%
%


 




%





x


%

%

' )


x
%




%



x













%
%

x



%



x

*"






*"






"


   





&'(
 %
% #'   '#$ 

%#$# #   !

! # #  $#











%
%


%

 

 

x









%



 







%
%



& ' ( ! '  
!"#"# "$#



!"#"# "$#


Figure 1.

Block diagram of the baseline layered MS decoder architecture

 #* #%


) #' '%


!

#

# 

$#


Figure 2.

III. L AYERED MS D ECODER A RCHITECTURE


For the sake of simplicity, we shall rst assume that all
the check-nodes have the same degree, which will be denoted
in the sequel by dcmax . No further assumptions are made
regarding the base matrix B. The case of check-node irregular
codes will be discussed in Section III-C. We start by discussing
the baseline architecture, then the proposed enhancements are
discussed in Section III-B.

Compressed -message

in the next section, we assume that input LLRs n and


check-to-variable node messages m,n are quantized on q
bits, while AP-LLR values n are quantized on q bits, with
q < q. Subtractions and additions used in the VNU and APLLR steps are implemented through the use of q-bit saturated
adders. Hence, variable-to-check messages m,n computed at
the VNU step are quantized on q bits, and they are saturated
to q bits just before entering the CNU. The m,n values used
at the AP-LLR step are the unsaturated q-bit values.
It is worth noting that for a given m, the absolute values
of the m,n messages computed at the CNU step are equal
to either the rst or the second minimum of the input mesSAT
sages absolute values |m,n
|. Moreover, there is only one
m,n message whose absolute value is equal to the second
minimum, with the variable-node index corresponding to the
rst minimum. In the sequel, we shall denote by min1 and
min2 the rst and second minimum, and by indx min1 the
index of the rst minimum. Thus, m,n messages can be stored
in a compressed format [9] to reduce memory requirements,
by storing only their signs, min1, min2, and indx min1
values, as shown in Figure 2.

A. Baseline Architecture
Figure 1 illustrates the baseline architecture of the layered
MS decoder, whose main blocks are further discussed below.
Each decoding iteration takes two clock cycles. All data are
read and processed at the rst rising edge clock, then written
at the second rising edge clock.
Memory blocks. Two memory blocks are used, one for
memory) and one for the m,n messages
the n values (
( memory). n values are quantized on q bits, and m,n
messages on q bits. memory is implemented by registers,
in order to allow massively parallel read or write operations.
The memory is organized in C blocks, denoted by APi
(i = 0, . . . , C 1) corresponding to the number of columns of
base matrix, each one consisting of Z q bits. Data are read
from/write to blocks corresponding to non-negative entries in
the row of B (layer) being processed. memory is implemented as a Random Access Memory (RAM). Each memory

232

 

word consists of Z compressed -messages, corresponding to


one row of B.
Permutations for Reading and Writing (PER R, PER W).
PER R permutation is used to rearrange the data read from
memory, according to the processed layer, so as to ensure
processing by the proper VNU/CNU. PER W block operates
oppositely to PER R.
Barrel Shifter for Reading and Writing (BS R, BS W).
Barrel shifters are used to implement the cyclic (shift) permutations corresponding to the non-negative entries of the base
matrix B. We use dcmax BS R and dcmax BS W blocks,
corresponding to the check-node degree, each of them having
Z q-bit inputs and Z q-bit outputs.
Decompress. This block is used to convert m,n messages
from the compressed format to the uncompressed one.
Variable Node Units (VNUs). These processing units compute the m,n messages. The inputs of the VNUs are read
from memory and memory. Each VNU i block (i =
0, . . . , dcmax 1) in Figure 1 consists of Z q-bit saturated
subtractors for the parallel execution of Z variable-nodes (one
column of B).
Saturators (SATs). Prior to CNU processing, m,n values are
saturated to q bits.
Check Node Units (CNUs). These processing units compute
the m,n messages. For simplicity, Figure 1 shows one CNU
block with dcmax inputs, each one of size Z q bits. Thus, this
block actually includes Z computing units, used to process
in parallel the Z check-nodes within one layer. The checknode processing consists of computing the signs of the messages, as well as min1, min2 and indx min1 value,
and is implemented by using the high-speed low-cost (treestructure) TS approach proposed in [10].
AP-LLR Units. These units compute the n values. Each
AP LLR i block (i = 0, . . . , dcmax 1) in Figure 1 consists
of Z q-bit saturated adders, for the parallel execution of Z
variable-nodes (one column of B).
Controller. This block generates control signals such as
count layer for indicating which layer is being processed,
En read and En write for reading and writing data, etc. It
also controls the synchronous execution of the other blocks.

&
'
(
!


'



&
'
(
!


'



x

x





%



%





 x
x



 

x

x

x



+  + 

 

x

x

%

x



%

 

x

%


+ 


%












') 
')





%
%

%


!"#"$#

x





Figure 3.

New processing units for the layered MS decoder architecture








 x


+,








x

x


 


x



Figure 4.

VNU/AP-LLR processing unit

Figure 5.

Adder/subtractor block used within the VNU/AP-LLR unit

1) VNU/AP-LLR Unit: The main difference between VNU


and AP-LLR processing units is that subtractors are used
within the rst, while adders are used within the second. We
propose a new VNU/AP-LLR processing unit that merges their
logical functionalities, controlled by a specic signal (sel)
to allow selecting between the VNU or AP-LLR mode. The
control signal is generated by the controller, such that VNU
mode is selected during the rst clock, and AP-LLR mode
during the second.
The block diagram of the VNU/AP-LLR unit is detailed
in Figure 4. At the input, two multiplexers are used to select
the input data according to either the VNU or AP-LLR mode.
Similarly, at the output, a de-multiplexer is used to choose the
value of either m,n or n , depending on the sel signal. The
block in the middle, which may acts as either a subtractor or

B. Enhanced Architecture
In this Section we discuss the main enhancements that we
are incorporating into the baseline architecture, which consist
of (1) a low-cost VNU/AP-LLR processing unit that merges
in an efcient way the logical functionalities of the VNU
and AP-LLR units, (2) a low-cost CNU architecture, which
is executed twice in order to complete computation of the
check-node messages, (3) a splitting of the iteration processing
in two perfectly symmetric stages, yielding an optimal clock
frequency. VNU/AP-LLR unit and the new CNU substitute to
the VNU, AP-LLR, and the old CNU units in the baseline
architecture, as shown in Figure 3 (where VNU/AP-LLR is
shortened to VN/AP). All the other blocks of the architecture
remain the same.

233

* 



/

0 

#$




#
#
$#


+#

#
#

$#

#,




!-.



-,




!-.








!-.






-.

x


 


Figure 7.

1
!-.




x

x

/

1
!-.

1
!-.

x

 

 

Figure 8.


" 














$&'(














$&#'(



 
 


x


Figure 9.

 

2-FMIG architecture



1
!-.

Block diagram of the proposed CNU architecture





#



-

Figure 6.

IG (Index Generator) architecture

a number of inputs (2k + 2r ) equal to the sum of two powers


of 2. The general case can be worked out by decomposing the
number of inputs as a sum of powers of 2, then combining
corresponding blocks similarly to the technique used in [10].
The 2k -FMIG (First Minimum and Index Generator) block
computes the value and the index of the rst minimum among
the 2k input values. The 2-FMIG block includes one comparator and one multiplexer, as shown in Figure 7. The 4-FMIG
consists of three 2-FMIG blocks for nding the minimum
value and one multiplexer for indicating its index, as shown in
Figure 8. Similarly, the 2k+1 -FMIG block can be constructed
from three 2k -FMIG blocks and one multiplexer. The IG
(Index Generator) block in Figure 6 is used to determine the
index of the minimum value, and is further detailed in Figure 9
3) Iteration Processing Split: As shown in Figure 3, in
the new architecture the clock signal is fed to the CNU.
This allows splitting the iteration processing in two perfectly
symmetric stages, executed in two consecutive clock cycles,
each one using the same processing units, but in different
mode. In the rst clock cycle we perform read operations, then
execute the VNU/AP-LLR unit in VNU mode, and the CNU to
compute min1 and indx min1. In the second clock cycle
we execute the CNU to compute min2, the VNU/AP-LLR
unit in AP-LLR mode, and perform write back operations.
The processing load is perfectly balanced between the two
clock cycles, thus yielding an optimal clock frequency. In
particular, the second execution of the CNU during the second
clock cycle does not impose any penalty on the operating
clock frequency. The baseline CNU (i.e. computing min1,
min2, and indx min1) executed in one of the two clock

4-FMIG architecture

an adder is detailed in Figure 5 (by the sake of simplicity,


we illustrate this block for q = 4 bits). It consist of a
modied Ripple Carry Adder (RCA) with carry in given by
the complement of the sel signal (C0 = sel), and which
is further XORed to all the bits of the second input. It can
be easily seen that the VNU/AP-LLR unit operate in VNU
mode if sel = 0 (C0 = 1), or in AP-LLR mode if sel = 1
(C0 = 0).
2) CNU Unit: We focus only on the computation of min1,
min2, and indx min1, as the signs of the output messages
can be simply computed by XORing the adequate signs of
input messages. We propose a high-speed low-cost CNU
architecture inspired by the TS architecture proposed in [10],
which is further simplied so as to compute only the value
and the index of the rst minimum. As shown in Figure 6, our
CNU is executed during the rst clock cycle to compute min1
and indx min1, then it is re-executed during the second
clock cycle with indx min1 input set to the maximum value,
so that to compute min2. The sel control signal is used to
indicate whether the CNU is in rst or second minimum mode
(rst or second clock cycle). The compare and select block is
used to set the indx min1 input to the maximum value, in
case that the sel signal indicates that the second minimum
is being computed (second clock cycle).
The proposed CNU architecture is detailed in Figure 6 for

234

x

x
x

x

x

x




x 
x

x

. 
x



+  + 






+ 
1

 



. 

 



#$

 




 
') 
!"#"$#

+ 









 




Figure 10.
Modied VNU to accommodate variable check-node degree
(example for dcmin = dcmax 1)

Figure 11.
Modied CNU to accommodate variable check-node degree
(example for dcmin = dcmax 1)

cycles would lead to an increased critical path, and therefore a


reduced clock frequency, while splitting its execution between
the two clock cycles would have resulted in an inefcient use
of the hardware resources.

corresponds to one row of the base matrix B. However, in


general it is also possible to dene a decoding layer as RPL
consecutive rows of the base matrix, as long as each column
of B has at most one non-negative entry in each layer. This
feature has been integrated to our design. If RPL > 1, the
number of decoding layers is equal to R/RPL, with RPL Z
check nodes per each layer.
Finally, the user-dened parameter allows specifying the
quantization parameters (q, q), and the number of decoding
iterations.

C. Case of Check-Node Irregular Codes


To accommodate QC-LDPC codes with variable checknode degree dc [dcmin , dcmax ], some extra control logic is
required in order to inactivate the last dcmax dc VNU/APLLR units, as well as the last dcmax dc inputs of the CNU,
for check-nodes of degree dc . If the check node degree dc
varies between dcmin and dcmax . A VNU/AP-LLR unit is
inactivated by setting the corresponding -inputs to 0, while an
input of the CNU is inactivated by setting it to the maximum
value (2q1 1, where q is the number of quantization bits
SAT
values, including the sign bit). The modied
of input m,n
VNU/AP-LLR and CNU architectures are shown in Figure 10
and Figure 11, respectively, for dcmin = dcmax 1.

IV. I MPLEMENTATION R ESULTS


We have implemented the baseline and enhanced layered
MS decoder architectures for a regular QC-LDPC code with
variable-nodes of degree dv = 3, and for the irregular WiMAX
QC-LDPC code with rate 1/2 [11]. For both codes, the size
of the base is equal to R C = 12 24. For the regular code,
the base matrix B is shown in Figure 13. It can be divided in
3 horizontal layers, with each layer corresponding to RPL = 4
consecutive rows of B. For the WiMAX code, the RPL value
is set to 1, thus the number of decoding layers is equal to
12. Conguration parameters of the two decoders are further
detailed in Table I.
ASIC synthesis results targeting a 65nm CMOS technology
are shown in Table II. The top part of the table reports the
maximum operating frequency, the corresponding throughput,
and the area. The reported throughput is given by the formula:

D. Design and Run Time Flexibility


Figure 12 details the owchart of the QC-LDPC decoder
generation. The VHDL inputs consist of two conguration
les, for the base-matrix related parameters and the userdened parameters. Base-matrix parameters relate to either the
matrix size (number of rows and columns, expand factor) or
to the number, position and values of the non-negative entries
(dcmin , dcmax , positions and values on non-negative entries
per row). While some of these parameters are xed, meaning
that they cannot be overwritten at run time, the number of
rows of the base matrix as well as the positions and values on
non-negative entries per row can be overwritten at run time,
while still ensuring proper operation of the decoder using the
redened base-matrix. This property is particularly useful to
achieve exibility of the implemented decoder with respect
to the coding rate. Note also that it would also be possible
to achieve exibility with respect to the expansion factor
(Z) value, by including some extra control logic. However,
such control logic has not been included in our current
implementation, so we report this parameter as being xed.
The RPL parameter shown in Figure 12 allows dening
the number of base matrix Rows Per Layer. For the sake of
simplicity, we have assumed so far that one decoding layer

Throughput =

N fmax
,
iter number cyc iter

where N = C Z is the codeword length, and cyc iter =


2 (R/RPL) is the number of clock cycles to complete
one iteration (2 clock cycles per layer, times the number of
layers). First, we note that the enhanced architecture provides
a signicant increase in the maximum operating frequency
compared to the baseline architecture, by a factor of 2.25 and
3, for the (3, 6)-regular and the WiMAX code, respectively.
This is due to the proposed increased-speed CNU together with
the proposed split of the iteration processing. Regarding the
area, it can be seen that the enhanced architecture provides
a signicant area reduction for the (3, 6)-regular code, by
24.2% compared to the baseline architecture. However, the

235

2 3  /

 /

 /
# 4 5  * $
% #'   # 

%$*   

%## #   !

%#$# #   !


4 /
# 4 5  * $
% #'   
*     !/     
  
/    !/  &
  (    
01% #'   *

67









<





16


1


16





1


66


6:











8<



89


6


9



9




68




<





<




11



16


9




1



8


9






98


6




8:


91





9

8;



9



1










9


17




8



81





1



6






6




8





8


61


<



1



9



7




451
* 

Table I
PARAMETERS OF THE QC-LDPC

)

#   


 #

Figure 12.

Flowchart for QC-LDPC decoder generation



6:


<




17



9



8




<


1:



8



:





9









16




;


6;


18






6:


67


<




7 89

 3

 #*



88




6;



17

Table II
C OMPARISON BETWEEN ENHANCED AND BASELINE ARCHITECTURES FOR
(3, 6)- REGULAR AND W I MAX QC-LDPC CODES

67



98



1;




(3, 6)-regular QC-LDPC


Baseline

Enhanced

WiMAX QC-LDPC
Baseline

Enhanced

Max. Freq. (MHz)

111

250

83

250

Throughput (Mbps)

1198

2700

398

1200

Area (mm2 )

0.95

0.72

0.88

0.86

0.71

0.88

Frequency (MHz)
Area (mm2 )

111
0.95

83
0.76

CODES

RPL

dcmin

dcmax

q iter number

(3,6)-regular 12

24

54

20

12

24

96

20

WiMAX

6
451 
  # 

#   


 #451

Base matrix of the (3, 6)-regular QC-LDPC code

Figure 13.

'%2 . ' ,011 # 


'%2 . ' $ !#!
 #'% #'  ! 

metrics is detailed in the footnote to Table III. Note that for


all the reported implementations, the achieved throughput is
inversely proportional to the number of iterations, hence the
NTAR metric corresponds to the TAR value assuming that
only one decoding iteration is performed. We mention that the
decoder proposed in [13] is a recongurable decoder that supports the IEEE 802.16e (WiMAX) and and the IEEE 802.11n
(WiFi) wireless standards. The reported throughput is the
maximum achievable coded throughput for the (1152, 2304)
WiMAX code with 5 decoding iterations. From Table III it
can be seen that the proposed enhanced architecture compares
favorably with state of the art implementations, yielding a
NTAR value of 27.9 Gbps/mm2 /iteration.
Finally, we mention that for the (3, 6)-regular QC-LDPC
code, the proposed enhanced architecture achieves an NTAR
value of 75 Gbps/mm2 /iteration.

area reduction is of only 2.27% for the WiMAX code. In oder


to keep the area comparison on an equal basis with respect
to synthesis timing constraints, in the bottom part of Table II
we report area gures when the same timing constraints are
applied to both the baseline and the enhanced architecture.
We consider timing constrains corresponding to the maximum
operating frequency for the baseline architecture. In this case,
it can be seen that the proposed cost-efcient VNU/AP-LLR
and CNU processing units yield an area reduction by 25.26%
for the (3, 6)-regular code, and by 13.64% for the WiMAX
code.
For the WiMAX QC-LDPC code, the proposed enhanced
architecture is further compared with other state of the art
implementations in Table III. We also report throughput and
area gures scaled to 65nm [12], as well as the Throughput to
Area Ratio (TAR) and the Normalized TAR (NTAR) metrics
[13], so as to keep the throughput comparison on an equal
basis with respect to technology, area, and number of iterations. To scale throughput and area to 65nm, we use scale
factors (technology size/65) and (65/technology size)2 , as
suggested in [12]. The computation of the TAR and NTAR

V. C ONCLUSION
In this paper we proposed a low-cost and exible architecture for high-throughput layered LDPC decoders with
fully-parallel processing units. To do so, we proposed new
processing unit architectures that allow a more efcient hardware usage, thus yielding a signicant cost reduction. The
proposed CNU further allows splitting the iteration processing
in two perfectly symmetric stages, resulting in a signicant
increase in the maximum operating frequency. The proposed

236

Table III
C OMPARISON BETWEEN THE PROPOSED ENHANCED ARCHITECTURE AND STATE OF THE ART IMPLEMENTATIONS FOR THE W I MAX QC-LDPC
Y. Ueng (2008) [14]

K. Zhang (2009) [15]

T. Heidari (2013) [16]

K. Kanchetla (2016) [13]

Proposed decoder

Code length

2304

2304

2304

576-2304

2304

Technology (nm)

180

90

130

90

65

Frequency (MHz)

200

950

100

149

250

Iterations

4.6 (average)

10

10

20

Throughput (Mbps)

106

2200

183

955

1200

Tput. scaled to 65nm (Mbps)

294

3036

366

1318

1200

Area

(mm2 )

Area scaled to 65nm


TAR

2.90

()

1.51

()

2010.60

211.56

221.89

1395.35

20106

2115.6

1109.45

27907

(mm2 )

(Mbps/mm2 )

NTAR (Mbps/mm2 /iter)


()

CODE

6.90

()

1.73

()

11.42
5.94

()
()

0.86

()

0.86

()

only core area is reported

()

total chip area is reported

TAR = (Throughput scaled to 65nm) / (Area scaled to 65nm)


NTAR = TAR Iterations

enhanced architecture allows full design time exibility, and


also provides good run time exibility, by allowing the same
architecture being executed with different base matrices sharing a number of common characteristics. Finally, the benets
of the proposed architecture have been demonstrated through
comparison with a baseline layered architecture with fullyparallel processing units, as well as several state of the art
implementations of layered LDPC decoders.

[9] Z. Wang and Z. Cui, A memory efcient partially parallel decoder


architecture for quasi-cyclic LDPC codes, IEEE Trans. on Very Large
Scale Integration (VLSI) Systems, vol. 15, no. 4, pp. 483488, 2007.
[10] C.-L. Wey, M.-D. Shieh, and S.-Y. Lin, Algorithms of nding the
rst two minimum values and their hardware implementation, IEEE
Transactions on Circuits and Systems I: Regular Papers, vol. 55, no. 11,
pp. 34303437, 2008.
[11] IEEE-802.16e, Physical and medium access control layers for combined xed and mobile operation in licensed bands, 2005, amendment
to Air Interface for Fixed Broadband Wireless Access Systems.
[12] J. R. Hauser, MOSFET device scaling, in Handbook of Semiconductor
Manufacturing Technology. Boca Raton, FL: CRC Press, 2008, pp. 8
21.
[13] V. K. Kanchetla, R. Shrestha, and R. Paily, Multi-standard highthroughput and low-power quasi-cyclic low density parity check decoder
for worldwide interoperability for microwave access and wireless delity
standards, IET Circuits, Devices & Systems, vol. 10, no. 2, pp. 111120,
2016.
[14] Y.-L. Ueng, C.-J. Yang, Z.-C. Wu, C.-E. Wu, and Y.-L. Wang, VLSI
decoding architecture with improved convergence speed and reduced
decoding latency for irregular LDPC codes in WiMAX, in IEEE
International Symposium on Circuits and Systems, ISCAS 2008., 2008,
pp. 520523.
[15] K. Zhang, X. Huang, and Z. Wang, High-throughput layered decoder
implementation for quasi-cyclic ldpc codes, IEEE Journal on Selected
Areas in Communications, vol. 27, no. 6, pp. 985994, 2009.
[16] T. Heidari and A. Jannesari, Design of high-throughput qc-ldpc decoder
for wimax standard, in 2013 21st Iranian Conference on Electrical
Engineering (ICEE), 2013, pp. 14.

ACKNOWLEDGMENT
The authors acknowledge support from the European H2020
Work Programme, project Flex5Gware, and the French ANR
Programme Blanc-2013, project DIAMOND.
R EFERENCES
[1] R. Tanner, A recursive approach to low complexity codes, IEEE Trans.
on Inf. Theory, vol. 27, no. 5, pp. 533547, 1981.
[2] F. R. Kschischang and B. J. Frey, Iterative decoding of compound
codes by probability propagation in graphical models, IEEE Journal on
Selected Areas in Communications, vol. 16, no. 2, pp. 219230, 1998.
[3] D. Hocevar, A reduced complexity decoder architecture via layered
decoding of LDPC codes, in IEEE Workshop on Signal Processing
Systems (SIPS), 2004, pp. 107112.
[4] J. Zhang, Y. Wang, M. P. Fossorier, and J. S. Yedidia, Iterative decoding
with replicas, IEEE Transactions on Information Theory, vol. 53, no. 5,
pp. 16441663, 2007.
[5] M. P. Fossorier, Quasicyclic low-density parity-check codes from circulant permutation matrices, IEEE Transactions on Information Theory,
vol. 50, no. 8, pp. 17881793, 2004.
[6] E. Boutillon and G. Masera, Hardware design and realization for iteratively decodable codes, in Channel Coding: Theory, Algorithms, and
Applications, D. Declercq, M. Fossorier, and E. Biglieri, Eds. Academic
Press Library in Mobile and Wireless Communications, Elsevier, June
2014.
[7] O. Boncalo, A. Amaricai, A. Hera, and V. Savin, Cost efcient FPGA
layered LDPC decoder with serial AP-LLR processing, in IEEE International Conference on Field Programmable Logic and Applications
(FPL), Munich, Germany, September 2014, pp. 16.
[8] M. Fossorier, M. Mihaljevic, and H. Imai, Reduced complexity iterative
decoding of low-density parity check codes based on belief propagation,
IEEE Trans. on Communications, vol. 47, no. 5, pp. 673680, 1999.

237

You might also like