0% found this document useful (0 votes)

19 views

Flexible, Cost-Efficient, High-Throughput Architecture For Layered LDPC Decoders With Fully-Parallel Processing Units

httfuyu

Uploaded by

Anupam Das

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views

Flexible, Cost-Efficient, High-Throughput Architecture For Layered LDPC Decoders With Fully-Parallel Processing Units

httfuyu

Uploaded by

Anupam Das

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

2016 Euromicro Conference on Digital System Design

Flexible, Cost-Efcient, High-Throughput

Architecture for Layered LDPC Decoders with
Fully-Parallel Processing Units
Thien T. Nguyen-Ly , Tushar Gupta , Manuel Pezzin , Valentin Savin , David Declercq and Sorin Cotofana
CEA-LETI,

MINATEC Campus, Grenoble, France

{thientruong.nguyen-ly, tushar.gupta, manuel.pezzin, valentin.savin}@cea.fr
ETIS, ENSEA / CNRS UMR-8051 / University of Cergy-Pontoise, France, [email protected]
Computer Engineering Laboratory, Delft University of Technology, The Netherlands, [email protected]

AbstractIn this paper, we propose a layered LDPC decoder

architecture targeting exibility, high-throughput, low cost, and
efcient use of the hardware resources. The proposed architecture
provides full design time exibility, i.e., it can accommodate
any Quasi-Cyclic (QC) LDPC code, and also allows redening a
number of parameters of the QC-LDPC code at the run time.
The main novelty of the paper consists of: (1) a new low-cost
processing unit that merges the logical functionalities of the
Variable-Node Unit (VNU) and the A Posteriori Log-Likelihood
Ratio (AP-LLR) unit in an efcient way, (2) a high speed, low-cost
Check-Node Unit (CNU) architecture, which is executed twice
at each iteration in order to complete the computation of the
check-node messages, (3) a splitting of the iteration processing
in two perfectly symmetric stages, executed in two consecutive clock cycles, each one using exactly the same processing
resources; the processing load is perfectly balanced between
the two clock cycles, thus yielding an optimal clock frequency.
Synthesis results targeting a 65nm CMOS technology for a
(3, 6)-regular (648, 1296) Quasi-Cyclic LDPC code and for the
WiMax (1152, 2304) irregular QC-LDPC code show signicant
improvements in terms of area and throughput compared to the
baseline architecture discussed in this paper, as well as several
state of the art implementations.

to their neighbors. This message-passing schedule is usually

referred to as ooding scheduling [2]. A different approach is
to split the parity check matrix in several horizontal layers,
then process horizontal layer sequentially, while check-nodes
(rows) within the same layer are processed by using a ooding
schedule strategy. Each time a layer is processed the decoder
updates the neighbor variable-nodes, so as to prot from the
propagated messages, and then proceeds to the next layer.
This message scheduling, known as layered scheduling [3],
propagates information faster and converges in about half the
number of iterations compared to the fully parallel scheduling
[4], thus yielding a lower decoding latency. Layered scheduling advantageously applies to Quasi-Cyclic (QC) LDPC codes
[5], which are naturally equipped with a layered structure,
and also known to signicantly reduce the complexity of the
interconnection network. Due to their benets in terms of
area/throughput/exibility, layered QC-LDPC decoders have
been widely adopted, and can be considered as a de facto
standard solution in most applications [6]. Additional considerations may address different optimizations at the processing
unit level, e.g., implementing different decoding algorithms
or processing the input data in either a serial or a parallel
manner [7]. Regarding the MP decoding algorithm, hardware
implementations of LDPC decoders mostly rely on the MinSum (MS) algorithm [8], since the corresponding VNUs
and CNUs can be implemented by very simple arithmetic
operations (additions and comparisons).
In this work, we propose a layered MS decoder architecture
targeting (i) exibility, (ii) high-throughput, and (iii) low
cost and efcient use of the hardware resources. Highest
exibility can be achieved by using serial processing units:
VNUs and CNUs process incoming messages in a serial
manner, which makes their implementation independent of the
variable or check-node degree. However, this comes at the
cost of a reduced throughput. Thus, in this paper we focus
on layered LDPC decoder architectures with fully parallel
processing units. Such architecture has some inherent limitations in terms of exibility, mainly concerning the number
of incoming messages into VNUs and CNUs, corresponding to
the degrees (i.e., number of connections) of the corresponding

I. I NTRODUCTION
Low Density Parity Check (LDPC) codes are a class of
error correction codes known to closely approach to the
Shannon limit under iterative message-passing (MP) decoding
algorithms. MP architectures are composed of processing units
that perform the desired computation by passing messages
to each other. The way such architecture applies to LDPC
decoding is closely related to the bipartite graph representation
of LDPC codes [1]. It comprises two types of nodes, known as
variable-nodes and check-nodes, corresponding respectively to
coded bits and parity-check equations. Accordingly, an LDPC
decoder comprises two types of processing units, namely
Variable-Node Units (VNUs) and Check-Node Units (CNUs),
which exchange messages according to the structure of the
bipartite graph.
MP decoders may deal with different scheduling strategies,
according to the order in which variable and check-node messages are updated during the message passing iterative process.
The classical convention is that, at each iteration, all checknodes and subsequently all variable-nodes pass new messages
978-1-5090-2817-7/16 $31.00 2016 IEEE
DOI 10.1109/DSD.2016.33

230

II. L AYERED MS D ECODING FOR QC-LDPC C ODES

We consider a QC-LDPC code dened by a base matrix B
of size R C, with integer entries bi,j 1. The paritycheck matrix H is obtained by expanding the base-matrix B
by an expansion factor Z; thus, each entry of B is replaced
by a square matrix of size Z Z, dened as follows: 1
entries are replaced by the all-zero matrix, while bi,j 0
entries are replaced by a circulant matrix, obtained by rightshifting the identity matrix by bi,j positions. Hence, H has
M = R Z rows and N = C Z columns. We also denote
by Mr the set of Z consecutive rows of H corresponding to
the r-th row in B. Mr is further referred to as a (decoding)
layer of H. Finally, we denote by N (m) the set of columns
of H having a non-zero (1) entry in the m-th row, for
any m = 1, . . . , M . In the bipartite graph, representation,
check and variable nodes correspond respectively to rows and
columns of H, and they are connected by edges according the
the non-zero entries of H. The number of edges incident to
each check or variable node (or equivalently, the weight of the
corresponding row/column) is referred to as the node degree.
Let (x1 , , xN ) denote a codeword that is sent over
a binary input channel, and (y1 , , yN ) be the received
word. The following notation for MP decoders will be used
throughout the paper:
n = log (Pr(xn = 0|yn )/ Pr(xn = 1|yn )), the LLR value
of xn according to the received yn value; it is also referred to
as the a priori LLR of variable node n;
n : the a posteriori (AP) LLR of variable node n;
m,n : message sent from variable-node n to check-node m;
m,n : message sent from check-node m to variable-node n;
The layered MS decoding is described in Algorithm 1. To
match to the hardware implementation that will be discussed

variable and check nodes in the Tanner graph [1]. To ensure

the highest possible exibility, the proposed architecture can
accommodate any QC-LDPC code, and also allows redening
a number of parameters at the run time, e.g., number of rows
of the QC base matrix, as well as the positions and values of
the non-negative entries within each row.
The classical solution to increase throughput and to also
ensure an efcient use of hardware resources in layered
architectures is to pipeline the datapath. However, the number
of stages in the datapath may impose specic constraints on
the base matrix of the QC-LDPC code, in order to ensure
that no memory conicts occur during the read/write operations from/to the memory storing the exchanged messages
or the a posteriori logarithmic likelihood ratios (AP-LLR)
values. Moreover, pipelined architectures violate the layered
scheduling principle, in the sense that each layer processing
starts before completing processing the previous layer, thus
reducing the convergence speed. To avoid such limitations,
the proposed architecture does not use pipeline. Instead, we
propose a specic design of the datapath processing units
(VNUs, CNUs, and AP-LLR units) that allow an efcient
reuse of the hardware resources, thus yielding signicant cost
reduction. Accordingly, the main novelty of the paper consists
of: (1) A low-cost VNU/AP-LLR processing unit that merges
in an efcient way the logical functionalities of the VNU
and AP-LLR units, and can be executed by selecting either
the VNU or the AP-LLR mode. (2) A high-speed, low-cost
CNU architecture, which only computes the rst minimum
(min1) and index of the rst minimum (indx min1), instead
of rst two minima and indx min1 as required by the MS
decoding algorithm. To compute the second minimum (min2),
the CNU is executed a second time with indx min1 input
set to the maximum value (according to the bit-length of the
exchanged messages). Due to a specic organization of the
datapath, the second execution of the CNU does not induce
any penalty in terms of throughput, as explained below. (3) We
split the iteration processing in two perfectly symmetric stages,
executed in two consecutive clock cycles, each one using the
same processing resources. In the rst clock cycle we perform
read operations, then execute the VNU/AP-LLR unit in VNU
mode, and the CNU to compute min1 and indx min1. In
the second clock cycle we execute the CNU to compute min2,
the VNU/AP-LLR unit in AP-LLR mode, and perform write
back operations. The processing load is perfectly balanced
between the two clock cycles, thus yielding an optimal clock
frequency. In particular, the second execution of the CNU
during the second clock cycle does not impose any penalty
on the operating clock frequency.
The paper is organized as follows. In Section II we briey
review QC-LDPC codes and the MS decoding algorithm.
Section III details the proposed low-cost, high-throughput
exible architecture for the layered MS decoder. We discuss
rst the baseline architecture, and then the main enhancements
that we are incorporating into this architecture. Implementation
results are provided in Section IV, and Section V concludes
the paper.

Algorithm 1 Layered MS decoding algorithm

Input: (1 , . . . , N )
input LLRs
Output: (
x1 , . . . , x
N )
estimated codeword
[Initialization]
for all n = 1, . . . , N do n = n ;
for all m = 1, . . . , M and n N (m) do m,n = 0;
[Decoding Iterations]
for all iter = 1, . . . , iter max do
Iteration loop
for all r = 1, . . . , R do
Loop over horizontal layers
for all m Mr and n N (m) do
VNU
m,n = n m,n ;
for all m Mr and n N (m) do
SAT CNU

m,n =
sign(m,n ) min |m,n
| ;
n H(m)\n

n H(m)\n

SAT
// where m,n
is the value of m,n saturated to q bits

for all m Mr and n N (m) do

AP-LLR
n = m,n + m,n ;
end (horizontal layers loop)
for all n = 1, . . . , N do x
n = sign bit(
n ); hard decision
if H x
N
=
0
then
exit
iteration
loop;
syndrome check
1
end (iteration loop)

231

" #!$

$#
!
! #
#
!!

&
'

#
$#

! #
#
$#
!!
!
$#
!!
! #
#
!

!
! #

!!

(
!

'

x
%
%
%

%

')

x
%

%
%

&'(
%
% #''#$

%#$# #!

! # # $#

%
%

& ' ( ! '
!"#"# "$#

!"#"# "$#

Figure 1.

Block diagram of the baseline layered MS decoder architecture

#*#%

) #''%

Figure 2.

III. L AYERED MS D ECODER A RCHITECTURE

For the sake of simplicity, we shall rst assume that all
the check-nodes have the same degree, which will be denoted
in the sequel by dcmax . No further assumptions are made
regarding the base matrix B. The case of check-node irregular
codes will be discussed in Section III-C. We start by discussing
the baseline architecture, then the proposed enhancements are
discussed in Section III-B.

Compressed -message

in the next section, we assume that input LLRs n and

check-to-variable node messages m,n are quantized on q
bits, while AP-LLR values n are quantized on q bits, with
q < q. Subtractions and additions used in the VNU and APLLR steps are implemented through the use of q-bit saturated
adders. Hence, variable-to-check messages m,n computed at
the VNU step are quantized on q bits, and they are saturated
to q bits just before entering the CNU. The m,n values used
at the AP-LLR step are the unsaturated q-bit values.
It is worth noting that for a given m, the absolute values
of the m,n messages computed at the CNU step are equal
to either the rst or the second minimum of the input mesSAT
sages absolute values |m,n
|. Moreover, there is only one
m,n message whose absolute value is equal to the second
minimum, with the variable-node index corresponding to the
rst minimum. In the sequel, we shall denote by min1 and
min2 the rst and second minimum, and by indx min1 the
index of the rst minimum. Thus, m,n messages can be stored
in a compressed format [9] to reduce memory requirements,
by storing only their signs, min1, min2, and indx min1
values, as shown in Figure 2.

A. Baseline Architecture
Figure 1 illustrates the baseline architecture of the layered
MS decoder, whose main blocks are further discussed below.
Each decoding iteration takes two clock cycles. All data are
read and processed at the rst rising edge clock, then written
at the second rising edge clock.
Memory blocks. Two memory blocks are used, one for
memory) and one for the m,n messages
the n values (
( memory). n values are quantized on q bits, and m,n
messages on q bits. memory is implemented by registers,
in order to allow massively parallel read or write operations.
The memory is organized in C blocks, denoted by APi
(i = 0, . . . , C 1) corresponding to the number of columns of
base matrix, each one consisting of Z q bits. Data are read
from/write to blocks corresponding to non-negative entries in
the row of B (layer) being processed. memory is implemented as a Random Access Memory (RAM). Each memory

232

word consists of Z compressed -messages, corresponding to

one row of B.
Permutations for Reading and Writing (PER R, PER W).
PER R permutation is used to rearrange the data read from
memory, according to the processed layer, so as to ensure
processing by the proper VNU/CNU. PER W block operates
oppositely to PER R.
Barrel Shifter for Reading and Writing (BS R, BS W).
Barrel shifters are used to implement the cyclic (shift) permutations corresponding to the non-negative entries of the base
matrix B. We use dcmax BS R and dcmax BS W blocks,
corresponding to the check-node degree, each of them having
Z q-bit inputs and Z q-bit outputs.
Decompress. This block is used to convert m,n messages
from the compressed format to the uncompressed one.
Variable Node Units (VNUs). These processing units compute the m,n messages. The inputs of the VNUs are read
from memory and memory. Each VNU i block (i =
0, . . . , dcmax 1) in Figure 1 consists of Z q-bit saturated
subtractors for the parallel execution of Z variable-nodes (one
column of B).
Saturators (SATs). Prior to CNU processing, m,n values are
saturated to q bits.
Check Node Units (CNUs). These processing units compute
the m,n messages. For simplicity, Figure 1 shows one CNU
block with dcmax inputs, each one of size Z q bits. Thus, this
block actually includes Z computing units, used to process
in parallel the Z check-nodes within one layer. The checknode processing consists of computing the signs of the messages, as well as min1, min2 and indx min1 value,
and is implemented by using the high-speed low-cost (treestructure) TS approach proposed in [10].
AP-LLR Units. These units compute the n values. Each
AP LLR i block (i = 0, . . . , dcmax 1) in Figure 1 consists
of Z q-bit saturated adders, for the parallel execution of Z
variable-nodes (one column of B).
Controller. This block generates control signals such as
count layer for indicating which layer is being processed,
En read and En write for reading and writing data, etc. It
also controls the synchronous execution of the other blocks.

&
'
(
!

'

x
x

+ +

')
')

%
%

%

!"#"$#

Figure 3.

New processing units for the layered MS decoder architecture

Figure 4.

VNU/AP-LLR processing unit

Figure 5.

Adder/subtractor block used within the VNU/AP-LLR unit

1) VNU/AP-LLR Unit: The main difference between VNU

and AP-LLR processing units is that subtractors are used
within the rst, while adders are used within the second. We
propose a new VNU/AP-LLR processing unit that merges their
logical functionalities, controlled by a specic signal (sel)
to allow selecting between the VNU or AP-LLR mode. The
control signal is generated by the controller, such that VNU
mode is selected during the rst clock, and AP-LLR mode
during the second.
The block diagram of the VNU/AP-LLR unit is detailed
in Figure 4. At the input, two multiplexers are used to select
the input data according to either the VNU or AP-LLR mode.
Similarly, at the output, a de-multiplexer is used to choose the
value of either m,n or n , depending on the sel signal. The
block in the middle, which may acts as either a subtractor or

B. Enhanced Architecture
In this Section we discuss the main enhancements that we
are incorporating into the baseline architecture, which consist
of (1) a low-cost VNU/AP-LLR processing unit that merges
in an efcient way the logical functionalities of the VNU
and AP-LLR units, (2) a low-cost CNU architecture, which
is executed twice in order to complete computation of the
check-node messages, (3) a splitting of the iteration processing
in two perfectly symmetric stages, yielding an optimal clock
frequency. VNU/AP-LLR unit and the new CNU substitute to
the VNU, AP-LLR, and the old CNU units in the baseline
architecture, as shown in Figure 3 (where VNU/AP-LLR is
shortened to VN/AP). All the other blocks of the architecture
remain the same.

233

/

0

#
#
$#

#
#

$#

!-.

Figure 7.

1
!-.

Figure 8.

$&'(

$&#'(

Figure 9.

2-FMIG architecture

1
!-.

Block diagram of the proposed CNU architecture

Figure 6.

IG (Index Generator) architecture

a number of inputs (2k + 2r ) equal to the sum of two powers

of 2. The general case can be worked out by decomposing the
number of inputs as a sum of powers of 2, then combining
corresponding blocks similarly to the technique used in [10].
The 2k -FMIG (First Minimum and Index Generator) block
computes the value and the index of the rst minimum among
the 2k input values. The 2-FMIG block includes one comparator and one multiplexer, as shown in Figure 7. The 4-FMIG
consists of three 2-FMIG blocks for nding the minimum
value and one multiplexer for indicating its index, as shown in
Figure 8. Similarly, the 2k+1 -FMIG block can be constructed
from three 2k -FMIG blocks and one multiplexer. The IG
(Index Generator) block in Figure 6 is used to determine the
index of the minimum value, and is further detailed in Figure 9
3) Iteration Processing Split: As shown in Figure 3, in
the new architecture the clock signal is fed to the CNU.
This allows splitting the iteration processing in two perfectly
symmetric stages, executed in two consecutive clock cycles,
each one using the same processing units, but in different
mode. In the rst clock cycle we perform read operations, then
execute the VNU/AP-LLR unit in VNU mode, and the CNU to
compute min1 and indx min1. In the second clock cycle
we execute the CNU to compute min2, the VNU/AP-LLR
unit in AP-LLR mode, and perform write back operations.
The processing load is perfectly balanced between the two
clock cycles, thus yielding an optimal clock frequency. In
particular, the second execution of the CNU during the second
clock cycle does not impose any penalty on the operating
clock frequency. The baseline CNU (i.e. computing min1,
min2, and indx min1) executed in one of the two clock

4-FMIG architecture

an adder is detailed in Figure 5 (by the sake of simplicity,

we illustrate this block for q = 4 bits). It consist of a
modied Ripple Carry Adder (RCA) with carry in given by
the complement of the sel signal (C0 = sel), and which
is further XORed to all the bits of the second input. It can
be easily seen that the VNU/AP-LLR unit operate in VNU
mode if sel = 0 (C0 = 1), or in AP-LLR mode if sel = 1
(C0 = 0).
2) CNU Unit: We focus only on the computation of min1,
min2, and indx min1, as the signs of the output messages
can be simply computed by XORing the adequate signs of
input messages. We propose a high-speed low-cost CNU
architecture inspired by the TS architecture proposed in [10],
which is further simplied so as to compute only the value
and the index of the rst minimum. As shown in Figure 6, our
CNU is executed during the rst clock cycle to compute min1
and indx min1, then it is re-executed during the second
clock cycle with indx min1 input set to the maximum value,
so that to compute min2. The sel control signal is used to
indicate whether the CNU is in rst or second minimum mode
(rst or second clock cycle). The compare and select block is
used to set the indx min1 input to the maximum value, in
case that the sel signal indicates that the second minimum
is being computed (second clock cycle).
The proposed CNU architecture is detailed in Figure 6 for

234

x
x

.
x

+ +

+
1

')
!"#"$#

Figure 10.
Modied VNU to accommodate variable check-node degree
(example for dcmin = dcmax 1)

Figure 11.
Modied CNU to accommodate variable check-node degree
(example for dcmin = dcmax 1)

cycles would lead to an increased critical path, and therefore a

reduced clock frequency, while splitting its execution between
the two clock cycles would have resulted in an inefcient use
of the hardware resources.

corresponds to one row of the base matrix B. However, in

general it is also possible to dene a decoding layer as RPL
consecutive rows of the base matrix, as long as each column
of B has at most one non-negative entry in each layer. This
feature has been integrated to our design. If RPL > 1, the
number of decoding layers is equal to R/RPL, with RPL Z
check nodes per each layer.
Finally, the user-dened parameter allows specifying the
quantization parameters (q, q), and the number of decoding
iterations.

C. Case of Check-Node Irregular Codes

To accommodate QC-LDPC codes with variable checknode degree dc [dcmin , dcmax ], some extra control logic is
required in order to inactivate the last dcmax dc VNU/APLLR units, as well as the last dcmax dc inputs of the CNU,
for check-nodes of degree dc . If the check node degree dc
varies between dcmin and dcmax . A VNU/AP-LLR unit is
inactivated by setting the corresponding -inputs to 0, while an
input of the CNU is inactivated by setting it to the maximum
value (2q1 1, where q is the number of quantization bits
SAT
values, including the sign bit). The modied
of input m,n
VNU/AP-LLR and CNU architectures are shown in Figure 10
and Figure 11, respectively, for dcmin = dcmax 1.

IV. I MPLEMENTATION R ESULTS

We have implemented the baseline and enhanced layered
MS decoder architectures for a regular QC-LDPC code with
variable-nodes of degree dv = 3, and for the irregular WiMAX
QC-LDPC code with rate 1/2 [11]. For both codes, the size
of the base is equal to R C = 12 24. For the regular code,
the base matrix B is shown in Figure 13. It can be divided in
3 horizontal layers, with each layer corresponding to RPL = 4
consecutive rows of B. For the WiMAX code, the RPL value
is set to 1, thus the number of decoding layers is equal to
12. Conguration parameters of the two decoders are further
detailed in Table I.
ASIC synthesis results targeting a 65nm CMOS technology
are shown in Table II. The top part of the table reports the
maximum operating frequency, the corresponding throughput,
and the area. The reported throughput is given by the formula:

D. Design and Run Time Flexibility

Figure 12 details the owchart of the QC-LDPC decoder
generation. The VHDL inputs consist of two conguration
les, for the base-matrix related parameters and the userdened parameters. Base-matrix parameters relate to either the
matrix size (number of rows and columns, expand factor) or
to the number, position and values of the non-negative entries
(dcmin , dcmax , positions and values on non-negative entries
per row). While some of these parameters are xed, meaning
that they cannot be overwritten at run time, the number of
rows of the base matrix as well as the positions and values on
non-negative entries per row can be overwritten at run time,
while still ensuring proper operation of the decoder using the
redened base-matrix. This property is particularly useful to
achieve exibility of the implemented decoder with respect
to the coding rate. Note also that it would also be possible
to achieve exibility with respect to the expansion factor
(Z) value, by including some extra control logic. However,
such control logic has not been included in our current
implementation, so we report this parameter as being xed.
The RPL parameter shown in Figure 12 allows dening
the number of base matrix Rows Per Layer. For the sake of
simplicity, we have assumed so far that one decoding layer

Throughput =

N fmax
,
iter number cyc iter

where N = C Z is the codeword length, and cyc iter =

2 (R/RPL) is the number of clock cycles to complete
one iteration (2 clock cycles per layer, times the number of
layers). First, we note that the enhanced architecture provides
a signicant increase in the maximum operating frequency
compared to the baseline architecture, by a factor of 2.25 and
3, for the (3, 6)-regular and the WiMAX code, respectively.
This is due to the proposed increased-speed CNU together with
the proposed split of the iteration processing. Regarding the
area, it can be seen that the enhanced architecture provides
a signicant area reduction for the (3, 6)-regular code, by
24.2% compared to the baseline architecture. However, the

235

2 3 /

/
# 4 5 * $
% #' #

%$*

%## #!

%#$# #!

4 /
# 4 5 * $
% #'
* !/

/ !/ &
(
01% #'*

67

<

16

1

16

1

66

6:

8<

89

6

9

9

68

<

<

11

16

9

1

8

9

98

6

8:

91

9

8;

9

1

9

17

8

81

1

6

6

8

8

61

<

1

9

7

451
*

Table I
PARAMETERS OF THE QC-LDPC

)

#

Figure 12.

Flowchart for QC-LDPC decoder generation

6:

<

17

9

8

<

1:

8

:

9

16

;

6;

18

6:

67

<

7 89

3

88

6;

17

Table II
C OMPARISON BETWEEN ENHANCED AND BASELINE ARCHITECTURES FOR
(3, 6)- REGULAR AND W I MAX QC-LDPC CODES

67

98

1;

(3, 6)-regular QC-LDPC

Baseline

Enhanced

WiMAX QC-LDPC
Baseline

Enhanced

Max. Freq. (MHz)

111

250

Throughput (Mbps)

1198

2700

398

1200

Area (mm2 )

0.95

0.72

0.88

0.86

0.71

0.88

Frequency (MHz)
Area (mm2 )

111
0.95

83
0.76

CODES

RPL

dcmin

dcmax

q iter number

(3,6)-regular 12

WiMAX

6
451
#

#

#451

Base matrix of the (3, 6)-regular QC-LDPC code

Figure 13.

'%2 .',011#

'%2 .'$!#!
#'% #'!

metrics is detailed in the footnote to Table III. Note that for

all the reported implementations, the achieved throughput is
inversely proportional to the number of iterations, hence the
NTAR metric corresponds to the TAR value assuming that
only one decoding iteration is performed. We mention that the
decoder proposed in [13] is a recongurable decoder that supports the IEEE 802.16e (WiMAX) and and the IEEE 802.11n
(WiFi) wireless standards. The reported throughput is the
maximum achievable coded throughput for the (1152, 2304)
WiMAX code with 5 decoding iterations. From Table III it
can be seen that the proposed enhanced architecture compares
favorably with state of the art implementations, yielding a
NTAR value of 27.9 Gbps/mm2 /iteration.
Finally, we mention that for the (3, 6)-regular QC-LDPC
code, the proposed enhanced architecture achieves an NTAR
value of 75 Gbps/mm2 /iteration.

area reduction is of only 2.27% for the WiMAX code. In oder

to keep the area comparison on an equal basis with respect
to synthesis timing constraints, in the bottom part of Table II
we report area gures when the same timing constraints are
applied to both the baseline and the enhanced architecture.
We consider timing constrains corresponding to the maximum
operating frequency for the baseline architecture. In this case,
it can be seen that the proposed cost-efcient VNU/AP-LLR
and CNU processing units yield an area reduction by 25.26%
for the (3, 6)-regular code, and by 13.64% for the WiMAX
code.
For the WiMAX QC-LDPC code, the proposed enhanced
architecture is further compared with other state of the art
implementations in Table III. We also report throughput and
area gures scaled to 65nm [12], as well as the Throughput to
Area Ratio (TAR) and the Normalized TAR (NTAR) metrics
[13], so as to keep the throughput comparison on an equal
basis with respect to technology, area, and number of iterations. To scale throughput and area to 65nm, we use scale
factors (technology size/65) and (65/technology size)2 , as
suggested in [12]. The computation of the TAR and NTAR

V. C ONCLUSION
In this paper we proposed a low-cost and exible architecture for high-throughput layered LDPC decoders with
fully-parallel processing units. To do so, we proposed new
processing unit architectures that allow a more efcient hardware usage, thus yielding a signicant cost reduction. The
proposed CNU further allows splitting the iteration processing
in two perfectly symmetric stages, resulting in a signicant
increase in the maximum operating frequency. The proposed

236

Table III
C OMPARISON BETWEEN THE PROPOSED ENHANCED ARCHITECTURE AND STATE OF THE ART IMPLEMENTATIONS FOR THE W I MAX QC-LDPC
Y. Ueng (2008) [14]

K. Zhang (2009) [15]

T. Heidari (2013) [16]

K. Kanchetla (2016) [13]

Proposed decoder

Code length

2304

576-2304

2304

Technology (nm)

180

130

Frequency (MHz)

200

950

100

149

250

Iterations

4.6 (average)

Throughput (Mbps)

106

2200

183

955

1200

Tput. scaled to 65nm (Mbps)

294

3036

366

1318

1200

Area

(mm2 )

Area scaled to 65nm

TAR

2.90

()

1.51

()

2010.60

211.56

221.89

1395.35

20106

2115.6

1109.45

27907

(mm2 )

(Mbps/mm2 )

NTAR (Mbps/mm2 /iter)

()

CODE

6.90

()

1.73

()

11.42
5.94

()
()

0.86

()

0.86

()

only core area is reported

()

total chip area is reported

TAR = (Throughput scaled to 65nm) / (Area scaled to 65nm)

NTAR = TAR Iterations

enhanced architecture allows full design time exibility, and

also provides good run time exibility, by allowing the same
architecture being executed with different base matrices sharing a number of common characteristics. Finally, the benets
of the proposed architecture have been demonstrated through
comparison with a baseline layered architecture with fullyparallel processing units, as well as several state of the art
implementations of layered LDPC decoders.

[9] Z. Wang and Z. Cui, A memory efcient partially parallel decoder

architecture for quasi-cyclic LDPC codes, IEEE Trans. on Very Large
Scale Integration (VLSI) Systems, vol. 15, no. 4, pp. 483488, 2007.
[10] C.-L. Wey, M.-D. Shieh, and S.-Y. Lin, Algorithms of nding the
rst two minimum values and their hardware implementation, IEEE
Transactions on Circuits and Systems I: Regular Papers, vol. 55, no. 11,
pp. 34303437, 2008.
[11] IEEE-802.16e, Physical and medium access control layers for combined xed and mobile operation in licensed bands, 2005, amendment
to Air Interface for Fixed Broadband Wireless Access Systems.
[12] J. R. Hauser, MOSFET device scaling, in Handbook of Semiconductor
Manufacturing Technology. Boca Raton, FL: CRC Press, 2008, pp. 8
21.
[13] V. K. Kanchetla, R. Shrestha, and R. Paily, Multi-standard highthroughput and low-power quasi-cyclic low density parity check decoder
for worldwide interoperability for microwave access and wireless delity
standards, IET Circuits, Devices & Systems, vol. 10, no. 2, pp. 111120,
2016.
[14] Y.-L. Ueng, C.-J. Yang, Z.-C. Wu, C.-E. Wu, and Y.-L. Wang, VLSI
decoding architecture with improved convergence speed and reduced
decoding latency for irregular LDPC codes in WiMAX, in IEEE
International Symposium on Circuits and Systems, ISCAS 2008., 2008,
pp. 520523.
[15] K. Zhang, X. Huang, and Z. Wang, High-throughput layered decoder
implementation for quasi-cyclic ldpc codes, IEEE Journal on Selected
Areas in Communications, vol. 27, no. 6, pp. 985994, 2009.
[16] T. Heidari and A. Jannesari, Design of high-throughput qc-ldpc decoder
for wimax standard, in 2013 21st Iranian Conference on Electrical
Engineering (ICEE), 2013, pp. 14.

ACKNOWLEDGMENT
The authors acknowledge support from the European H2020
Work Programme, project Flex5Gware, and the French ANR
Programme Blanc-2013, project DIAMOND.
R EFERENCES
[1] R. Tanner, A recursive approach to low complexity codes, IEEE Trans.
on Inf. Theory, vol. 27, no. 5, pp. 533547, 1981.
[2] F. R. Kschischang and B. J. Frey, Iterative decoding of compound
codes by probability propagation in graphical models, IEEE Journal on
Selected Areas in Communications, vol. 16, no. 2, pp. 219230, 1998.
[3] D. Hocevar, A reduced complexity decoder architecture via layered
decoding of LDPC codes, in IEEE Workshop on Signal Processing
Systems (SIPS), 2004, pp. 107112.
[4] J. Zhang, Y. Wang, M. P. Fossorier, and J. S. Yedidia, Iterative decoding
with replicas, IEEE Transactions on Information Theory, vol. 53, no. 5,
pp. 16441663, 2007.
[5] M. P. Fossorier, Quasicyclic low-density parity-check codes from circulant permutation matrices, IEEE Transactions on Information Theory,
vol. 50, no. 8, pp. 17881793, 2004.
[6] E. Boutillon and G. Masera, Hardware design and realization for iteratively decodable codes, in Channel Coding: Theory, Algorithms, and
Applications, D. Declercq, M. Fossorier, and E. Biglieri, Eds. Academic
Press Library in Mobile and Wireless Communications, Elsevier, June
2014.
[7] O. Boncalo, A. Amaricai, A. Hera, and V. Savin, Cost efcient FPGA
layered LDPC decoder with serial AP-LLR processing, in IEEE International Conference on Field Programmable Logic and Applications
(FPL), Munich, Germany, September 2014, pp. 16.
[8] M. Fossorier, M. Mihaljevic, and H. Imai, Reduced complexity iterative
decoding of low-density parity check codes based on belief propagation,
IEEE Trans. on Communications, vol. 47, no. 5, pp. 673680, 1999.

237

Simulation of Digital Communication Systems Using Matlab
From Everand
Simulation of Digital Communication Systems Using Matlab
Mathuranathan Viswanathan
3.5/5 (22)
Fluid Transients in Pipeline Systems (1st Edition) - Thorley
100% (10)
Fluid Transients in Pipeline Systems (1st Edition) - Thorley
264 pages
A Scalable Decoder Architecture For IEEE 802.11n LDPC Codes
No ratings yet
A Scalable Decoder Architecture For IEEE 802.11n LDPC Codes
5 pages
Istc 18 Paper
No ratings yet
Istc 18 Paper
6 pages
Systematic_construction_verification_and_implement
No ratings yet
Systematic_construction_verification_and_implement
13 pages
VLSI Decoder Architecture For High Throughput, Variable Block-Size and Multi-Rate LDPC Codes
No ratings yet
VLSI Decoder Architecture For High Throughput, Variable Block-Size and Multi-Rate LDPC Codes
4 pages
Min-Sum/offset-Min-Sum Algorithm Is Proposed That Supports Both Irreg
No ratings yet
Min-Sum/offset-Min-Sum Algorithm Is Proposed That Supports Both Irreg
5 pages
A Low-Power 1-Gbps Reconfigurable LDPC Decoder Design For Multiple 4G Wireless Standards
No ratings yet
A Low-Power 1-Gbps Reconfigurable LDPC Decoder Design For Multiple 4G Wireless Standards
4 pages
50ICRASE130513
No ratings yet
50ICRASE130513
4 pages
Implementation With: 170 Mbps (8176, 7156) Quasi-Cyclic LDPC Decoder Fpga
No ratings yet
Implementation With: 170 Mbps (8176, 7156) Quasi-Cyclic LDPC Decoder Fpga
4 pages
Low-Power VLSI Decoder Architectures For LDPC Codes: Mohammad M. Mansour and Naresh R.Shanbhag
No ratings yet
Low-Power VLSI Decoder Architectures For LDPC Codes: Mohammad M. Mansour and Naresh R.Shanbhag
6 pages
Kiran Gunnam, Weihuang Wang, Euncheol Kim, Gwan Choi, Mark Yeary
No ratings yet
Kiran Gunnam, Weihuang Wang, Euncheol Kim, Gwan Choi, Mark Yeary
6 pages
24.2 A 1.15Gb/s Fully Parallel Nonbinary LDPC Decoder With Fine-Grained Dynamic Clock Gating
No ratings yet
24.2 A 1.15Gb/s Fully Parallel Nonbinary LDPC Decoder With Fine-Grained Dynamic Clock Gating
3 pages
LDPC CCSDS
No ratings yet
LDPC CCSDS
64 pages
C2006 Ieee PDF
No ratings yet
C2006 Ieee PDF
4 pages
FPGA Implementation of LDPC Decoder Architecture For Wireless Communication Standards
No ratings yet
FPGA Implementation of LDPC Decoder Architecture For Wireless Communication Standards
4 pages
Scalable and Low Power LDPC Decoder Design Using High Level Algorithmic Synthesis
No ratings yet
Scalable and Low Power LDPC Decoder Design Using High Level Algorithmic Synthesis
4 pages
Electronics 10 00516 v2
No ratings yet
Electronics 10 00516 v2
18 pages
Efficient reconfigurable parallel switching for low-density parity-check encoding and decoding
No ratings yet
Efficient reconfigurable parallel switching for low-density parity-check encoding and decoding
10 pages
A High-Throughput LDPC Decoder Architecture With Rate Compatibility
No ratings yet
A High-Throughput LDPC Decoder Architecture With Rate Compatibility
9 pages
High Throughput, Parallel, Scalable LDPC Encoder/Decoder Architecture For Ofdm Systems
No ratings yet
High Throughput, Parallel, Scalable LDPC Encoder/Decoder Architecture For Ofdm Systems
4 pages
Transactions Briefs: A Nonbinary LDPC Decoder Architecture With Adaptive Message Control
No ratings yet
Transactions Briefs: A Nonbinary LDPC Decoder Architecture With Adaptive Message Control
5 pages
Efficient Architectures For Multigigabit CCSDS LDPC Encoders
No ratings yet
Efficient Architectures For Multigigabit CCSDS LDPC Encoders
10 pages
Report
No ratings yet
Report
69 pages
Basic-Set Trellis Min-Max Decoder Architecture For Nonbinary LDPC Codes With High-Order Galois Fields
No ratings yet
Basic-Set Trellis Min-Max Decoder Architecture For Nonbinary LDPC Codes With High-Order Galois Fields
12 pages
A Bit-Serial Approximate Min-Sum LDPC Decoder - Chan Carusone A University of Toronto
No ratings yet
A Bit-Serial Approximate Min-Sum LDPC Decoder - Chan Carusone A University of Toronto
4 pages
Pepe Luigi PDF
No ratings yet
Pepe Luigi PDF
107 pages
IEEE Paper ldpc encoder
No ratings yet
IEEE Paper ldpc encoder
3 pages
A Simple Circular-Shift Network
No ratings yet
A Simple Circular-Shift Network
5 pages
isi_28.05_01
No ratings yet
isi_28.05_01
13 pages
A Low Power Layered Decoding Architecture For LDPC Decoder Implementation For IEEE 802.11n LDPC Codes
No ratings yet
A Low Power Layered Decoding Architecture For LDPC Decoder Implementation For IEEE 802.11n LDPC Codes
6 pages
A Comparison Between LDPC Block and Convolutional Codes
No ratings yet
A Comparison Between LDPC Block and Convolutional Codes
5 pages
Electronics: Efficient QC-LDPC Encoder For 5G New Radio
No ratings yet
Electronics: Efficient QC-LDPC Encoder For 5G New Radio
15 pages
Low-Complexity Multi-Size Circular-Shift Network for 5G New
No ratings yet
Low-Complexity Multi-Size Circular-Shift Network for 5G New
13 pages
Design of LDPC Decoder Using FPGA: Review of Flexibility: Asisa Kumar Panigrahi, Ajit Kumar Panda
No ratings yet
Design of LDPC Decoder Using FPGA: Review of Flexibility: Asisa Kumar Panigrahi, Ajit Kumar Panda
6 pages
B1 Report
No ratings yet
B1 Report
226 pages
An Automated FPGA-based Framework For Rapid Prototyping of Nonbinary LDPC Codes
No ratings yet
An Automated FPGA-based Framework For Rapid Prototyping of Nonbinary LDPC Codes
5 pages
A_low-complexity_implementation_of_QC-LD
No ratings yet
A_low-complexity_implementation_of_QC-LD
4 pages
Beyond 100 Gbits Pipeline Decoders For Spatially C
No ratings yet
Beyond 100 Gbits Pipeline Decoders For Spatially C
20 pages
Reduced Energy Min-Max Decoding Algorithm For LDPC Code With Adder Correction Technique
No ratings yet
Reduced Energy Min-Max Decoding Algorithm For LDPC Code With Adder Correction Technique
7 pages
Epfl TH7297
No ratings yet
Epfl TH7297
189 pages
Error-Correction on Non-Standard Communication Channels
From Everand
Error-Correction on Non-Standard Communication Channels
Edward A. Ratzer
No ratings yet
RC LDPC Dec Opex
No ratings yet
RC LDPC Dec Opex
7 pages
Sum Subt Fix Point LDPC Dec
No ratings yet
Sum Subt Fix Point LDPC Dec
12 pages
Low-Density Parity-Check Code Constructions For Hardware Implementation
No ratings yet
Low-Density Parity-Check Code Constructions For Hardware Implementation
5 pages
ASurvey LDPC
No ratings yet
ASurvey LDPC
8 pages
High Performance Short-Block Binary Regular LDPC Codes: Alexandria Engineering Journal
No ratings yet
High Performance Short-Block Binary Regular LDPC Codes: Alexandria Engineering Journal
7 pages
Low-Complexity Reliability-Based Message-Passing Decoder Architectures For Non-Binary LDPC Codes
No ratings yet
Low-Complexity Reliability-Based Message-Passing Decoder Architectures For Non-Binary LDPC Codes
13 pages
An Efficient FPGA Implementation of IEEE 802.16e LDPC Encoder
No ratings yet
An Efficient FPGA Implementation of IEEE 802.16e LDPC Encoder
56 pages
Low-Complexity Transformed Encoder Architectures For Quasi-Cyclic Nonbinary LDPC Codes Over Subfields
No ratings yet
Low-Complexity Transformed Encoder Architectures For Quasi-Cyclic Nonbinary LDPC Codes Over Subfields
10 pages
LDPC Research
No ratings yet
LDPC Research
51 pages
Architectures and Implementations of LDPC Decoding Algorithm
No ratings yet
Architectures and Implementations of LDPC Decoding Algorithm
25 pages
10 1109@WCNCW48565 2020 9124897
No ratings yet
10 1109@WCNCW48565 2020 9124897
6 pages
LDPCFPGASurvey
No ratings yet
LDPCFPGASurvey
26 pages
PDP2008 Mgomes
No ratings yet
PDP2008 Mgomes
8 pages
LDPC Options For Next Generation Wireless Systems: T. Lestable and E. Zimmermann
No ratings yet
LDPC Options For Next Generation Wireless Systems: T. Lestable and E. Zimmermann
10 pages
LDPC Decoder Thesis
100% (3)
LDPC Decoder Thesis
8 pages
Routing in Wireless Mesh Networks
From Everand
Routing in Wireless Mesh Networks
Raghav Kumar
No ratings yet
Column Layered Decoding PDF
No ratings yet
Column Layered Decoding PDF
10 pages
REVIEW REport
No ratings yet
REVIEW REport
33 pages
LDPC Codes - A Brief Tutorial
No ratings yet
LDPC Codes - A Brief Tutorial
9 pages
SIT Staff List
No ratings yet
SIT Staff List
3 pages
TEST - 4 (SUBJECT WISE) 23-Feb-20 16 - 53
No ratings yet
TEST - 4 (SUBJECT WISE) 23-Feb-20 16 - 53
54 pages
TEST - 8 (SUBJECT WISE) 07-Mar-20 17 - 42
No ratings yet
TEST - 8 (SUBJECT WISE) 07-Mar-20 17 - 42
65 pages
Expense Tracker
No ratings yet
Expense Tracker
18 pages
TEST - 13 (SUBJECT WISE) 10-Apr-20 12 - 26
No ratings yet
TEST - 13 (SUBJECT WISE) 10-Apr-20 12 - 26
65 pages
KV Form
No ratings yet
KV Form
1 page
NIMHANS - Welfare Benefits For Persons With Psychiatric Disability English Sep 2016
No ratings yet
NIMHANS - Welfare Benefits For Persons With Psychiatric Disability English Sep 2016
2 pages
Week 5 Lecture Material
No ratings yet
Week 5 Lecture Material
88 pages
Taguchi Analysis Report
No ratings yet
Taguchi Analysis Report
15 pages
A New Methodology The Design of Asynchronous Digital Circuits
No ratings yet
A New Methodology The Design of Asynchronous Digital Circuits
6 pages
First Stage Registration Form
No ratings yet
First Stage Registration Form
1 page
SCRA 2015 Cutoff
No ratings yet
SCRA 2015 Cutoff
1 page
Experiment - 5 Load Flow Analysis Using Power World Simulator
No ratings yet
Experiment - 5 Load Flow Analysis Using Power World Simulator
1 page
Side-Channel Power Analysis of A GPU AES Implementation: Chao Luo, Yunsi Fei, Pei Luo, Saoni Mukherjee, David Kaeli
No ratings yet
Side-Channel Power Analysis of A GPU AES Implementation: Chao Luo, Yunsi Fei, Pei Luo, Saoni Mukherjee, David Kaeli
8 pages
Firstyear Namelist
No ratings yet
Firstyear Namelist
27 pages
Writing Statement of Purpose
No ratings yet
Writing Statement of Purpose
1 page
Earthquake Drain Installation Specification
No ratings yet
Earthquake Drain Installation Specification
3 pages
2 Executive
No ratings yet
2 Executive
16 pages
DA201 C1 Rev 24
No ratings yet
DA201 C1 Rev 24
1,440 pages
Chapter 1 Design
100% (2)
Chapter 1 Design
17 pages
KRAI Practical
No ratings yet
KRAI Practical
14 pages
Applied Physics Lab Lab Lab
No ratings yet
Applied Physics Lab Lab Lab
31 pages
793f Ac Hoist System
100% (1)
793f Ac Hoist System
30 pages
2011 Land Rover Range Rover 5.0L Eng HS
No ratings yet
2011 Land Rover Range Rover 5.0L Eng HS
12 pages
CK4L3000P LED Video Processor User Manual CKDZ English (Ingles), Video Controller
No ratings yet
CK4L3000P LED Video Processor User Manual CKDZ English (Ingles), Video Controller
32 pages
Jawaban Uts Isu Isu Kontemporer Pemda - Hendriyana
No ratings yet
Jawaban Uts Isu Isu Kontemporer Pemda - Hendriyana
8 pages
600630.IAF-ILAC-A4 2004 Guidance On The Application of ISO-IEC 17020 2007-04
No ratings yet
600630.IAF-ILAC-A4 2004 Guidance On The Application of ISO-IEC 17020 2007-04
25 pages
Floating Drilling Equipment and Operations
No ratings yet
Floating Drilling Equipment and Operations
6 pages
Pelleting History
No ratings yet
Pelleting History
7 pages
Aabri Report
No ratings yet
Aabri Report
13 pages
IT Enabled Supply Chain Management.
No ratings yet
IT Enabled Supply Chain Management.
46 pages
Client Server Testing
No ratings yet
Client Server Testing
3 pages
JSW Cold Rolled Brochure
No ratings yet
JSW Cold Rolled Brochure
23 pages
Design Tables R00
100% (1)
Design Tables R00
4 pages
Protected Cultivation Post Harvest Technology PDF
No ratings yet
Protected Cultivation Post Harvest Technology PDF
120 pages
OrangeHRM OS 3.3 Administrative User Guide PDF
No ratings yet
OrangeHRM OS 3.3 Administrative User Guide PDF
130 pages
Rustilo DWX 30
No ratings yet
Rustilo DWX 30
5 pages
artificial-intelligence-in-soil-health-monitoring
No ratings yet
artificial-intelligence-in-soil-health-monitoring
2 pages
BIOBASE -40_ Freezer BDF-40V268II User Manual (220V±10% 50Hz) 202203
No ratings yet
BIOBASE -40_ Freezer BDF-40V268II User Manual (220V±10% 50Hz) 202203
20 pages
Daily Lesson Plan School Grade Level Teacher Mr. Herbert J. Magango Learning Area Teaching Date and Time Quarter
No ratings yet
Daily Lesson Plan School Grade Level Teacher Mr. Herbert J. Magango Learning Area Teaching Date and Time Quarter
2 pages
NID
No ratings yet
NID
14 pages
Limantos Residence - Fernanda Marques Arquitetos Associados - ArchDaily
No ratings yet
Limantos Residence - Fernanda Marques Arquitetos Associados - ArchDaily
8 pages
Synopsis Compiler Design
No ratings yet
Synopsis Compiler Design
2 pages
BP33-12 (H) (F) : Maintenance-Free Rechargeable Sealed Lead-Acid Battery
No ratings yet
BP33-12 (H) (F) : Maintenance-Free Rechargeable Sealed Lead-Acid Battery
2 pages
IELTS Practise Materials General TrainingTest
50% (6)
IELTS Practise Materials General TrainingTest
58 pages

Flexible, Cost-Efficient, High-Throughput Architecture For Layered LDPC Decoders With Fully-Parallel Processing Units

Uploaded by

Flexible, Cost-Efficient, High-Throughput Architecture For Layered LDPC Decoders With Fully-Parallel Processing Units

Uploaded by

2016 Euromicro Conference on Digital System Design

Flexible, Cost-Efcient, High-Throughput

MINATEC Campus, Grenoble, France

AbstractIn this paper, we propose a layered LDPC decoder

to their neighbors. This message-passing schedule is usually

II. L AYERED MS D ECODING FOR QC-LDPC C ODES

variable and check nodes in the Tanner graph [1]. To ensure

Algorithm 1 Layered MS decoding algorithm

for all m Mr and n N (m) do

Block diagram of the baseline layered MS decoder architecture

III. L AYERED MS D ECODER A RCHITECTURE

in the next section, we assume that input LLRs n and

word consists of Z compressed -messages, corresponding to

New processing units for the layered MS decoder architecture

VNU/AP-LLR processing unit

Adder/subtractor block used within the VNU/AP-LLR unit

1) VNU/AP-LLR Unit: The main difference between VNU

Block diagram of the proposed CNU architecture

IG (Index Generator) architecture

a number of inputs (2k + 2r ) equal to the sum of two powers

an adder is detailed in Figure 5 (by the sake of simplicity,

cycles would lead to an increased critical path, and therefore a

corresponds to one row of the base matrix B. However, in

C. Case of Check-Node Irregular Codes

IV. I MPLEMENTATION R ESULTS

D. Design and Run Time Flexibility

where N = C Z is the codeword length, and cyc iter =

2 3  /

%## #   !

%#$# #   !

#   

Flowchart for QC-LDPC decoder generation

(3, 6)-regular QC-LDPC

Max. Freq. (MHz)

#   

Base matrix of the (3, 6)-regular QC-LDPC code

'%2 . ' ,011 # 

metrics is detailed in the footnote to Table III. Note that for

area reduction is of only 2.27% for the WiMAX code. In oder

K. Zhang (2009) [15]

T. Heidari (2013) [16]

K. Kanchetla (2016) [13]

Tput. scaled to 65nm (Mbps)

Area scaled to 65nm

NTAR (Mbps/mm2 /iter)

only core area is reported

total chip area is reported

TAR = (Throughput scaled to 65nm) / (Area scaled to 65nm)

enhanced architecture allows full design time exibility, and

[9] Z. Wang and Z. Cui, A memory efcient partially parallel decoder

You might also like

2 3 /

%## #!

%#$# #!

#

#

'%2 .',011#