LDPCFPGASurvey
LDPCFPGASurvey
Abstract—Low-Density Parity Check (LDPC) error correction including WiFi [3], WiMAX [4], DVB-S2 [5], CCSDS [6] and
decoders have become popular in communications systems, as ITU G.hn [7].
a benefit of their strong error correction performance and LDPC codes benefit from a number of appealing features
their suitability to parallel hardware implementation. A great
deal of research effort has been invested into LDPC decoder that make them very attractive for implementation. The LDPC
designs that exploit the flexibility, the high processing speed decoding algorithm can be implemented using low-complexity
and the parallelism of Field-Programmable Gate Array (FPGA) calculations, resulting in a relatively low design and im-
devices. FPGAs are ideal for design prototyping and for the plementation cost for the processing hardware. Like turbo
manufacturing of small-production-run devices, where their in- codes, LDPC codes are decoded iteratively, achieving an error
system programmability makes them far more cost-effective
than Application-Specific Integrated Circuits (ASICs). However, correction performance that is close to the theoretical limit
the FPGA-based LDPC decoder designs published in the open when decoding messages that have large block lengths [8].
literature vary greatly in terms of design choices and performance However, in contrast to turbo codes, there is a wide variety
criteria, making them a challenge to compare. This paper of possible algorithms and levels of parallelisation that may
explores the key factors involved in FPGA-based LDPC decoder be considered for the design of LDPC decoders, presenting
design and presents an extensive review of the current literature.
In-depth comparisons are drawn amongst 140 published designs designers with a range of options that may be relied upon to
(both academic and industrial) and the associated performance achieve the desired characteristics.
trade-offs are characterised, discussed and illustrated. Seven key However, while the design of the individual processing com-
performance characteristics are described, namely their process- ponents is relatively simple, the design of a complete LDPC
ing throughput, processing latency, hardware resource require- decoder is subject to a complex interplay between a number
ments, error correction capability, processing energy efficiency,
bandwidth efficiency and flexibility. We offer recommendations of system characteristics, namely the processing throughput,
that will facilitate fairer comparisons of future designs, as well processing latency, hardware resource requirements, error cor-
as opportunities for improving the design of FPGA-based LDPC rection capability, processing energy efficiency, bandwidth
decoders. efficiency and flexibility. These characteristics depend on a
Index Terms—Digital communication, error correction codes, number of system parameters, namely the architecture, the
low-density parity check (LDPC) codes, field programmable gate LDPC code employed, the algorithm used and the number of
array, iterative decoding decoding iterations. This relationship is shown in Fig. 1. Note
that the bandwidth efficiency also depends on the modulation
scheme chosen, as does the transmission energy efficiency,
I. I NTRODUCTION which furthermore depends on the coding gain and the error
correction capability of the chosen LDPC code. To elaborate
L OW-Density Parity Check (LDPC) codes may be em-
ployed for correcting transmission errors in communica-
tion systems. They represent a class of Forward Error Correc-
a little further in the context of Fig. 1, we can improve the
error correction capability in many different ways, for example
tion (FEC) codes that are currently the focus of much research by using a stronger LDPC code or more decoding iterations.
within the communications community. They were first pro- Naturally, increasing the number of iterations increases the
posed by Gallager in 1962 [1], but they were considered to complexity and hence reduces the processing energy effi-
be too complex for practical simulation and implementation at ciency, but increases the transmit energy efficiency. Hence the
the time of their conception, hence they were left largely un- total energy dissipation should be considered holistically, when
touched for decades. Apart from their excellent performance, designing an LDPC decoder. Further similar trade-offs will
perhaps partially motivated by the fact that the turbo codes emerge throughout our forthcoming discussions.
patented during the early 1990s attracted a license-fee, in 1996 In order to fully characterise an LDPC decoder design, it
LDPC codes were rediscovered by Mackay and Neal [2], and is necessary to physically implement it. Perhaps the simplest
ever since have enjoyed a renaissance. Given the increased way of doing so is to use a Field-Programmable Gate Array
computing power available today they have become a key (FPGA) device, which facilitates rapid prototyping and fast
component of many commercialised communication systems, parallel logic processing. This approach is especially useful
for measuring the Bit Error Rate (BER) performance, since
The financial support of the PhD studentship provided by Altera, California simulations that would take days on a computer can be
USA, the grants EP/J015520/1 and EP/L010550/1 provided by EPSRC, completed in only hours when using a custom FPGA imple-
Swindon UK, the grant TS/L009390/1 provided by Innovate UK, Swindon mentation [9]. These advantages are evident from the sheer
UK, as well as the Advanced Fellow grant provided by the European Research
Council is gratefully acknowledged. The research data for this paper is number of published FPGA-based LDPC decoder designs
available at http://dx.doi.org/10.5258/SOTON/384946. that exist in the open literature, which will be compared
IEEE COMMUNICATIONS SURVEYS AND TUTORIALS 2
Algorithm Architecture a brief tutorial on the LDPC code structure and encoding,
as well as describing variations on the decoding algorithms,
decoder architectures and FPGA devices. Section III provides
LDPC No. of
our comparison of all FPGA-based LDPC decoders that we are
code iterations
aware of, whilst discussing the parameters and characteristics
Parameters of an LDPC decoder in more detail. Section IV illustrates and
characterises the observed trade-offs and relationships between
the various parameters and characteristics of FPGA-based
LDPC decoders. Recommendations for readers interested in
Processing Characteristics developing their own FPGA-based LDPC decoders are offered
throughput Flexibility
in Section V, along with suggestions for further work in the
Processing Bandwidth area. Finally, we offer our conclusions in Section VI. This
latency efficiency structure is depicted in Fig. 2.
Hardware Transmission
requirements energy
Processing efficiency
energy Background
efficiency LDPC codes and encoding
Fig. 1. FPGA-based LDPC decoder system parameters and characteristics Decoding algorithms
Code structures
Decoder architectures
later in this paper. Furthermore, the decoding techniques and
implementation-oriented research presented alongside these FPGA devices
designs has been of significant benefit to the wider com- Comparison
munications research community [10]–[15]. In particular, the
Table of survey results
implementational characteristics of these FPGA-based LDPC
decoders are increasingly informing the holistic design of Description of parameters
communication systems. Description of characteristics
In addition to their suitability for prototyping, FPGAs con-
stitute a viable alternative to Application-Specific Integrated
Discussions
Circuits (ASICs) for the LDPC decoders of small-production- Trade-offs
run communication devices, while their programmability has Influence of parameters on certain characteristics
made them attractive for software-defined radios. This pa- Processing throughput
per focuses exclusively on FPGA implementations of LDPC Processing latency
decoders, since they cannot be fairly compared to ASIC Hardware resource requirements
implementations, which are designed at a significantly higher Transmission energy efficiency
development cost to have particularly high performance for
Recommendations and further work
high-production-run applications. Indeed, ASIC implementa-
tions are even difficult to compare with each other, because Conclusion
some papers provide post-synthesis results, while others offer
post-layout results. Meanwhile, some papers consider only Fig. 2. Structure of this paper
the ASIC core, while others include both the memory and
Input/Output (I/O) resources.
This paper has been conceived for achieving the following II. BACKGROUND
aims:
This section presents a tutorial on FPGA-based LDPC
• Provide a tutorial on LDPC decoding, discussing both the
decoders. Section II-A commences by discussing FEC, before
parameters and characteristics that affect the performance LDPC codes are introduced in Section II-B. This is followed
of FPGA implementations. by a discussion of how LDPC codes are decoded and designed
• Accurately compare all implementations of FPGA-based
in Sections II-C and II-D, respectively. The practicalities of
LDPC decoders that we are aware of. LDPC decoder implementations are then discussed in Sec-
• Characterise the observed trade-offs and relationships
tion II-E, which is followed by a brief introduction to FPGAs
between the system parameters and characteristics. in Section II-F.
• Recommend good practice to aid future designs of FPGA-
based LDPC decoders, and to make published designs
more comparable with each other. A. Forward error correction
• Identify opportunities for the further enhancement of Fig. 3 shows a schematic of a simplified communications
FPGA-based LDPC decoders. system, where the information message word m = {mi }K i=1
The structure of the paper is as follows. Section II presents is a vector of K bits, which is FEC encoded in order to obtain
IEEE COMMUNICATIONS SURVEYS AND TUTORIALS 3
codeword, where N = K + M [17]. Codes which include the received codeword ĉ. Since all of the codeword bits involved
K bits of the message word within the N bits of the codeword in a parity check (including the parity bit itself) should have
are referred to as systematic, while non-systematic codes have a modulo-2 summation of 0, Equations (7a)–(7d) can be re-
codewords which do not directly contain the original message written as follows:
bits. There are 2K possible permutations of the K-bit message
c4 ⊕ c6 ⊕ c7 = 0 (10a)
word, each of which is mapped by the LDPC encoder to a
corresponding one of 2K legitimate codeword permutations. c1 ⊕ c3 ⊕ c4 ⊕ c6 ⊕ c8 = 0 (10b)
The error correction capability of the LDPC code depends on c2 ⊕ c5 ⊕ c9 = 0 (10c)
the minimum Hamming distance between any pair of these 2K c1 ⊕ c2 ⊕ c6 ⊕ c10 = 0. (10d)
legitimate codeword permutations. Naturally high minimum
distances are preferred, since these make it unlikely for a These equations are more commonly viewed as a PCM
legitimate codeword to be transformed into another by the H, which has N columns corresponding to the bits of the
distortion introduced during transmission. codeword and M rows corresponding to the parity checks. A
For example, a code with a message word length of K = 6 non-zero entry in any position Hji indicates that the i-th bit
and a codeword length of N = 10 employs M = N − K = 4 ci takes part in the j-th parity check. In the case of systematic
parity bits and has a coding rate of R = K/N = 3/5. In the codes H is related to G according to
case where the code is systematic, each codeword c may be H = A T IM .
(11)
of the form
Continuing our example from above, we have
c = [c1 , c2 , c3 , c4 , c5 , c6 , c7 , c8 , c9 , c10 ], (6)
0 0 0 1 0 1 1 0 0 0
where c1 . . . c6 are the K = 6 bits of the message word m 1 0 1 0 1 1 0 1 0 0
and c7 . . . c10 are the M = 4 parity bits. Each of the parity H= 0 1 0 0 1 0
. (12)
0 0 1 0
bits represents a parity check covering a specific subset of the 1 1 0 0 0 1 0 0 0 1
message bits. As an example, the parity check bits may be
obtained according to the following modulo-2 summations of Upon obtaining a received codeword ĉ, the syndrome s can
message bits: be calculated according to s = ĉ × HT . In the case where ĉ is
a legitimate codeword permutation, the syndrome will equate
c7 = c4 ⊕ c6 (7a) to a vector of zeros. This may be demonstrated by re-using the
c8 = c1 ⊕ c3 ⊕ c5 ⊕ c6 (7b) codeword calculated in the previous subsection, which equates
to a (1 × M )-element vector of 0s when multiplied by HT .
c9 = c2 ⊕ c5 (7c)
Note however that an LDPC H matrix of the form shown
c10 = c1 ⊕ c2 ⊕ c6 . (7d) in (12) is very unusual in practice. As it will be explained in
The design of an LDPC code’s parity check equations is Section II-C1, the decoder’s error correction ability is dictated
subject to many complex factors, as will be briefly described by the number of non-zero entries in each row or column,
in Section II-D. Using these equations, a (K × N )-element which is referred to as its weight. More specifically, columns
generator matrix G can be constructed to efficiently describe with a weight of 1 can result in the decoder being unable
the encoding process. In a systematic code, G may adopt the to correct some transmission errors. This can be avoided
form by modifying the PCM H using elementary row operations
(modulo-2 additions and swaps). In the case of the above
G = IK A , (8)
example, this may lead to:
where IK is the (K × K)-element identity matrix and the
columns of A represent each of the parity checks. The 1 0 0 1 1 0 1 0 1 1
0 1 1 0 1 0 0 1 0 1
generator matrix of the systematic code described above would H= 1 1 1 0 0 1 0 1 1 0 . (13)
therefore be
0 1 0 1 1 1 1 0 1 0
1 0 0 0 0 0 0 1 0 1
0 1 0 0 0 0 0 0 1 1
This modified H avoids any weight-1 columns, while still
0 0 1 0 0 0 0 1 0 0 checking the same distribution of parity bits that was added
G= . (9) to codewords by the generator matrix G of (9). Note however
0 0 0 1 0 0 1 0 0 0
0 0 0 0 1 0 0 1 1 0 that this toy-example PCM is still unusual for a realistic LDPC
0 0 0 0 0 1 1 1 0 1 code. Specifically, the PCM used in LDPC decoding should be
sparse, containing far fewer non-zero entries than 0s. Clearly,
Codewords can be calculated using this matrix by finding the H of (13) does not satisfy this constraint, owing to its
the modulo-2 matrix product of the message m and the gener- codeword length of N = 10, which is very short compared
ator matrix G, according to c = m × G. For example, it may to practical LDPC codewords, which tend to be hundreds or
be readily verified that the message m = [0 1 1 1 0 1] has the even thousands of bits long.
corresponding codeword c = m × G = [0 1 1 1 0 1 0 0 1 0]. Owing to its significance in the decoding process, the PCM
2) Parity-check matrix: In the decoder, the parity checks H is commonly used to define a particular LDPC code design.
are used to detect the presence of transmission errors in the As discussed later in Section II-D, creating a H matrix that
IEEE COMMUNICATIONS SURVEYS AND TUTORIALS 5
achieves a strong error correction capability is a complex task, output messages that are sent back to the nodes it is connected
so this is usually the first aspect of the code to be designed. to. Thus the processing of the LDPC decoder is delegated to
Following this, the generator matrix G can be derived from the many individual calculations performed by the individual
H, by following the reverse of the process described above. nodes, rather than being a single monolithic global equation.
3) Factor graphs: The PCM H can also be visualised An important facet of the belief propagation algorithm is that
graphically using a factor graph, which is also known as a any message sent to a particular node does not depend on
Tanner graph [18]. This is exemplified in Fig. 4 for the PCM the message received from that node. For example, CN c2
of (13). A factor graph is comprised of two sets of connected is connected to VNs v2 , v3 , v5 , v8 and v10 ; however, the
nodes, namely N Variable Nodes (VNs) for representing the message r̃2−5 it sends to v5 will be calculated based only
columns of H and M Check Nodes (CNs) for representing on the messages it has received from v2 , v3 , v8 and v10 .
the rows. Nodes are activated in an order determined by the LDPC de-
coder’s schedule. This has a significant effect upon the LDPC
Variable nodes
P̃1 P̃2 P̃3 P̃4 P̃5 P̃6 P̃7 P̃8 P̃9 P̃10
decoder’s error correction capability, as well as on its other
characteristics. Many different schedules exist and the most
v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 common options will be outlined in Section II-C1. Following
this, variations of the specific calculations performed within
CNs and VNs will be presented in Sections II-C2 and II-C3
respectively.
q̃4−1 r̃4−9
1) Scheduling: The schedule of the LDPC decoding pro-
Edges
cess determines the order in which VNs and CNs are pro-
cessed, as well as whether multiple nodes are processed in
c1 c2 c3 c4
parallel. Many scheduling variations exist, but the three most
common schedules are described here, namely flooding [20],
Check nodes Layered Belief Propagation (LBP) [21] and Informed Dynamic
Scheduling (IDS) [22].
Fig. 4. A factor graph for an example LDPC code
Flooding is perhaps the most conceptually simple LDPC
decoding schedule. Here, the factor graph is processed in an
The connections P̃i above each VN in Fig. 4 pertain to
iterative manner, where each iteration comprises the simul-
LLRs associated with the N codeword bits of ĉ. An edge
taneous activation of all CNs, followed by the simultaneous
connects the i-th VN vi to the j-th CN cj if there is a non-
activation of all VNs [19]. An example of this schedule is
zero element in the i-th column and j-th row of H, Hji = 1.
depicted in Fig. 5. It can be seen that at first the CNs c1 –c4
To illustrate this, all of the edges that are connected to the 1st
shown in dark grey calculate their messages, which are then
CN c1 in Fig. 4 are shown with thicker lines. These edges are
connected to the 1st , 4th , 5th , 9th and 10th VNs, in accordance
to the position of the 1s in the top row of H in (13). P̃1 P̃2 P̃3 P̃4 P̃5 P̃6 P̃7 P̃8 P̃9 P̃10
The degree of a node is defined as the number of other nodes
that it is connected to and is equal to the corresponding row or v1 v2 v3 v4 v5 v6 v7 v8 v9 v10
column weight in H. The degree of the CNs Dc and the degree
of the VNs Dv are important parameters in an LDPC code. If
all CNs have the same degree Dc and all VNs have the same
degree Dv , the LDPC code is said to be regular. If either value
varies from node to node, the code is said to be irregular and
Dc and Dv can be expressed as the average degree over all
nodes. For example, the factor graph of Fig. 4 is irregular with c1 c2 c3 c4
Dc = 5.75 and Dv = 2.3. In any case, the number of 1s in
the PCM H must be the same regardless, whether it is viewed
row-by-row or column-by-column, giving Dc ×M = Dv ×N , P̃1 P̃2 P̃3 P̃4 P̃5 P̃6 P̃7 P̃8 P̃9 P̃10
with Dv = Dc × (1 − R).
v1 v2 v3 v4 v5 v6 v7 v8 v9 v10
C. LDPC decoding
LDPC codes are typically decoded using a belief propa-
gation (BP) algorithm in which messages – typically in the
form of LLRs – are iteratively passed in both directions along
the edges between connected nodes [19]. For example, Fig. 4
illustrates a message q̃4−1 sent from the 4th VN v4 to the 1st
c1 c2 c3 c4
CN c1 , while the message r̃4−9 is sent from the 4th CN c4
to the 9th VN v9 . The messages provided as inputs to a node
are processed by activating that node, causing it to create new Fig. 5. An example of the flooding schedule
IEEE COMMUNICATIONS SURVEYS AND TUTORIALS 6
sent along every edge (in bold) to every receiving VN, shown P̃1 P̃2 P̃3 P̃4 P̃5 P̃6 P̃7 P̃8 P̃9 P̃10
in light grey. In the second half-iteration, the VNs are shown
in dark grey to indicate that they are performing calculations, v1 v2 v3 v4 v5 v6 v7 v8 v9 v10
while the CNs are only receiving messages, so they are shown
in light grey.
While Layered Belief Propagation also operates in an
iterative manner, it processes the nodes more sequentially
within each iteration, activating only one or a specific subset
of nodes at a time [21]. LBP is commonly operated in a CN-
centric manner, processing each CN in turn. Once a CN has c1 c2 c3 c4
been activated, all of its connected VNs are activated before
moving on to the next CN. Once every CN has been processed,
P̃1 P̃2 P̃3 P̃4 P̃5 P̃6 P̃7 P̃8 P̃9 P̃10
the iteration is complete. Using Fig. 6 as an example, LBP
may commence each decoding iteration by activating CN c1
v1 v2 v3 v4 v5 v6 v7 v8 v9 v10
first, sending messages to each of its connected VNs: v1 , v4 ,
v5 , v7 , v9 and v10 . Each of these VNs may then be activated,
sending new messages to each of their connected CNs, except
c1 . Following this, c2 may be activated, allowing it to make
use of the new information received from v5 and v10 alongside
the information previously received from its other connected
VNs. This process continues until every CN has been activated,
which then marks the end of one decoding iteration. c1 c2 c3 c4
LBP has the advantage that the information obtained during
an iteration is available to aid the remainder of the iteration. P̃1 P̃2 P̃3 P̃4 P̃5 P̃6 P̃7 P̃8 P̃9 P̃10
Owing to this however, it does not have the same high level
of parallelism as the flooding schedule, possibly resulting in a v1 v2 v3 v4 v5 v6 v7 v8 v9 v10
lower processing throughput and a higher processing latency.
It can also be seen that M CN activations and Dc × M VN
activations occur per iteration, resulting in a higher computa-
tional complexity per iteration, when compared to the flooding
schedule. However, it will also be shown in Section II-C2
that CN activations can be significantly more computationally
expensive than the VN activations, hence the increased cost c1 c2 c3 c4
is manageable. Additionally, LBP tends to converge to the
correct codeword using fewer iterations and therefore with
Fig. 6. An example of the layered belief propagation schedule
lower computational complexity than flooding [17], resulting
in lower complexity overall.
Informed Dynamic Scheduling inspects the messages that
are passed between the various nodes, selecting to activate All of the residuals in the graph are then compared for the
whichever node is expected to offer the greatest improvement sake of identifying the new maximum, before the process is
in belief [22]. This requires IDS to perform additional calcula- repeated.
tions in order to determine which node to activate at each stage Using Fig. 7 as before, suppose that at the start of the
of the decoding process. However, IDS facilitates convergence iterative decoding process, the message r̃3−8 from CN c3 is
using fewer node activations than in either flooding or LBP, identified as having the highest magnitude of all the check-to-
which may lead to a lower complexity overall. variable messages in the graph. Owing to this, r̃3−8 is passed
During IDS, the difference between the previous message to the VN v8 , which is then activated, in order to obtain
sent over an edge and the message that is obtained using the message q̃8−2 which is then passed to c2 . The CN c2
recently-updated information [23] is calculated. This differ- can then be activated to calculate new residuals for its other
ence is termed the residual, and represents the improvement four edges, as the difference between their previous messages
in belief that is achieved by the new message. Like the LBP and their new messages that have been obtained using the
schedule, IDS is commonly centred on the CNs. At the start updated information from v8 . These new residuals are then
of the iterative decoding process, the residual for each output compared with the others from the previous step, allowing a
of each CN is calculated as the magnitude of the message to new global maximum to be identified, to inform the next step
be sent over that edge. The message with the greatest residual of the decoding process. Note that the next highest residual
is identified, and the receiving VN is then activated, sending within the factor graph does not necessarily have to originate
updated messages to each of its connected CNs. These CNs from the most recently updated CN c2 . In the example seen in
then calculate new residuals for each of their edges as the Fig. 7, it can be seen that c2 is activated to calculate residuals
difference between its new message and its previous message. but it is r̃1−4 from CN c1 to VN v4 that is sent. This implies
IEEE COMMUNICATIONS SURVEYS AND TUTORIALS 7
P̃1 P̃2 P̃3 P̃4 P̃5 P̃6 P̃7 P̃8 P̃9 P̃10 for which we use the notation ã b̃, referred to as the boxplus
operator [27]. Inverting (4) and substituting into (14) yields:
v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 !
−1 ã b̃
ã b̃ = 2 tanh tanh × tanh (15)
2 2
= sign ã × sign b̃ × min |ã|, |b̃|
+ log 1 + e−|ã+b̃| − log 1 + e−|ã−b̃| . (16)
c1 c2 c3 c4 The SPA uses the full version of (15) given above, which
leads to strong error correction performance but a high compu-
tational complexity. The MSA, on the other hand, is a reduced-
P̃1 P̃2 P̃3 P̃4 P̃5 P̃6 P̃7 P̃8 P̃9 P̃10
complexity approximation of the SPA [28], using (16) without
the correction factor terms, according to
v1 v2 v3 v4 v5 v6 v7 v8 v9 v10
ã b̃ = sign ã × sign b̃ × min |ã|, |b̃| . (17)
creating stopping sets [29] and short cycles [30] in the cor- standards, including DVB-S2 [5], IEEE 802.11 (WiFi) [3] and
responding factor graph, which are associated with an eroded IEEE 802.16 (Mobile WiMAX) [4].
error correction performance. A number of techniques have 4) Repeat-accumulate codes: Repeat-accumulate (RA)
been proposed for placing edges within the factor graph have codes constitute another type of semi-structured codes. Like
been proposed, as summarised in the following subsections. QC codes, RA codes benefit from simpler encoding/decoding
1) Random codes: Unstructured randomly-designed codes than random codes, without imposing an unacceptable loss
potentially achieve the best LDPC error correction perfor- in error correction performance. The PCMs of RA codes are
mance, owing to the maximised degree of freedom that is composed of two horizontally-concatenated submatrices H1
afforded, when placing edges in this manner [31]. However, and H2 , where H2 is an (M × M )-element dual-diagonal
this is achieved at the cost of having to implement complex matrix. This structure allows each parity bit to be calculated
unstructured routing or memory lookup tables, in order to ex- using only the previous parity bit and a subset of the message
change LLRs between the variable and CNs. A straightforward bits, leading to the accumulation alluded to in the code’s name.
recursive algorithm for creating unstructured PCMs of this 5) Progressive edge growth algorithm: Whilst not a code
form involves placing a 1 at a random unfilled location in H, structure itself, the Progressive Edge Growth (PEG) algo-
then checking to see whether doing so has violated any design rithm [34] is an important technique of constructing codes
constraints, such as the maximum node degrees, stopping sets having an excellent error correction performance. The oper-
or cycle lengths. If the placement is valid, the algorithm will ation of the PEG algorithm is VN-centric, focusing on each
continue and repeat the process for the next randomly placed VN in turn in order to place edges. The algorithm repeatedly
1. This is repeated until the desired number of edges have constructs a set of CNs as candidates for the VN to connect to.
been positioned. If a randomly placed 1 is not valid, then it From this set, the subset of nodes having the lowest degree is
will be rejected and a new location will be tried instead. This extracted and one of these is randomly selected. This approach
algorithm is conceptually very simple, but whether the process results in LDPC codes that have approximately regular degree
can successfully complete and how quickly is unpredictable. distributions.
2) Pseudorandom codes: The original LDPC code con- The PEG algorithm constructs a tree structure, alternating
struction method proposed by Gallager [1] involves stack- between the connection of VNs to CNs and vice versa. At
ing Dc number of submatrices on top of each other. Each each stage only nodes that are not already in the tree are
submatrix has the dimensions M/Dc × N , with each column considered for inclusion. This process continues until there
having a weight of 1 and each row having a weight of Dv . are no remaining options meeting this constraint. The PEG
The top matrix is pseudo-randomly generated, and random algorithm then places an edge in the location that is identified
column permutations are applied to it in order to obtain all as maximising the length of the resultant cycle within the
other submatrices. graph, before continuing the algorithm with the selection of a
Similarly to this, Mackay [2] proposed a code construction different VN. In this way, a factor graph having no short cycles
method, which involves constructing the PCM H on a column- can be created, yielding a strong error correction performance.
by-column basis, where the columns are generated pseudo-
randomly with appropriate weight, before being concatenated E. LDPC decoding architectures
horizontally. Again, this process must be performed in a The implementation of a practical LDPC decoder is subject
recursive manner, so that the row weights can be checked after to numerous design decisions, such as the degree of paral-
each column is added. If Dc has been exceeded for any row, lelism, the representation of the LLRs and the stopping criteria.
then the current column is regenerated. These three factors are discussed in the following subsections.
3) Quasi-cyclic codes: An LDPC code wherein the cyclic 1) Parallelism: The inherent parallelism of the belief prop-
shift of any legitimate codeword permutation by s places to the agation algorithm facilitates the design of fully-parallel LDPC
left or right yields another legitimate codeword permutation is decoder architectures, in which every VN and CN in the factor
termed Quasi-Cyclic (QC), while the code is termed cyclic graph is implemented separately in hardware [35]. Fully-
in the special case of s = 1. The PCMs of QC codes parallel decoders can achieve very high processing throughputs
are semi-structured, based on an upper matrix of elements by performing all of the VN updates and all of the CN
which each represent an equally-sized square submatrix [32]. updates simultaneously, using the flooding schedule of Fig. 5.
If a particular element in the upper matrix has a value However, this is achieved at the cost of excessive hardware
of -1, then the corresponding submatrix is a null matrix. resource consumption. For long codes comprising thousands
Otherwise, the submatrix is an identity matrix, which has of bits, the inter-node routing may require a greater area
been cyclically shifted a number of times according to the than the nodes themselves [36], rendering this architecture
corresponding value in the upper matrix [33]. Adopting this impractical for many decoder designs. Additionally, significant
structure facilitates low complexity memory addressing and further hardware resources are required for implementing
routing for the hardware implementation, since the location of flexible routing, using a Beneš network [37], for example.
every edge in each submatrix can be determined using only Otherwise, fully-parallel decoders are completely inflexible,
knowledge of the relatively small upper matrix. This advantage only supporting the single code that they are designed for.
can be achieved without incurring a significant sacrifice in By contrast, decoders associated with a fully-serial archi-
error correction performance. Owing to this benefit, QC- tecture implement just a single one of each node type in hard-
LDPC codes are employed by a number of communications ware. This hardware is time-multiplexed between the various
IEEE COMMUNICATIONS SURVEYS AND TUTORIALS 9
nodes of the LDPC decoder, using memories to store interim iteration to determine whether the current state of the recov-
results [35]. Fully-serial decoders require few hardware re- ered codeword is a legitimate permutation or not, signalling
sources but suffer from a very low processing throughput, whether or not decoding has been successful. These checks
since each decoding iteration could require thousands of clock are performed based on the output of the VNs, as mentioned
cycles. However, since all of the factor graph edges are previously in Section II-C3.
represented by memory addresses, fully-serial decoders can Occasionally however, a received frame is corrupted in such
be readily adapted at run-time to implement a different LDPC a way that it can never be corrected. In this case, the iterative
factor graph, by rearranging the memory accesses. decoding process would loop infinitely, unless other criteria
In order to strike a compromise between the high processing for stopping it were implemented. Owing to this, a maximum
throughput of fully-parallel architectures and the more mod- iteration or complexity limit may be imposed. When this limit
est hardware requirement of fully-serial architectures, many is reached, the iterative decoding process is terminated and
LDPC decoders implement a number of time-multiplexed decoding is deemed to have failed. In implementations where
nodes in a so-called partially-parallel fashion. This parametriz- a low hardware resource requirement is a greater priority than
able degree of parallelism facilitates control over the trade- high processing throughput, the iteration limit may be the only
off between processing throughput and hardware resource re- stopping criterion imposed. Here, every received message is
quirements. Furthermore, this approach is of particular benefit decoded using the same number of iterations, without early
when any structure within the PCM H can be exploited in the stopping. In this case, the parity checks are only used at the
configuration of the nodes implemented in hardware. For this end of the iterative decoding process, in order to determine
reason, QC codes are particularly suited to partially-parallel whether the recovered codeword is valid or not. Early stopping
implementations. can also be used to detect that no error correction progress is
2) Representation of LLRs: Another architectural consider- being made with successive decoding iterations, allowing the
ation is the digital representation of the LLRs passed between decoding process to fail and terminate before the iteration limit
nodes. The algorithms described earlier can be modified to is reached.
replace the LLRs with single-bit hard decisions, but this causes
them to suffer from a significant error correction performance
F. FPGAs
loss. In general, increasing the resolution and range of the
two’s complement fixed point LLR representation by using a FPGAs are digital logic devices that can be flexibly pro-
greater bit width has a positive effect on the error correction grammed to perform a variety of digital functions, using
performance [38], at the cost of increasing the hardware a Hardware Description Language (HDL). Their main ad-
resources required. vantages are their in-field-programmability, as well as their
It is therefore desirable for a designer to quantify the effect high-speed very-parallel logic processing. Owing to these
of the fixed-point bit width on the performance of a chosen benefits, FPGAs are desirable for a multitude of applications,
decoding algorithm, in order to determine the smallest number including software-defined radio, ASIC prototyping, digital
of bits that are required in order to achieve a satisfactory error signal processing, cryptography and computer hardware emu-
correction performance. This may be achieved using Extrinsic lation. This section presents a simplified view of their internal
Information Transfer (EXIT) charts [39], which have been structure, followed by a discussion of the main differences and
conceived by ten Brink for characterising the operation of similarities between different makes and models of FPGAs,
iterative decoding algorithms. More specifically, EXIT charts and how they may be compared to each other.
visualize the quality of the LLRs output by the VNs and 1) Structure: The internal structure of an FPGA typically
CNs as functions of the quality of the LLRs provided to the comprises a variable number of three main programmable ele-
corresponding inputs. By plotting these EXIT functions for ments, namely logic blocks, RAM blocks and I/O blocks [44].
LDPC decoders employing a range of fixed-point bit widths, The inputs and outputs of these blocks are linked by pro-
a designer can quantify at a glance, how each representation grammable routing, as shown in the sample schematic of
improves or degrades the quality of the LLRs and hence Fig. 8.
the resultant error correction performance of the LDPC de- The most fundamental design of a logic block comprises
coder [40]. This eliminates the requirement to run multiple a Lookup Table (LUT) and a Flip-Flop (FF), as shown in
time-consuming BER simulations. Fig. 8. A LUT is a digital structure that can be programmed
Further to this, some designs have demonstrated that the to perform any combinatorial function of its inputs, thus
hardware requirement can be reduced by using non-uniform mimicking any possible combination of logic gates. Typically,
quantisation schemes [41], by sending the bits of the LLR in FPGA LUTs have 4–6 inputs, which are used to select a value
a serial fashion rather than in parallel [42], or by utilising for a single output bit. Increasing the number of LUT inputs
stochastic [36] or non-binary [43] number representations. typically allows the same HDL design to be implemented
However, these methods can also have adverse effects on the using fewer LUTs, therefore reducing the amount of FPGA
node complexity and the decoding throughput, requiring yet routing required. However, the hardware resources required by
further investigation. a LUT increase exponentially with its number of input bits,
3) Stopping criteria: The design of an LDPC decoder hence very large LUTs are impractical [44]. The output of
also has to consider how to terminate the decoding process. each LUT can optionally be connected to a corresponding FF,
Commonly, checks are carried out following each decoding for facilitating synchronous operation. Alternatively, the LUT
IEEE COMMUNICATIONS SURVEYS AND TUTORIALS 10
routing between logic elements or the use of additional FPGA is the degree of parallelism, which may be classified as fully-
blocks, such as memory or embedded multipliers. However, parallel, partially-parallel or fully-serial. This parameter may
it does serve as a functional approximation of the hardware be quantified by the total number of Processing Units (PUs)
requirements associated with each design considered, if they instantiated by the decoder, as listed in Table II. Frequently
were all implemented on the same FPGA. Measuring the usage these processors perform the function of individual VNs and
only in terms of these fundamental building blocks permits a CNs, although some designs use a different approach.
comparison between modern FPGA models and much older The operand width of the LLR representation, as listed in
designs, which would not otherwise be possible. Table II, is also a measurable parameter, which affects the
LDPC decoder’s error correction performance. Designs using
III. C OMPARISON OF DECODERS a higher number of bits may be expected to have superior
error correction performance than their counterparts employing
A comprehensive review of published FPGA-based LDPC fewer bits. However, this is typically achieved at the cost of
decoder designs is presented in this section. The analysis a larger hardware resource requirement or a lower processing
of Table II considers both the parameters that are chosen throughput.
by the designers, as well as the characteristics that may be The quantisation scheme used in the LLR representation
measured based on the design. Each of these is discussed and may be either uniform or non-uniform, as denoted by a ‘U’
characterised in Sections III-A and III-B, together with expla- or an ‘N’ in Table II, respectively. In uniform quantisation
nations and discussions of the symbols used in Table II where schemes, the entire range of representable LLR values has
applicable. The entries in Table II have been sourced from a constant resolution, allowing the VN and CN functions to
both academic publications and commercially-available soft IP be implemented using straightforward binary arithmetic. By
cores. Unfortunately, the licensers of these commercial designs contrast, non-uniform quantisation schemes typically adopt
were often unwilling to divulge many of the parameters and a finer resolution for lower LLR magnitudes and a lower
characteristics required for this analysis, resulting in several resolution for larger magnitudes. This facilitates a more ben-
incomplete sets of results. Furthermore, none of the licensers eficial trade-off between range and resolution, but makes
were willing to provide pricing information for the purposes the associated processing significantly more complex. Many
of this survey, preventing the comparison of this interesting authors (e.g. [42], [47], [48]) mention the number of bits used
but non-technical characteristic of their IP. in their FPGA-based LDPC decoders, but do not detail the
Note that Table II presents a condensed version of our quantisation scheme employed. Since non-uniform schemes
findings, showing only the most significant parameters and require significantly more details than uniform representations,
characteristics. In the case of references that present multiple these cases are assumed to employ uniform quantisation and
FPGA-based LDPC decoder designs, only a representative are marked with an asterisk in Table II.
subset has been reproduced here. A full version of our survey The maximum achievable clock frequency of an FPGA-
results may be downloaded from [46]. based LDPC decoder depends largely on the capabilities of
the FPGA employed, but also on some design decisions such
as the critical path length. For example, designs that process
A. Parameters
entire VNs or CNs in a single clock cycle typically have long
In this section, we consider the parameters of FPGA-based critical paths, while designs that only perform one arithmetic
LDPC decoders, which include all factors of the design that are or logical operation per clock cycle typically have much
specified by the designer. These include which LDPC PCMs shorter critical paths. Based on this observation, the clock
to support, the decoding algorithm to employ and the number frequency is included as a parameter in this analysis. The
of decoding iterations used. These parameters are discussed in majority of authors have explicitly stated the clock frequency
Sections III-A1, III-A3 and III-A4 respectively. Section III-A2 at which their decoder operates. However, in some cases
describes the architectural parameters, namely the degree of (eg. [76]) we have derived the clock frequency from other
parallelism, LLR representation, clock frequency, flexibility data, as indicated by an asterisk in Table II.
and choice of FPGA. Many decoder architectures are highly optimised to the
1) LDPC PCMs: One of the most fundamental features specific characteristics of the single LDPC PCM that they
of an LDPC decoder is the selection of the PCMs that it is are designed to support (eg. [43], [62], [74]). By contrast,
designed to support. Decoders may support just one PCM, be some other designs instead adopt a more general architecture
tailored to a family of related PCMs or may be designed to be (eg. [57], [60], [66]), sacrificing performance for the flexibility
completely flexible. As discussed in Section II-B, each PCM to switch between several supported PCMs at run-time. A
H has a number of parameters, namely N , M , Dc and Dv . decoder’s flexibility may be considered to be both a figure
However, the total number of edges in the corresponding factor of merit and an architectural decision that is made by the
graph can be considered to encompass all of these factors, designer, allowing it to be regarded as a characteristic or as a
representing the overall size and complexity of the code, as parameter. However, we show in Section IV-A that adding
listed in Table II. flexibility to a design can only be achieved as a trade-off
2) Architecture: Architectural decisions influence the phys- against some other desirable characteristics. For this reason,
ical implementation and hardware used by the decoder. As we treat flexibility as a characteristic in this paper.
described in Section II-E, the primary architectural parameter The selection of an FPGA for the implementation of an
IEEE COMMUNICATIONS SURVEYS AND TUTORIALS 12
Bandwidth eff.
H dimensions
LLR bitwidth
(decoded bps)
(encoded bps)
Clock (MHz)
Eb /N0 (dB)
Throughput
Throughput
Algorithm
Iterations
Edges (k)
Run-time
flexibility
ELBs (k)
FPGA
Ref.
Bandwidth eff.
H dimensions
LLR bitwidth
(decoded bps)
(encoded bps)
Clock (MHz)
Eb /N0 (dB)
Throughput
Throughput
Algorithm
Iterations
Edges (k)
Run-time
flexibility
ELBs (k)
FPGA
Ref.
Bandwidth eff.
H dimensions
LLR bitwidth
(decoded bps)
(encoded bps)
Clock (MHz)
Eb /N0 (dB)
Throughput
Throughput
Algorithm
Iterations
Edges (k)
Run-time
flexibility
ELBs (k)
FPGA
Ref.
LDPC decoder may have a significant impact upon its per- throughput, subject to the influence of the other parameters
formance. The selected FPGA dictates the number of logic outlined above. It is also worth noting that the maximum
elements, memory blocks and I/O pins that are available for all number of iterations is perhaps the easiest parameter to change
processing and routing. Additionally, some FPGAs facilitate at runtime. Owing to this, some designs (eg. [28], [67], [69])
higher clock frequencies than others when implementing the are presented with two sets of results, namely one employing a
same design, depending on the process technology employed. low number of iterations for maximum processing throughput
Unfortunately it is impossible to fairly compare the capabilities (marked with a ‘T’ in Table II), and one with a high number
of all FPGAs numerically. For this reason, Table II simply for maximum error correction (marked with an ‘E’ in Table II).
states which FPGA has been employed for each LDPC decoder Table II presents the fixed number of iterations that are
considered. employed in designs without early stopping functionality,
3) Algorithm: As discussed in Section II-C, several varia- while the average number of iterations is presented for designs
tions of the LDPC decoding algorithm exist. Some algorithms employing early stopping. However, some papers proposing
vary from each other only slightly, while others may employ early stopping designs (eg. [60], [61], [64]) do not present an
vastly different mathematical concepts. Furthermore, different average number of iterations, only providing the maximum
authors may use different terms to describe the same algo- limit imposed, as indicated with an asterisk in Table II.
rithm, making this parameter difficult to compare. Table II Likewise, some papers (eg. [50], [53]) do not state the number
therefore only includes the terms used by the authors to of iterations employed, but this parameter can be inferred as a
describe their algorithms and no direct comparison between function of other parameters and characteristics. These cases
them is inferred. are marked with a double-asterisk (**) in Table II.
4) Iterations: The limit placed on the maximum number of
decoding iterations has a significant effect upon the process-
B. Characteristics
ing throughput and error correction performance of decoders
operating without early stopping functionality, as well as in In this section, we consider all those characteristics of
cases where the received frame is too corrupted to be decoded FPGA-based LDPC decoders, which we plan to quantify.
successfully. Decreasing the maximum number of iterations Seven main characteristics are identified, namely processing
will increase the LDPC decoder’s processing throughput in throughput, processing latency, hardware resource require-
terms of the maximum achievable bitrate, but runs the risk ments, transmission energy efficiency, processing energy ef-
of allowing errors to remain in the recovered codeword that ficiency, bandwidth efficiency and flexibility, as seen Fig. 1.
could have otherwise been corrected. Generally, it can be Each of these is described in turn in the following sections.
assumed that the number of iterations used in each considered 1) Processing throughput: Perhaps the most frequently-
design was selected by the author to offer the most desirable stated characteristic of an FPGA-based LDPC decoder is its
trade-off between error correction performance and processing processing throughput, which is the number of bits that it can
IEEE COMMUNICATIONS SURVEYS AND TUTORIALS 15
process per second. A high processing throughput is required implemented on different FPGAs. However, the resource
for high-speed data transfers and video streaming applications, requirements stated by the various authors of LDPC-based
amongst other uses. A base station serving many users requires FPGA decoder designs often do not directly translate to ELBs,
the sum of the individual throughputs to be high, so that each hence requiring further analysis to be performed as follows:
user receives a satisfactory service.
• The conversion from 6LUTs to 4LUTs described in Sec-
In an LDPC decoder it is important to note the difference
tion II-F is first employed to ensure that all measurements
between encoded and decoded processing throughput. We refer
of LUTs consider an approximately equivalent quantity
to the number of codeword bits processed per second as
of hardware.
the encoded processing throughput, while we use decoded
• Subsequently, if the hardware requirement of a design is
processing throughput to quantify the number of message
quantified only in terms of either 4LUTs or FFs, then we
word bits per second. For example, half of the codeword bits
assume a numerically equal number of ELBs.
generated by a 1/2-rate LDPC code are parity bits, which carry
• If the hardware requirement of a design is quantified
no information of their own. Therefore if the encoded pro-
in terms of both 4LUTs and FFs, then we assume that
cessing throughput is 2 Gbps then the corresponding decoded
ELBs = max(4LUTs, FFs). These cases are identified
processing throughput would be 1 Gbps. Ultimately, it is the
using a single asterisk in Table II.
decoded processing throughput that matters most to the user of
• For designs based on Xilinx FPGAs having complex
the decoder, so we have deemed this to be the more important
multi-element slices, we have derived a “utilisation”
characteristic in comparisons. For designs where the author
figure of merit, which quantifies how many LUTs/FFs are
has only presented encoded processing throughput, we have
commonly used per slice. We obtained this by calculating
inferred the decoded processing throughput by multiplying
the average utilisation of designs for which both the num-
by the coding rate, as denoted by an asterisk in Table II.
ber of slices and the number of LUTs/FFs used is stated.
In some cases it is unclear whether the stated processing
These utilisation figures were found to be approximately
throughput is encoded or decoded. This is reflected in the
0.83 for LUTs and 0.36 for FFs, demonstrating that the
Table II by allowing the stated processing throughput to span
majority of slices are used for their LUTs. For designs
both columns. A double asterisk is used in Table II to identify
where the hardware utilisation is presented only in terms
designs in which the processing throughput was not explicitly
of slices, we assume ELBs = slices × 4LUTs per slice
stated, but has been inferred from other stated parameters and
× 0.83. These cases are indicated in Table II using a
characteristics.
double asterisk.
2) Processing latency: The processing latency of an FPGA-
based LDPC decoder is the amount of time it requires to 4) Transmission energy efficiency: Another fundamental
process a complete LDPC codeword. Low processing latency figure of merit for an LDPC decoder is its error correction
is therefore important for interactive cloud computing and capability, as a function of the channel’s signal to noise
safety-critical operations, where an immediate response is power ratio per bit Eb /N0 , which is typically expressed in
crucial. It may be observed that processing latency is strongly decibels. If a codeword is transmitted using a high energy
linked to processing throughput, since the processing latency per bit Eb , then the energy of the noise corrupting each bit
can often be calculated as the message word length K divided becomes relatively smaller, causing the BER at the receiver to
by the decoded processing throughput. However, some decoder decrease. However, energy-efficient transmitters are desirable,
designs achieve a high processing throughput by decoding because they are cheaper to run and can operate for longer
more than one codeword simultaneously. In these cases, the without requiring new batteries, particularly since transmission
associated processing latency would be much higher than that energy consumption is dominant in transmitter hardware. It
of a decoder which achieves the same processing throughput is therefore desirable for an LDPC decoder to be capable of
while decoding only a single codeword at a time. For example, correcting errors and achieving a satisfactorily low BER, even
a decoder that decodes a single 1000-bit message word with at low Eb /N0 values.
a processing throughput of 2 Gbps would have a processing The error correction performance of a decoder is typically
latency of 0.5 µs, while two 1 Gbps decoders operating in characterised in the form of a BER curve, showing how the
parallel would achieve the same processing throughput, but BER is reduced as the channel Eb /N0 increases. In order to
would have a processing latency of 1 µs. Processing latency convert these plots into a comparable metric, we specified a
is a key characteristic of an FPGA-based LDPC decoder, desirable target BER of 10−4 . For each considered design,
however most authors do not explicitly state it in their results, the Eb /N0 required by the decoder in order to achieve this
and it is therefore not included in Table II. BER was noted. In some publications however (eg. [17], [61]),
3) Hardware requirements: When implemented on an the error correction performance is quantified using the Frame
FPGA, the size and complexity of an LDPC decoder’s design Error Rate (FER) rather than BER. In these cases, we assumed
is represented by how much of the FPGA’s hardware resources that a BER of 10−4 equates to a FER of 10−2 [61], based on
it utilises. Larger designs require more resources and therefore the observation that the considered designs typically have a
a bigger, more expensive FPGA, making smaller designs message word length K of the order of 1000 bits, as well
preferable. as a minimum Hamming distance of the order of 10 bits.
The ELB metric described in Section II-F can be used These cases are indicated using a single asterisk in the Eb /N0
to compare the hardware resource requirements of designs column of Table II.
IEEE COMMUNICATIONS SURVEYS AND TUTORIALS 16
Quantifying the BER versus Eb /N0 facilitates a fair com- only be adapted to use a different PCM by reprogramming
parison of transmission energy for LDPC codes having differ- the FPGA, preventing a high degree of rapid reconfigurability.
ent coding rates R, since it considers the transmission energy The degree of design-time flexibility can also be difficult to
per message word bit. However, some publications present accurately quantify, since any design that is synthesised from
the BER as a function of the SNR Es /N0 , which does not a HDL can be modified and re-synthesised fairly rapidly.
allow a fair comparison of codes having different coding rates, Design-time flexibility has therefore not been considered in
since it considers the transmission energy per codeword bit, this survey.
Es = Eb × R. The corresponding Eb /N0 can therefore be
obtained by dividing by the SNR Es /N0 by the coding rate IV. D ISCUSSIONS
R, which is achieved in logarithmic terms according to The data presented in Section III inspires a great deal
Eb /N0 [dB] = SNR [dB] − 10 log10 (R). (18) of discussions and visualisation of the relationships amongst
the various parameters and characteristics of FPGA-based
Entries calculated in this way are denoted in Table II using a LDPC decoders. This section commences by characterising
double asterisk. Unfortunately some authors have erroneously the fundamental trade-off between desirable characteristics in
labelled the x-axis of BER plots as SNR, when Eb /N0 would Section IV-A, before identifying the parameters that affect
be more appropriate. Some of these cases were clarified via each characteristic in Section IV-B.
private correspondence with the authors. However, in some
cases there is other evidence that the presented results are in
terms of Eb /N0 rather than SNR, such as comparisons with A. Trade-offs
benchmarkers or capacity bounds. In these cases, Eb /N0 is As seen in Fig. 1 and discussed in Section III-B, the main
assumed and identified using a triple asterisk (***) in Table II. measurable characteristics of an FPGA-based LDPC decoder
5) Processing energy efficiency: As for any electronic are processing throughput, processing latency, hardware re-
system, low processing energy consumption is desirable in source utilisation, transmission energy efficiency, processing
the design of FPGA-based LDPC decoders. However, only energy efficiency, bandwidth efficiency and flexibility. Of
a few publications ([28], [56], [73]) have included energy these, it is the processing throughput, hardware resource util-
consumption measurements, hence this characteristic cannot isation, flexibility and transmission energy efficiency, which
be considered in our comparisons. provide the clearest and most fundamental trade-off, since the
6) Bandwidth efficiency: The bandwidth efficiency of a other characteristics are all in some way dependent on these.
communication system is given by the ratio of the information The relationship amongst these four characteristics is plotted
throughput that it can convey to the corresponding bandwidth in Fig. 9.
required. For example, a scheme that conveys 500 bits per Note that all scatter plots presented in this paper are
second over a channel having a bandwidth of 1 kHz has a organised so that a decoder with desirable values for all
bandwidth efficiency of 0.5 (bits/s)/Hz. For BPSK-modulated characteristics would correspond to a data point in the top-right
codewords using ideal Nyquist pulse shaping filters, bandwidth corner. In Fig. 9, the x-axis is plotted with the values reversed,
efficiency is numerically equal to the LDPC coding rate R. so that decoders with smaller hardware resource requirements
In this regard, LDPC codes with higher coding rates are more (preferred) are further to the right than larger ones. Meanwhile
desirable, since they make more efficient use of their channel’s the y-axis is plotted as normal, so that decoders with the
bandwidth. highest processing throughput are at the top. In this way, points
7) Flexibility: Flexibility is a desirable characteristic, be- above the trend line are superior to the average case, whilst
cause it allows an FPGA-based LDPC decoder to support points below it are inferior, notwithstanding the values of their
different parity check matrices, having different coding rates, other characteristics.
block lengths and node degrees. Some designs may support a It can be seen in Fig. 9 that most designs can only excel
selection of related PCMs from within a particular code family, in at most three of the four characteristics presented. The
such as the 21 PCMs included in the DVB-S2 standard [5]. trend line presents the average processing throughput vs size
Meanwhile, other designs may be completely flexible, sup- trade-off, and decoders that perform above this line generally
porting any PCM. tend to suffer from poor transmission energy efficiency, whilst
Decoders may exhibit flexibility either during their design decoders with a high energy efficiency tend to either have
or during their operation. True run-time flexibility allows a larger hardware resource requirements or lower processing
specific codeword to be decoded using a particular PCM, throughput than the average case. Any decoders that perform
immediately before decoding a different codeword using a well in all three of these characteristics tend to be totally
different PCM. This allows the communication system to inflexible to any PCM changes at run-time.
dynamically adapt to time-varying channel conditions, such as The five points in Fig. 9 having the highest processing
by decreasing the coding rate R in high-noise environments throughput are from [28], [74], [68] and [67], all of which
in order to improve the BER performance. However this employ fully-parallel architectures. The design of [67] has
advantage may only be achieved at the cost of requiring a the smallest hardware resource requirement of the four, owing
more sophisticated design, typically having higher hardware to its use of only one bit per LLR. By contrast, the designs
resource requirements or lower processing throughput. By of [28], [68] and [74] use two or three bits per LLR, which
contrast, decoders that are only flexible at design-time may is reflected in their relative hardware resource requirements.
IEEE COMMUNICATIONS SURVEYS AND TUTORIALS 17
5.5
Inflexible, with Eb /N0 data
Flexible, with Eb /N0 data 5
104 [28] Inflexible, without Eb /N0 data
[74] Flexible, without Eb /N0 data
Throughput vs size trend 4.5
[68]
Decoded throughput (Mbps)
[67]
2.5
[61]
101 2
[60] 1.5
100 1
105 104 103
Equivalent Logic Blocks (ELBs)
Fig. 9. Processing throughput vs. hardware requirements vs. transmission energy efficiency vs. flexibility
None of these high-throughput decoders have any run-time LDPC code. By contrast, the other flexible designs shown in
flexibility, as is typical of fully-parallel architectures. The Fig. 9, such as [17] and [61], are only flexible for a set of
next highest processing throughput is achieved by the design related PCMs.
of [65], which adopts a partially-parallel rather than fully- In addition to the trade-offs described above, Fig. 9 also
parallel architecture, but also uses only one bit per LLR. demonstrates that it is difficult to consider all of the character-
The effect of using these small numbers of bits can be seen istics of an FPGA-based LDPC decoder at once. For example,
in these decoders’ poor transmission energy efficiency, since Fig. 9 does not consider the capabilities of the FPGA that
reducing the resolution of the LLRs impedes the associated each decoder is implemented using. In particular, more recent
error correction capability. FPGAs may be able to operate identical designs at higher
In addition to employing single-bit LLRs, the design of [65] clock speeds than older FPGAs. This could be crudely factored
achieves a high processing throughput by decoding two frames into the results by dividing the processing throughput by the
at once. The designs presented in [77] use a similar technique, clock frequency, but doing so would then negate the impact of
processing three, four or even six frames in parallel using other parameters such as the critical path length. Furthermore,
multiple decoder copies in the same FPGA. Owing to this, no consideration is given in Fig. 9 to the processing latency
these designs have a larger processing throughput than the of each considered design. Note, however, that by plotting
average case, while also being flexible and having reasonable the decoded processing throughput rather than the encoded
error correction performance. However, as discussed in Sec- processing throughput, the coding rate and the bandwidth
tion III-B2, the processing latency of these decoders is much efficiency of the LDPC code has been taken into at least partial
higher than their processing throughput would imply, making consideration.
them less suitable for time-critical applications.
The decoders presented in [36] and [45] both achieve good
transmission energy efficiency, while also having higher pro- B. Relationships between parameters and each characteristic
cessing throughputs (or lower hardware requirements) than the Having established the fundamental trade-off that exists
average case. Both of these designs use stochastic bitstreams between the main characteristics of FPGA-based LDPC de-
to represent the LLRs, facilitating a fully-parallel architecture coders, namely processing throughput, processing latency,
having single-wire serial transmission between nodes, greatly hardware requirements and transmission energy efficiency, the
simplifying the hardware design. following subsections present discussions of the parameters
The points in the bottom-right of Fig. 9 correspond to the that affect each one. A discussion of bandwidth efficiency
designs presented in [60], which employ a fully-serial archi- is combined with transmission energy efficiency in Sec-
tecture and so have very low hardware resource requirements, tion IV-B4, but a quantitative discussion of flexibility and
but also low processing throughput. However, these designs processing energy efficiency could not be made, owing to the
also have the benefit of being truly run-time flexible for any lack of the required information in the publications considered.
IEEE COMMUNICATIONS SURVEYS AND TUTORIALS 18
105 9
9-15 iterations
[67] 16+ iterations
3-8 iterations 8
104 Data N/A
[68] Throughput vs parallelism trend 7
Decoded throughput (Mbps)
102 [75] 4
3
101
[76] [60] 2
100 1
103 102 101 100 10−1
PUs per 1000 edges in H
1) Processing throughput: Fig. 10 characterises the strong processing throughput than more complicated alternatives such
relationship between an FPGA-based LDPC decoder’s de- as the SPA [24] is also demonstrated by comparing the results
gree of parallelism and its decoded processing throughput, of [75] and [76], which present two very similar designs that
confirming the expectation that designs having more parallel vary in algorithm. The design in [76] suffers from a 4-5 Mbps
processors can decode a higher number of bits per second. processing throughput drop compared to [75], caused by its
Note that in Fig. 10 the number of parallel processing units employment of the SPA instead of the MSA, as well as by
has been divided by the number of edges in the PCM H, to using a non-uniform quantisation scheme for the LLRs.
remove the dependence on the LDPC code size. The shading The point furthest above the trend line corresponds to the
and shapes of the markers in Fig. 10 also indicate the influence design of [65], which achieves a high processing throughput
that the number of bits per LLR and the number of decoding by using only a single bit per LLR, five iterations per frame
iterations have on the processing throughput, respectively. and by decoding two frames simultaneously. This design
Points above the trend line typically employ a small number of also exploits the properties of quasi-cyclic LDPC codes to
bits per LLR or iterations, evidenced by their dark shading or implement an efficient partially-parallel architecture, reducing
circular point shape. By contrast, slower-than-average designs the number of processing units required to achieve its high
typically employ a larger number of bits per LLR or iterations, processing throughput.
and therefore have lighter shading or a square shape. 2) Processing latency: As discussed above, processing la-
Perhaps the most prominent points in Fig. 10 are the light tency is not treated as a quantifiable characteristic in our anal-
grey circles belonging to [69], which achieve a much higher ysis, because the majority of publications do not quantify this
processing throughput than the trend line, despite using 8 characteristic of their design. However, the processing latency
bits per LLR. This may be explained by this design’s use is dependent on the processing throughput, the message word
of layered belief propagation with the aid of a novel joint length K, the scheduling and the number of frames that are
row-column processor, which decreases the processing time decoded in parallel.
of each iteration and helps to avoid memory conflicts, thereby Some of the decoders considered, such as that of [65]
increasing the processing throughput. and [77], process multiple frames in parallel by instantiating
The light triangles in the bottom-right represent the fully- several independent copies of the decoder on the same FPGA.
serial decoders presented in [60], which achieve a low pro- In these cases the total processing throughput and resource
cessing throughput owing to their low number of processors. requirement could be divided by the number of decoders,
Conversely, the dark points in the top-left represent the fully in order to produce results that correspond to the processing
parallel decoders of [67] and [68], which achieve a very high latency of an equivalent design that only considers one frame
processing throughput by using few bits, few iterations, a at a time. However, other designs, such as [42], process
large degree of parallelism and operate on the basis of the multiple frames by making use of spare time within the
MSA [25]. The fact that the MSA can facilitate a higher decoding schedule, with the result that the hardware cost does
IEEE COMMUNICATIONS SURVEYS AND TUTORIALS 19
12
103 1k-10k edges
10k-50k edges
[48] 50k+ edges 10
Equivalent Logic Blocks (ELBs) Data N/A
[76] [75] Size vs parallelism trend
104
6
[47]
[58]
4
[61] 2
105 [77]
0
100 101 102 103
Processing Units (PUs)
not increase linearly with the processing throughput. Owing designs of [61] also correspond to a set of points positioned
to this, it is not possible to normalise the data to only consider very far below the trend line.
the processing throughput and hardware resources required for The results of [75], [48] and [76] all sit above the trend
decoding one frame at a time, so the processing latency cannot line, despite employing a large number of bits per LLR, as
be fairly inferred. well as a moderate PCM size. This may be partially attributed
3) Hardware requirements: Unsurprisingly, the major con- to their implementation of quasi-cyclic LDPC codes, using
tributing factor to the hardware resource requirement of an partially-parallel architectures, leading to a very efficient use
FPGA-based LDPC decoder design is its degree of paralleli- of hardware resources. Additionally, the smallest hardware
sation, as shown in Fig. 11. Additionally, Fig. 11 shows that resource requirement of these designs is achieved by one
the number of bits employed per LLR and the number of edges that uses the MSA rather than the SPA, illustrating that this
employed in the parity check matrix also have some influence algorithm requires fewer hardware resources.
on the hardware resource requirement, though the effects of The design of [47] requires more FPGA resources than the
these parameters are quite varied. This may be attributed to trend line would suggest, which is remarkable considering its
the difficulty of accurately comparing the hardware resource small PCM and number of bits per LLR. At first glance this
requirements of different designs, as well as suggesting that may be attributed to its use of the uncommon array-based
other factors are involved. It is however noticeable that there is LDPC code. However, the design of [58] also uses an array-
a general reduction in the number of bits per LLR employed in based code but sits above the trend line, despite employing
designs with increased parallelism. This may be explained by a large number of bits per LLR and a large PCM. On closer
the explosion in routing complexity upon increasing the num- inspection, it can be observed that the design of [47] employs
ber of PUs, which would be exacerbated by the requirement a simple FPGA from an old generation, suggesting that its
for data buses having large operand widths. comparably large hardware resource requirement stems from
The dark grey circles corresponding to the designs of [77] inefficient FPGA synthesis.
towards the bottom of Fig. 11 seemingly have a much larger 4) Transmission energy efficiency and bandwidth efficiency:
hardware resource requirement than would be expected, con- The minimum SNR per bit Eb /N0 at which it becomes theo-
sidering the number of processing units, the number of PCM retically possible to reliably send information over a channel
edges and the number of bits employed per LLR. However, depends on the target bandwidth efficiency and therefore on
these designs are each run-time flexible for a different family the coding rate of the FEC code employed. A code having a
of codes, having HDL code that is automatically generated. lower coding rate may achieve a lower minimum transmission
This additional flexibility results in decoders that are not as energy, owing to the increased number of parity bits that it
fully optimised as one that was designed specifically for a employs for error correction. For this reason, we consider the
single PCM, explaining the associated hardware overhead. transmission energy efficiency and the bandwidth efficiency
This is confirmed by the observation that the run-time flexible jointly in this subsection.
IEEE COMMUNICATIONS SURVEYS AND TUTORIALS 20
9
[41] [61]
1 [60]
[69] 8
7
Distance from capacity (dB)
[50]
[74] 4
4
3-8 iterations 3
9-15 iterations
16+ iterations 2
5 Data N/A [67]
Performance loss vs edges trend
1
105 104 103
Number of edges in H
• Block length and coding rate completing each design element. More details about these
Choose which LDPC • Run-time vs. design-time PCM switching
issues can be found throughout Sections II-C — II-E.
PCM(s) to decode • Regular/irregular codes and node degrees
• Other code features (e.g. quasi-cyclic)
• provide mathematical detail about the algorithm used and quasi-cyclic) LDPC codes such as those in [77] have the great-
endeavour to use established terminology if the same est potential for flexibility and high processing throughputs.
formulae have been used before; Recent research into hierarchical quasi-cyclic codes as in [56]
• provide power/energy consumption measurements ob- and [53] could be of particular interest.
tained during BER simulation; 2) Schedules: Unfortunately, different publications have
• when mentioning flexibility, explicitly state whether the used different terminology to describe the decoding schedules
changes can be made at run-time or whether they require adopted in their decoder designs, so a direct comparison could
a new synthesis run; not be easily drawn between them in this paper. However,
• endeavour to make it possible to compare new designs to there is an opportunity to investigate the effects of using
old ones by selecting a benchmarker, and implementing different schedules in two otherwise equal FPGA-based LDPC
a new design using exactly the same set of parameters decoders, assessing their effects not only on BER performance
on the same FPGA. and complexity as in previous research on scheduling, but
In addition to adhering to the above list of guidelines to fa- also on processing throughput, hardware resources, processing
cilitate fairer comparisons between different designs, it would energy and flexibility. In particular, none of the reviewed
be of significant benefit if authors of FPGA-based LDPC de- decoders operate on the basis of calculating residuals as in
coder designs were at liberty to make their source code freely IDS, implying that this schedule is largely under-represented
available online. Open-source code can be readily found for within the field despite claims of its superiority to others [22].
many of the signal processing blocks used in communications Further research is required to determine whether these claims
systems, but unfortunately there are very few freely-available are valid in practical implementations, and to investigate the
FPGA-based LDPC decoder designs. This inevitably hinders architectural constraints that employing IDS would impose on
innovation within the field, since every prospective designer an FPGA-based LDPC decoder design.
is required to commence by implementing a basic structure, 3) Stochastic decoders: The two stochastic decoders pre-
rather than improving an existing design. Additionally, if a sented in this report, [36] and [45], performed well in terms
reader of a published design had access to the source code, of processing throughput, BER performance and hardware
it would significantly aid their comprehension of the novel requirements. Stochastic designs are associated with their own
techniques that are being described. Finally, making source set of advantages and challenges, offering another opportunity
code freely available facilitates the employment of current for further research. The serial transmission of messages be-
FPGA-based LDPC decoder designs as benchmarkers for tween processing nodes facilitates a higher grade of feasibility
future designs. for fully-parallel designs, and allows the error correction per-
formance to be dynamically traded for processing throughput
C. Further work by simply increasing the number of bits used for each message.
Performing the analysis described above has enabled us to 4) Low processing energy consumption: It is unfortunate
identify several opportunities for further research and develop- that the majority of the designs reviewed in this report
ment in the field of FPGA-based LDPC decoders, as discussed did not present any information about the decoder’s energy
in the following subsections. consumption. As with all electronic devices, low energy con-
1) Flexible decoders: Perhaps the biggest gap illustrated sumption is a key figure of merit in communication systems,
by the trade-offs described in Section IV-A is for high- since it dictates how long mobile devices can function for
speed decoders having run-time flexibility and low hardware between battery recharges, as well as dictating the cost and
resource cost. Run-time flexibility has huge advantages for environmental impact of operating base station equipment.
commercial applications, since it allows a decoder to dynam- This provides a motivation to investigate the factors behind
ically support the variety of different LDPC codes within energy consumption in FPGA-based LDPC decoders, possibly
a particular communications standard, without incurring the by implementing some of the published designs and measuring
overhead of the time and technical intervention that is required their energy consumption directly. Drawing upon these results,
to reprogram an FPGA. Further to this, flexible decoders FPGA-based LDPC decoders having low energy consumption
can adapt automatically depending on the channel conditions could then be designed.
without any user input, increasing the efficiency of FPGA- 5) Low processing latency: Similarly to processing en-
based LDPC decoders in consumer applications. Run-time ergy consumption, processing latency is a crucial character-
flexibility can also be useful for research purposes, reducing istic of communications hardware that was curiously under-
the number of times an FPGA has to be re-synthesised when represented in the works reviewed here. While the processing
testing multiple different codes. latency may be approximated for many designs as a function
As seen in Section IV-A, decoders having a fully-serial of the processing throughput and message word length, the
architecture can be flexible with little or no extra hardware fact that it was rarely quantified implies that it was rarely
resource cost, but suffer in terms of their low processing a design focus. Some applications of FPGA-based LDPC
throughput. Meanwhile, the extra hardware required to make decoders may require ultra-low processing latency above all
a fully-parallel decoder flexible renders this approach im- other characteristics, suggesting that this is a gap in the market
practical, regardless of their capacity for high processing that is currently unfilled. Further research could be conducted
throughputs. Initial research therefore suggests that partially- to determine whether this is indeed the case, before devising
parallel decoder architectures utilising semi-structured (e.g. new designs having ultra-low processing latency. Such designs
IEEE COMMUNICATIONS SURVEYS AND TUTORIALS 23
would require large processing throughputs without processing [5] ETSI, “ETSI EN 302 307 v1.3.1 Digital Video Broadcasting (DVB);
multiple frames in parallel, so would be likely to employ very Second generation,” 2013. [Online]. Available: https://www.dvb.org/
standards/dvb-s2
parallel architectures and low-complexity algorithms. In this [6] CCSDS, “CCSDS 131.0-B-2 Recommendation for Space Data System
case, the cost of the high processing latency would be a higher Standards; TM Synchronization and Channel Coding,” 2011.
hardware resource consumption and a lower transmission [7] V. Oksman and S. Galli, “G.hn: The new ITU-T home networking
standard,” IEEE Commun. Mag., vol. 47, no. 10, pp. 138–145, Oct.
energy efficiency. 2009.
[8] G. D. Forney, T. J. Richardson, and R. Urbanke, “On the design of
low-density parity-check codes within 0.0045 dB of the Shannon limit,”
VI. C ONCLUSIONS IEEE Commun. Lett., vol. 5, no. 2, pp. 58–60, 2001.
In this paper, we have assessed the practicalities and limi- [9] Y. Cai, S. Jeon, K. Mai, and B. V. K. V. Kumar, “Highly parallel
FPGA emulation for LDPC error floor characterization in perpendicular
tations of FPGA-based LDPC decoders. Section II presented magnetic recording channel,” IEEE Trans. Magn., vol. 45, no. 10, pp.
a tutorial on LDPC codes, covering their structure, encoding 3761–3764, 2009.
process, decoding process and construction techniques. A [10] A. Naderi, S. Mannor, M. Sawan, and W. J. Gross, “Delayed Stochastic
Decoding of LDPC Codes,” IEEE Trans. Signal Process., vol. 59, no. 11,
number of practical decoder implementation decisions were pp. 5617–5626, Nov. 2011.
then highlighted, before providing background information on [11] G. Sundararajan, C. Winstead, and E. Boutillon, “Noisy Gradient
the structure of FPGAs and the differences between those Descent Bit-Flip Decoding for LDPC Codes,” IEEE Trans. Commun.,
vol. 62, no. 10, pp. 3385–3400, 2014.
produced by the main two FPGA vendors. In Section III, the [12] G. Sarkis, S. Hemati, S. Mannor, and W. J. Gross, “Stochastic Decoding
results from an extensive survey were presented in a condensed of LDPC Codes over GF(q),” IEEE Trans. Commun., vol. 61, no. 3,
form, featuring only a subset of the rows and columns available 2013.
[13] S. S. Tehrani, C. Jego, B. Zhu, and W. J. Gross, “Stochastic Decoding of
in the online version. The remainder of Section III was then Linear Block Codes With High-Density Parity-Check Matrices,” IEEE
devoted to describing the parameters and characteristics used Trans. Signal Process., vol. 56, no. 11, pp. 5733–5739, 2008.
in the evaluation, discussing the significance of each and how [14] L. Zhang, L. Gui, Y. Xu, and W. Zhang, “Configurable Multi-Rate De-
coder Architecture for QC-LDPC Codes Based Broadband Broadcasting
it was measured. Section IV then illustrated, characterised System,” IEEE Trans. Broadcast., vol. 54, no. 2, pp. 226–235, 2008.
and discussed the complex interplay between all of these [15] Z. Zhang, L. Dolecek, B. Nikolic, V. Anantharam, and M. J. Wainwright,
parameters and characteristics, using plots of the results to “Design of LDPC decoders for improved low error rate performance:
quantization and algorithm choices,” IEEE Trans. Commun., vol. 57,
show how each one was affected by the others. Subsequently, no. 11, pp. 3258–3268, 2009.
using the experience gained from compiling the survey results, [16] E. Yeo, P. Pakzad, B. Nikolic, and V. Anantharam, “High throughput
Section V presented a list of recommendations for future low-density parity-check decoder architectures,” in IEEE Glob. Telecom-
mun. Conf., no. 3. San Antonio, TX, USA: IEEE, Nov. 2001, pp.
publications of FPGA-based LDPC decoder designs, in order 3019–3024.
to facilitate fairer, more comprehensive comparisons in future. [17] M. Karkooti, P. Radosavljevic, and J. R. Cavallaro, “Configurable LDPC
Finally, we have identified a number of opportunities for future decoder architectures for regular and irregular codes,” J. Signal Process.
Syst., vol. 53, no. 1-2, pp. 73–88, May 2008.
FPGA-based LDPC decoder designs. [18] R. Tanner, “A recursive approach to low complexity codes,” IEEE Trans.
Perhaps the most significant conclusion that can be drawn Inform. Theory, vol. 27, no. 5, pp. 533–547, Sep. 1981.
from the research described in this paper is that it is extremely [19] L. Zhang, J. Huang, and L. L. Cheng, “Reliability-based high-efficient
dynamic schedules for belief propagation decoding of LDPC codes,” in
difficult to predict how two different FPGA-based LDPC IEEE Int. Conf. Signal Process., no. 1. Beijing, China: IEEE, Oct.
decoder designs might compare, when they are implemented 2012, pp. 1388–1392.
using different codes, architectures, algorithms, schedules and [20] E. Sharon, S. Litsyn, and J. Goldberger, “Efficient serial message-passing
schedules for LDPC decoding,” IEEE Trans. Inform. Theory, vol. 53,
hardware. This in itself lends further weight to the advantage no. 11, pp. 4076–4091, Nov. 2007.
of using FPGAs for prototyping designs, utilising their re- [21] Y.-M. Chang, A. I. V. Casado, M.-C. F. Chang, and R. D. Wesel, “Lower-
programmability in an efficient design-implement-test devel- complexity layered belief-propagation decoding of LDPC codes,” in
IEEE Int. Conf. Commun. Beijing, China: IEEE, May 2008, pp. 1155–
opment cycle. To do so requires accurate comparisons amongst 1160.
competing designs to be made, which can only be achieved [22] A. I. Vila Casado, M. Griot, and R. D. Wesel, “Informed dynamic
using the list of recommendations provided in Section V-B. scheduling for belief-propagation decoding of LDPC codes,” in IEEE
Int. Conf. Commun. Glasgow, Scotland: IEEE, Jun. 2007, pp. 932–
However, even having completed this process, it may still be 937.
difficult to say which design is superior, as there is such a [23] ——, “Improving LDPC decoders via informed dynamic scheduling,”
complex interplay of characteristics that each will inevitably in IEEE Inform. Theory Work. Tahoe City, CA, USA: IEEE, Sep. 2007,
pp. 208–213.
have its own advantages and disadvantages. [24] F. R. Kschischang, B. J. Frey, and H.-A. Loeliger, “Factor graphs and
the sum-product algorithm,” IEEE Trans. Inform. Theory, vol. 47, no. 2,
R EFERENCES pp. 498–519, 2001.
[25] F. Angarita, J. Valls, V. Almenar, and V. Torres, “Reduced-complexity
[1] R. G. Gallager, “Low-density parity-check codes,” IRE Trans. Inform. min-sum algorithm for decoding LDPC codes with low error-floor,”
Theory, vol. IT-8, no. Jan., pp. 21–28, 1962. IEEE Trans. Circuits Syst. I, Reg. Pap., vol. 61, no. 7, pp. 2150–2158,
[2] D. J. C. MacKay and R. M. Neal, “Near Shannon limit performance Jul. 2014.
of low density parity check codes,” Electron. Lett., vol. 32, no. 18, p. [26] Y. Chen and K. K. Parhi, “Overlapped message passing for quasi-cyclic
1645, Aug. 1996. low-density parity check codes,” IEEE Trans. Circuits Syst. I, Reg. Pap.,
[3] IEEE, “IEEE 802.11n-2009 Standard for Information technology - Local vol. 51, no. 6, pp. 1106–1113, Jun. 2004.
and metropolitan area networks - Specific requirements - Part 11: [27] C. Spagnol, W. Marnane, and E. Popovici, “FPGA implementations of
Wireless LAN Medium Access Control (MAC) and Physical Layer LDPC over GF(2m) decoders,” in IEEE Work. Signal Process. Syst.,
(PHY),” 2009. no. 8. Shanghai, China: IEEE, Oct. 2007, pp. 273–278.
[4] ——, “IEEE 802.16-2004 Standard for Local and Metropolitan Area [28] V. A. Chandrasetty and S. M. Aziz, “An area efficient LDPC decoder
Networks - Part 16: Air Interface for Fixed Broadband Wireless Access using a reduced complexity min-sum algorithm,” Integr. VLSI J., vol. 45,
Systems,” 2004. no. 2, pp. 141–148, Mar. 2012.
IEEE COMMUNICATIONS SURVEYS AND TUTORIALS 24
[29] A. Orlitsky, K. Viswanathan, and J. Zhang, “Stopping set distribution European Design and Automation Association, Apr. 2009, pp. 1242–
of LDPC code ensembles,” IEEE Trans. Inform. Theory, vol. 51, no. 3, 1245.
pp. 929–953, Mar. 2005. [53] V. A. Chandrasetty and S. M. Aziz, “A highly flexible LDPC decoder
[30] T. Tian, C. R. Jones, J. D. Villasenor, and R. D. Wesel, “Selective using hierarchical quasi-cyclic matrix with layered permutation,” J.
avoidance of cycles in irregular LDPC code construction,” IEEE Trans. Networks, vol. 7, no. 3, pp. 441–450, Mar. 2012.
Commun., vol. 52, no. 8, pp. 1242–1247, Aug. 2004. [54] P. Saunders and A. Fagan, “A high speed, low memory FPGA based
[31] D. J. C. MacKay, “Good error-correcting codes based on very sparse LDPC decoder architecture for quasi-cyclic LDPC codes,” in Int. Conf.
matrices,” IEEE Trans. Inform. Theory, vol. 45, no. 2, pp. 399–431, F. Program. Log. Appl. Madrid, Spain: IEEE, Aug. 2006, pp. 1–6.
Mar. 1999. [55] Y.-H. Chien and M.-K. Ku, “A high throughput H-QC LDPC decoder,”
[32] M. P. C. Fossorier, “Quasi-cyclic low-density parity-check codes from in IEEE Int. Symp. Circuits Syst. New Orleans, LA, USA: IEEE, May
circulant permutation matrices,” IEEE Trans. Inform. Theory, vol. 50, 2007, pp. 1649–1652.
no. 8, pp. 1788–1793, Aug. 2004. [56] V. A. Chandrasetty and S. M. Aziz, “A multi-level hierarchical quasi-
[33] L. Chen, J. Xu, I. Djurdjevic, and S. Lin, “Near-Shannon-limit quasi- cyclic matrix for implementation of flexible partially-parallel LDPC
cyclic low-density parity-check codes,” IEEE Trans. Commun., vol. 52, decoders,” in IEEE Int. Conf. Multimed. Expo. Barcelona, Spain: IEEE,
no. 7, pp. 1038–1042, Jul. 2004. Jul. 2011, pp. 1–7.
[34] E. Eleftheriou and D. M. Arnold, “Regular and irregular progressive [57] F. Charot, C. Wolinski, N. Fau, and F. Hamon, “A new powerful scalable
edge-growth tanner graphs,” IEEE Trans. Inform. Theory, vol. 51, no. 1, generic multi-standard LDPC decoder architecture,” in Int. Symp. Field-
pp. 386–398, Jan. 2005. Programmable Cust. Comput. Mach. Palo Alto, CA, USA: IEEE, Apr.
[35] E. Yeo, B. Nikolic, and V. Anantharam, “Architectures and implemen- 2008, pp. 314–315.
tations of low-density parity check decoding algorithms,” in Midwest [58] J. Sha, M. Gao, Z. Zhang, L. Li, and Z. Wang, “An FPGA implemen-
Symp. Circuits Syst. Tulsa, OK, USA: IEEE, Aug. 2002, pp. 437–440. tation of array LDPC decoder,” in IEEE Asia Pac. Conf. Circuits Syst.
[36] S. S. Tehrani, S. Mannor, and W. J. Gross, “Fully parallel stochastic Singapore: IEEE, Dec. 2006, pp. 1675–1678.
LDPC decoders,” IEEE Trans. Signal Process., vol. 56, no. 11, pp. 5692– [59] Z. Cao, J. Kang, and P. Fan, “An FPGA implementation of a structured
5703, 2008. irregular LDPC decoder,” in IEEE Int. Symp. Microw. Antenna Propag.
[37] G. Masera, F. Quaglio, and F. Vacca, “Implementation of a flexible EMC Technol. Wirel. Commun., vol. 1. Beijing, China: IEEE, Aug.
LDPC decoder,” IEEE Trans. Circuits Syst. II, Express Briefs, vol. 54, 2005, pp. 1050–1053.
no. 6, pp. 542–546, Jun. 2007. [60] S. M. E. Hosseini, K. S. Chan, and W. L. Goh, “A reconfigurable FPGA
[38] T. Zhang, Z. Wang, and K. K. Parhi, “On finite precision implementation implementation of an LDPC decoder for unstructured codes,” in Int.
of low density parity check codes decoder,” in IEEE Int. Symp. Circuits Conf. Signals Circuits Syst. Nabeul, Tunisia: IEEE, Nov. 2008, pp.
Syst. Sydney, Australia: IEEE, May 2001, pp. 202–205. 1–6.
[39] S. ten Brink, “Convergence behavior of iteratively decoded parallel [61] L. Yang, H. Liu, and C. J. R. Shi, “Code construction and FPGA
concatenated codes,” IEEE Trans. Commun., vol. 49, pp. 1727–1737, implementation of a low-error-floor multi-rate low-density parity-check
Jan. 2001. code decoder,” IEEE Trans. Circuits Syst. I, Reg. Pap., vol. 53, no. 4,
[40] X. Zuo, R. G. Maunder, and L. L. Hanzo, “Design of Fixed-Point pp. 892–904, 2006.
Processing Based LDPC Codes Using EXIT Charts,” in IEEE Veh.
[62] H. Ding, S. Yang, W. Luo, and M. Dong, “Design and implementation
Technol. Conf., San Francisco, CA, USA, Jan. 2011.
for high speed LDPC decoder with layered decoding,” in WRI Int. Conf.
[41] Z. Cui and Z. Wang, “A 170 Mbps (8176, 7156) quasi-cyclic LDPC
Commun. Mob. Comput. Yunnan: IEEE, Jan. 2009, pp. 156–160.
decoder implementation with FPGA,” in IEEE Int. Symp. Circuits Syst.,
[63] Y. Pei, L. Yin, and J. Lu, “Design of irregular LDPC codec on a single
no. x. Kos, Greece: IEEE, May 2006, pp. 5095–5098.
chip FPGA,” in IEEE Proc. Circuits Syst. Symp. Emerg. Technol., vol. 1.
[42] A. Darabiha, A. C. Carusone, and F. R. Kschischang, “A bit-serial
Shanghai, China: IEEE, May 2004, pp. 221–224.
approximate min-sum LDPC decoder and FPGA implementation,” in
[64] M. Gomes, G. Falcão, V. Silva, V. Ferreira, A. Sengo, and M. Falcão,
IEEE Int. Symp. Circuits Syst. Kos, Greece: IEEE, May 2006, pp. 1–4.
“Flexible parallel architecture for DVB-S2 LDPC decoders,” in IEEE
[43] Y. Sun, Y. Zhang, J. Hu, and Z. Zhang, “FPGA implementation of
Glob. Telecommun. Conf. Washington, DC, USA: IEEE, Nov. 2007,
nonbinary quasi-cyclic LDPC decoder based on EMS algorithm,” in Int.
pp. 3265–3269.
Conf. Commun. Circuits Syst. Milpitas, CA, USA: IEEE, Jul. 2009,
pp. 1061–1065. [65] X. Chen, Q. Huang, S. Lin, and V. Akella, “FPGA based low-complexity
[44] I. Kuon, R. Tessier, and J. Rose, “FPGA architecture: survey and high-throughput tri-mode decoder for quasi-cyclic LDPC codes,” in
challenges,” Found. Trends Electron. Des. Autom., vol. 2, no. 2, pp. Annu. Allert. Conf. Commun. Control Comput. Monticello, IL, USA:
135–253, 2007. IEEE, Sep. 2009, pp. 600–606.
[45] S. S. Tehrani, S. Mannor, and W. J. Gross, “An area-efficient FPGA- [66] C. Beuschel and H. Pfleiderer, “FPGA implementation of a flexible
based architecture for fully-parallel stochastic LDPC decoding,” in IEEE decoder for long LDPC codes,” in 2008 Int. Conf. F. Program. Log.
Work. Signal Process. Syst. Shanghai, China: IEEE, Oct. 2007, pp. Appl. Heidelberg, Germany: IEEE, Sep. 2008, pp. 185–190.
255–260. [67] V. A. Chandrasetty and S. M. Aziz, “FPGA Implementation of a LDPC
[46] P. Hailes, L. Xu, R. G. Maunder, B. M. Al-Hashimi, and L. L. Hanzo, Decoder using a Reduced Complexity Message Passing Algorithm,” J.
“Survey results for ’A survey of FPGA-based LDPC decoders’,” 2015. Networks, vol. 6, no. 1, pp. 36–45, Jan. 2011.
[Online]. Available: http://dx.doi.org/10.5258/SOTON/384946 [68] ——, “FPGA implementation of high performance LDPC decoder using
[47] P. Bhagawat, M. Uppal, and G. Choi, “FPGA based implementation of modified 2-bit min-sum algorithm,” in Int. Conf. Comput. Res. Dev.
decoder for array low-density parity-check codes,” in IEEE Proc. Int. Kuala Lumpur, Malaysia: IEEE, May 2010, pp. 881–885.
Conf. Acoust. Speech Signal Process. Philadelphia, PA, USA: IEEE, [69] Z. He, S. Roy, and P. Fortier, “FPGA implementation of LDPC decoders
Mar. 2005, pp. 29–32. based on joint row-column decoding algorithm,” in IEEE Int. Symp.
[48] K. Shimizu, T. Ishikawa, N. Togawa, T. Ikenaga, and S. Goto, “Partially- Circuits Syst. New Orleans, LA, USA: IEEE, May 2007, pp. 1653–
parallel LDPC decoder based on high-efficiency message-passing algo- 1656.
rithm,” in IEEE Proc. Int. Conf. Comput. Des. San Jose, CA, USA: [70] A. Blad and O. Gustafsson, “FPGA implementation of rate-compatible
IEEE Comput. Soc, Oct. 2005, pp. 503–510. QC-LDPC code decoder,” in Eur. Conf. Circ. Theory Des. Linkoping,
[49] T. Zhang and K. K. Parhi, “A 54 Mbps (3,6)-regular FPGA LDPC Sweden: IEEE, Aug. 2011, pp. 777–780.
decoder,” in IEEE Work. Signal Process. Syst. San Diego, CA, USA: [71] S. S. Khati, P. Bisht, and S. C. Pujari, “Improved decoder design for
IEEE, Oct. 2002, pp. 127–132. LDPC codes based on selective node processing,” in World Congr.
[50] K. Wang, N. Liu, B. Sun, and H. Sun, “A configurable FPGA implemen- Inform. Commun. Technol. IEEE, Oct. 2012, pp. 413–418.
tation of PEG-based PS-LDPC decoder,” in Int. Conf. Pervasive Comput. [72] Z. Zhang, L. Dolecek, B. Nikolic, V. Anantharam, and M. Wainwright,
Signal Process. Appl. Harbin, China: IEEE, Sep. 2010, pp. 670–674. “Investigation of error floors of structured low- density parity-check
[51] Y. Chen and D. E. Hocevar, “A FPGA and ASIC implementation of rate codes by hardware emulation,” in IEEE Glob. Telecommun. Conf., no. 2.
1/2, 8088-b irregular low density parity check decoder,” in IEEE Glob. San Francisco, CA, USA: IEEE, Nov. 2006, pp. 1–6.
Telecommun. Conf. San Francisco, CA, USA: IEEE, Dec. 2003, pp. [73] X. Chen, J. Kang, S. Lin, and V. Akella, “Memory system optimization
113–117. for FPGA-based implementation of quasi-cyclic LDPC codes decoders,”
[52] F. Demangel, N. Fau, N. Drabik, F. Charot, and C. Wolinski, “A generic IEEE Trans. Circuits Syst. I, Reg. Pap., vol. 58, no. 1, pp. 98–111, 2011.
architecture of CCSDS low density parity check decoder for near-earth [74] R. Zarubica, S. G. Wilson, and E. Hall, “Multi-Gbps FPGA-based
applications,” in Proc. Conf. Des. Autom. Test Eur. Nice, France: low density parity check (LDPC) decoder design,” in IEEE Glob.
IEEE COMMUNICATIONS SURVEYS AND TUTORIALS 25
Telecommun. Conf., no. 1. Washington, DC, USA: IEEE, Nov. 2007, Lei Xu has been working with Altera for more
pp. 548–552. than 7 years within system solution engineering
[75] Y. Dai, Z. Yan, and N. Chen, “Optimal overlapped message passing and marketing. His current role is wireless system
decoding of quasi-cyclic LDPC codes,” IEEE Trans. Very Large Scale architect, instrumental to define and drive strategic
Integr. Syst., vol. 16, no. 5, pp. 565–578, 2008. direction and solution roadmap of wireless business
[76] N. Chen, Y. Dai, and Z. Yan, “Partly parallel overlapped sum-product unit of Altera. Previously he has been working on
decoder architectures for quasi-cyclic LDPC codes,” in IEEE Work. various wireless system solutions in Altera such as
Signal Process. Syst. Banff, AB, Canada: IEEE, Oct. 2006, pp. 220– DPD, MIMO, Turbo SIC, etc. Prior to that, he has
225. been working as the system algorithmic/architecture
[77] H. Li, Y. S. Park, and Z. Zhang, “Reconfigurable architecture and expert in VIA technology and Agilent technology
automated design flow for rapid FPGA-based LDPC code emulation,” on various wireless and broadcasting systems, such
in Proc. ACM/SIGDA Int. Symp. F. Program. Gate Arrays. Monterey, as DAB, DVB, GSM/WCDMA, WiFI and LTE. He holds BSEE and MSEE
CA, USA: ACM, Feb. 2012, pp. 167–170. from Tsinghua University, China and PhD of Wireless Communication from
[78] C. Spagnol, W. Marnane, and E. Popovici, “Reduced complexity, FPGA University of Southampton, UK and has published 20+ leading journal and
implementation of quasi-cyclic LDPC decoder,” in Proc. Eur. Conf. Circ. conference papers and holds 11 patents.
Theory Des., vol. 1. Cork, Ireland: IEEE, Aug. 2005, pp. 289–292.
[79] M. Karkooti and J. R. Cavallaro, “Semi-parallel reconfigurable architec-
tures for real-time LDPC decoding,” in Proc. Int. Conf. Inform. Technol.
Coding Comput. Las Vegas, NV, USA: IEEE, Apr. 2004, pp. 579–585.
[80] L. Xiong, Z. Tan, and D. Yao, “The moderate-throughput and memory-
efficient LDPC decoder,” in 2006 8th Int. Conf. Signal Process. Beijing,
China: IEEE, Nov. 2006, pp. 1–4.
[81] Softjin Technologies, “LDPC decoder for DVB-S2.” [Online].
Available: http://www.softjin.com/IP Datasheet PDF version/LDPC
Decoder datasheet.pdf
[82] Unicore Systems Ltd, “CCSDS C2 LDPC encoder/decoder IP cores,”
2011. [Online]. Available: http://unicore.co.ua/uploads/File/CCSDS
XX user manual(netlist).pdf
[83] ——, “IEEE 802.16e (WiMAX) LDPC decoder IP core,” 2009.
[Online]. Available: http://unicore.co.ua/uploads/File/ldpc dec brief.pdf
[84] Blue Rum Consulting Limited, “802.11n/802.11ac LDPC decoder,”
2013. [Online]. Available: http://www.bluerum.co.uk/consulting/
datasheets/BRC008 LdpcDecRtlDs.pdf
[85] Turbo Concept, “ITU G.hn LDPC decoder.” [Online]. Available:
http://www.turboconcept.com/prod tc4400.php
[86] Creonic GmbH, “IEEE 802.11ad WiGig LDPC decoder product brief,” Robert G. Maunder (S03-M08-SM12) has been
2014. [Online]. Available: http://www.creonic.com/images/product with the department of Electronics and Computer
briefs/PB Creonic IEEE 802 11ad WiGig LDPC Decoder IP.pdf Science at the University of Southampton, UK, since
[87] IPrium Ltd., “I.6 LDPC encoder/decoder IP core short description,” October 2000. He was awarded the B.Eng. (Hons.)
2013. [Online]. Available: https://www.iprium.com/bins/pdf/iprium ug degree in electronic engineering in 2003, as well as
i6 ldpc codec.pdf a Ph.D. degree in wireless communications in 2007.
[88] TrellisWare Technologies, “Flexible low-density parity-check (F- He became a lecturer in 2007 and an Associated Pro-
LDPC),” 2014. [Online]. Available: http://www.trellisware.com/ fessor in 2013. His research interests include joint
products/fec-products/f-ldpc/ source/channel coding, iterative decoding, irregular
[89] Logic Fruit Technologies, “LDPC decoder IP specification,” coding, and modulation techniques.
2010. [Online]. Available: http://www.logic-fruit.com/resource/
LDPCDecoderIP.pdf
[90] L. L. Hanzo, S. X. Ng, T. Keller, and W. Webb, Quadrature Amplitude
Modulation. Chichester: Wiley-IEEE Press, 2004.