Scaling Deep Learning-Based Decoding of Polar
Scaling Deep Learning-Based Decoding of Polar
Abstract—The training complexity of deep learning-based there is a demand for alternative decoding strategies. Besides
channel decoders scales exponentially with the codebook size algorithmic optimizations [7] of the existing algorithms, mod-
arXiv:1702.06901v1 [cs.IT] 22 Feb 2017
and therefore with the number of information bits. Thus, neural ifying the code structure [8] can be considered to overcome
network decoding (NND) is currently only feasible for very
short block lengths. In this work, we show that the conventional this issue. However, once standardized, the encoder cannot be
iterative decoding algorithm for polar codes can be enhanced changed. In this work, we propose an alternative approach by
when sub-blocks of the decoder are replaced by neural network applying machine learning techniques to find an alternative
(NN) based components. Thus, we partition the encoding graph decoding algorithm instead of changing the code structure.
into smaller sub-blocks and train them individually, closely Once trained, the final decoding algorithm (i.e., the weights
approaching maximum a posteriori (MAP) performance per
sub-block. These blocks are then connected via the remaining of a deep neural network) itself is static and can be efficiently
conventional belief propagation decoding stage(s). The resulting implemented and parallelized on a graphical processing unit
decoding algorithm is non-iterative and inherently enables a high- (GPU), field programmable gate array (FPGA), or application-
level of parallelization, while showing a competitive bit error specific integrated circuit (ASIC).
rate (BER) performance. We examine the degradation through A first investigation of the topic learning to decode was
partitioning and compare the resulting decoder to state-of-the-
art polar decoders such as successive cancellation list and belief already done in [1]. The authors showed that the main dif-
propagation decoding. ficulty lies within the curse of dimensionality meaning that
for k information bits 2k classes exist, leading to exponential
I. I NTRODUCTION complexity during the training phase. In other applications,
Non-iterative and consequently low-latency decoding to- such as computer vision, the number of possible output classes
gether with close to maximum a posteriori (MAP) decoding is typically limited, e.g., to the number of different objects. In
performance are two advantages of deep learning-based chan- contrast to many other machine learning fields, an unlimited
nel decoding. However, this concept is mainly restricted by amount of labeled training data is available, since the encoding
its limited scalability in terms of the supported block lengths, function and the channel model are well known. Additionally,
known as curse of dimensionality [1]. For k information bits, a clear benchmark with existing decoders is possible. Although
the neural network (NN) needs to distinguish between 2k very powerful machine learning libraries such as Theano [9]
different codewords, which results in an exponential training and Tensorflow [10] are available nowadays and the computa-
complexity in case that the full codebook needs to be learned. tion power increased by order of magnitudes, the exponential
In this work we focus on NN decoding of polar codes and complexity still hinders straight-forward learning of practical
show that the scalability can be significantly improved towards code lengths as shown in [11]. It was observed in [11], that
practical lengths when the NN only replaces sub-components there is a certain generalization of NN decoding, meaning that
of the current decoding algorithm. the NN can infer from certain codewords to others it has never
List decoding [2] with manageable list sizes and excellent seen before. This is essential for learning longer codes and
bit error rate (BER) performance for short block lengths makes gives hope that neural network decoding (NND) can be scaled
polar codes [3] a potential candidate for future communication to longer codes. However, to the best of our knowledge, the
standards such as the upcoming 5G standard or internet naive approach of learning to decode only works for rather
of things (IoT) applications. As polar codes are currently small block lengths.
proposed for the 5G control channel [4], decoding algorithms The authors in [12] proposed the idea of using machine
for very short block lengths are of practical importance [5]. learning techniques to train the weights of a belief propagation
Besides that, the rate can be completely flexible adjusted factor graph in order to improve its decoding performance for
with a single-bit granularity. However, the price to pay is an high density parity check (HDPC) codes. As the Tanner graph
inherently serial decoding algorithm which is in general hard is already given initially and only its weights are refined, their
to accelerate, e.g., through parallel processing [6]. This leads approach scales very well for larger block lengths and does
to high decoding latency when compared to state-of-the-art not suffer from the curse of dimensionality. However, in this
low density parity check (LDPC) codes/decoders [7]. Thus, case, the use of machine learning refines an existing solution.
The decoding algorithm itself is not learned, since the iterative u0 x0
nature of the BP algorithm is kept.
u1 x1
In our work, we tackle the problem of completely replacing
the polar code decoder by a machine learning approach. As u2 x2
it turns out, only small codeword lengths can be trained
u3 x3
efficiently, and thus we divide the polar encoding graph
N x4 xN
into several sub-graphs (cf. [13]). We learn sub-block wise u u4
decoding and couple the components by conventional belief u5 x5
propagation (BP) stages. This scales the results from [11]
towards practical block lengths. u6 x6
u7 x7
II. P OLAR C ODES
An encoder for polar codes maps the k information bits Fig. 1: Polar Encoding circuit for N = 8; blue boxes indicate
onto the k most reliable bit positions of the vector u of length the independent partitions of the code for M = 2 and green
N , denoted as information set A, while the remaining N − k boxes for M = 4.
positions are treated as frozen positions. These frozen positions
are denoted as Ā and must be known at the decoder side.
Now the input block u is encoded according to x = u · GN , 2) Right-to-left propagation: The L-messages are updated
where G = F⊗n is the generator matrix and F⊗n denotes the starting from the rightmost stage (i.e., the stage of
nth Kronecker power of the kernel F = [ 11 01 ]. The resulting channel information) until reaching the leftmost stage.
encoding circuit is depicted in Fig. 1 for N = 8, which also The output from two nodes becomes the input to a specific
defines the decoding graph. This factor graph consists of neighboring processing element (PE) (for more details we refer
n + 1 = log2 (N ) + 1 stages, each consisting of N nodes. to [15]). One PE updates the L- and R-messages as follows
The BER performance of a polar code highly depends on [15]:
the type of decoder used and has been one of the most exciting Lout,1 = f (Lin,1 , Lin,2 + Rin,2 )
and active areas of research related to polar coding. There are Rout,1 = f (Rin,1 , Lin,2 + Rin,2 )
(1)
two main algorithmic avenues to tackle the polar decoding Lout,2 = f (Rin,1 , Lin,1 ) + Lin,2
problem: Rout,2 = f (Rin,1 , Lin,1 ) + Rin,2
1) successive cancellation-based decoding, following a se- where
1 + ea+b
rial “channel layer unwrapping” decoding strategy [3], f (a, b) = ln .
2) belief propagation-based decoding based on Gallager’s ea + eb
BP iterative algorithm [14]. For initialization, all messages are set to zero, except for the
Throughout this work, we stick with the BP decoder as its first and last stage, where
structure is a better match to neural networks and enables par-
Li,n+1 = Li,ch , and
allel processing. For details about successive cancellation (SC) (
decoding and its list extension called successive cancellation Lmax ∀i ∈ Ā (2)
Ri,1 =
list (SCL) decoding, we refer the interested reader to [2] and 0 else
[3]. The BP decoder describes an iterative message passing
algorithm with soft-values, i.e., log likelihood ratio (LLR) with Lmax denoting the clipping value of the decoder (in
values over the encoding graph. For the sake of simplicity, theory: Lmax → ∞), as all values within the simulation
we assume binary phase shift keying (BPSK) modulation and are clipped to be within (−Lmax , Lmax ). This prevents from
an additive white Gaussian noise (AWGN) channel. However, experiencing numerical instabilities.
other channels can be implemented straightforwardly. For a
A. Partitionable Codes
received value y, it holds that
As opposed to other random-like channel codes with close-
P (x = 0|y) 2 to-capacity performance, polar codes exhibit a very regu-
LLR(y) = ln = 2y
P (x = 1|y) σ lar (algebraic) structure. It is instructive to realize that the
encoding graph, as visualized in Fig. 1 for N = 8, can
where σ 2 is the noise variance. There are two types of LLR be partitioned into independent sub-graphs [13], [16], i.e.,
messages: the right-to-left messages (L-messages) and the left- there is no interconnection in the first log2 (Np ) stages, where
to-right messages (R-messages). One BP iteration consists of Np denotes the number of bits per sub-block. We define a
two update propagations: partitionable code in a sense that each sub-block can be
1) Left-to-right propagation: The R-messages are updated decoded independently (i.e., no interconnections within the
starting from the leftmost stage (i.e., the stage of a priori same stage exist; leading to a tree like factor graph). This
information) until reaching the rightmost stage. algorithm can be adopted to all partitionable codes and is
not necessarily limited to polar codes. Each sub-graph (in BER performance gap between a NND and MAP decoding is
the following called sub-block) is now coupled with the other shown in Fig. 2. It illustrates that learning to decode is limited
sub-blocks only via the remaining polar stages as depicted in through exponential complexity as the number of information
Fig. 1. In order to simplify polar decoding, several sub-blocks bits in the codewords increases.
Bi can now be decoded on a per-sub-block basis [13]. The
set of frozen bit positions Ā need to be split into sub-sets Āi
corresponding to the sub-blocks Bi (with information vector 0.8
BP BP ...
x̂1 NN3
..
.. .
.
û2 NND2 t
.. ..
(N2 , k2 , A2 ) . . Fig. 4: Pipelined implementation of the proposed PNN de-
R coder.
x̂2 BP
.. .. ..
. . L .
leading to an unequal sub-block size. This helps to obtain as
Fig. 3: Partitioned neural network polar decoding structure. few as possible sub-blocks. The results for M = 8 partitions
are shown in Fig. 5.
Additionally, finetuning-learning can be applied to the
decoder stage nNN (see Fig. 1). Then, the first NN decoder
overall network in order to adjust the independently trained
estimates the received sub-block NN1 . After having decoded
components to the overall system. This means that the whole
the first sub-block, the results are propagated via conventional
decoding setup is used to re-train the system such that the
BP decoding algorithms through the remaining coupling stages
conventional stages are taken into consideration for decoding.
(see Fig. 1 and 3). Thus, this algorithm is block-wise sequen-
This prevents from performance degradation due to potentially
tial where M sub-blocks NNi are sequentially decoded from
non-Gaussian NN-input distributions, which was assumed
top to bottom. The detailed decoding process works as follows:
during the training. Such an effect can be observed whenever
1) Initialize stage n + 1 with received channel LLR-values clipping of the LLRs is involved. However, the basic structure
2) Update stages nNN to n + 1 according to BP algorithm is already fixed and thus only a small amount of the 2k possible
rules (propagate LLRs) codewords is sufficient for good training results. The required
3) Decode next sub-block (top to bottom) training set is created with the free-running decoder.
4) Re-encode results and treat results as perfectly known, The coupling could be also done by an SC stage (as orig-
i.e., as frozen bits inally proposed in [13]) without having additional iterations.
5) If not all sub-blocks are decoded go to step 2 However, the BP structure suits better to the NN structure as
The interface between the NND and the BP stages depends both algorithms can be efficiently described by a graph and
on the trained input format of the NN. Typically, the NN their corresponding edge weights. Thus, the BP algorithm is
prefers normalized input values between 0 and 1. Fortunately, preferred.
it was observed in [11] that the NN can handle both input For cyclic redundancy check (CRC) aided decoding (a CRC
formats and effective training is possible. check over the whole codeword) the CRC check can be split
In summary, the system itself can be modeled as one large into smaller parts as in [13], where each CRC only protects
NN as well. Each BP update iterations defines additional one sub-block, i.e., the CRC can be considered by the NN
layers, which are deterministic and thus do not effect the decoders and thus straightforwardly learned. However, this
training complexity, similar to regularization layers [18]. This requires at least some larger NN-sub-blocks, otherwise the
finally leads to a pipelined structure as depicted in Fig. 4. As rate-loss due to the CRC checks becomes prohibitive and is
each NN is only passed once and to emphasize the difference thus not considered at the moment.
compared to iterative decoding, we term this kind of decoding
as one-shot-decoding. IV. C OMPARISON WITH SCL/BP
In general, a fair comparison with existing solutions is
hard, as many possible optimizations need to be considered.
A. Further Optimizations We see this idea as an alternative approach, for instance in
As it can be seen in Fig. 2, the limiting parameter is the cases whenever low-latency is required. The BER results for
number of information bits per sub-block ki , since it defines N = 128 and different decoding algorithms are shown in
the number of possible estimates of the NND. One further Fig. 5. As shown in Tab I, the size of the partitions is chosen
improvement in terms of sub-block size can be done
P by merg- such each partition does not contain more than kmax = 12
ing multiple equally-sized sub-blocks such that i ki < kmax , information bits, which facilitates learning the sub-blocks. If
10−1 TABLE I: Number of information bits ki for each sub-block 5
sub-block i 1 2 3 4 5 6 7 8 N = 64
sub-block size Ni 32 16 16 16 16 8 8 16 N = 128
information bits ki 1 3 11 5 13 7 8 16 4 N = 256
−2
10
NEPSCL
N = 512
3 N = 1024
10−1
2
10−3
10−2 1
1 2 4 8 16 32 64
number of equally sized partitions
−4 10−3
BER
−5
10 10−5
PNN8
PNN8{32, 16,
{32, 16, 16, 16,
16, 16,16, 16, 8, 8, 16}
8, 8, 16}
128bit λpart
λpart
λNN
BP 256bit λNN