CDI15-04 - Arithmetic Coding
CDI15-04 - Arithmetic Coding
The case 𝑁𝑁 = 1
For messages of length 1 (any one of the symbols 𝑎𝑎𝑗𝑗 of 𝒜𝒜), we set
𝑆𝑆𝑎𝑎𝑗𝑗 = �𝑝𝑝1 + ⋯ + 𝑝𝑝𝑗𝑗−1 , 𝑝𝑝1 + ⋯ + 𝑝𝑝𝑗𝑗−1 + 𝑝𝑝𝑗𝑗 � = [𝜎𝜎𝑗𝑗−1 , 𝜎𝜎𝑗𝑗 ),
where we define 𝜎𝜎𝑗𝑗 = 𝑝𝑝1 + ⋯ + 𝑝𝑝𝑗𝑗−1 + 𝑝𝑝𝑗𝑗 , 𝑗𝑗 = 1, … , 𝑛𝑛 (we say that
𝜎𝜎1 , … , 𝜎𝜎𝑛𝑛 ) is the cumulative probability distribution. Thus we have
𝑙𝑙𝑎𝑎𝑗𝑗 = 𝜎𝜎𝑗𝑗−1 , ℎ𝑎𝑎𝑗𝑗 = 𝜎𝜎𝑗𝑗 , ℎ𝑎𝑎𝑗𝑗 − 𝑙𝑙𝑎𝑎𝑗𝑗 = 𝑝𝑝𝑗𝑗
From the definitions it follows that 0 < 𝑝𝑝1 = 𝜎𝜎1 < ⋯ < 𝜎𝜎𝑛𝑛 = 1 and hence
that the segments 𝑆𝑆𝑎𝑎𝑗𝑗 cover [0,1) and are pairwise disjoint. See the left
column of the illustration on next page for an example in which 𝑛𝑛 = 3.
Example. Before considering the general construction, we first describe it
in a simple example. We will see how to obtain the segment 𝑆𝑆𝑀𝑀 for the
message 𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏 produced by the source {′𝑎𝑎′ : 0.2, ′𝑏𝑏 ′ : 0.5, ′𝑐𝑐 ′ : 0.3}.
4
Now the interval 𝑆𝑆𝑏𝑏𝑏𝑏𝑏𝑏 is obtained in a similar way: subdivide 𝑆𝑆𝑏𝑏𝑏𝑏 into in-
tervals that are proportional to 𝑝𝑝(𝑎𝑎), 𝑝𝑝(𝑏𝑏) and 𝑝𝑝(𝑐𝑐 ) and choose as 𝑆𝑆𝑏𝑏𝑏𝑏𝑏𝑏
the segment corresponding to 𝑏𝑏. Actually we have
ℎ𝑏𝑏𝑏𝑏𝑏𝑏 = 𝑙𝑙𝑏𝑏𝑏𝑏𝑏𝑏 = 0.2 + 0.1 × 𝜎𝜎(𝑎𝑎) = 0.22,
ℎ𝑏𝑏𝑏𝑏𝑏𝑏 = 𝑙𝑙𝑏𝑏𝑏𝑏𝑏𝑏 = 0.2 + 0.1 × 𝜎𝜎(𝑏𝑏) = 0.27,
ℎ𝑏𝑏𝑏𝑏𝑏𝑏 = 0.2 + 0.1 × 𝜎𝜎(𝑐𝑐 ) = 0.30,
and hence
𝑆𝑆𝑏𝑏𝑏𝑏𝑏𝑏 = [𝑙𝑙𝑏𝑏𝑏𝑏𝑏𝑏 , ℎ𝑏𝑏𝑏𝑏𝑏𝑏 ) = [0.22,0.27).
Finally we get, following the same procedure with 𝑆𝑆𝑏𝑏𝑏𝑏𝑏𝑏 ,
𝑆𝑆𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏 = �0.22 + 0.05 × 𝜎𝜎(𝑏𝑏), 0.22 + 0.05 × 𝜎𝜎(𝑐𝑐 )� = [0.255,0.270).
6
0.010001 0.011001
0.010000
The binary unit segment
Example. If P=[(’a’, 0.25), (’b’, 0.4}, (’c’, 0.15), (’d’, 0.1), (’e’, 0.1)] and
M= “badbbdcbabea”,
then the arithmetic encoding AE(M,P) is
C=[12,”0101010110111011011100101”]
and AD(C,P) is “badbbdcbabea”.
Remark. This description of arithmetic encoding and decoding would work if
floats had unlimited precision, but in pyzo it only works for short messages (of
the order of 54 encoded bits, due to the fact of that floats have 16 significant
digits, which amounts to 16*log(10)/log(2)≃ 53.15 bits.
11
and therefore
𝑛𝑛𝑘𝑘+1 +1 1 1 1
< 𝑎𝑎 + + = 𝑎𝑎 + 𝑘𝑘 ≤ 𝑎𝑎 + (𝑏𝑏 − 𝑎𝑎) = 𝑏𝑏.
2𝑘𝑘+1 2𝑘𝑘+1 2𝑘𝑘+1 2
So in this case we can take the binary representation of 𝑛𝑛𝑘𝑘+1 as the bit
encoding of [𝑎𝑎, 𝑏𝑏).
This can be easily programmed into an alternative bit encoder of intervals.
Note. If 𝑥𝑥 is the binary string representing 𝑛𝑛/2𝑘𝑘 , the function segment(x)
in cdi.py returns the endpoints of 𝐼𝐼𝑘𝑘 .
From the analysis above, it is easy to show that arithmetic coding becomes
practically optimal for long messages. Indeed, for messages 𝑀𝑀 of length 𝑁𝑁,
the average length is ℓ�𝑁𝑁 = ∑𝑀𝑀 𝑃𝑃(𝑀𝑀)ℓ(𝑀𝑀), where ℓ(𝑀𝑀) is the number of
bits of the arithmetic encoding of 𝑀𝑀. Since we just have seen that
1 1
ℓ(𝑀𝑀) ≤ �log 2 ( )� +1≤ log 2 ( ) +2
𝑃𝑃 𝑀𝑀 𝑃𝑃 𝑀𝑀
we get
𝐻𝐻𝑁𝑁 ≤ ℓ�𝑁𝑁 ≤ 𝐻𝐻𝑁𝑁 + 2
13
where 𝐻𝐻𝑁𝑁 stands for the entropy of messages of length 𝑁𝑁 (the first ine-
quality by Shannon’s theorem). Now 𝐻𝐻𝑁𝑁 = 𝑁𝑁𝑁𝑁, where 𝐻𝐻 is the entropy of
the symbol source (an easy calculation using that 𝑃𝑃�𝑎𝑎𝑗𝑗1 ··· 𝑎𝑎𝑗𝑗𝑁𝑁 � =
∏𝑘𝑘 𝑝𝑝𝑗𝑗𝑘𝑘 )). Therefore
2
𝐻𝐻 ≤ ℓ� ≤ 𝐻𝐻 + ,
𝑁𝑁
where ℓ� is the average length per symbol, and this means that ℓ� ≈ 𝐻𝐻 for
large 𝑁𝑁.
Note that if sending the encoding of messages 𝑀𝑀 of length 𝑁𝑁 includes a
fixed overhead (like sending the number 𝑁𝑁 as part of the encoding), then
the conclusion still holds (the 2 above is replaced by some constant 𝐾𝐾𝑁𝑁
𝐾𝐾𝑁𝑁
such that → 0 as 𝑁𝑁 grows).
𝑁𝑁
14
𝑢𝑢 = ℎ − 𝑙𝑙 + 1
17
ℎ = 𝑙𝑙 + (𝑢𝑢 · 𝜎𝜎′[𝑥𝑥])/𝜎𝜎 ′ − 1
𝑙𝑙 = ℎ − (𝑢𝑢 · 𝐹𝐹 ′ [𝑥𝑥])/𝜎𝜎′)
Decoding algorithm
𝑥𝑥 = encoding value (string)
𝑙𝑙 = 0; ℎ = 𝜎𝜎 ′
𝑢𝑢 = ℎ − 𝑙𝑙 + 1
ℎ = 𝑙𝑙 + (𝑢𝑢 · 𝜎𝜎′[𝑥𝑥])/𝜎𝜎 ′ − 1
𝑙𝑙 = ℎ − (𝑢𝑢 · 𝐹𝐹 ′ [𝑥𝑥])/𝜎𝜎′)