Data Compression Using Adaptive Coding and Partial String Matching
Data Compression Using Adaptive Coding and Partial String Matching
4, APRIL 1984
Abstract-The recently developed technique of arithmetic coding, in There are obvious disadvantages in having the encoder and
conjunction with a Markov model of the source, is a powerful method decoder share a fixed model which governs the coding of all
of data compression in situations where a linear treatment is inap- messages. While it may be appropriate in some tightly defined
propriate. Adaptive coding allows the model to be constructed dy- circumstances, such as special-purpose machines for facsimile
namically by both encoder and decoder during the course of the transmission of documents [ 91 , it will not work well for a
transmission, and has been shown to incur a smaller coding overhead variety of different types of message. For example, imagine
than explicit transmission of the model’s statistics. But there is a basic an encoder embedded in a general-purpose modem or a com-
conflict between the desire to use high-order Markov models and the puterdiskchannel.Themostappropriatemodel t o use for
need to have them formed quickly as the initial part of the message is such general applications may be one of standard mixed-case
sent.This paper describes how the conflict can be resolved with English text.Butthesystemmayhavetoencodelong se-
partial string matching, and reports experimental results which show quences of upper-case-onlytext,orprogramtext,orfor-
thatmixed-case English text can be coded in as little as 2.2 bits/ mattedbibliography files-all withstatisticsquitedifferent
character with no prior knowledge of the source. from those of the model.
There is clearly a case for basing the model on the statis-
ticsofthemessagewhichiscurrentlybeingtransmitted.
I. INTRODUCTION But t o d o so seems to requireatwo-passapproach,witha
R ECENTapproachestodatacompressionhavesplitthe
problemintotwoparts:modelingthestatisticsofthe
sourceandtransmittingaparticularmessagegeneratedby
first pass through the message to acquire the statistics and a
secondforactualtransmission.Thisprocedureisquiteun-
suitable for many applications. Usually, one wishes to begin
that source in a small number of bits [ 191. For the first part, sending the message before the end of i t has been seen. The
Markovmodeling is generallyemployed,althoughtheuse obvious solution is to arrange that both sender and receiver
oflanguage-dependentworddictionariesindatacompres- adapt the model dynamically to the message statistics as the
sion has also been explored [ 2 11 . In either case the problem transmission proceeds. This is called “adaptive coding” [ 121 .
of transmitting the model must be faced. The usual procedure Ithasbeenshowntheoretically [ 51 thatforsomemodels,
is t o arrange that when the transmission system is set up, both adaptivecoding is neversignificantlyworsethanatwo-pass
encoder and decoder sharea general model of the sorts of mes- approachandcanbesignificantlybetter.Thispaperverifies
sages that will be sent. The model could take the form of a theseresultsinpracticeforadaptivecodingusingaMarkov
table of letter or diagram frequencies. Alternatively, one could model of the source.
extractappropriatestatisticsfromthemessageitself.This But even with adaptive coding, there is still a problem in
involves a preliminaryscan of the message bytheencoder the initial part of the message because not enough statistical
and a preamble to the transmission which informs the decoder information has been gained for efficient coding. On the one
of the model statisitics. In spite of this overhead, significant hand, one wishes t o use a high-order Markov model t o provide
improvementcanbeobtainedoverconventionalcompres- as much data compression as possible once appropriate statis-
sion techniques. ticshavebeengathered.Butittakeslonger to gatherthe
The second part, transmitting a message generated by the statistics for a high-order model. So, on the other hand, one
source in a small number of bits, is easier. Conceptually, one wishes t o use a low-order model to accelerate the acquisition
cansimplyenumerateallmessageswhichcanbegenerated of frequency counts so that efficient coding can begin sooner
by themodelandallocateapartofthecodespacetoeach in the message. Oursolution is to usea“partialmatch”
whose size depends on the message probability. This procedure strategy,whereahigh-ordermodel is formedbut used for
of enumerative coding [ 6 ] unfortunately becomes impractical lowerorderpredictionsincaseswhenhigh-orderonesare
for models of any complexity. However, the recent invention n o t y e tavailable.
of arithmeticcoding [ 151hasprovidedamethodwhich is The structure of this paper is as follows. The next section
guaranteed to transmit amessage in a number of bits which presentsthecodingmethodandthepartialstringmatch
can be made arbitrarily close to its entropy with respect to strategywhich isused to gaingoodperformanceevenearly
the model which is used. The method can be thought of as a inthe message. InSection 111 theresultsofsomeexperi-
generalizationofHuffmancodingwhichperformsoptimally mentswiththecodingschemearepresented.Theseexperi-
even when the statistics d o n o t have convenient power-of-two ments usea varietyofdifferentsortsofdataforcompres-
relationships to each other. It has been shown to be equivalent sion,andrespectable-insomecasesexceptionallygood-
to enumerative encoding [5], [ 181, [ 191, and gives the same performance isachievedwith all ofthem.Finally,there-
coding efficiency. sources required to use the method in practice are discussed.
Anappendix gives formal
a definition of howcharacter
Paper approved by the Editor for Data Communication Systems of probabilities are estimated using partial string matching.
the IEEE Communications Society for publication without oral presen-
tation. Manuscriptreceived May 7 , 1982; revised May 22, 1983. This 11. THE CODINGMETHOD
work was supported by the Natural Sciences and Engineering Research Arithmetic Coding
Council of Canada.
The authors are with the Department of Computer Science, Univer- Arithmetic coding has been discussed recently by a number
sity of Calgary, Calgary, Alta., Canada T2N 1N4. of authors [ 7 ] , [ l o ] ,[ 1 5 ] , [191. Imagineasequenceof
We takeapragmaticapproachtotheproblem,andhave TABLE I
investigatedtheperformance of twodifferentexpressions SUCCESS OF THE PARTIAL STRINGMATCH METHOD ON
for the probability of a novel event. One motivation for the DIFFERENT KINDS OF MESSAGE
experiments described in the'next section was t o see if there
data chars bits/ bits/coded char
is any clear choice between them in practice. Further experi- char 0-0 0 =0 =
21 0=3 0 =0 9= 4
ments are needed to compare them with the other techniques
I. English text 3,614 4.622
7 4.102.3.8913.792
3.800
3.838
referred t o above. 2. English text 551,623 4.555
73.663
2.939
2.384
2.192 --
The first technique estimates the probabilities for arithmetic 3. English text 44,871 4.535
7 3.692
3.1012.804
2.772
2.890
4. c source program 3,913 5.309
7 3.719
2.9232.795
2.789
2.831
codingbyallocatingonecounttothepossibilitythatsome 5. Bibliographic data 20,115 5.287
7 3.472
2.951
2.713
2.695
2.766
symbol will occur in a context in which it has not been seen 6. Numeric data in 102.400 8 5.668
5.189
5.527 -- -- --
binary format
before. Let c(cp) denote the number of times that the symbol 7. Binary program 21,505 8 6 . 0 2 4 5 . 1 7 1 4.931 4.902 4.914 4.950
8. Grey-scale picture a s 65,536 8 6 . 8 9 3 5.131 5 . 8 0 3 6 . 1 0 6 6 . 1 6 1 --
cp occursinthecontext "# i cp" foreach cp inthecoding 8-bit pixel values
alphabet A '(say, ASCII). Denote by C the total number of 9. crey-scale picture as 65,536
3.870
4 1.970 1.923
1.943
1.991 --
4-bit pixel values
timesthatthecontext "# i" hasbeenseen;thatis, C =
X,+,lp~~c(cp).Then,forthepurpose of arithmeticcoding,we Escapes calculated using Method A
estimate the probability of a symbol cp occurring in the same
context to be
For example, a 256-point transform with a sample rate of 8 kHz gives the 256
equally-spaced frequency components between 0 and 8 kHz that are shown in Table
4.2.
I
0 0 sec 0 0 Hz
I 125 1 . 31
2 62 250 2
3 375 3 94
500 4 125 4
Let 4 be the number of characters that have occurred in that ......
context,and a bethesize of thecodingalphabet A . Then ......
......
there are a - q characters that have not yet occurred in the 254 31750 254 7938
context. We allocate to eachnovelcharactertheoverall 31875255 sec , 255 7969 HZ
coding probability
Table 4 . 2 Time domain and frequency domain samples far a 2 5 6 - p o i n t DFT.
p((p) ='-
I '
- -'
1
c(cp) = 0.
with 8 kHz sampling
The top half of the frequency spectrum is of no interest, because
it contains the complex conjugates of the bottom half (in reverse order).
1 S C a-4 corresponding to frequencies greater than half the Sampling frequency.
Thus for a 30 Hz resolution in the frequency domain,
256 samples or a 3 2 msee stretch of speech, needs to be transformed.
For convenience we call this technique method A. A common technique is to take overlapping periods i n the time domain to
The second technique, method B, classes a character in a give a new frequency spectrum every 16 msec. From the acoustic point
of view this is a reasonable rate to re-compute the spectrum. f o r as noted
particular context as novel unless it has already occurred rwice. above when discussing channel vocoders the race of change in the spectrum
is limited by the speed that the speaker can move his vocal organs, and
(This is motivated by the consideration that a once-off event anything between 10 and 2 5 msee is a reasonable figure far transmitting
may be an error or other anomaly, whereas an event that has or storing the spectrum.
occurred twice or more is likely to be repeated further.) The The DFT is a complex transform, and speech is a real signal. It is possible
probabilityofaneventwhichhasoccurredmorethanonce to do two DFT's at once by putting one time wave form into the real parts
of the input and another into the imaginary parts. This destroys the DFT
in the context is then estimated to be symmetry property, for it only holds for real inputs. But given the DFT
of a complex sequence formed in this way. it is easy to separate out the
DFT's of the two real time sequences. If the two time sequences are '
and , then the transform of the complex sequence
is
It follows that the complex conjugate of the aliased parts of the spectrum.
The escape probability is therefore in the upper frequency region. are
actual
message.
computer
fourth
aThe
sample
program
isin TABLE I1
source form, written in the language C [ 1 11, including some CoMPARIsoN O F OPTIMUMCOMPRESSIONACHIEVED BY
comments
and h e d i n e ) characters.
fifth
The short
is a ESCAPE
TECHNIQUES
extractfromabibliographyfile.,whichcontainsauthors’
data Method A Method B Improvem
o fe n t
names,titles,andreference-detaiisina’structuredmanner 0 bits/char o bits/char B over A
suitable for computer indexing and keyword retrieval [ 141 .
Thesefirst five samplesareallrepresented as ASCII text, I. E n g l i s ht e x t 3 3.792 4 3.542 4.1%
2. E n ’ g l i s ht e x t 4 2.192 4 2.198 -0.3%
requiring 7 bits/character. 3. E n g l i s ht e x t 4 2.777 5 2.746 0.9%
The next three samples each use 8 bits/character. Sample 6 4. C s o u r c ep r o g r a m 4 2;789 4 2.937 -5.0% ,
I I 1 I L
10 20 50 100 2M) 500
CHARACTERS X loo0
Fig. 3. Cumulative
coding performance plotted against time, for
sample 2.
data
b C d e
nodes in tree c,(d Ip a
o=o 0=10 = 3 032 0 4 0=9
I
rn
3 0 0 0 0 2 1/3
2 4 2 W 2 )l W 4 ) 0 0 0 3 1/4
I. Enelish
text 61 3,614 5504014 1966
223586677 I 5 3 2 l(112) 0 0 I 1 1/2
2. English text 154077
61546
16802
2417551.623
94 -- 0 5 3 2 1 0 0 0 1
3. English text 198628
35793
17196
5822
110276
44,871
4. c source program
17571
5062
3236
1733625 693,913
-1 1 1 1 1
5. Bibliographic 80610 data
19915
11397
4805
1051 76
20,115
6. Numeric data in
b i n a r y format
48580
14150255
102,400 -- __ __
7,. Binary program 21.505
255 5139 114914
38789
25716
14140 Of p,(’) are given n
i brackets
8. Grey-scale picture a s
&bit pixel values
6 5 , 5 36664
6 140 40376 95980 159411 __
9 . Grey-scale picture a s
4-bit pixel values
6 5212
,536 15 1600 6634 17912 __
than 0 in higher order models:
0
REFERENCES
[22] I. H . Witten,
“Approximate,
non-deterministic
modelling of be-
[I] L.R. Bahl et d . , “Recognitionofacontinhodsly read naturalhavioursequences,” I n t . J. GeneralSyst., vol. 5 , pp. 1-12, Jan.
corpus,” in Proc. IEEE I n t . Conf. Acoust:, Speech, Signal Proc- 1979.
essing, Tulsa, OK, Apr. 1978,
pp.
418-424.
[23] I. H.Witten, Principles of Computei Speech. London, England:
[2] J . G . Cleary,“An.associativeandimpressiblecomputer,”Ph.D. Academic,1982.
dissertation,Univ,Canterbury,Christchurch, N~~ Zealand,1980,[24l J . ZivandA.Lempel,“Compression of individualsequences via
[3] -, ”Compacthashtables,” IEEE Trans. Cornput.,to be pub- variable-ratecoding,” IEEE Trans.Inform.Theory, vol.IT-24,
lished.
[4] -, “Representing
trees
andtries
without
pointers,”
Dep.
[5]
Comput. Sci., Univ. Calgary, Calgary, Alta., Canada, Res. Rep.
82/102/21, Sept.,. 1982.
J’. G . Clearyand I . k. Witten,“Arithmetic,enumerative and
*
adaptive coding,’;:IEEE Trans. Inform. Theory, to be published.
[61 T. M., Cover, “Enumerative source encoding,” IEEE Trans. I n -
form. Theory, vol. IT-19, pp. 73-77, Jan. 1973.
[71 M. Guazzo, “A general minimum-redundancy source-coding algo-
rithm,” IEEE Trans. Inform. Theod, vol. IT-26, pp. 15-25, Jan.
1980.
C.W.Harrison,,‘‘Experimknts,;withlinearprediction in televi-
sion,” Bell Syst. Tqch. J . , pp. 764-783, July 1952.
R.HunterandA. H. Robinson,:‘Ihternationaldigitalfacsimile
coding standard,” Pqoc. IEEE, vo1.,68, pp. 854-867, July 1980.
5; B . Jones,.“Anefficientcoding’;bystemfor long sourcese-
quences,” IEEE Trans. Itiform. Theory, vol. IT-27, pp. 280-291,
May1981.
B. W. Kefnighariand D. M.Ritchie, The C ProgrammingLan-
guage. EnglewoodCliffs,NJ:Prentice-Hall,1978.
G . G. Langdon and<;J. Rissanen,‘.‘Compression of black-white
images, witK. arithmeticcodidg,” IEEE Trans. Commun., vol.
COM-29, pp. 858-867,June1981.
G . G . tangdon, “A note on the Ziv-Lempel model for compressing
individual*sequences,” IEEE Trans. Inform. Theory, vol.IT-30,
pp. 28&287, Mar,; 1983.
M. E. Lesk, “Refer,” in Gnix Programmers Manual, Bell Lab.,
Murray,Hill,NJ,1979.
R. Pasco, “Source coding algorithms for fast data compression,”
Ph.D. disiertation, Stanford Univ., Stanford, CA, 1976.
C . S . Pierce,“Theprobabjliiyof,,induction,” in The World of
Marhematics, vol.2, J . R.New,man, Ed,. New York:Simonand
Schuster,1956,pp: 1341-1354: ”
J . Raviv;”Decision making in,Markov chains applied to the prob-
lem of paiternrecognitioh,” IEEE Trans.Inform.Theory, vol.
IT-13,pp.536-551,Oct.1967..:
J . J . Rissanen, “Arithmetic codings as number representations,”
Acta. Polytech. Scandinavica, vol. Math. 31, pp. 44-51,1979.
J. J. Rissanen and G. G. Langdon, “Universal modeling and cod-
ing,” IEEE Trans. Inform. Theory, vol.IT-27,pp. 12-23, Jan.
1981.
(261 M. F. Roberts,”LocalorderestimatingMarkoviananalysisfdr
noiseless source coding and authorship identification,” Ph.D. dis-
sertation, Stanford Univ., Stanford, CA, 1982.
H. E. White, “Printed English compression by dictionary encod-
ing,” Proc. IEEE, vol. 5 5 , pp. 390-396, 1967.