0% found this document useful (0 votes)
195 views7 pages

Data Compression Using Adaptive Coding and Partial String Matching

1) The document describes an adaptive data compression technique that uses arithmetic coding with a Markov model to dynamically adapt the model as the message is transmitted between encoder and decoder. 2) It addresses the issue that initially not enough statistical information has been gained to efficiently code the message. It proposes using a "partial string matching" strategy where a high-order Markov model is formed but lower-order predictions are used until high-order probabilities become available. 3) Experiments showed the technique achieved respectable and in some cases exceptionally good compression performance on a variety of data types with no prior knowledge of the source.

Uploaded by

SagteSaid
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
195 views7 pages

Data Compression Using Adaptive Coding and Partial String Matching

1) The document describes an adaptive data compression technique that uses arithmetic coding with a Markov model to dynamically adapt the model as the message is transmitted between encoder and decoder. 2) It addresses the issue that initially not enough statistical information has been gained to efficiently code the message. It proposes using a "partial string matching" strategy where a high-order Markov model is formed but lower-order predictions are used until high-order probabilities become available. 3) Experiments showed the technique achieved respectable and in some cases exceptionally good compression performance on a variety of data types with no prior knowledge of the source.

Uploaded by

SagteSaid
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

396 IEEE TRANSACTIONS O N COMMUNICATIONS, VOL. COM-32, NO.

4, APRIL 1984

Data Compression Using Adaptive Coding and Partial


String Matching
JOHN G . CLEARY AND IAN H. WITTEN, MEMBER, IEEE

Abstract-The recently developed technique of arithmetic coding, in There are obvious disadvantages in having the encoder and
conjunction with a Markov model of the source, is a powerful method decoder share a fixed model which governs the coding of all
of data compression in situations where a linear treatment is inap- messages. While it may be appropriate in some tightly defined
propriate. Adaptive coding allows the model to be constructed dy- circumstances, such as special-purpose machines for facsimile
namically by both encoder and decoder during the course of the transmission of documents [ 91 , it will not work well for a
transmission, and has been shown to incur a smaller coding overhead variety of different types of message. For example, imagine
than explicit transmission of the model’s statistics. But there is a basic an encoder embedded in a general-purpose modem or a com-
conflict between the desire to use high-order Markov models and the puterdiskchannel.Themostappropriatemodel t o use for
need to have them formed quickly as the initial part of the message is such general applications may be one of standard mixed-case
sent.This paper describes how the conflict can be resolved with English text.Butthesystemmayhavetoencodelong se-
partial string matching, and reports experimental results which show quences of upper-case-onlytext,orprogramtext,orfor-
thatmixed-case English text can be coded in as little as 2.2 bits/ mattedbibliography files-all withstatisticsquitedifferent
character with no prior knowledge of the source. from those of the model.
There is clearly a case for basing the model on the statis-
ticsofthemessagewhichiscurrentlybeingtransmitted.
I. INTRODUCTION But t o d o so seems to requireatwo-passapproach,witha

R ECENTapproachestodatacompressionhavesplitthe
problemintotwoparts:modelingthestatisticsofthe
sourceandtransmittingaparticularmessagegeneratedby
first pass through the message to acquire the statistics and a
secondforactualtransmission.Thisprocedureisquiteun-
suitable for many applications. Usually, one wishes to begin
that source in a small number of bits [ 191. For the first part, sending the message before the end of i t has been seen. The
Markovmodeling is generallyemployed,althoughtheuse obvious solution is to arrange that both sender and receiver
oflanguage-dependentworddictionariesindatacompres- adapt the model dynamically to the message statistics as the
sion has also been explored [ 2 11 . In either case the problem transmission proceeds. This is called “adaptive coding” [ 121 .
of transmitting the model must be faced. The usual procedure Ithasbeenshowntheoretically [ 51 thatforsomemodels,
is t o arrange that when the transmission system is set up, both adaptivecoding is neversignificantlyworsethanatwo-pass
encoder and decoder sharea general model of the sorts of mes- approachandcanbesignificantlybetter.Thispaperverifies
sages that will be sent. The model could take the form of a theseresultsinpracticeforadaptivecodingusingaMarkov
table of letter or diagram frequencies. Alternatively, one could model of the source.
extractappropriatestatisticsfromthemessageitself.This But even with adaptive coding, there is still a problem in
involves a preliminaryscan of the message bytheencoder the initial part of the message because not enough statistical
and a preamble to the transmission which informs the decoder information has been gained for efficient coding. On the one
of the model statisitics. In spite of this overhead, significant hand, one wishes t o use a high-order Markov model t o provide
improvementcanbeobtainedoverconventionalcompres- as much data compression as possible once appropriate statis-
sion techniques. ticshavebeengathered.Butittakeslonger to gatherthe
The second part, transmitting a message generated by the statistics for a high-order model. So, on the other hand, one
source in a small number of bits, is easier. Conceptually, one wishes t o use a low-order model to accelerate the acquisition
cansimplyenumerateallmessageswhichcanbegenerated of frequency counts so that efficient coding can begin sooner
by themodelandallocateapartofthecodespacetoeach in the message. Oursolution is to usea“partialmatch”
whose size depends on the message probability. This procedure strategy,whereahigh-ordermodel is formedbut used for
of enumerative coding [ 6 ] unfortunately becomes impractical lowerorderpredictionsincaseswhenhigh-orderonesare
for models of any complexity. However, the recent invention n o t y e tavailable.
of arithmeticcoding [ 151hasprovidedamethodwhich is The structure of this paper is as follows. The next section
guaranteed to transmit amessage in a number of bits which presentsthecodingmethodandthepartialstringmatch
can be made arbitrarily close to its entropy with respect to strategywhich isused to gaingoodperformanceevenearly
the model which is used. The method can be thought of as a inthe message. InSection 111 theresultsofsomeexperi-
generalizationofHuffmancodingwhichperformsoptimally mentswiththecodingschemearepresented.Theseexperi-
even when the statistics d o n o t have convenient power-of-two ments usea varietyofdifferentsortsofdataforcompres-
relationships to each other. It has been shown to be equivalent sion,andrespectable-insomecasesexceptionallygood-
to enumerative encoding [5], [ 181, [ 191, and gives the same performance isachievedwith all ofthem.Finally,there-
coding efficiency. sources required to use the method in practice are discussed.
Anappendix gives formal
a definition of howcharacter
Paper approved by the Editor for Data Communication Systems of probabilities are estimated using partial string matching.
the IEEE Communications Society for publication without oral presen-
tation. Manuscriptreceived May 7 , 1982; revised May 22, 1983. This 11. THE CODINGMETHOD
work was supported by the Natural Sciences and Engineering Research Arithmetic Coding
Council of Canada.
The authors are with the Department of Computer Science, Univer- Arithmetic coding has been discussed recently by a number
sity of Calgary, Calgary, Alta., Canada T2N 1N4. of authors [ 7 ] , [ l o ] ,[ 1 5 ] , [191. Imagineasequenceof

0090-6778/84/0400-0396$01.OO 0 1984 IEEE


CLEARY AND WITTEN: DATA COMPRESSION 397

symbols X l X 2 ... X , is to be encoded as another sequence 7 bitASCIIcodesinthissection.(Ofcourse,thetechniq.ue


Y , Y2 ... Y M .After a sequence X 1 ... Xi- 1 , the possible values is not restricted t o thisalphabet.)Inparticular,characters
for X i areorderedarbitrarily,say w1 ... w,,, , and assigned such as (space) aresignificant. (space) is writtenhere as
probabilties p ( w l ) ... p(w,). Theencoding Y is computed “#” merely to enhance legibility.
iteratively using only these probabilities at each step. Sincethemodel is oforder o = 2,thenextcharacter
Arithmetic coding has some interesting properties which are is predicted on the basis of occurrences of trigrams “# i cp”
important in what follows. earlierin the message. A scan through an English dictionary
Thesymbolprobabilities p ( w k ) maybedifferentat show that cp = ‘<e,>> “h,”“i,” 6‘’ 3,“k,”6Gq,>> 6‘
0 J, u,” “w,”
each step. Therefore, a wide range of algorithms can be used “y,”and “z” areunlikely inthiscontext,whilethehigh
to compute the ~ ( w A )independent
, of the arithmetic coding frequency of theword“is” will give cp = “s” areasonably
method itself. high probability in this context.
It is efficientandcanbeimplementedusingasmall Thecodingschemedoesnot use afixedorder. If itdid
fixed number of finite arithmetic operations as each symbol andtheorder o werelarge,thenpredictionswouldbein-
of X is processed. frequent until most of the ( a -t 1)-sequenceswhichactually
0 Theoutputcodelength is determinedbytheprobabil- occur in the message had been seen. When a context occurs in
ities of the symbols Xi and can arbitrarily closely approximate which the following character has not been seen before-for
thesum 2 - log p ( X i ) if arithmeticofsufficientaccuracy example, on the first occurrence of. “# i #’-an escape mech-
is used. anism is used to transmit the character identity.
Instead of using a fixed length context, both encoder and
Adaptive Transmission of the Model decoderrecognizepredictionsonthebasisofthelongest-’
stringmatchbetweenthepresentcontextandpreviously
Arithmetic coding
commonly uses fixed
probabilities,
seenones.Thiscreatesnoambiguitybecauseeachseesthe
contingentonthecurrentcontext,whichare derivedfrom
same message sequence. For example, when the character cp =
statistical analysis of text. However, our method derives the
“s” occurs in the context “ # i cp” for the first time, the pre-
probabilitiesfromthemessageitself.Furthermore,because
diction will be based on the length-1 context “i cp”. Thus,if the
encoding is doneinasinglepassthroughthemessage,the
string “is” has occurred previously in the message, even without
statisticsaregatheredfromthe preceding portion of the
a preceding space (as in the word “history”), the coding of the
message only.Thus,
they
are
continually
changingwith
character “s” will be based on this foreshortened context. In es-
time as the transmission proceeds. Such an adaptive strategy
sence, both encoder and decoder use an escape mechanism to
has been used by Langdon and Rissanen [ 121 with fixed-order
back down to the previous level. Then the character is encoded
Markov models. Roberts [20] also discusses similar techniques
appliedtoencodingandauthorshipidentification of English at this level, using the order-1 model which is implicit in the
text. Ziv and Lempel [24] have proposed a coding technique stored order-2 one.
which involvesthe
adaptive
matching of variable length If the string has not occurred previously, the context will
strings.Superficially,thistechniqueappears to be verydif- be further shortened to the empty string. The encoder will
ferent from the“arithmeticcodingplus.adaptivemodel” use a second escape sequence to inform the decoder of this
approachpursuedhere.Nevertheless,ithasbeenshownby event.This will cause thecharactertobepredictedonthe
Langdon [ 131 that there exists a scheme of this form which basis of the order-0 model, that is, on its frequency so far in
is equivalent t o Ziv-Lempel coding, the message. If,however, thecharacter has never been seen
It may at first seem difficult to implement a coding system before, so thatit is notpredictedbytheorder-0model,
based uponpredictionswhoseprobabilitiesarechangingall the escape mechanism is used a third time. The actual identity
thetime.However,theproblem is not so great as mightbe of the characteris then transmitted usinga probability of 1/128
imagined, because usually, nothing extra need be transmitted for each.
toupdatetheprobabilities.Afterall,thedecoder-ifit is
working properly-is seeing exactly the same message sequence Escape Probabilities
as the encoder, and so it can update frequency counts just as A case of great interest occurs when the encoder encounters
easilyascan theencoder.Itis,ofcourse,necessarythata a character in a context where it has never been seen before.
character is encoded according to the old model, before the For example, suppose cp = “s” occurs for the first time in the
countshavebeenupdatedtotakeintoaccountthatoccur- context “# i #’-that is, the word “is,” or indeed any other
renceofthecharacter.Havingencodedacharacter,theen- wordwhichbegins ‘‘is,’’ has not occurred in the message so
coderupdatesitsmodel.Havingdecodedit,thedecoder far.(This,ofcourse, will happenquitefrequentlyinthe
updatesitsownmodel.Assumingerror-freetransmission- initialpartofthemessage.)Thenit is impossibleforthe
and this is an assumption that is made throughout this paper- encoder to encode it on the basis of its present model. Notice
the models willalwaysagree,eventhoughexplicitdetailsof that we are talking here not just of the first occurrence of a
the models are never transmitted. Appropriate error correction particular character in the message, but of its first occurrence
policies, or error detection and retransmission protocols, can in each possible context.
beapplied totheencodeddatatomaketheprobability of For each context, one must allocate a probability to the
undetected errors arbitrarily small. event thata novelcharacteroccursinthatcontext. It is
difficult to imaginearationaleforoptimalchoiceofthis
Markov Modelingwith Partial String Matching probability. There has been extensive discussion of this prob-
lembyphilosophersfromatleastthetimeofKant.Pierce
The coding scheme uses a Markov model which conditions gives an outline of this early work in [ 161 . Some more modern
theprobabilitythataparticularsymbol will occuronthe solutionsinthecontext of Markovmodelshavebeensum-
sequence of characters which immediately precede the symbol. marized by Roberts [ 2 0 ] whoreferstothisasthe“zero
The order of the Markov model, for which we use the symbol frequency problem” (see also [ 171 and [ 1, p. 4241 ). Roberts
0 ,is the number of characters in the context used for predic- also proposes his own solution (LOEMA) which takes weighted
tion. For example, suppose an order-2 model is selected, and sums of theprobabilitypredictionsofmodelsofdifferent
the current symbol sequence is “. . . #and#the#current#sym- orders. As notedby all theseauthors,intheabsence of a
bol#sequence#i.” The next character, which will be denoted priori knowledge, there seems to be no theoretical basis for
by c p, could be any member of the coding alphabet-we use choosing one solution over another.
398 TRANSACTIONS
COMMUNICATIONS,
IEEE
ON
COM-32,
VOL. NO. 4, APRIL 1984

We takeapragmaticapproachtotheproblem,andhave TABLE I
investigatedtheperformance of twodifferentexpressions SUCCESS OF THE PARTIAL STRINGMATCH METHOD ON
for the probability of a novel event. One motivation for the DIFFERENT KINDS OF MESSAGE
experiments described in the'next section was t o see if there
data chars bits/ bits/coded char
is any clear choice between them in practice. Further experi- char 0-0 0 =0 =
21 0=3 0 =0 9= 4
ments are needed to compare them with the other techniques
I. English text 3,614 4.622
7 4.102.3.8913.792
3.800
3.838
referred t o above. 2. English text 551,623 4.555
73.663
2.939
2.384
2.192 --
The first technique estimates the probabilities for arithmetic 3. English text 44,871 4.535
7 3.692
3.1012.804
2.772
2.890
4. c source program 3,913 5.309
7 3.719
2.9232.795
2.789
2.831
codingbyallocatingonecounttothepossibilitythatsome 5. Bibliographic data 20,115 5.287
7 3.472
2.951
2.713
2.695
2.766
symbol will occur in a context in which it has not been seen 6. Numeric data in 102.400 8 5.668
5.189
5.527 -- -- --
binary format
before. Let c(cp) denote the number of times that the symbol 7. Binary program 21,505 8 6 . 0 2 4 5 . 1 7 1 4.931 4.902 4.914 4.950
8. Grey-scale picture a s 65,536 8 6 . 8 9 3 5.131 5 . 8 0 3 6 . 1 0 6 6 . 1 6 1 --
cp occursinthecontext "# i cp" foreach cp inthecoding 8-bit pixel values
alphabet A '(say, ASCII). Denote by C the total number of 9. crey-scale picture as 65,536
3.870
4 1.970 1.923
1.943
1.991 --
4-bit pixel values
timesthatthecontext "# i" hasbeenseen;thatis, C =
X,+,lp~~c(cp).Then,forthepurpose of arithmeticcoding,we Escapes calculated using Method A
estimate the probability of a symbol cp occurring in the same
context to be
For example, a 256-point transform with a sample rate of 8 kHz gives the 256
equally-spaced frequency components between 0 and 8 kHz that are shown in Table
4.2.

The escapeprobability thatsomecharacteroccurswhich is time domain frequency domain


novel in that context, one for which c ( q ) = 0, is therefore
sample time sample frequency
what remains after accounting for all seen characters: number number

I
0 0 sec 0 0 Hz
I 125 1 . 31
2 62 250 2
3 375 3 94
500 4 125 4
Let 4 be the number of characters that have occurred in that ......
context,and a bethesize of thecodingalphabet A . Then ......
......
there are a - q characters that have not yet occurred in the 254 31750 254 7938
context. We allocate to eachnovelcharactertheoverall 31875255 sec , 255 7969 HZ

coding probability
Table 4 . 2 Time domain and frequency domain samples far a 2 5 6 - p o i n t DFT.

p((p) ='-
I '
- -'
1
c(cp) = 0.
with 8 kHz sampling
The top half of the frequency spectrum is of no interest, because
it contains the complex conjugates of the bottom half (in reverse order).
1 S C a-4 corresponding to frequencies greater than half the Sampling frequency.
Thus for a 30 Hz resolution in the frequency domain,
256 samples or a 3 2 msee stretch of speech, needs to be transformed.
For convenience we call this technique method A. A common technique is to take overlapping periods i n the time domain to
The second technique, method B, classes a character in a give a new frequency spectrum every 16 msec. From the acoustic point
of view this is a reasonable rate to re-compute the spectrum. f o r as noted
particular context as novel unless it has already occurred rwice. above when discussing channel vocoders the race of change in the spectrum
is limited by the speed that the speaker can move his vocal organs, and
(This is motivated by the consideration that a once-off event anything between 10 and 2 5 msee is a reasonable figure far transmitting
may be an error or other anomaly, whereas an event that has or storing the spectrum.
occurred twice or more is likely to be repeated further.) The The DFT is a complex transform, and speech is a real signal. It is possible
probabilityofaneventwhichhasoccurredmorethanonce to do two DFT's at once by putting one time wave form into the real parts
of the input and another into the imaginary parts. This destroys the DFT
in the context is then estimated to be symmetry property, for it only holds for real inputs. But given the DFT
of a complex sequence formed in this way. it is easy to separate out the
DFT's of the two real time sequences. If the two time sequences are '
and , then the transform of the complex sequence

is

It follows that the complex conjugate of the aliased parts of the spectrum.
The escape probability is therefore in the upper frequency region. are

Fig. 1. Example text taken from data sample 3.

marizes the results, for a few different values of 0 , the order


We allocate t o each novel character the overall coding proba- of the
Markov model, using method A. Before discussing
bility theseresults,however,weshouldsaysomethingaboutthe
kinds of message which were used.
4 1 Thefirstthreesamplesare English text. All ofthem use
=- *
upper and lower case characters in the normal way. Sample 1,
~

the shortest, is anabstract of atechnicalpaper.Itincludes


someformattingcontrols as well as a (newline) character
A formal definition of the probability calculations used by at the end of each line. Sample 2 , the longest, is a complete
partial string matching is given in the Appendix together with 11-chapterbook [ 231 . Noticethatthissamplecontains
anexamplecalculation of theprobabilities.Thisincludes overhalfamillioncharacters.Prior to coding, we removed
an improvement whereby characters predicted by higher order theformattingcontrolsandmathematicalexpressionsauto-
modelsarenzglectedwhencalculatingtheprobabilitiesof matically, which left some, rather anomalous gaps in the text.
predictions by the lower order models. Tabularillustrationswerenotdeleted.Fig.1showsarepre-
111. EXPERIMENTAL PERFORMANCE sentative part of the text which includes a small table. Sample
3 is the first chapter from the book and, thus, forms a sub-
The Sample Messages sequence of sample 2: i t is inchded to study how the coding
The adaptive partial string match coding method has been efficiency is improvedbyexposingthecodingscheme to a
testedon several differentkinds of message.Table I sum- large,representativesampleoftextbeforetransmitting.the
CLEARY
COMPRESSION : DATA 399

actual
message.
computer
fourth
aThe
sample
program
isin TABLE I1
source form, written in the language C [ 1 11, including some CoMPARIsoN O F OPTIMUMCOMPRESSIONACHIEVED BY
comments
and h e d i n e ) characters.
fifth
The short
is a ESCAPE
TECHNIQUES
extractfromabibliographyfile.,whichcontainsauthors’
data Method A Method B Improvem
o fe n t
names,titles,andreference-detaiisina’structuredmanner 0 bits/char o bits/char B over A
suitable for computer indexing and keyword retrieval [ 141 .
Thesefirst five samplesareallrepresented as ASCII text, I. E n g l i s ht e x t 3 3.792 4 3.542 4.1%
2. E n ’ g l i s ht e x t 4 2.192 4 2.198 -0.3%
requiring 7 bits/character. 3. E n g l i s ht e x t 4 2.777 5 2.746 0.9%
The next three samples each use 8 bits/character. Sample 6 4. C s o u r c ep r o g r a m 4 2;789 4 2.937 -5.0% ,

is adata file containinggeophysical.informationinbinary 5. B i b l i o g r a p h i cd a t a 4 2.695 4 2.750 -2.0%


6. N u m e r i cd a t a in 1 5.189 3 4.680 10.9%
format. Sample 7 is a binary program, compiled from Pascal b i n a r yf o r m a t
fortheVAX11/780computer.Care was takentoremove 7. B i n a r yp r o g r a m 3 4.962 1 3 .46 %4 . 3 1 7
8. G r e y - s c a l ep i c t u r e as 1 5.131 1 . 4 %I 5 . 0 6 1
thesymboltablefromitbeforecoding so that no English- 83-bit p i x e l v a l u e s
liketext is included.(apartfromliteralstrings).Sample 8 is 9. G r e y - s c a l ep i c t u r e as 2 1.923 2 1.938 -0.7%
4 - b i tp i x e lv a l u e s
a grey-scale picture, histogram equalized and stored in raster
order,with256grey levels. Finally,sample9 is- thesame
picture with pixels truncated to 4 bits (16 grey levels).
Performance of the Coding Scheme codedin5.189bits/byteattheoptimal value o = 1.This
Nowwecanexaminetheresults’ofcodingwithmet.hod represents a reduction to 65 percent of the unencoded value
A which are presented in Table I. These are expressed in terms of 8 bits/byte; much less that the reduction to 31-54 percent
of bits/codedcharacter.Forexample,thefirstlineshows which was achieved for the samples of English. The compiled
thatashort Englishmessage canbecodedin3.792bits/ program of sample 7 takes ‘4.902 bits/byte, or 61 percent of
character, using theoptimal value of o = 3.Thepenalty theunencoded value. Presumablythis is less “noisy?’than
paid by choosing o too large is very small, however; for with the geophysical data, which would lead one to suspect that
o = 9only3.838bitsareneeded-about1.2percentmore. greater gains are possible for it. On the other hand, the coding
The optimal value of o grows slightly with the length of the inwhichmachine-languageprogramsareexpressedhasbeen
message. carefully designed t o eliminate redundancy.
The piece of English text in sample3 has an optimum ato = The grey-scale picturesprovideaninterestingexample.
4 (althoughthis is notapparentfromthetablebecausethe With 8 bit pixels, only 5.1 31bits/pixel is achieved (64 per-
figurefor o E 5 is notshown).Forthetextofsample2 cent) at the optimal value o = ‘1.This value of o is rather
wewereunableto.carrytheexperimentbeyond o = 4 for low, indicating that little information is obtainable from the
resourcereasons.However,wehavedemonstratedthatthe context of a pixel. This is not suiprising considehng that the
method is. able to code mixed-case English, including tables low-orderfewbitsareundoubtedly very noisy.Alinear
and rather arbitrary spaces, below 2.2 bits/character. And this treatment with the assumption of additive Gaussian noiie would
is the average codingperformanceovertheentire message- probablybemuchmoreapproprateforthiskind of data.
.with
. both the encoder and decoder starting from scratch with On the other hand, discarding the lower order bits to give a
no prior informatian at all about the likely statistics of the mes- 4 bit pixel eliminates most of this noise, making the coding
sage. Although the table does not show it, the final 90 percent scheme perform much better-1.923 bits/pixel, or 48 percent
of this sample-half a million characters-was coded in 2.132 of the unencoded value. We suspect that this may be better
bits/character with o = 4. This indicates the perforinance,which thancouldbeachieved using techniquessuch as linearpre-
can be expected when the cpding and decoding modules are diction [ 8 ] .
primedwithashortbutrepresentativesampleofthekind
of Englishused (55 000 characters in this case). Notice how Selection of the Escape Probability
much better it is thanoperatinginunprimedmodeforthe
short (45 000 character) text of sample 3-2.772 bits/character We have investigated the use of two algorithms for calculat-
a_tthe same value of o = 4. ing the escape probability, that is, the probability that a char-
It is interesting to compare the results for sample 4, the acter will occurin P contextin’which it hasn’otoccurred
programinsourceform,withthoseforsample1,which is before.Thetwomethods, called Aand B, weredescribed
English text of about the same length. For the lowest value above. In practice, we find that there is no clear choicebe-
of o , o = 0, the coding scheme does not perform as well with tween them. This can be seen in Table 11, which compares the
theprogram as itdoeswithEnglish.This is because of the best compressions achieved by the two techniques on each of
abundance of unusualcharacters,like “{” and “*”, inthe the messages. Method B isslightlybetterthan A o n five of
programtext,leadingtoalargereffectivealphabet.(Any- the texts and worse on four. Also, there is no apparent relation
one who has encountered the C language will assure you that betweenthelengthandtypeof message andwhichescape
it appears cryptic, especially at first.) However, performance technique fares better. This insensitivity to the escape probabil-
improves with larger values of o , until at o = 4 (which is in itycalculationisactuallyquitesatisfying,Itillustrates‘that
fact the optimum for sample 4) the coded~message occupies the coding method is robust, gaining its power from the idea
only2.789‘bits/character-73percentofthatforsample1. of the escape mechanism rather than the precise details of the
Thisisbecauseofthemorestructuredform of a program: algorithm used toimplementit.Thispointisfutherrein-
variables are all declared before they are used and are repeated forced by Roberts [20] , who used a very different technique
relativelyoften,keywordsaredrawnfromarelativelysmall of “blending” Markov models of different orders t o achieve
set, the syntax constrains most operators to occur only after excellent results (unfortunately on texts which are not easily
variables and not after keywords, andso on. compaiable withthose used here).This
insensitivity
is
Another example of structured information is the biblio- particularly fortunatein view of the fact
noted
earlier
graphic text file of sample 5. This contains formatted informa- that it is. hard to see how any particular escape algorithm can
tiontogetherwithfreetextintheformoftitles,authors’ be justified theoretically.
names, and so on. At o = 4, coding with 2.695 bits/character Fig. 2 shows graphs of the coding performance versu6 value
is achieved,betterthanthatobtainedonthetextfileof of o for both ,methods, using the text of samples 1,.and 3.
similar size in sample3. The general behavior,shown there is typical of that for all the
Not surprisingly, a much smaller coding gain was obtained examples. In each case method B is relatively less efficient for
withbinarydata.Thegeophysicaldataofsample6canbe small values ‘of o butmoreefficientforlargeones.Also,
400 f E E E TRANSACTIONS ON COMMUNICATIONS, VOL. COM-32, NO. 4, APRIL 1984

I I 1 I L
10 20 50 100 2M) 500
CHARACTERS X loo0
Fig. 3. Cumulative
coding performance plotted against time, for
sample 2.

methiod B’s efficiencydoesnotdeteriorate so quicklypast


the optimum value of 0. Thisrelativelackofsensitivity to time is on the oidef of 10-50 mslcharactei, or 20-100 char-
o once it is large enough may make Method B preferable in acters/s.Decodingtakes , a similar, time. However, the per-
situatiods where it is hard to estimate the bestvalue for 0. formance of otherimplementations of parts of thesystem
have been investigated previously in different contexts. In an
Evaluation of Partial Matching
Algol implementation, the partial string match search has been
Recallth?tthecodingscheme uses a“partialmatch” found to be possible in9 ms/character, even for an eighth-order
stritegy,wherebyitbeginsforming a model of thedesired model,onaB6700computer. We believe thatitwouldbe
order at once but uses partial string matching to force predic- possible to reduce the time taken for partial string matching
tions out of the nascent model in the early stages, The value by the present program by a factor of ten, using better algo-
o f this approach is demonstrated in,;Fig. 3, which shows how rithmsandhand-coding of criticalparts.Atightlycoded
thecodingperformance varies as t i h e progresses duringthe assembly-language program for arithmetic coding has already
long text of sample 2. Time, in terms of number of characters, achieved 120 ps/character for encoding and 150 pslcharacter
isplottedhorizontallyonalogarithmicscale.The,vertical for decoding on a VAX 11/780. Since partial string matching
axisrepresentscodingperformanceovertheentireinitial and arithmetic coding between them cover the whole opeta-
substting of the message. Thelow.er 1ine.shows the performance tion of thescheme,acompletedata-compressionsystem
of the partial string match algorithm, while the same algorithm could operate
at
approaching 1000characters/s.
Special-
is used for theupperlinebutwithpartialstringmatching purposearchitecturesusing VLSI canbe envisaged which
suppressed. In both cases, o was chosen t o be 4 . could increase the speed to 100 000 characters/s.
I t is partialstringmatchingwhichallowsefficientcoding The,secondimportantresource is thememoryspacere-
tohe achievedearlyoninthemessage.Forexample,the quired by both encoder and decoder.As is common in Markov
bit rate in the first 10 000 characters is below 3.5 bits/char- models,thiscangrowexponentiallywiththeorder of the
acter,withpartialstringmatching,whereaswithoutit, it model, and is quite large in practice even for order-5 models
exceeds 5.5 bits/chaiacter.Moreover,theimprovedperform- of English text. However, notice that the scheme uses no pre-
ance of partialstringmatchingcanbeseenthroughoutthis storedstatistics;therequired,memory is emptyinitially.
rather lohg piece of text. Eventually, of course, if the message Therearecomplicatedtradeoffsbetweenspace,time,and
really ,doe; haveanhomogeneousstructure,partialstring implementationcomplexityinpartialstringmatching algo-
matching will cease t o give.any advantage. But Fig. 3 indicates rithms[2],[4].Ourexperimentalimplementationstores
that this will take a long time, even for a fairly modest value the Markov model in,a tree structure (as must any implementa-
of 0 (0 = 4). tion whose execution time grows at most linearly with o and
There is anupturninbothlinesbetween320 000 and which occupies a reasonable space).
550 000 characters.This is causedbyasuddendisruption For each sample of data, the number of nodes in the tree
of thestatistics o f thetextataroundcharacter 450 000, is shown in Table 111 for various orders of model, All results
which the interested reader will find just over halfway through reportedinthispaperhavebeenobtainedwith less than
Chapter 9 of the book [231 . Perhaps this should,be taken as 200 000 nodes.OurexperimentalimplementationinPascal
a warning that text statistics in real life are not homogeneous consumes128bits/node,butthiscan easily beimproved.
and nicely, behaved, making it particularly appropriate to use At each node must be stored a character code, a count of the
an adaptive encoding method. number of times that node has been visited (to allow prpba-
bilities tobecalculated),andtwopointers-onetoindicate
IV. RESOURCE REQUIREMENTS the next node at the current level and the other t o show the
.? Let us nowconsidertheresourcesrequiredtorunthe subtree for the next level. Allowing 32 bits for the count and
coding algorithm. Its most important feature, from the point each pointer, and 8 bits for the character code, the node con-
of view of practical coding, is that the time required for both sumes104bitsofstorage.Foranimplementationwhich
encodinganddecoding,growsonlylinearlywiththelength accommodates200 000 nodes(themaximumattainedin
of the message. Furthermore, it can be implemented.in such any of ourexamples),only18bits,arerequiredforeach
a way .that it grows only linearly with the order of the model. pointer.Furthermore,thecountcouldsafelybereducedto
And impressive data compression has been demonstrated with the same figure or less; on the basis that limiting the counts
models of low order-o = 3 or 4. t o evenasmallmaximumvaluewouldprobablynotimpair
Our current implementation is experimental and inefficient. coding efficiency significantly. This would reduce the storage
I t is written in the Pascal language on a VAX 1 1/780 computer. foreachnodetoabout 54 bits, so that1.4Mbyteswould
Formodels of orderbetween o = 0 and o = 4 , encoding suffice for 200 000 nodes.
CLEARY 40 1

TABLE 111 TABLE IV


SIZE O F DATA STRUCTURE REQUIRED FOR DIFFERENT KINDS CALCULATION O F PARTIAL STRING MATCH PROBABILITIES
O F MESSAGE USING METHOD A

data
b C d e
nodes in tree c,(d Ip a
o=o 0=10 = 3 032 0 4 0=9
I
rn
3 0 0 0 0 2 1/3
2 4 2 W 2 )l W 4 ) 0 0 0 3 1/4
I. Enelish
text 61 3,614 5504014 1966
223586677 I 5 3 2 l(112) 0 0 I 1 1/2
2. English text 154077
61546
16802
2417551.623
94 -- 0 5 3 2 1 0 0 0 1
3. English text 198628
35793
17196
5822
110276
44,871
4. c source program
17571
5062
3236
1733625 693,913
-1 1 1 1 1
5. Bibliographic 80610 data
19915
11397
4805
1051 76
20,115
6. Numeric data in
b i n a r y format
48580
14150255
102,400 -- __ __
7,. Binary program 21.505
255 5139 114914
38789
25716
14140 Of p,(’) are given n
i brackets
8. Grey-scale picture a s
&bit pixel values
6 5 , 5 36664
6 140 40376 95980 159411 __
9 . Grey-scale picture a s
4-bit pixel values
6 5212
,536 15 1600 6634 17912 __
than 0 in higher order models:
0

A m = {P: C m ( v ) U AI. >0 } -


Alternativeimplementations,such as thecompacthash l=m + 1
describedbyCleary [ 3 ] , [ 4 ] , couldreducethis t o anesti- Thesets A m will allowtheprobabilitypredictionstobe
mated 28 bitslnode and 0.7 Mbytes without sacrificing search improvedbyneglectingcharacterspredictedbyhigherorder
time. modelswhencalculatingtheprobabilitiespredictedbythe
Modifications could be made to the coding method which lowerordermodels.Forexample, if thecurrentcontext
reduce the number of nodes needed. One example is partial is “# i” and the sequence “i s” has occurred previously in the
modelstorage.Anorder-1model is storedinitially.Only message but not the sequence “# i s”, then “S” is a member
whenthishasbeenseento give ambiguouspredictions is it of A I but not of A - 1 , A o , or A , . Because of the definition
augmentedtoanorder-2model,andthenonlyforthecon- above,nocharactercanoccurinmorethanoneset A,.
texts in which ambiguity arises. In general, the order of each Also becauseofthedefinition of c - , every,characterin
node in ‘the model is increased selectively, up to a maxim- the alphabet will occur in precisely one of A - through A o .
value of 0 , whenevermorethanoneprediction is seento Following method A, the probability for a character rela-
emanatefromit.Anotherpossibility is toconstructanon- tive t o a model of order rn is estimated to be
deterministicautomatonmodel of the message string,and
store a reduced form as described by Witten [ 221.
However, we are not overly concerned about the amount
of storage that the method consumes. After all, only unfilled
storage‘isneeded.Withthecontinuedimprovementininte- where C , is the total count for characters first predicted by
gratedcircuittechnology,empty’store is becomingacheap a model of orderm ;
resource.. The major expense associated with memory is the
cost of filling itwithinformationandmaintainingandup-
.dating that information. But this is done automatically by the
coding scheme. \PEAm
Most coding methods do exact a cost by requiring statistics This gives the estimated escape probability of a novel character
to becalculatedandstoredbeforecodingbegins.Theone occurring relative to order m as
describeddoesnot.However,manyapplications will find i t
worthwhile t o prime the encoder and decoder with representa-
tive statisticsbeforetransmittingamessage.This iseasily
donebysendingarepresentativesample of textbeforethe
maintransmissionbegins,andwesawduringdiscussionof Finally,theestimatedprobabilityforacharacterusing
Table I thatthis can bemosteffective. If thestatisticsare partial string matchingis
misleading, then of course some deterioration in coding
efficiency is only t o beexpected.Adaptation will ensure
that the initial priming is eventually outweighed by the statis-
tics of the message itself.
p ( p > = P,(cp) * fi
I=m+l
el,

APPENDIX In other words, to compute p ( q ) start at the .highest order 0 .


Reducingtheorderateachstep,taketheproduct of the
A formal definition is now given of the probabilities esti- escape probabilities until the character is positively predicted.
matedforcharactersusingpartialstringmatching. TO d o Then multiply this product by the probability estimated for
this we extendthenotation used earlier.Let c,(cp) be, the thecharacter using themodelwhichfirstpositivelypredicts
count of the number of times the character cp hasoccurred it. An example calculation of ,p(cp) using a small alphabet of
in the current context of an order rn model; where 0 < rn o < six characters, “abcdef”, is given in Table IV.
and o is themaximumorder of thestoredmodel. In order It is possible to extend method B to partial string matching
to gracefully cover the special case when a character has never bysuitablymodifyingthedefinitions of A , , e,, and p m
occurredbeforeinthe message (so that c,(cp) = 0 for all above.
themodels), rn is allowed to rangefrom -1 and c- ~(cp)is ACKNOWLEDGMENT
defined t o be 1 for all cp in the alphabet.
Let the set of characters predicted by the model of order We would like to thank R. Neal for providing timing data
m but not by higher order models by A,, . Then using method on his fast implementation of arithmetic coding on the VAX
A, A , is theset of characterswhichhavecountsgreater 11/780 and an anonymous referee for pointing out the early
than 0 in the order rn model less all those with counts greater work on the zero frequency problem by Kant and others.
402 IEEE TRANSACTIONS Or\i COMMUNICATIONS, VOL. COM-32, NO. 4, APRIL 1984

REFERENCES
[22] I. H . Witten,
“Approximate,
non-deterministic
modelling of be-
[I] L.R. Bahl et d . , “Recognitionofacontinhodsly read naturalhavioursequences,” I n t . J. GeneralSyst., vol. 5 , pp. 1-12, Jan.
corpus,” in Proc. IEEE I n t . Conf. Acoust:, Speech, Signal Proc- 1979.
essing, Tulsa, OK, Apr. 1978,
pp.
418-424.
[23] I. H.Witten, Principles of Computei Speech. London, England:
[2] J . G . Cleary,“An.associativeandimpressiblecomputer,”Ph.D. Academic,1982.
dissertation,Univ,Canterbury,Christchurch, N~~ Zealand,1980,[24l J . ZivandA.Lempel,“Compression of individualsequences via
[3] -, ”Compacthashtables,” IEEE Trans. Cornput.,to be pub- variable-ratecoding,” IEEE Trans.Inform.Theory, vol.IT-24,
lished.
[4] -, “Representing
trees
andtries
without
pointers,”
Dep.

[5]
Comput. Sci., Univ. Calgary, Calgary, Alta., Canada, Res. Rep.
82/102/21, Sept.,. 1982.
J’. G . Clearyand I . k. Witten,“Arithmetic,enumerative and
*
adaptive coding,’;:IEEE Trans. Inform. Theory, to be published.
[61 T. M., Cover, “Enumerative source encoding,” IEEE Trans. I n -
form. Theory, vol. IT-19, pp. 73-77, Jan. 1973.
[71 M. Guazzo, “A general minimum-redundancy source-coding algo-
rithm,” IEEE Trans. Inform. Theod, vol. IT-26, pp. 15-25, Jan.
1980.
C.W.Harrison,,‘‘Experimknts,;withlinearprediction in televi-
sion,” Bell Syst. Tqch. J . , pp. 764-783, July 1952.
R.HunterandA. H. Robinson,:‘Ihternationaldigitalfacsimile
coding standard,” Pqoc. IEEE, vo1.,68, pp. 854-867, July 1980.
5; B . Jones,.“Anefficientcoding’;bystemfor long sourcese-
quences,” IEEE Trans. Itiform. Theory, vol. IT-27, pp. 280-291,
May1981.
B. W. Kefnighariand D. M.Ritchie, The C ProgrammingLan-
guage. EnglewoodCliffs,NJ:Prentice-Hall,1978.
G . G. Langdon and<;J. Rissanen,‘.‘Compression of black-white
images, witK. arithmeticcodidg,” IEEE Trans. Commun., vol.
COM-29, pp. 858-867,June1981.
G . G . tangdon, “A note on the Ziv-Lempel model for compressing
individual*sequences,” IEEE Trans. Inform. Theory, vol.IT-30,
pp. 28&287, Mar,; 1983.
M. E. Lesk, “Refer,” in Gnix Programmers Manual, Bell Lab.,
Murray,Hill,NJ,1979.
R. Pasco, “Source coding algorithms for fast data compression,”
Ph.D. disiertation, Stanford Univ., Stanford, CA, 1976.
C . S . Pierce,“Theprobabjliiyof,,induction,” in The World of
Marhematics, vol.2, J . R.New,man, Ed,. New York:Simonand
Schuster,1956,pp: 1341-1354: ”
J . Raviv;”Decision making in,Markov chains applied to the prob-
lem of paiternrecognitioh,” IEEE Trans.Inform.Theory, vol.
IT-13,pp.536-551,Oct.1967..:
J . J . Rissanen, “Arithmetic codings as number representations,”
Acta. Polytech. Scandinavica, vol. Math. 31, pp. 44-51,1979.
J. J. Rissanen and G. G. Langdon, “Universal modeling and cod-
ing,” IEEE Trans. Inform. Theory, vol.IT-27,pp. 12-23, Jan.
1981.
(261 M. F. Roberts,”LocalorderestimatingMarkoviananalysisfdr
noiseless source coding and authorship identification,” Ph.D. dis-
sertation, Stanford Univ., Stanford, CA, 1982.
H. E. White, “Printed English compression by dictionary encod-
ing,” Proc. IEEE, vol. 5 5 , pp. 390-396, 1967.

You might also like