0% found this document useful (0 votes)
36 views31 pages

Imc14 05 Dictionary Codes

The document discusses dictionary-based coding techniques where commonly occurring patterns in data are identified, encoded more efficiently using indexes in a dictionary, and a default encoding is used for uncommon patterns, with the goal of achieving a smaller average number of bits per symbol; it covers static and adaptive dictionaries, examples of LZ77 and LZ78 algorithms which use previously encoded data to build the dictionary dynamically, and variations like LZW; problems with these methods include recurring patterns not captured if outside the window size or incomplete dictionary entries during decoding.

Uploaded by

Rohan Borgalli
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views31 pages

Imc14 05 Dictionary Codes

The document discusses dictionary-based coding techniques where commonly occurring patterns in data are identified, encoded more efficiently using indexes in a dictionary, and a default encoding is used for uncommon patterns, with the goal of achieving a smaller average number of bits per symbol; it covers static and adaptive dictionaries, examples of LZ77 and LZ78 algorithms which use previously encoded data to build the dictionary dynamically, and variations like LZW; problems with these methods include recurring patterns not captured if outside the window size or incomplete dictionary entries during decoding.

Uploaded by

Rohan Borgalli
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Dictionary-based Coding

Techniques

National Chiao Tung University


Chun-Jen Tsai
10/16/2014
Rationale

In previous two chapters, we looked at coding


techniques that assume a source that generates a
sequence of independent symbols.
Most data sources are correlated, thus, the coding step is
generally preceded by a de-correlation step (i.e. model
prediction).
Alternatively, we can build a list of commonly
occurring patterns and encode these patterns by
transmitting their index in the list
→ dictionary techniques

2/31
Static vs. Adaptive Dictionary

The dictionary holds a list of strings of symbols and it


may be static or dynamic (adaptive)
Static dictionary – permanent, sometimes allowing
the addition of strings but no deletions
Dynamic dictionary – holding strings previously found
in the input stream, allowing for additions and
deletions of strings as new input symbols are being
read

3/31
Basic Idea of Dictionary Coding

Given an input source, we want to


Identify frequent symbol patterns
Encode those more efficiently
Use a default (less efficient) encoding for the rest
Hopefully, the average bits per symbol gets smaller
In general, dictionary-based techniques works well
for highly correlated data (e.g. text), but less efficient
for data with low correlation (e.g. i.i.d. sources)

4/31
Motivating Example

Consider an ‘English’ source with 26 letters & six


punctuation marks
Single-symbol FLC, fixed-length encoding: 5 bps
Four-symbol FLC, fixed-length encoding: 20 bps (324)
If we assume uneven distribution of the symbols
Pick a dictionary witch contains the 256 most-frequent
patterns (probability p) and encode them with 8 bits
Encode the rest with 20 bits
Use 1-bit prefix to distinguish the two cases
then, the average rate is 9p + 21(1 – p) = 21 – 12p.
If p > 0.084, 21 – 12p < 20.

5/31
Static Dictionary

Using a static dictionary is less complex, but the


probability p of a hit highly depends on the
applications
For student records in a university is probably ok.
The key for success is that the most common
patterns are a small subset of all possible messages
Out of over 100,000 English words, only less than 2,000
words are used in most writings

6/31
Digram Coding
The dictionary is composed of
All letters from the alphabet
As many digrams (pairs of letters) as possible

For example, if we want to encode pure ASCII text


documents, we can design a dictionary of size 256
entries, and
Source alphabet: 95 printable ASCII symbols
Digrams: 161 most common pairs

7/31
Simple Digram Coding Example

The source alphabet A = {a, b, c, d, r}


Dictionary:

Try to code the sequence abracadabra, the output is


101100110111101100000.

8/31
Problem: Which Digrams to Use?
Source 1: LaTex documents Source 2: C programs

9/31
Adaptive Dictionary Technique

Original ideas published by Jacob Ziv and Abraham


Lempel in 1977 (LZ77/LZ1) and 1978 (LZ78/LZ2)
The most well-known dictionary-based technique,
LZW, is a modification to LZ algorithms published by
Terry Welch in 1984

10/31
LZ77 (1/2)

General approach
Dictionary is a portion of the previously encoded sequence
Use a sliding window for compression
Mechanism
Find the maximum length match for the string pointed to by
the search pointer in the search buffer, and encode it
Rationale
If patterns tend to repeat locally, we should be able to get
more efficient representation

11/31
LZ77 (2/2)
Sliding window is composed of a search buffer and a look-
ahead buffer (note: window size W = S + LA)
Match pointer Search pointer

a _ _ a b r a _ a d a b r a r r a r r a _

Search buffer Look-ahead buffer


(size S = 8) (size LA = 7)

Offset = search pointer – match pointer (o = 7)


Length of match = number of consecutive letters matched (l = 4)
Codeword (c = C(r)), where C(x) is the codeword for x
Encoding triple: <o, l, c> = <7, 4, C(r)>
If FLC is used and alphabet size is |A|, <o, l, c> can be
encoded with log2S + log2W + log2|A| bits.

12/31
Possible Cases for Triples

There could be three different possibilities that may


be encountered during the coding process:
No match for the next character to be encoded in the window
There is a match
The matched string extends inside the look-ahead buffer
For each of these cases, we have a triple to signal
the case to the decoder

13/31
LZ77 Encoding Example
Sequence |cadabrar|rarrad|
cabracadabrarrarrad |cadabrar|rarrad|
W = 13, S = 7 |cadabrar|rarrad|
|cabraca|dabrar|rarrad send <3, 3, C(r)>
no match for d Could we do better?
send <0, 0, C(d)> send <3, 5, C(d)> instead
|abracad|abrarr|arrad
|abracad|abrarr|arrad
|abracad|abrarr|arrad
|abracad|abrarr|arrad
send <7, 4, C(r)>

14/31
LZ77 Decoding Example

Current input: <0, 0, C(d)> <7, 4, C(r)> <3, 5, C(d)>


Current output: cabraca
Decode: <0, 0, C(d)>
Decode C(d): c|abracad|
Decode: <7, 4, C(r)>
Start with the first ‘a’, copy four letters: cabra|cadabra|
Decode C(r): cabrac|adabrar|
Decode: <3, 5, C(d)>
Start with the first ‘r’, copy three letters: cabracada|brarrar|
Copy two more letters: cabracadabr|arrarar|
Decode C(d): cabracadabrarrarard

15/31
LZ77 Variants

For LZ77, we have


Adaptive scheme, no prior knowledge
Asymptotically approaches the source statistics
Assumes that recurring patterns close to each others
Possible improvements
Variable-bit encoding: PKZip, zip, gzip, …, etc., uses a
variable-length coder to encode <o, l, c>.
Variable buffer size: larger buffer requires faster searches
Elimination of <0, 0, C(x)>
LZSS sends a flag bit to signal whether the next “token” is an
<o, l> pair or the codeword of a symbol

16/31
Problems with LZ77

If the recurring patterns happens with a period larger


than the search window, the performance is bad
Example:

17/31
LZ78

LZ78 improvements from LZ77


No search buffer – explicit dictionary instead
Encoder/decoder must build dictionary in sync
Encoding: <i, c>
i = index in the dictionary, i = 0 for symbols not in the dictionary
c = code of the following character
Example: encode the following contents
wabbabwabbabwabbabwabbabwoobwoobwoo

18/31
LZ78 Example

Input: wabbabwabbabwabbabwabbabwoobwoobwoo
Dictionaries: final dictionary

initial dictionary (empty)

Index Entry

dictionary after encoding w, a, b


Encoder Output Index Entry
<0, C(w)> 01 w
<0, C(a)> 02 a
<0, C(b)> 03 b

19/31
Remarks on LZ78

Observation
If we keep on encoding, the dictionary will keep on growing
Possible solutions
Stop growing the dictionary
Effectively switch to a static dictionary
Prune it
Based on usage statistics
Reset it
Start all over again
The best solution depends on the knowledge of the
source

20/31
LZ78 Variants: LZW
Invented by Terry Welch in 1984
Idea
Instead of <i, c>, encode i only
Algorithm
Initial dictionary contains all alphabet letters, p = null
while (!done)
read next symbol into a
if (p*a) is in the dictionary // Note: ‘*’ stands for concatenation
p = p*a
else
send out index of p
add p*a to the dictionary
p = a
end

21/31
Example: LZW Encoding

Input: wabbabwabbabwabbabwabbabwoobwoobwoo
Dictionaries:
initial dictionary (source alphabet) final dictionary
Index Entry
1 b
2 a
3 b
4 o
5 w

Output: 5 2 3 3 2 1 6 8 10 12 9 11 7 16 5 4 4 11 21 23 4
22/31
Problems with LZW Decoding

Decoding of LZW is simple, in general


Output symbols from the dictionary as indexed by the inputs
Construct the dictionary on-the-fly as the encoder does

However, if we have a message pattern cScS …,


where c is a character, S is a string, we may run into
a situation that the indexed entry is in partial
construction
Solution: the current dictionary entry under
construction is in p, we should allow reading partial
data out of p during decoding

23/31
Example: Special Case in Decoding

Alphabet A = {a, b}, input is abababab, encoder output


is 1235 ….
Decoding dictionaries:
initial dictionary intermediate dictionary

Index Entry
1 a
2 b

when we reach decoding of 5, p = ab???, we do not


have the complete output!

24/31
Application: Compress

An early implementation of LZW


Adaptive dictionary, starts with 29 entries
User can configure max codeword length bmax = 9~16
Dictionary grows up to double in size
When dictionary reaches 2bmax entries, it becomes a static
dictionary encoder
If compression ratio falls below a threshold, dictionary
is reset

25/31
Application: GIF Images

LZW scheme, similar to compress:


Clear code is used to reset the encoder/decoder. For
b bits/pixel images, 2b is used as the clear code
Dictionary size is initially 2b+1
Dictionary size can grows up to 4096 entries
Format:
Codewords stored in blocks of 8-bit characters
Each block begins with a header with a size count up to 255,
and ends with a block terminator symbol (8 zero bits)
The last block has a end-of-information code, 2b +1, before
the block terminator

26/31
GIF Performance

GIF vs. arithmetic coding

27/31
Application: PNG Images

Based on LZ77, patent-free alternative to GIF


Designed specifically for lossless image compression
Modes: true color, grayscale, 8-bit pallette
Two autonomous compression components
Deflate (RFC 1951) — LZ77-style dictionary compression
algorithm plus Huffman coding
Filtering — lossless transformations of byte-level image data

28/31
PNG – Deflate

Deflate = LZ77 + Huffman


Three types of data blocks
Uncompressed, LZ77 + fixed Huffman, LZ77 + adaptive
Huffman
Match length is between 3 and 258 bytes
A sliding window of at least 3-byte long is examined
If match is not found, encode the first byte and slide window
At each step, LZ77 either outputs a codeword for a literal or
a paired value of <match_length, offset>
Match length is encoded by index code (257~285) and a
selector code (0~5 bits)
Offset (1~32768) is encoded using Huffman code
29/31
PNG – Filtering

Filters are applied on a scanline-by-scanline basis


All algorithms applied to bytes (not pixels)
Filter types:
None: unmodified value
Sub: difference from previous byte value (mod 256)
Up: difference from the byte value above
Average: subtract average of the left and the above bytes
Paeth:
Compute initial estimate by left + above – upper_left
The value of left, above, or upper_left that is closest to the
initial estimate is used as the estimate

30/31
PNG: Performance

PNG vs. GIF vs. arithmetic coding

31/31

You might also like