A Software Implementation of The Shannon-Fano Coding Algorithm
A Software Implementation of The Shannon-Fano Coding Algorithm
Algorithm
Student authors: Đorđe K. Manoilov1 and Daniel S. Dimitrov1
Mentors: Radomir Stanković2 and Dušan Gajić2
Abstract – The Shannon-Fano coding technique is one of the for educational purposes. In the following Section IV we give
earliest algorithms which produce code words with minimum experimental results for data compression ratio, time and
redundancy and it serves as a basis for some more recent number of different characters. We close the paper with some
methods. In this paper, we present a C# implementation of the conclusions in the final section.
Shannon-Fano encoding method for data compression. We
conducted various experiments with different inputs provided to
the application and recorded compression rates and algorithm II. SHANNON - FANO ALGORITHM
running times. The presented solution features a graphical user
interface and has solid real-world performance, but it was
developed primarily as an education tool that can help students A. Theoretical basis and the algorithm
to better understand this encoding technique.
Shannon-Fano coding was developed by Claude Elwood
Keywords – Shannon-Fano encoding, C# programming Shannon and Robert Fano [1]. This is a technique which uses
solution, text compression.
prefix encoding. It is based on a set of symbols and their
probabilities.
I. INTRODUCTION A prefix code is a type of a code system which is
characterized by a prefix property. This property states that
Data compression is a mathematical method - an algorithm there is no valid code word in the system that is a prefix (start)
used to decrease the number of the bits in a file that are of any other valid code word in the set. Message can be
necessary for storage, sending or transferring of electronic transmitted as a sequence of concatenated code words,
information. In other words, by using compression the size of without any extra markers to frame the words in the message
a file or group of files is decreased and space needed for using prefix code. The recipient decodes the message by
storing the information becomes smaller. repeating the process searching for prefixes that form valid
There are some compression methods that loose data, but code words. This is not possible with codes that lack the
we will discuss only compression that occurs without loss. prefix property. Shannon–Fano coding starts with the set of
The "good" part is that the compressed data will be symbols, with elements arranged in order from most probable
decompressed in the same form (recovering the data into its to least probable. After that, the set is divided into two sets
initial state), but an error producing even a bit less would be whose total probabilities are as close as possible to being
fatal. Compression with no loss can be realized with different equal. All symbols then have the first digits of their codes
algorithms like: RLE (Run Length Encoding) algorithm, assigned. Symbols in the first set receive "0" and symbols in
algorithm for removing all zeros, Shannon-Fano algorithm, the second set receive "1". Shannon–Fano coding uses a
Huffman algorithm [1]. binary tree structure. As long as any set with more than one
We will discuss the Shannon-Fano compression. For member remains, the same process is repeated. When a set has
Shannon-Fano compression there is an algorithm which uses been reduced to one symbol this means that the symbol's code
prefix coding [1]. is complete and will not form the prefix of any other symbol's
In this paper, we will present its implementation and code.
include test results for different textual files [7]. In Section II The algorithm produces codes with variable and fairly
we describe the theoretical basis of the Shannon-Fano coding. efficient length. When the two smaller sets produced by
Next, in Section III we present a software solution for data partitioning are of exactly equal probabilities, one bit of
compression using the Shannon-Fano algorithm realized in C# information used to distinguish them is used most efficiently.
programming language. This application is mainly developed It can be seen from the examples that the Shannon – Fano
algorithm does not always produce the optimum length codes.
For a set of probabilities {0.35, 0.17, 0.17, 0.16, 0:15}
Student authors: Shannon - Fano coding does not give the optimal length code.
1
Đorđe Manoilov and Daniel Dimitrov are with the Faculty of
Electronic Engineering, Aleksandra Medvedeva 14, 18000 Niš,
The Shannon – Fano compression uses binary tree as data
Serbia, E-mails: [email protected], [email protected]. structure where the encoded symbols are placed in the leaves
of this tree.
Mentors: The tree is constructed in the specific way in order to define
2
Radomir Stanković and Dušan Gajić are with the University of the effective code table. The actual algorithm is simple:
Niš, Faculty of Electronic Engineering, Aleksandra Medvedeva 14, For a given list of symbols, develop a corresponding list
18000 Niš, Serbia, E-mails: [email protected], of probabilities or frequency counts, so that each
[email protected]. symbol’s relative frequency of occurrence is known.
505
Sort the lists of symbols according to frequency, with the The application consists of four forms (Fig. 2.). "Main
most frequently occurring symbols at the left and the least form" is used for selection of file (for coding) or for manual
common at the right. input of text for coding. Also, on the form "Main form" (Fig.
Divide the list into two parts, with the total frequency 1.) the symbols and their respectable codes are displayed. It is
counts of the left half being as close to the total of the possible to save coded text on desired location on disk or
right as possible. other medium. "Manual form" offers a brief user manual. In
The left half of the list is assigned the binary digit 0, and "Statistics form" (Fig. 3.) we can see degree of compression
the right half is assigned the digit 1. This means that the for selected text. "Information form" includes information
codes for the symbols in the first half will all start with 0, about authors of the application.
and the codes in the second half will all start with 1. Text that is necessary to compress is placed into a string
Recursively apply the steps 3 and 4 to each of the two variable. In the application there is a function for the
halves, subdividing groups and adding bits to the codes separation of the different nodes, and also for calculating their
until each symbol has become a corresponding code leaf probabilities of occurrence. Probability of occurrence of
on the tree. symbols is calculated as the ratio of the number of
occurrences of this symbol and the number of symbols in the
file. For the purposes of the algorithm, it is necessary to
B. The field of use arrange these symbols in ascending or descending order. After
sorting, coding of symbols is done by calling the Shannon-
Shannon–Fano coding is used in the IMPLODE [2] Fano algorithm implementation. All symbols in the text
compression method, which is part of the ZIP file format. change in to their code and all of that put into the new string.
Huffman algorithm [1] is an improved version of the Shannon Code is replaced with its symbols via library function
– Fano algorithm used to compress music files in MP3 format StringReplace.
and for JPEG picture compression [8]. C# does not support work on the level of bits. Therefore,
before entering into a binary file, the sequence of 8 characters
is stored in the buffer, which is the size of a byte. 0 is entered
III. ARCHITECTURE OF THE APPLICATION AND into the buffer by moving the contents of the buffer to the left
(shift - left), 1 is entered into the buffer using the shift - left
THE PROGRAMMING IMPLEMENTATION
and logical OR operation with 0x01h. This method is possible
if the length of the coded text is divisible with 8. It is therefore
The application is developed in Visual C# .NET 3.5 and it
can be only used within Microsoft Windows operating system.
506
necessary to add additional 0 in buffer with last entry in the IV. EXPERIMENTAL RESULTS
file. This way leads to an increase in encoded file, for up to 7
bits, but it allows the simulation of work on the level of bits in The application was tested for various input files in order to
C#. get time and percentage of compression. Input file is a text.
The all of the experiments were done using a Laptop PC with
Intel Core 2 Duo T5450 processor and 3 GBs of RAM,
running a Windows XP Service Pack 2 operating. The
duration of compression depends on the computer's hardware
and current utilization of computer resources. The test results
for normal text are shown in Table I, and test results for
source code are shown in Table II. Tables I and II also show
that the compression speed and compression ratio also depend
on the number of different characters and file size. For a small
number of different character(s) encoding goes fast regardless
of the size of the file. This is because each character encodes a
small number of bits and operations with strings quickly
completed. For very large files (about 60 MB) the application
reports “Memory error”. The problem is caused because of the
usage of strings and can be solved by using a StringBuilder.
Time coding for a normal text file as source code or a book is
at most a few seconds. Compression ratio is about 50% but if
there are lots of similar character goes up to 80%. From the
presented results we can also conclude that the compression
ratio of source codes is less than for the plain text.
507
TABLE I V. CONCLUSION
DIFFERENT TEXT FILES, n – NUMBER OF DIFFERENT
CHARACTERS, t1 – TIME CODING, t2 – RECORDING TIME, fs – Through performing the experiments with our
FILES SIZE, cr – COMPRESSION RATIO implementation of the Shannon-Fano algorithm we reached
the following conclusions:
n t1 t2 fs cr The most common characters have shorter code words
4 0 ms 15.625 10B -> 3B 70 % and opposite.
10 0 ms 0 ms 10B -> 5B 50 % For the same number of different characters, the
15 0 ms 15.625 ms 21B -> 11B 47.6 % algorithm has the same compression ratio.
5 0 ms 15.625 ms 39B -> 12B 69.2 % For two files with the same size, but with different
47 0 ms 15.625 ms 1,88K -> 1006B 47.7 % number of unique characters, a file with a smaller number
116 656.2 ms 703.12 ms 100,9K -> 49,8K 50.6 % of different characters has a higher compression ratio.
116 31.9 s 29 s 4847K -> 2394K 50.6 % Time required for recording and encoding increases with
3 10.01 s 3.625 s 23523K->4324K 81.8 % the size of the input file.
3 27 s Mem error 61.2M -> ? ?
Application that we have developed cannot actually
79 5.112 s 3.718 s 889K -> 501,B 43.26%
compete with existing commercial applications that compress
data. It was developed primarily as an education tool that can
help students to better understand this encoding technique that
TABLE II
serves as basis of more recent compression methods.
SOURCE CODES, n – NUMBER OF DIFFERENT
CHARACTERS, t1 – TIME CODING, t2 – RECORDING TIME, fs –
FILES SIZE, cr – COMPRESSION RATIO
REFERENCES
[1] David Salomon, Data Compression: The Complete Reference,
n t1 t2 fs cr 3rd Edition, Springer, 2004. (ISBN 0-387-40697-2)
[2] http://en.wikipedia.org/wiki/Shannon%E2%80%93Fano_coding
92 807 ms 620ms 126K -> 55,1K 56.2% website last visited on 14/04/2011.
95 186 ms 144ms 29,2K -> 14,2K 51.13% [3] http://www.ustudy.in/node/6409, website last visited on
71 44 ms 36ms 8,56K -> 4,55K 46.88% 15/12/2010.
90 187 ms 118ms 29,7K -> 12,8K 56.99% [4] http://www.binaryessence.com/dct/en000041.htm, website last
90 153 ms 116 ms 21,3K -> 13,2K 38.14% visited on 14/04/2011.
71 19 ms 17 ms 3,1K -> 1,9K 38,71% [5] http://cppgm.blogspot.com/2008/01/shano-fano-code.html,
website last visited on 14/04/2011.
58 5 ms 7 ms 1K -> 647B 41.5%
[6] http://www.dotnetspark.com/Forum/169-how-to-open-one-chm-
65 28 ms 22 ms 5,4K -> 3,15K 41.9% help-file-c-sharp-windows.aspx, website last visited on
63 13 ms 12 ms 2,3K -> 1,35K 41.61% 14/04/2011.
[7] http://www.onlinehowto.net/Why-compress-/2, website last
visited on 14/04/2011.
[8] http://en.wikipedia.org/wiki/Huffman_coding, website last
visited on 14/04/2011.
508