0% found this document useful (0 votes)
30 views

A Novel Approach of Data Compression For Dynamic Data

A Novel Approach of Data Compression for Dynamic Data

Uploaded by

perhacker
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views

A Novel Approach of Data Compression For Dynamic Data

A Novel Approach of Data Compression for Dynamic Data

Uploaded by

perhacker
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/4364797

A Novel Data Compression Algorithm for Dynamic Data

Conference Paper · August 2008


DOI: 10.1109/SIBIRCON.2008.4602560 · Source: IEEE Xplore

CITATIONS READS
3 430

3 authors:

Rahul Gupta Ashutosh Gupta


Motilal Nehru National Institute of Technology M.J.P. Rohilkhand University
3 PUBLICATIONS   16 CITATIONS    40 PUBLICATIONS   93 CITATIONS   

SEE PROFILE SEE PROFILE

Agarwal S.
Motilal Nehru National Institute of Technology
165 PUBLICATIONS   1,059 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

secret sharing schemes View project

Vehicular Ad Hoc Networks View project

All content following this page was uploaded by Ashutosh Gupta on 20 May 2015.

The user has requested enhancement of the downloaded file.


A Novel Approach of Data Compression for Dynamic
Data
Rahul Gupta Ashutosh Gupta Suneeta Agarwal
Department of Computer Science Department of Computer Science Department of Computer Science
and Engineering and Engineering and Engineering
Motilal Nehru National Institute of IERT Allahabad, India Motilal Nehru National Institute
Technology, Allahabad, India [email protected] of Technology, Allahabad, India
[email protected] [email protected]

presented in this paper show that the algorithm gives a high


Abstract - This paper presents a compression algorithm
processing time improvement over the standard LZW
for dynamic data, the size of which keeps on increasing
compression algorithm.
frequently. It is an efficient data compression technique
comprising of a block approach that keeps the data in
compressed form as long as possible and enables the data 2 Related work
to be appended to the already compressed text. The Most of the data compression algorithms based on
approach requires only a minimal decompression for adaptive dictionaries have their roots in two algorithms
supporting update of data. The algorithm reduces the developed in 1977 and 1978 known as LZ77 [6] and LZ78
unnecessary time spent in compression-decompression [7] respectively. Various improved variants of LZ78
approaches for dynamic documents deigned till now to a algorithm have been designed. The most important
minimum. Further, the text document can be modified as modification is done by Terry Welch that got published in
required without decompressing and again compressing 1984 and is known as LZW [5].
the whole document.
Many variants of LZW algorithm also exist today [1],
Keywords: LZW Compression, Text Compression, [2], [3], [8], [9], [10], many of which are specific to some
Dynamic Data, Block Data Structure. applications. Relatively more advanced methods are based
on coding words as basic unit [12], [13], [14] have been
1 Introduction designed. These compression techniques use word as a unit
Data compression is compression is widely used in to put up in the dictionary that aids in compression.
office automation systems to save storage-space and Experiments with word-based LZW have been doneby
network bandwidth. Many compression methods have been Horspool and Cormack [11]. Further improved approaches
proposed till date which reduces the size of these for database compression based on structure of stored data
documents by various degrees [1], [2], [3]. One important have been proposed [15]. There has been research for
characteristic of office documents is that their size keeps on efficient database compression and data retrieval systems
increasing at a very fast pace. Moreover, office automation which retrieves the required data from compressed text
systems require frequent modification of already stored text efficiently [16], [17], [18]. New techniques that create
[4]. Hence, if the data is kept in compressed form, it needs block of data also have been proposed [17]. Moffat has
to undergo repeated cycles of compression and described a technique for dynamic data compression [19].
decompression with each update.
In this paper, we have proposed a new technique of
In this paper, we focus on dynamic documents that are data compression that creates blocks of data while
appended with new data very frequently. We have compressing the input text. This block structure forms the
developed a new compression method, inspired by Lempel- base for the efficiency of the data append and update
Ziv compression [5] for such documents. The algorithm algorithm.
creates blocks of input text data while compressing it. This
novel algorithm keeps the data in compressed state as long 3 Improved dynamic documents
as possible and decompresses only that block which needs
to be modified or appended with new data. This saves the algorithm
repeated compression-decompression cycles for each In this section, we outline the algorithm designed by
modification as required by other approaches till date. us for dynamic documents. The algorithm is inspired by
Moreover, the algorithm allows a word to be directly LZW compression algorithm. The main idea behind our
located in the compressed code. The experimental results algorithm is to keep the data in compressed state as long as

978-1-4244-2173-2/08/$25.00 ©2008 IEEE


Authorized licensed use limited to: INDIAN SCHOOL OF MINES. Downloaded on December 7, 2009 at 04:38 from IEEE Xplore. Restrictions apply.
possible and to decompress only the minimal part with each
update. The algorithm works by creating blocks of text
when it parses it for compression. The algorithm creates an
index table of blocks which is used by data update and
modification algorithm.

3.1 Blocks and index table creation


The key point of the algorithm is the block creation of
input text while compressing it. The compression algorithm
adds every new substring of characters it sees to the lexicon
which is implemented using trie data structure. Figure 2. Index structure with the corresponding entries per
block.

3.2 Compression algorithm


The algorithm uses trie data structure for storing the
lexicon. The lexicon is initialized with all one character
strings at the start of algorithm. The compression algorithm
outputs a single code for a string of characters. For each
compressed code generated; the corresponding substring is
entered into the lexicon. The algorithm uses trie data
structure for storing the lexicon. The trie is removed after it
gets filled up, a new trie is made afresh and an entry is
made into the index table.

Figure 1. Lexicon implementation by trie data structure.

The algorithm gives an integer to every new node


added to the lexicon and these integers forms the
compressed codes for these substrings. Now if each
compressed code is of x bits, there can be 2 x different
nodes in the trie. After creation of 2x nodes, the trie gets
filled up completely and no more nodes can be added
further.

No. of bits per compressed text: x


No. of nodes in Lexicon :2x (1)

Our algorithm considers this amount of text which


fills up the trie once as a block. When a block of data is
compressed, the algorithm creates an entry corresponding
to the block in an index table along with the number of
words and characters encountered in the block. The
algorithm adds starting pointer and ending pointer entries
associated with each block in the index table which
corresponds to the starting bit and the last bit in the
compressed stream respectively corresponding to the block
of input text. This created a data structure with
preprocessing information which is used by update and Figure 3. Compression algorithm.
modification algorithm.
Finally, the index table is embedded with the
compressed text. Since the index is much smaller in

Authorized licensed use limited to: INDIAN SCHOOL OF MINES. Downloaded on December 7, 2009 at 04:38 from IEEE Xplore. Restrictions apply.
comparison to total input text, this requires extra space but repeated compression -decompression cycles of same text.
since the index table is very small as compared to the input To handle the new data to be appended following steps are
text, the overhead is small. followed in order:

3.3 Decompression algorithm 1. The starting pointer corresponding to the last block
is fetched from the index table.
The decompression algorithm does not require the 2. This block is decompressed by the decompression
lexicon created by compression algorithm to be passed algorithm described in section 3.3. This creates the
along with the compressed text. The lexicon is build exactly lexicon corresponding to the last block in memory.
as done during compression. Foe each code a substring is 3. The text obtained in step 2 along with the new
generated and a new entry is made in the lexicon. The appended text is compressed using compression
decompression algorithm generates text in blocks. The algorithm described in section 3.2.
algorithm also supports individual block decompression 4. The entries corresponding to this block are
and does not require whole compressed code to be modified in the index table when the lexicon
decompressed. The index table passed with compressed text implemented using trie data structure gets filled up
allows the required block to be located and individually or the new input text finishes, whichever occurs
decompressed using the following decompression first.
algorithm. 5. The entries corresponding to new blocks are
appended to the already created index table.
6. The whole index table is embedded with the total
compressed text.

The above algorithm requires only the pre processing


of the index table and keeps the compressed text open for
further addition of any amount of data without requiring the
decompression of the whole compressed text.

The algorithm is capable of handling any modification


with the minimal overhead. To handle each modification
most effectively, following steps are followed in order:

1. Based on the word number or character number,


the appropriate block corresponding to the data to
be modified is identified.
2. The starting pointer and the ending pointer for the
block are fetched from the index table.
3. This block is decompressed by the decompression
algorithm described in section 3.3. This creates the
lexicon corresponding this block in memory.
4. The decompressed text obtained is allowed to be
Figure 4. Decompression algorithm. modified. When the modification is done, the
modified text is compressed using the algorithm
This block wise decompression algorithm approach is described in section 3.2.
highly useful for dynamic documents as described in the 5. The entries corresponding to this block are
following section. The partial block wise decompression modified in the index when the lexicon
supported by our algorithm saves the repeated compression implemented using trie data structure gets filled up
-decompression cycles of the whole text with each or the new input text finishes, whichever occurs
modification. Moreover, new data can be appended to the first.
already compressed text with minimal decompression 6. Any modification may require adding a new block
overhead. at the end of the block used above. If this occurs,
new blocks of data are created and added to the
3.4 Data update and modification stream of compressed text.
7. The entries corresponding to these new blocks are
The block feature of our algorithm speeds up added to the already created index table.
modification and update of compressed text. This new 8. The compressed code of rest of the blocks is
algorithm decompresses only the block having the required appended to the above stream without any
data and not the whole compressed text and saves the

Authorized licensed use limited to: INDIAN SCHOOL OF MINES. Downloaded on December 7, 2009 at 04:38 from IEEE Xplore. Restrictions apply.
decompression or compression along with their
corresponding entries to the index table.
9. Finally, the whole index table is embedded with
the total compressed text.

The modification may require addition of one or more


new blocks of data in between. If any new block is created
by small amount of text, then that particular block has a
poor compression ratio since the compression in our
algorithm improves with the increase in size of the lexicon
table. But since the block is small, the overall text having
low compression ratio is also small. If amount of data Figure 6. The variation in the number of characters per
present in new block is high, it has as good compression block with the number of bits per compressed code.
ratio as other blocks because large data creates a large
lexicon table which improves compression ratio. The above graphs show the general pattern followed
by text files which tells that the number of words per block
4 Implementation details and of compressed text keeps on increasing exponentially with
performance results the number of bits per compressed code. More the number
of bits per compressed code, more number of nodes can be
The index data structure designed in section 3.1 leads accommodated in the lexicon. Thus more number of words
directly to one block of compressed code and we need not in original text can be compressed per block by the
decompress the whole compressed text for handling the compression algorithm of section 3.2. The number of
update. The algorithm creates blocks of input text characters per compressed block follows the same pattern
depending on the number of bits per compressed code. For as the number of words.
x bits per compressed code, the lexicon table can hold 2 x
nodes. Hence, the number of words per block keeps on The proposed algorithm uses lexicon for storing the
increasing with increasing the number of bits per already encountered substrings. More the number of bits
compressed code. per compressed code, more nodes can be accommodated in
the lexicon and increasing the number of nodes in the
For examining and comparing the efficiency of our lexicon encodes a single compression code for a longer
proposed algorithm, we have used Canterbury Corpus [20]. string. Hence increasing the number of bits in the
The collection is considered as the main benchmark for compression code improves the compression ratio.
comparing compression methods. The following pattern
was observed on text files of the Canterbury Corpus by
testing our algorithm with varying number of bits per
compressed code.

Figure 7. The variation in compression ratio of various files


of Canterbury Corpus with the number of bits per
compressed code.

Figure 5. The variation in the number of words per block No. of bits in original text
with the number of bits per compressed code. Compression Ratio = ------------------------------------ (2)
No. of bits in compressed text

Authorized licensed use limited to: INDIAN SCHOOL OF MINES. Downloaded on December 7, 2009 at 04:38 from IEEE Xplore. Restrictions apply.
The above graph shows the compression ratio for Table 2. Table showing the comparison of processing time
different text files of the Canterbury Corpus. The data between standard LZW compression algorithm and our
obtained shows that the compression ratio improves with proposed algorithm. The data of text files of the Canterbury
increasing the number of bits per compressed code. Since Corpus and Large Corpus are appended.
increasing the number of bits per compressed code allows
more number of nodes to be accommodated in the lexicon, Name of Size of Text Processing Processing
this improves the compression ratio. file file File Size Time by Time by
appende appende Before Standard Our
The graph also shows that for smaller files, after a d d Appendi LZW Proposed
ng Data algorithm Algorithm
certain length of compressed code, the compression ratio
(bytes) (sec.) (sec.)
decreases with increasing the number of bits per alice29.t 152089 0 1.117165 1.117165
compressed code. This happens because the higher bits are xt
not utilized and are 0 for all the compression codes, thus asyoulik. 125179 152089 1.602373 0.969255
increasing the code length unnecessarily. Hence, the correct txt
behavior is shown by the text which is larger than one fields.c 11150 277268 1.346438 0.138950
block, thus utilizing the MSB (most significant bit) in lcet10.tx 426754 288418 4.305077 3.039200
compressed codes. The general pattern that is followed by t
large files is demonstrated by lcet10.txt and plrab12.txt of plrabn12 481861 715172 5.994635 3.320711
the corpus. The results obtained are shown in the following .txt
table: bible.txt 4047392 1197033 27.192912 22.922099
world19 2473400 5244425 32.930046 16.187454
2.txt
Table 1. Table showing the variation in average
compression ratio for large text files (lcet10.txt and
plrabn12.txt) of the Canterbury Corpus with the number of
bits per compressed code.

No. of bits per No. of nodes in the Compression


compressed code lexicon Ratio (avg.)
9 512 1.326
10 1024 1.529
11 2048 1.678
12 4096 1.809
13 8192 1.933
14 16384 2.050
15 32768 2.172
16 65536 2.243

To demonstrate the high improvement made by our Figure 8. The difference in processing time of standard
data update algorithm in the processing time over the LZW compression algorithm and the algorithm proposed in
standard LZW algorithm, we are using the files of this paper.
Canterbury Corpus. For large files, which shows a big
improvement over the standard algorithm, we are using The above graph shows that the proposed algorithm is
Large Corpus. highly efficient in comparison to the standard LZW
algorithm for dynamic documents. As the size of the
To demonstrate data append algorithm, we are document keeps on increasing, the difference between the
appending the data of one file after other to the already processing time of standard algorithm and our proposed
compressed text using the algorithm described in section algorithm keeps on increasing. This difference arises
3.4. The comparison between the processing time between because the processing time of standard LZW algorithm
the standard LZW algorithm and our algorithm on the same depends on the decompression time of the whole
machine with same operating system under identical load compressed text and compression time of new text while
conditions is shown. For demonstration, 12 bits per the processing time of our proposed algorithm depends on
compressed code have been used. the decompression time of one block and compression time
of the new text. Thus, the algorithm is highly useful for
applications where the amount of data in the same
document keeps on increasing frequently.

Authorized licensed use limited to: INDIAN SCHOOL OF MINES. Downloaded on December 7, 2009 at 04:38 from IEEE Xplore. Restrictions apply.
5 Conclusions [10] M.S. Pinho, W.A. Finamore and W.A. Pearlman,
"Fast multi-match Lempel-Ziv", Data Compression
Compression methods suitable for dynamic documents Conference, page(s) 545, 1999.
must have certain special properties. These compression
methods must allow update and modification of documents [11] R.N. Horspool and G.V. Cormack, "Constructing
with minimum overhead. Further, these methods must allow word-based Text Compression Algorithms", IEEE Data
data to be appended to the already compressed text with Compression Comference, 1992.
minimum decompression of the already compressed text.
The algorithm proposed by us supports the dynamic update [12] D. Pirckl, "Word-based LZW Compression", Master
of such documents without repeated decompression- thesis, Palacky University, Ololouc, Czech, 1998.
compression cycles. The update of these documents
requires decompression of only the block containing the [13] J. Yang and S.A. Savari, “Dictionary-based English
data to be updated. Experimental implementation of the text compression using word endings”, Data Compression
algorithm and comparison of results proved this to be a Conference, 2007.
major improvement over the standard LZW compression
algorithm. [14] K.S. Ng, L.M. Cheng and C.H. Wong, “Dynamic
word based text compression”, Proceedings of the Fourth
References International Conference on Document Analysis and
Recognition, Vol. 1, pp. 412-416, 1997.
[1] J.A. Storer and M. Cohn, editors., Proc. 1999 IEEE
Data Compression Conference, Los Alamitos, California: [15] M. Tilgner, M. Ishida and T. Yamaguchi, "Recursive
IEEE Computer Society Press, 1999. block structured data compression", Data Compression
Comference, 1997.
[2] J.A. Storer and M. Cohn, editors., Proc. 2000 IEEE
Data Compression Conference, Los Alamitos, California: [16] T. C. Bell, "Data Compression in Full-Text Retreival
IEEE Computer Society Press, 2000. Systems", Journal of the American Soceity for Information
Science, Vol. 44, No. 9, pp. 508-531, 1993.
[3] D. Salomon, Data Compression, Springer Verlag,
1998. [17] W.K. Ng and C. V. Ravinshankar, “Block-oriented
compression technique for large statistical databases”,
[4] I.H. Witten, A. Moffat and T.C. Bell, Managing IEEE Transactions on Knowledge and Data Engineering,
Gigabytes : Compressing and Indexing Documents and Vol, 9, No. 2, pp. 314-328, 1997.
Images, Van Nostrand Reinheld, 1994.
[18] A. Moffat and R. Wan, “RE-store: A system for
[5] Terry A. Welch, "A technique for high-performance compressing, browsing and searching large documents”, in
data compression", IEEE Computer, Vol. 17, No. 6, pp. 8- Proc. 8th Intl. Symp. On String Processing and Information
19, 1984. Retrieval, pp. 162-174, 2001.

[6] J. Ziv and A. Lempel, "An universal algorithm for [19] A. Moffat, N. Sharman and J. Zobel, "Static
sequential data compression", in IEEE Transactions on Compression for Dynamic Texts", IEEE Data Compression
Information Theory, Vol. 23, Issue 3, pp. 337-343, 1977. Conference, pp. 126-135, 1994.

[7] J. Ziv and A. Lempel, "Compression of individual [20] I.H. Witten and T. Bell. The Calgary / Canterbury text
sequences via variable-rate coding", in IEEE Transactions compression corpus. Anonymous ftp from
on Information Theory, Vol. IT-24, Issue 5, pp. 530-536, ftp.cpsc.ucalgary.ca:/pub/text.compression.corpus/text.com
1978. pression.corpus.tar.Z.

[8] Hu Yuanfu, Wu Xunsen, "The methods of improving


the compression ratio of LZ77 family data compression
algorithms", 3rd International Conference on Signal
Processing, Vol. 1, pp. 698-701, 1996.

[9] R.N. Williams, "An extremely fast Ziv-Lempel data


compression algorithm", Data Compression Conference,
pp. 362-371, 1991.

View Authorized
publication stats licensed use limited to: INDIAN SCHOOL OF MINES. Downloaded on December 7, 2009 at 04:38 from IEEE Xplore. Restrictions apply.

You might also like