A Novel Approach of Data Compression For Dynamic Data
A Novel Approach of Data Compression For Dynamic Data
net/publication/4364797
CITATIONS READS
3 430
3 authors:
Agarwal S.
Motilal Nehru National Institute of Technology
165 PUBLICATIONS 1,059 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Ashutosh Gupta on 20 May 2015.
Authorized licensed use limited to: INDIAN SCHOOL OF MINES. Downloaded on December 7, 2009 at 04:38 from IEEE Xplore. Restrictions apply.
comparison to total input text, this requires extra space but repeated compression -decompression cycles of same text.
since the index table is very small as compared to the input To handle the new data to be appended following steps are
text, the overhead is small. followed in order:
3.3 Decompression algorithm 1. The starting pointer corresponding to the last block
is fetched from the index table.
The decompression algorithm does not require the 2. This block is decompressed by the decompression
lexicon created by compression algorithm to be passed algorithm described in section 3.3. This creates the
along with the compressed text. The lexicon is build exactly lexicon corresponding to the last block in memory.
as done during compression. Foe each code a substring is 3. The text obtained in step 2 along with the new
generated and a new entry is made in the lexicon. The appended text is compressed using compression
decompression algorithm generates text in blocks. The algorithm described in section 3.2.
algorithm also supports individual block decompression 4. The entries corresponding to this block are
and does not require whole compressed code to be modified in the index table when the lexicon
decompressed. The index table passed with compressed text implemented using trie data structure gets filled up
allows the required block to be located and individually or the new input text finishes, whichever occurs
decompressed using the following decompression first.
algorithm. 5. The entries corresponding to new blocks are
appended to the already created index table.
6. The whole index table is embedded with the total
compressed text.
Authorized licensed use limited to: INDIAN SCHOOL OF MINES. Downloaded on December 7, 2009 at 04:38 from IEEE Xplore. Restrictions apply.
decompression or compression along with their
corresponding entries to the index table.
9. Finally, the whole index table is embedded with
the total compressed text.
Figure 5. The variation in the number of words per block No. of bits in original text
with the number of bits per compressed code. Compression Ratio = ------------------------------------ (2)
No. of bits in compressed text
Authorized licensed use limited to: INDIAN SCHOOL OF MINES. Downloaded on December 7, 2009 at 04:38 from IEEE Xplore. Restrictions apply.
The above graph shows the compression ratio for Table 2. Table showing the comparison of processing time
different text files of the Canterbury Corpus. The data between standard LZW compression algorithm and our
obtained shows that the compression ratio improves with proposed algorithm. The data of text files of the Canterbury
increasing the number of bits per compressed code. Since Corpus and Large Corpus are appended.
increasing the number of bits per compressed code allows
more number of nodes to be accommodated in the lexicon, Name of Size of Text Processing Processing
this improves the compression ratio. file file File Size Time by Time by
appende appende Before Standard Our
The graph also shows that for smaller files, after a d d Appendi LZW Proposed
ng Data algorithm Algorithm
certain length of compressed code, the compression ratio
(bytes) (sec.) (sec.)
decreases with increasing the number of bits per alice29.t 152089 0 1.117165 1.117165
compressed code. This happens because the higher bits are xt
not utilized and are 0 for all the compression codes, thus asyoulik. 125179 152089 1.602373 0.969255
increasing the code length unnecessarily. Hence, the correct txt
behavior is shown by the text which is larger than one fields.c 11150 277268 1.346438 0.138950
block, thus utilizing the MSB (most significant bit) in lcet10.tx 426754 288418 4.305077 3.039200
compressed codes. The general pattern that is followed by t
large files is demonstrated by lcet10.txt and plrab12.txt of plrabn12 481861 715172 5.994635 3.320711
the corpus. The results obtained are shown in the following .txt
table: bible.txt 4047392 1197033 27.192912 22.922099
world19 2473400 5244425 32.930046 16.187454
2.txt
Table 1. Table showing the variation in average
compression ratio for large text files (lcet10.txt and
plrabn12.txt) of the Canterbury Corpus with the number of
bits per compressed code.
To demonstrate the high improvement made by our Figure 8. The difference in processing time of standard
data update algorithm in the processing time over the LZW compression algorithm and the algorithm proposed in
standard LZW algorithm, we are using the files of this paper.
Canterbury Corpus. For large files, which shows a big
improvement over the standard algorithm, we are using The above graph shows that the proposed algorithm is
Large Corpus. highly efficient in comparison to the standard LZW
algorithm for dynamic documents. As the size of the
To demonstrate data append algorithm, we are document keeps on increasing, the difference between the
appending the data of one file after other to the already processing time of standard algorithm and our proposed
compressed text using the algorithm described in section algorithm keeps on increasing. This difference arises
3.4. The comparison between the processing time between because the processing time of standard LZW algorithm
the standard LZW algorithm and our algorithm on the same depends on the decompression time of the whole
machine with same operating system under identical load compressed text and compression time of new text while
conditions is shown. For demonstration, 12 bits per the processing time of our proposed algorithm depends on
compressed code have been used. the decompression time of one block and compression time
of the new text. Thus, the algorithm is highly useful for
applications where the amount of data in the same
document keeps on increasing frequently.
Authorized licensed use limited to: INDIAN SCHOOL OF MINES. Downloaded on December 7, 2009 at 04:38 from IEEE Xplore. Restrictions apply.
5 Conclusions [10] M.S. Pinho, W.A. Finamore and W.A. Pearlman,
"Fast multi-match Lempel-Ziv", Data Compression
Compression methods suitable for dynamic documents Conference, page(s) 545, 1999.
must have certain special properties. These compression
methods must allow update and modification of documents [11] R.N. Horspool and G.V. Cormack, "Constructing
with minimum overhead. Further, these methods must allow word-based Text Compression Algorithms", IEEE Data
data to be appended to the already compressed text with Compression Comference, 1992.
minimum decompression of the already compressed text.
The algorithm proposed by us supports the dynamic update [12] D. Pirckl, "Word-based LZW Compression", Master
of such documents without repeated decompression- thesis, Palacky University, Ololouc, Czech, 1998.
compression cycles. The update of these documents
requires decompression of only the block containing the [13] J. Yang and S.A. Savari, “Dictionary-based English
data to be updated. Experimental implementation of the text compression using word endings”, Data Compression
algorithm and comparison of results proved this to be a Conference, 2007.
major improvement over the standard LZW compression
algorithm. [14] K.S. Ng, L.M. Cheng and C.H. Wong, “Dynamic
word based text compression”, Proceedings of the Fourth
References International Conference on Document Analysis and
Recognition, Vol. 1, pp. 412-416, 1997.
[1] J.A. Storer and M. Cohn, editors., Proc. 1999 IEEE
Data Compression Conference, Los Alamitos, California: [15] M. Tilgner, M. Ishida and T. Yamaguchi, "Recursive
IEEE Computer Society Press, 1999. block structured data compression", Data Compression
Comference, 1997.
[2] J.A. Storer and M. Cohn, editors., Proc. 2000 IEEE
Data Compression Conference, Los Alamitos, California: [16] T. C. Bell, "Data Compression in Full-Text Retreival
IEEE Computer Society Press, 2000. Systems", Journal of the American Soceity for Information
Science, Vol. 44, No. 9, pp. 508-531, 1993.
[3] D. Salomon, Data Compression, Springer Verlag,
1998. [17] W.K. Ng and C. V. Ravinshankar, “Block-oriented
compression technique for large statistical databases”,
[4] I.H. Witten, A. Moffat and T.C. Bell, Managing IEEE Transactions on Knowledge and Data Engineering,
Gigabytes : Compressing and Indexing Documents and Vol, 9, No. 2, pp. 314-328, 1997.
Images, Van Nostrand Reinheld, 1994.
[18] A. Moffat and R. Wan, “RE-store: A system for
[5] Terry A. Welch, "A technique for high-performance compressing, browsing and searching large documents”, in
data compression", IEEE Computer, Vol. 17, No. 6, pp. 8- Proc. 8th Intl. Symp. On String Processing and Information
19, 1984. Retrieval, pp. 162-174, 2001.
[6] J. Ziv and A. Lempel, "An universal algorithm for [19] A. Moffat, N. Sharman and J. Zobel, "Static
sequential data compression", in IEEE Transactions on Compression for Dynamic Texts", IEEE Data Compression
Information Theory, Vol. 23, Issue 3, pp. 337-343, 1977. Conference, pp. 126-135, 1994.
[7] J. Ziv and A. Lempel, "Compression of individual [20] I.H. Witten and T. Bell. The Calgary / Canterbury text
sequences via variable-rate coding", in IEEE Transactions compression corpus. Anonymous ftp from
on Information Theory, Vol. IT-24, Issue 5, pp. 530-536, ftp.cpsc.ucalgary.ca:/pub/text.compression.corpus/text.com
1978. pression.corpus.tar.Z.
View Authorized
publication stats licensed use limited to: INDIAN SCHOOL OF MINES. Downloaded on December 7, 2009 at 04:38 from IEEE Xplore. Restrictions apply.