Mini Project
Mini Project
MINI PROJECT
2021188029
Karpagam K
Ph. No.: 8939245646
Email ID: [email protected]
ABSTRACT ................................................................................................3
1 INTRODUCTION ......................................................................................4
2 PROBLEM STATEMENT AND SOLUTION ..........................................6
3 ALGORITHM ............................................................................................7
4 IMPLEMENTATION ................................................................................8
4.1 Code
5 RESULT & PERFORMANCE ................................................................19
6 REFERENCES .........................................................................................24
INTRODUCTION
Let us suppose, we need to store a string of length 1000 that comprises characters a, e, n, and z.
To storing it as 1-byte characters will require 1000 byte (or 8000 bits) of space. If the symbols in
the string are encoded as (00=a, 01=e, 10=n, 11=z), then the 1000 symbols can be stored in 2000
bits saving 6000 bits of memory.
The number of occurrence of a symbol in a string is called its frequency. When there is
considerable difference in the frequencies of different symbols in a string, variable length codes
can be assigned to the symbols based on their relative frequencies. The most common characters
can be represented using shorter codes than are used for less common source symbols. More is
the variation in the relative frequencies of symbols, it is more advantageous to use variable
length codes for reducing the size of coded string.
Since the codes are of variable length, it is necessary that no code is a prefix of another so that
the codes can be properly decode. Such codes are called prefix code (sometimes called "prefix-
free codes", that is, the code representing some particular symbol is never a prefix of the code
representing any other symbol). Huffman coding is so much widely used for creating prefix
codes that the term "Huffman code" is sometimes used as a synonym for "prefix code" even
when such a code is not produced by Huffman's algorithm.
Huffman was able to design the most efficient compression method of this type: no other
mapping of individual source symbols to unique strings of bits(i.e. codes) will require lesser
space for storing a piece of text when the actual symbol frequencies agree with those used to
create the code.
BASIC TECHNIQUE
In Huffman Coding , the complete set of codes can be represented as a binary tree, known as a
Huffman tree. This Huffman tree is also a coding tree i.e. a full binary tree in which each leaf is
an encoded symbol and the path from the root to a leaf is its code word. By convention, bit '0'
represents following the left child and bit '1' represents following the right child. One code bit
represents each level. Thus more frequent characters are near the root and are coded with few
bits, and rare characters are far from the root and are coded with many bits.
First of all, the source symbols along with their frequencies of occurrence are stored as
leaf nodes in a regular array, the size of which depends on the number of symbols, n. A finished
tree has up to n leaf nodes and n − 1 internal nodes.
PROBLEM DEFINITION:-
Given
A set of symbols and their weights (usually proportional to probabilities or equal to their frequencies).
Find
A prefix-free binary code (a set of code words) with
minimum expected code word length (equivalently, a tree with minimum weighted
path length from the root).
The program is written in C++ and the expected code word length is found out.
The simplest construction algorithm uses a priority queue where the node with lowest probability
is given highest priority:
Step 1:- Create a leaf node for each symbol and add it to the priority queue (i.e.Create a
min heap of Binary trees and heapify it).
Step 2:- While there is more than one node in the queue (i.e. min heap):
i. Remove the two nodes of highest priority (lowest probability or lowest frequency )
from the queue.
ii. Create a new internal node with these two nodes as children and with
probability equal to the sum of the two nodes' probabilities (frequencies).
iii. Add the new node to the queue.
Step 3:- The remaining node is the root node and the Huffman tree is complete.
Joining trees by frequency is the same as merging sequences by length in optimal merge.
Since a node with only one child is not optimal, any Huffman coding corresponds to a full
binary tree.
Definition of optimal merge: Let D={n1, ... , nk} be the set of lengths of sequences to be
merged. Take the two shortest sequences, ni, nj∈ D, such that n≥ ni and n≥ nj ∀ n∈ D.
Merge these two sequences. The new set D is D' = (D - {ni, nj}) ∪ {ni+nj}. Repeat until there
is only one sequence.
Since efficient priority queue data structures require O(log n) time per insertion, and a tree
with n leaves has 2n−1 nodes, this algorithm operates in O(n log n) time.
The worst case for Huffman coding (or, equivalently, the longest Huffman coding for a
set of characters) is when the distribution of frequencies follows the Fibonacci
numbers.
If the estimated probabilities of occurrence of all the symbols are same and the number of
symbols are a power of two, Huffman coding is same as simple binary block encoding, e.g.,
ASCII coding.
Although Huffman's original algorithm is optimal for a symbol-by- symbol coding (i.e. a
stream of unrelated symbols) with a known input probability distribution, it is not optimal when
the symbol-by-symbol restriction is dropped, or when the probability mass functions are
unknown, not identically distributed, or not independent (e.g., "cat" is more common than "cta").
PROGRAM CODE
#include <iostream>
#include <cmath> using
namespace std; struct
node
{
char info; int
freq; char
*code; node
*Llink; node
*Rlink;
};
class HuffmanCode
{
private:
BinaryTree HuffmanTree; // (a minimum weighted external path length tree) public:
HuffmanCode();
};
HuffmanCode::HuffmanCode()
{
minHeap Heap;
// Huffman Tree is build from bottom to top.
// The symbols with lowest frequency are at the bottom of the tree
// that leads to longer codes for lower frequency symbols and hence
// shorter codes for higher frequency symbol giving OPTIMAL code length.
while (Heap.T[0].root->freq>1)
{
// The first two trees with min. priority (i.e. frequency) are taken and
BinaryTree l=Heap.dequeue();
cout<<"\nAfter dequeueing "<<l.root->freq<<endl;
//The process continues till only one tree is left in the array of heap. cout<<"\nThe
process is completed and Huffman Tree is obtained\n"; HuffmanTree=Heap.T[1]; //
This tree is our HuffmanTree used for coding delete []Heap.T;
cout<<"Traversal of Huffman Tree\n\n";
HuffmanTree.print();
cout<<"\nThe symbols with their codes are as follows\n";
HuffmanTree.assign_code(0); // Codes are assigned to the symbols
cout<<"Enter the string to be encoded by Huffman Coding: ";
char *str; str=new
char[30]; cin>>str;
HuffmanTree.encode(str);
cout<<"Enter the code to be decoded by Huffman Coding: "; char *cd;
cd=new char[50];
cin>>cd;
int length;
cout<<"Enter its code length: ";
minHeap::minHeap()
{
cout<<"Enter no. of symbols:";
cin>>n;
T= new BinaryTree [n+1];
T[0].root=new node;
int i=(int)(n / 2);// Heapification will be started from the PARENT element of
//the last ( 'n th' ) element in the heap.
cout<<"\nAs elements are entered\n"; print();
while (i>0)
{
heapify(i);
i--;
}
void minHeap::heapify(int i)
{
while(1)
{
if (2*i > T[0].root->freq)
return;
if (2*i+1 > T[0].root->freq)
{
if (T[2*i].root->freq <= T[i].root->freq)
swap(T[2*i],T[i]);
return;
}
int m=min(T[2*i].root,T[2*i+1].root); if
(T[i].root->freq <= m)
return;
if (T[2*i].root->freq <= T[2*i+1].root->freq)
{swap(T[2*i],T[i]); i=2*i; }
else
{swap(T[2*i+1],T[i]); i=2*i+1;}
}
}
void minHeap::enqueue(BinaryTree b)
{
T[0].root->freq++;
T[T[0].root->freq]=b;
int i=(int) (T[0].root->freq /2 ); while
(i>0)
{
heapify (i);
i=(int) (i /2 );
}
}
void BinaryTree::assign_code(int i)
{
if (root==NULL)
return;
if (isleaf(root))
{
root->code[i]='\0';
cout<<root->info<<"\t"<<root->code<<"\n"; return;
}
BinaryTree l,r;
l.root=root->Llink;
r.root=root->Rlink;
l.assign_code(i);
r.assign_code(i);
}
void BinaryTree::print_code(char c)
{
int f=0;
if (isleaf(root))
BinaryTree l,r;
l.root=root->Llink; if
(f!=1) l.print_code(c);
r.root=root->Rlink; if
(f!=1) r.print_code(c);
}
void BinaryTree::print()
{
if (root==NULL)
return;
cout<<root->info<<"\t"<<root->freq<<"\n"; if
(isleaf(root))
return;
l.root=root->Llink;
r.root=root->Rlink;
l.print();
r.print();
}
int ispowerof2(int i)
{
if (i==1)
return 0; if
(i==0)
return 1;
while (i>2)
{
if (i%2!=0)
return 0;
i=i/2;
}
return 1;
}
int fn(int l)
{
if (l==1||l==0)
return 0;
return 2*fn(l-1)+1;
}
REFERENCES
http://www.itl.nist.gov/div897/sqg/dads/HTML/codingTree.ht ml
http://encyclopedia2.thefreedictionary.com/Huffman+tree
http://en.wikipedia.org/wiki/Huffman_coding
Name : KARPAGAM .K
The following table provides definitions for terms relevant to this document.
Term Definition
Binary tree It is a type of data structure in which information is stored as nodes,
and each node can have at most two children, or nodes that it points to.
Queue A queue can be defined as an ordered list which enables insert
operations to be performed at one end called REAR and delete
operations to be performed at another end called FRONT.Queue is
referred to be as First In First Out list.