0% found this document useful (0 votes)
62 views26 pages

Mini Project

The document describes implementing Huffman coding. It begins with an introduction to Huffman coding and how it assigns variable-length binary codes to symbols based on their frequency. It then states the problem of finding a prefix-free binary code with minimum expected code word length. The algorithm involves creating nodes for each symbol and merging the two lowest frequency nodes at each step. Implementation details are provided, including defining structs for nodes and classes for the binary tree, min heap, and Huffman coding. Code snippets are included for functions like assigning codes, printing codes, encoding/decoding strings, and heap operations.

Uploaded by

Karpagam K
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views26 pages

Mini Project

The document describes implementing Huffman coding. It begins with an introduction to Huffman coding and how it assigns variable-length binary codes to symbols based on their frequency. It then states the problem of finding a prefix-free binary code with minimum expected code word length. The algorithm involves creating nodes for each symbol and merging the two lowest frequency nodes at each step. Implementation details are provided, including defining structs for nodes and classes for the binary tree, min heap, and Huffman coding. Code snippets are included for functions like assigning codes, printing codes, encoding/decoding strings, and heap operations.

Uploaded by

Karpagam K
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

DCSE, CEG, ANNA UNIVERSITY CHENNAI - 600025

MINI PROJECT

ME – CSE BIG DATA

HUFFMAN CODES AND ITS


IMPLEMENTATION

2021188029
Karpagam K
Ph. No.: 8939245646
Email ID: [email protected]

INSTRUCTOR: Dr AROCKIA XAVIER ANNIE R


TABLE OF CONTENTS

CH. TITLE PAGE NO.

ABSTRACT ................................................................................................3

1 INTRODUCTION ......................................................................................4
2 PROBLEM STATEMENT AND SOLUTION ..........................................6
3 ALGORITHM ............................................................................................7
4 IMPLEMENTATION ................................................................................8
4.1 Code
5 RESULT & PERFORMANCE ................................................................19
6 REFERENCES .........................................................................................24

APPENDIX A: PROJECT APPROVAL ...........................................................25


APPENDIX B: KEY TERMS ............................................................................25

CP5161 – Data Structures and Algorithms Laboratory 16-05-2022


Page 2
ABSTRACT

Huffman Coding is an approach to text compression originally developed by David A.


Huffman while he was a Ph.D. student at MIT, and published in the 1952 paper "A Method
for the Construction of Minimum-Redundancy Codes". In computer science and information
theory, it is one of many lossless data compression algorithms. It is a statistical compression
method that converts characters into variable length bit strings and produces a prefix code. Most-
frequently occurring characters are converted to shortest bit strings; least frequent, the longest.

CP5161 – Data Structures and Algorithms Laboratory 16-05-2022


Page 3
CHAPTER 1

INTRODUCTION
Let us suppose, we need to store a string of length 1000 that comprises characters a, e, n, and z.
To storing it as 1-byte characters will require 1000 byte (or 8000 bits) of space. If the symbols in
the string are encoded as (00=a, 01=e, 10=n, 11=z), then the 1000 symbols can be stored in 2000
bits saving 6000 bits of memory.

The number of occurrence of a symbol in a string is called its frequency. When there is
considerable difference in the frequencies of different symbols in a string, variable length codes
can be assigned to the symbols based on their relative frequencies. The most common characters
can be represented using shorter codes than are used for less common source symbols. More is
the variation in the relative frequencies of symbols, it is more advantageous to use variable
length codes for reducing the size of coded string.

Since the codes are of variable length, it is necessary that no code is a prefix of another so that
the codes can be properly decode. Such codes are called prefix code (sometimes called "prefix-
free codes", that is, the code representing some particular symbol is never a prefix of the code
representing any other symbol). Huffman coding is so much widely used for creating prefix
codes that the term "Huffman code" is sometimes used as a synonym for "prefix code" even
when such a code is not produced by Huffman's algorithm.

Huffman was able to design the most efficient compression method of this type: no other
mapping of individual source symbols to unique strings of bits(i.e. codes) will require lesser
space for storing a piece of text when the actual symbol frequencies agree with those used to
create the code.

BASIC TECHNIQUE
In Huffman Coding , the complete set of codes can be represented as a binary tree, known as a
Huffman tree. This Huffman tree is also a coding tree i.e. a full binary tree in which each leaf is
an encoded symbol and the path from the root to a leaf is its code word. By convention, bit '0'
represents following the left child and bit '1' represents following the right child. One code bit
represents each level. Thus more frequent characters are near the root and are coded with few
bits, and rare characters are far from the root and are coded with many bits.

CP5161 – Data Structures and Algorithms Laboratory 16-05-2022


Page 4
HUFFMAN TREE

First of all, the source symbols along with their frequencies of occurrence are stored as
leaf nodes in a regular array, the size of which depends on the number of symbols, n. A finished
tree has up to n leaf nodes and n − 1 internal nodes.

CP5161 – Data Structures and Algorithms Laboratory 16-05-2022


Page 5
CHAPTER 2

PROBLEM STATEMENT AND SOLUTION

PROBLEM DEFINITION:-

Given
A set of symbols and their weights (usually proportional to probabilities or equal to their frequencies).

Find
A prefix-free binary code (a set of code words) with
minimum expected code word length (equivalently, a tree with minimum weighted
path length from the root).

The program is written in C++ and the expected code word length is found out.

CP5161 – Data Structures and Algorithms Laboratory 16-05-2022


Page 6
CHAPTER 3
ALGORITHM

The simplest construction algorithm uses a priority queue where the node with lowest probability
is given highest priority:

Step 1:- Create a leaf node for each symbol and add it to the priority queue (i.e.Create a
min heap of Binary trees and heapify it).

Step 2:- While there is more than one node in the queue (i.e. min heap):

i. Remove the two nodes of highest priority (lowest probability or lowest frequency )
from the queue.
ii. Create a new internal node with these two nodes as children and with
probability equal to the sum of the two nodes' probabilities (frequencies).
iii. Add the new node to the queue.

Step 3:- The remaining node is the root node and the Huffman tree is complete.
Joining trees by frequency is the same as merging sequences by length in optimal merge.
Since a node with only one child is not optimal, any Huffman coding corresponds to a full
binary tree.

Definition of optimal merge: Let D={n1, ... , nk} be the set of lengths of sequences to be
merged. Take the two shortest sequences, ni, nj∈ D, such that n≥ ni and n≥ nj ∀ n∈ D.
Merge these two sequences. The new set D is D' = (D - {ni, nj}) ∪ {ni+nj}. Repeat until there
is only one sequence.

Since efficient priority queue data structures require O(log n) time per insertion, and a tree
with n leaves has 2n−1 nodes, this algorithm operates in O(n log n) time.
The worst case for Huffman coding (or, equivalently, the longest Huffman coding for a
set of characters) is when the distribution of frequencies follows the Fibonacci
numbers.
If the estimated probabilities of occurrence of all the symbols are same and the number of
symbols are a power of two, Huffman coding is same as simple binary block encoding, e.g.,
ASCII coding.
Although Huffman's original algorithm is optimal for a symbol-by- symbol coding (i.e. a
stream of unrelated symbols) with a known input probability distribution, it is not optimal when
the symbol-by-symbol restriction is dropped, or when the probability mass functions are
unknown, not identically distributed, or not independent (e.g., "cat" is more common than "cta").

CP5161 – Data Structures and Algorithms Laboratory 16-05-2022


Page 7
CHAPTER 4
IMPLEMENTATION

PROGRAM CODE

#include <iostream>
#include <cmath> using
namespace std; struct
node
{
char info; int
freq; char
*code; node
*Llink; node
*Rlink;
};

class BinaryTree // Coding Tree


{
private:
node *root; public:
BinaryTree() { root=NULL; }
void print();
void assign_code(int i); void
print_code(char c);
void encode(const char str[]);
void print_symbol(char cd[], int &f, int length); void
decode(char cd[], int size);
friend class minHeap; friend
class HuffmanCode;
};

CP5161 – Data Structures and Algorithms Laboratory 16-05-2022


Page 8
class minHeap
{
private:
BinaryTree *T; // Array of Binary Trees int n;
// Number of symbols
public:
minHeap();
void heapify(int i);
BinaryTree dequeue(); // Returns the first Binary Tree of the min heap and
// then heapify the array of Binary trees in order of the
//frequencies of their root nodes.
void enqueue(BinaryTree b); // To insert another Binary tree
// and then heapify the array of Binary trees
void print();
friend class HuffmanCode;
};

class HuffmanCode
{
private:
BinaryTree HuffmanTree; // (a minimum weighted external path length tree) public:
HuffmanCode();
};
HuffmanCode::HuffmanCode()
{
minHeap Heap;
// Huffman Tree is build from bottom to top.
// The symbols with lowest frequency are at the bottom of the tree
// that leads to longer codes for lower frequency symbols and hence
// shorter codes for higher frequency symbol giving OPTIMAL code length.
while (Heap.T[0].root->freq>1)
{
// The first two trees with min. priority (i.e. frequency) are taken and

BinaryTree l=Heap.dequeue();
cout<<"\nAfter dequeueing "<<l.root->freq<<endl;

CP5161 – Data Structures and Algorithms Laboratory 16-05-2022


Page 9
Heap.print();
BinaryTree r=Heap.dequeue();
cout<<"\nAfter dequeueing "<<r.root->freq<<endl;
Heap.print();
// a new tree is constructed taking the above trees as left and right sub-trees
// with the frequency of root node as the sum of frequencies of left & right child.
HuffmanTree.root=new node;
HuffmanTree.root->info='\0';
HuffmanTree.root->freq=l.root->freq + r.root->freq;
HuffmanTree.root->Llink=l.root;
HuffmanTree.root->Rlink=r.root;
// then it is inserted in the array and array is heapified again.
// Deletion and Insertion at an intermediate step is facilitated in heap-sort.
Heap.enqueue(HuffmanTree);

cout<<"\nAfter enqueueing "<<l.root->freq<<"+"<<r.root->freq<<"=


"<<HuffmanTree.root->freq<<endl;
Heap.print();
}

//The process continues till only one tree is left in the array of heap. cout<<"\nThe
process is completed and Huffman Tree is obtained\n"; HuffmanTree=Heap.T[1]; //
This tree is our HuffmanTree used for coding delete []Heap.T;
cout<<"Traversal of Huffman Tree\n\n";
HuffmanTree.print();
cout<<"\nThe symbols with their codes are as follows\n";
HuffmanTree.assign_code(0); // Codes are assigned to the symbols
cout<<"Enter the string to be encoded by Huffman Coding: ";
char *str; str=new
char[30]; cin>>str;
HuffmanTree.encode(str);
cout<<"Enter the code to be decoded by Huffman Coding: "; char *cd;
cd=new char[50];
cin>>cd;
int length;
cout<<"Enter its code length: ";

CP5161 – Data Structures and Algorithms Laboratory 16-05-2022


Page 10
cin>>length;
HuffmanTree.decode(cd,length);
delete [ ]cd;
delete [ ]str;
}

minHeap::minHeap()
{
cout<<"Enter no. of symbols:";
cin>>n;
T= new BinaryTree [n+1];
T[0].root=new node;

T[0].root->freq=n; //Number of elements in min. Heap at any time is stored in the


// zeroth element of the heap
for (int i=1; i<=n; i++)
{
T[i].root=new node;
cout<<"Enter characters of string :- ";
cin>>T[i].root->info;
cout<<"and their frequency of occurence in the string:- ";
cin>>T[i].root->freq;
T[i].root->code=NULL;
T[i].root->Llink=NULL;
T[i].root->Rlink=NULL;
// Initially, all the nodes are leaf nodes and stored as an array of trees.
}
cout<<endl;

int i=(int)(n / 2);// Heapification will be started from the PARENT element of
//the last ( 'n th' ) element in the heap.
cout<<"\nAs elements are entered\n"; print();
while (i>0)
{
heapify(i);
i--;
}

CP5161 – Data Structures and Algorithms Laboratory 16-05-2022


Page 11
cout<<"\nAfter heapification \n";
print();
}

int min(node *a, node *b)


{if (a->freq <= b->freq) return a->freq;
else return b->freq;}

void swap(BinaryTree &a, BinaryTree &b)


{
BinaryTree c=a;
a=b;
b=c;
}

void minHeap::heapify(int i)
{
while(1)
{
if (2*i > T[0].root->freq)
return;
if (2*i+1 > T[0].root->freq)
{
if (T[2*i].root->freq <= T[i].root->freq)
swap(T[2*i],T[i]);
return;
}
int m=min(T[2*i].root,T[2*i+1].root); if
(T[i].root->freq <= m)
return;
if (T[2*i].root->freq <= T[2*i+1].root->freq)
{swap(T[2*i],T[i]); i=2*i; }
else
{swap(T[2*i+1],T[i]); i=2*i+1;}
}
}

CP5161 – Data Structures and Algorithms Laboratory 16-05-2022


Page 12
BinaryTree minHeap::dequeue()
{
BinaryTree b=T[1]; T[1]=
T[T[0].root->freq];
T[0].root->freq--;
if (T[0].root->freq!=1)
heapify(1);
return b;
}

void minHeap::enqueue(BinaryTree b)
{
T[0].root->freq++;

T[T[0].root->freq]=b;
int i=(int) (T[0].root->freq /2 ); while
(i>0)
{
heapify (i);
i=(int) (i /2 );
}
}

int isleaf(node *nd)


{ if(nd->info=='\0') return 0; else return 1;}

void BinaryTree::assign_code(int i)
{
if (root==NULL)
return;
if (isleaf(root))
{
root->code[i]='\0';
cout<<root->info<<"\t"<<root->code<<"\n"; return;
}
BinaryTree l,r;
l.root=root->Llink;
r.root=root->Rlink;

CP5161 – Data Structures and Algorithms Laboratory 16-05-2022


Page 13
l.root->code=new char[i+1]; r.root-
>code=new char[i+1]; for (int k=0; k<i; k++)
{
l.root->code[k]=root->code[k];
r.root->code[k]=root->code[k];
}
l.root->code[i]='0';
r.root->code[i]='1';
i++;

l.assign_code(i);
r.assign_code(i);
}

void BinaryTree::encode(const char str[])


{
if (root==NULL) return;
int i=0;
cout<<"Encoded code for the input string '"<<str<<"' is\n"; while (1)
{
if (str[i]=='\0')
{
cout<<endl;
return;
}
print_code(str[i]);
i++;
}
}

void BinaryTree::print_code(char c)
{
int f=0;
if (isleaf(root))

CP5161 – Data Structures and Algorithms Laboratory 16-05-2022


Page 14
{
if (c==root->info)
{f=1; cout<<root->code;} return ;
}

BinaryTree l,r;
l.root=root->Llink; if
(f!=1) l.print_code(c);

r.root=root->Rlink; if
(f!=1) r.print_code(c);
}

int isequal(const char a[], const char b[], int length)


{
int i=0;
while (i<length)
{
if(b[i]!=a[i])
return 0; i++;
}
if (a[i]!='\0')
return 0;
return 1;
}

void BinaryTree::decode(char cd[], int size)


{
if (root==NULL)
return;
int i=0;
int length=0;
int f;
char *s;
cout<<"Decoded string for the input code '"<<cd<<"' is\n"; while
(i<size)
{

CP5161 – Data Structures and Algorithms Laboratory 16-05-2022


Page 15
f=0;
s=&cd[i];
while (f==0)
{
length++;
print_symbol(s,f,length);
}
i=i+length;
length=0;
}
cout<<endl;
}

void BinaryTree::print_symbol(char cd[], int &f, int length)


{
if (isleaf(root))
{
if (isequal(root->code, cd, length))
{
f=1; cout<<root->info;
}
return;
}
BinaryTree l,r;
l.root=root->Llink; if
(f!=1)
l.print_symbol(cd,f,length);
r.root=root->Rlink;
if (f!=1)
r.print_symbol(cd,f,length);
}

void BinaryTree::print()
{
if (root==NULL)
return;
cout<<root->info<<"\t"<<root->freq<<"\n"; if
(isleaf(root))
return;

CP5161 – Data Structures and Algorithms Laboratory 16-05-2022


Page 16
BinaryTree l,r;

l.root=root->Llink;
r.root=root->Rlink;
l.print();
r.print();
}

int power(int i, int j)


{
int n=1;
for (int k=1; k<=j; k++)
n=n*i;
return n;
}

int ispowerof2(int i)
{
if (i==1)
return 0; if
(i==0)
return 1;
while (i>2)
{
if (i%2!=0)
return 0;
i=i/2;
}
return 1;
}

int fn(int l)
{
if (l==1||l==0)
return 0;
return 2*fn(l-1)+1;
}

CP5161 – Data Structures and Algorithms Laboratory 16-05-2022


Page 17
void minHeap::print()
{
cout<<"The Heap showing the root frequencies of the Binary Trees are:\n"; if (T[0].root-
>freq==0)
{
cout<<endl;
return;
}
int level=1;
while( T[0].root->freq >= power(2,level) ) // 2^n-1 is the max. no. of nodes
///in a complete tree of n levels
level++;
if(level==1)
{
cout<<T[1].root->freq<<”\n”; return;
}
for (int i=1; i<=T[0].root->freq; i++)
{
if (ispowerof2(i))
{cout<<”\n”; level--;}
for (int k=1; k<=fn(level); k++)
cout<<” “;
cout<<T[i].root->freq<<” “;
for (int k=1; k<=fn(level); k++)
cout<<” “;
}
cout<<endl;
}
int main()
{
HuffmanCode c; system
(“pause”); return 0;}

CP5161 – Data Structures and Algorithms Laboratory 16-05-2022


Page 18
CHAPTER 5
RESULT & PERFORMANCE

CP5161 – Data Structures and Algorithms Laboratory 16-05-2022


Page 19
CP5161 – Data Structures and Algorithms Laboratory 16-05-2022
Page 20
CP5161 – Data Structures and Algorithms Laboratory 16-05-2022
Page 21
CP5161 – Data Structures and Algorithms Laboratory 16-05-2022
Page 22
CP5161 – Data Structures and Algorithms Laboratory 16-05-2022
Page 23
CHAPTER 6

REFERENCES

 Sartaj Sahani: Data structures, Algorithms and Applications in C++

 http://www.itl.nist.gov/div897/sqg/dads/HTML/codingTree.ht ml

 http://encyclopedia2.thefreedictionary.com/Huffman+tree

 http://en.wikipedia.org/wiki/Huffman_coding

CP5161 – Data Structures and Algorithms Laboratory 16-05-2022


Page 24
Appendix A: Project Approval
The undersigned acknowledge that they have completed the project FILE STORAGE AND
RETRIEVAL SYSTEM BASED ON AVL TREE and agree with the approach it presents.

Signature: Date: 16-05-2022

Name : KARPAGAM .K

Appendix B: Key Terms

The following table provides definitions for terms relevant to this document.
Term Definition
Binary tree It is a type of data structure in which information is stored as nodes,
and each node can have at most two children, or nodes that it points to.
Queue A queue can be defined as an ordered list which enables insert
operations to be performed at one end called REAR and delete
operations to be performed at another end called FRONT.Queue is
referred to be as First In First Out list.

CP5161 – Data Structures and Algorithms Laboratory 16-05-2022


Page 25
CP5161 – Data Structures and Algorithms Laboratory 16-05-2022
Page 26

You might also like