11 Hashtable-1
11 Hashtable-1
1
Hash Tables
• We’ll discuss the hash table ADT which supports only
a subset of the operations allowed by binary search
trees.
• The implementation of hash tables is called hashing.
• Hashing is a technique used for performing insertions,
deletions and finds in constant average time (i.e. O(1))
– Worst-case times O(n)
• This data structure, however, is not efficient in
operations that require any ordering information among
the elements, such as findMin, findMax and printing the
entire table in sorted order.
2
General Idea
• The ideal hash table structure is merely an array of some fixed
size, containing the items.
• A stored item needs to have a data member, called key, that will
be used in computing the index value for the item.
– Key could be an integer, a string, etc
– e.g. a name or Id that is a part of a large employee structure
• The size of the array is TableSize.
• The items that are stored in the hash table are indexed by values
from 0 to TableSize – 1.
• Each key is mapped into some number in the range 0 to
TableSize – 1; this number is called hash value.
• The mapping is called a hash function.
3
Example Hash
Table
0
1
Items
2
john 25000 john25000
25000
3 john
phil 31250 key Hash 4 phil31250
phil 31250
Function
dave 27500 5
6 dave27500
dave 27500
mary 28200
7 mary28200
mary 28200
key 8
9
hash
value
4
Hash Function
• The hash function:
– must be simple to compute.
– must distribute the keys evenly among the cells.
• If we know which keys will occur in
advance we can write perfect hash
functions, but we don’t.
5
Hash function
Problems:
• Keys may not be numeric.
• Number of possible keys is much larger than the
space available in table.
• Different keys may map into same location
– Hash function is not one-to-one => collision.
– If there are too many collisions, the performance of
the hash table will suffer dramatically.
6
Hash Functions
• If the input keys are integers then simply
Key mod TableSize is a general strategy.
– Unless key happens to have some undesirable
properties. (e.g. all keys end in 0 and we use
mod 10)
• If the keys are strings, hash function needs
more care.
– First convert it into a numeric value.
7
Some methods
• Truncation:
– e.g. 123456789 map to a table of 1000 addresses by
picking the last 3 digits of the key: H(IDNum) = IDNum % 1000 =
hash value
• Folding:
– e.g. 123|456|789: add them and take mod.
• Key mod N:
– N is the size of the table, better if it is prime.
• Squaring:
– Square the key and then truncate
• Radix conversion:
– e.g. 1 2 3 4 treat it to be base 11, truncate if necessary.
8
Hash Function 1
• Add up the ASCII values of all characters of the key.
int hash(const string &key, int tableSize)
{
int hasVal = 0;
10
Hash Function 3
KeySize 1
hash(key ) Key
i 0
[ KeySize i 1] 37 i
hashVal %=tableSize;
if (hashVal < 0) /* in case overflows occurs */
hashVal += tableSize;
return hashVal;
};
11
Hash function for strings:
98 108 105 key[i]
key a l i
0 1 2 i
KeySize = 3;
12
Collision Resolution
• If, when an element is inserted, it hashes to the
same value as an already inserted element, then we
have a collision and need to resolve it.
• There are several methods for dealing with this:
– Separate chaining
– Open addressing
• Linear Probing
• Quadratic Probing
• Double Hashing
13
Separate Chaining
• The idea is to keep a list of all elements that hash
to the same value.
– The array elements are pointers to the first nodes of the
lists.
– A new item is inserted to the front of the list.
• Advantages:
– Better space utilization for large items.
– Simple collision handling: searching linked list.
– Overflow: we can store more items than the hash table
size.
– Deletion is quick and easy: deletion from the linked
list.
14
Example
Keys: 0, 1, 4, 9, 16, 25, 36, 49, 64, 81
hash(key) = key % 10.
0 0
1 81 1
2
4 64 4
5 25
6 36 16
7
9 49 9
15
Operations
• Initialization: all entries are set to NULL
• Find:
– locate the cell using hash function.
– sequential search on the linked list in that cell.
• Insertion:
– Locate the cell using hash function.
– (If the item does not exist) insert it as the first item in
the list.
• Deletion:
– Locate the cell using hash function.
– Delete the item from the linked list.
16
Hash Table Class for separate chaining
template <class HashedObj>
class HashTable
{
public:
HashTable(const HashedObj & notFound, int size=101 );
HashTable( const HashTable & rhs )
:ITEM_NOT_FOUND( rhs.ITEM_NOT_FOUND ),
theLists( rhs.theLists ) { }
void makeEmpty( );
void insert( const HashedObj & x );
void remove( const HashedObj & x );
if( p == NULL )
whichList.insert( x, whichList.zeroth( ) );
}
18
Remove routine
/**
* Remove item x from the hash table.
*/
template <class HashedObj>
void HashTable<HashedObj>::remove( const HashedObj & x )
{
theLists[hash(x, theLists.size())].remove( x );
}
19
Find routine
/**
* Find item x in the hash table.
* Return the matching item or ITEM_NOT_FOUND if not found
*/
template <class HashedObj>
const HashedObj & HashTable<HashedObj>::find( const
HashedObj & x ) const
{
HashedObj * itr;
itr = theLists[ hash( x, theLists.size( ) ) ].find( x );
if(itr==NULL)
return ITEM_NOT_FOUND;
else
return *itr;
}
20
Analysis of Separate Chaining
• Collisions are very likely.
– How likely and what is the average length of
lists?
• Load factor definition:
– Ratio of number of elements (N) in a hash table
to the hash TableSize.
• i.e. = N/TableSize
– The average length of a list is also
– For chaining is not bound by 1; it can be > 1.
21
Cost of searching
• Cost = Constant time to evaluate the hash function
+ time to traverse the list.
• Unsuccessful search:
– We have to traverse the entire list, so we need to compare nodes on
the average.
• Successful search:
– List contains the one node that stores the searched item + 0 or more
other nodes.
– Expected # of other nodes = x = (N-1)/M which is essentially
since M is presumed large.
– On the average, we need to check half of the other nodes while
searching for a certain element
– Thus average search cost = 1 +/2
22
Summary
• The analysis shows us that the table size is
not really important, but the load factor is.
• TableSize should be as large as the number
of expected elements in the hash table.
– To keep load factor around 1.
• TableSize should be prime for even
distribution of keys to hash table cells.
23
Hashing: Open Addressing
24
Collision Resolution with
Open Addressing
• Separate chaining has the disadvantage of
using linked lists.
– Requires the implementation of a second data
structure.
• In an open addressing hashing system, all
the data go inside the table.
– Thus, a bigger table is needed.
• Generally the load factor should be below 0.5.
– If a collision occurs, alternative cells are tried
until an empty cell is found.
25
Open Addressing
• More formally:
– Cells h0(x), h1(x), h2(x), …are tried in succession where
hi(x) = (hash(x) + f(i)) mod TableSize, with f(0) = 0.
– The function f is the collision resolution strategy.
• There are three common collision resolution
strategies:
– Linear Probing
– Quadratic probing
– Double hashing
26
Linear Probing
• In linear probing, collisions are resolved by
sequentially scanning an array (with
wraparound) until an empty cell is found.
– i.e. f is a linear function of i, typically f(i)= i.
• Example:
– Insert items with keys: 89, 18, 49, 58, 9 into an
empty hash table.
– Table size is 10.
– Hash function is hash(x) = x mod 10.
• f(i) = i;
27
Figure 20.4
Linear probing
hash table after
each insertion
28
Find and Delete
• The find algorithm follows the same probe
sequence as the insert algorithm.
– A find for 58 would involve 4 probes.
– A find for 19 would involve 5 probes.
• We must use lazy deletion (i.e. marking items as
deleted)
– Standard deletion (i.e. physically removing the
item) cannot be performed.
• When an item is deleted, the location must be marked in a special way, so that the
searches know that the spot used to have something in it.
31
Analysis of Find
• An unsuccessful search costs the same as
insertion.
• The cost of a successful search of X is equal to the
cost of inserting X at the time X was inserted.
• For λ = 0.5 the average cost of insertion is 2.5.
The average cost of finding the newly inserted
item will be 2.5 no matter how many insertions
follow.
• Thus the average cost of a successful search is an
average of the insertion costs over all smaller load
factors.
32
Average cost of find
• The average number of cells that are examined in
an unsuccessful search using linear probing is
roughly (1 + 1/(1 – λ)2) / 2.
• The average number of cells that are examined in a
successful search is approximately
(1 + 1/(1 – λ)) / 2.
– Derived from:
1 1 1
x0 2 (1 x) 2
1 dx
33
Linear Probing – Analysis -- Example
• What is the average number of probes for a successful
search and an unsuccessful search for this hash table?
– Hash Function: h(x) = x mod 11 0 9
Successful Search: 1
– 20: 9 -- 30: 8 -- 2 : 2 -- 13: 2, 3 -- 25: 3,4 2 2
– 24: 2,3,4,5 -- 10: 10 -- 9: 9,10, 0
3 13
Avg. Probe for SS = (1+1+1+2+2+4+1+3)/8=15/8
4 25
Unsuccessful Search:
– We assume that the hash function uniformly 5 24
distributes the keys. 6
– 0: 0,1 -- 1: 1 -- 2: 2,3,4,5,6 -- 3: 3,4,5,6 7
– 4: 4,5,6 -- 5: 5,6 -- 6: 6 -- 7: 7 -- 8: 8,9,10,0,1
8 30
– 9: 9,10,0,1 -- 10: 10,0,1
9 20
Avg. Probe for US =
(2+1+5+4+3+2+1+1+5+4+3)/11=31/11 10 10
34
Quadratic Probing
• Quadratic Probing eliminates primary clustering
problem of linear probing.
• Collision function is quadratic.
– The popular choice is f(i) = i2.
• If the hash function evaluates to h and a search in
cell h is inconclusive, we try cells h + 1 2, h+22, … h
+ i 2.
– i.e. It examines cells 1,4,9 and so on away from the
original probe.
• Remember that subsequent probe points are a
quadratic number of positions from the original
probe point.
35
Figure 20.6
A quadratic probing
hash table after each
insertion (note that
the table size was
poorly chosen
because it is not a
prime number).
36
Quadratic Probing
• Problem:
– We may not be sure that we will probe all locations in
the table (i.e. there is no guarantee to find an empty cell
if table is more than half full.)
– If the hash table size is not prime this problem will be
much severe.
• However, there is a theorem stating that:
– If the table size is prime and load factor is not larger
than 0.5, all probes will be to different locations and an
item can always be inserted.
37
Theorem
• If quadratic probing is used, and the table
size is prime, then a new element can
always be inserted if the table is at least half
empty.
38
Some considerations
• How efficient is calculating the quadratic
probes?
– Linear probing is easily implemented.
Quadratic probing appears to require * and %
operations.
– However by the use of the following trick, this
is overcome:
• Hi = Hi-1+2i – 1 (mod M)
39
Some Considerations
• What happens if load factor gets too high?
– Dynamically expand the table as soon as the
load factor reaches 0.5, which is called
rehashing.
– Always double to a prime number.
– When expanding the hash table, reinsert the
new table by using the new hash function.
40
Analysis of Quadratic Probing
• Quadratic probing has not yet been mathematically
analyzed.
• Although quadratic probing eliminates primary
clustering, elements that hash to the same location
will probe the same alternative cells. This is known
as secondary clustering.
• Techniques that eliminate secondary clustering are
available.
– the most popular is double hashing.
41
Double Hashing
• A second hash function is used to drive the
collision resolution.
– f(i) = i * hash2(x)
• We apply a second hash function to x and probe at
a distance hash2(x), 2*hash2(x), … and so on.
• The function hash2(x) must never evaluate to zero.
– e.g. Let hash2(x) = x mod 9 and try to insert 99 in the
previous example.
• A function such as hash2(x) = R – ( x mod R) with
R a prime smaller than TableSize will work well.
– e.g. try R = 7 for the previous example.(7 - x mode 7)
42
The relative efficiency of
four collision-resolution methods
43
Hashing Applications
• Compilers use hash tables to implement the
symbol table (a data structure to keep track
of declared variables).
• Game programs use hash tables to keep
track of positions it has encountered
(transposition table)
• Online spelling checkers.
44
Largest Subset w/ Consecutive Numbers
Input: 1,3,8,14,4,10,2,11 Output: 1,2,3,4
45
Largest Subset w/ Consecutive Numbers
Input: 1,3,8,14,4,10,2,11 Output: 1,2,3,4
46
Largest Subset w/ Consecutive Numbers
Input: 1,3,8,14,4,10,2,11 Output: 1,2,3,4
48