DS 8
DS 8
Shyan-Ming Yuan
CS Department, NYCU
[email protected]
The Basic Idea of Hashing
A function maps an element in an input set to an element in
an output set .
Assume that all keys in a list of records are from the set .
If >>and then the function can be used to map keys of records
into elements in .
The function is then called a hash function.
Without lose generality, let are elements in .
If no two keys are mapped into the same elements in then an
array [0:b-1] can be used to store the list in such a way that
O(n)
holds the record .
To retrieve a record by its key ,
If output
O(1)
else output “no record has key=K”
Static Hashing
Records are stored in a fixed-size table called hash table.
The address of a record, , is obtained by computing an
arithmetic function ().
The hash table is usually represented by an array of buckets.
Each bucket can hold records.
The maps the set of possible keys onto the integers
The key density of a hash table is the ratio ,
where is the number of keys in the table and is the total number
of possible keys.
The loading factor of a hash table is .
In general, and .
The home bucket of a record is the bucket .
Collision and Overflow
Two keys, and , are said to be synonyms with respect to
ℎ(𝑥) if .
In other words, records and have the same home bucket.
A collision occurs when inserting a new record into a hash
table while the home bucket is not empty.
An overflow occurs when inserting a new record into a
hash table while the home bucket is full.
When each bucket has exactly 1 slot (i.e. s==1), collision
and overflow occur simultaneously.
A Hash Table with b=26 and s=2
Assume there are n=10 distinct keys and that each key begins with a letter.
The loading factor, .
Let the 1st character of k, where A=0, B=1, …, Z=25.
Slot 1 Slot 2
The home buckets for the 10 keys
0 A A2 GA, D, A, G, L, A2, A1, A3, A4, and E
1 are 6, 3, 0, 6, 11, 0, 0, 0, 0, 4.
2 The keys A, A2, A1, A3, and A4 are synonyms.
3 D
The keys GA and G are also synonyms.
4
After insert GA, D, A, G,
5 L(in bucket 11), and A2
6 GA G
. . . On inserting A1, an overflow occurs
. . . because slot 0 is full.
. . .
Where to insert A1 now????
25
Hash Functions
Basic desired properties of a hash function are:
Easy to compute
The number of collisions is minimized
A good hash function
should depend on every character of a key.
A hash function is called uniform hash function if
For every randomly selected key k from the key space, the
probability that h(k) = i is 1/b for every bucket i.
In other words, a random key has an equal chance of
hashing into any of the buckets.
Some generally used hash functions are:
Division, Mid-Squire, Folding, and Digit Analysis.
The Division Hash Function
When keys are nonnegtive integers,
the home bucket can be obtained by using the modulo (%)
operator.
, where D=b.
The bucket addresses are in the range of 0 through D-1.
If D is a power of 2, then depends only on the least
significant bits of . It is not depend on all bits of a key.
In general, if D has small prime factors such as 2, 3, 5, 7,
and so on, the distribution of home buckets is biased.
The degree of bias decrease as the smallest prime factor of D
increases. In practice, the smallest prime factor of D should
23.
When D is a prime number, the can be a very uniform
The Mid-Squire Hash Function
The is obtained by
1st squaring the key and then using an appropriate number of
bits from the middle of the square as the bucket address.
, where .
The Folding Hash Function
A key is partitioned into several parts, all but possibly the
last being of the same length. Then these parts are added
together to obtained the hash address for .
There are two variations of folding hash functions.
The shift folding hash is the summation of these parts.
The folding at the boundaries hash reverses the even parts at
the partition boundaries and then makes summation.
For example, let k=12320324111220 and every 3 digits
forms a partition.
The Digit Analysis Hash Function
It is very useful when all keys are known in advance.
The Digit Analysis Actions:
Each key is interpreted as a number using some radix r.
The distribution of each digits of these radix r numbers are
examined.
Delete a digit with the most skewed distribution one at a time
until the number of remaining digits is small enough to fit in the
hash table.
For example, the keys in radix-10 form are:
1234, 1567, 2318, and 1647
Digit distributions are:
1{1,1,2,1},2{2,5,3,6},3{3,6,1,4},4{4,7,8,7}
If the hash table has buckets 0~99, then digit 1 and digit 4 are
Overflow Handling
The two common methods for handling overflows are:
Open addressing is to generate a new bucket address for an
overflowed record .
Linear probing is using the next non-full bucket as the new address for
overflow.
Search the hash table buckets in the order, , , until the 1 st unfilled
bucket and insert the in [].
If no such bucket exists then the size of the hash table needs to
increase.
Quadratic probing
Random probing
Rehashing
Chaining is using a linked list for each overflowed bucket.
The Linear Probing
template <class K, class E>
pair <K, E> *LinearProbing <K, E> :: Get (const K &k) {
// search the linear probing hash table ht (with s=1) for key k
// if a pair with this key k is found, return a pointer to this pair; otherwise,
return 0.
int i = h(k); // home bucket for k
int j;
for (j=i; ht[j] && ht[j]->first != k; ) {
j = (j+1) % b;
if (j==i) return 0;
}
if (ht[j]->first == k) return ht[j];
return 0;
}
When using the linear probing with a uniform hash function, the expected
average number of key comparisons to look up a key is approximated
(2-α)/(2-2α), where α is the loading factor.
Hash Table with Linear Probing
0 A
1 A2
Assume that b=26 and s=1.
2 A1
3 D
4 A3
The home buckets for the 12 keys GA,
D, A, G, L, A2, A1, A3, A4, Z, ZA, and E 5 A4
are 6, 3, 0, 6, 11, 0, 0, 0, 0, 25, 25, 4. 6 GA
7 G
8 ZA
9 E
10
11 L
…
25 Z
The Quadratic Probing
One of the problems of linear probing is that it tends to
create clusters of keys.
When more keys are entered, these clusters tend to merge
together, resulting in bigger clusters.
It becomes difficult to find an unused bucket.
A quadratic probing scheme improves the growth of
clusters.
A quadratic function of is used as the increment when
searching through buckets.
A commonly used quadratic probing is by examining buckets
, where is a prime number of the form for some integer .
It can be shown that the search can examines all buckets.
The Random Probing
In random probing, when an overflow occurs, the
algorithm uses a random number generator to choose a
new index to probe, until an empty slot is found.
The random number generator must be able to cover all
buckets.
Pro:
It can help to evenly distribute keys across the hash table,
reducing the likelihood of subsequent overflows and
improving the overall performance of the hash table.
Con:
It may be less efficient than other probing methods, because
it requires the use of a random number generator and may
result in more probes to find an empty slot.
The Rehashing and Double Hashing
Another method to control the growth of clusters is to use a
series of hash functions . This is called rehashing.
Buckets are examined in that order.
Double hashing: uses 2 hash functions and .
The search sequence is .
If and are relatively prime, then it can be shown that the
search sequence cover all buckets.
For example:
and is a prime number.
Since is in and b is a prime, they are relatively prime.
Chaining for Overflow Handling
Linear probing and other open addressing methods can
perform poorly because the search for a key involves
comparisons with keys that have different hash values.
Unnecessary comparisons can be avoided if all the
synonyms are chained in the same linked list, where one list
per bucket.
In general, an array [] of type ChainNode <…>* is used such
that ht[i]
template points
<class K, classto
E>the first node of the chain for bucket i.
pair <K, E>* Chaining <K, E> :: Get (const K &k) {//search the chained hash table ht
for k.
// if a pair with this key is found, return a pointer to this pair; otherwise, return 0.
int i = h(k);
for (ChainNode <pair <K, E>>* current = ht[i]; current; current = current->link)
if (current->data.first == k) return ¤t->data;
return 0;
}
The Hash Chain Example
ht
0 A A A A A 0
4 3 1 2
1 0
2 0
3 D 0
4 E 0
5 0
6 G G 0
A
7 0
8 0 The home buckets for the 12 keys GA,
9 0 D, A, G, L, A2, A1, A3, A4, Z, ZA, and E
are 6, 3, 0, 6, 11, 0, 0, 0, 0, 25, 25, 4.
10 0
11 L
…
25 Z Z 0
A
Performance of Overflow Techniques
In general, hashing provides pretty good performance
over conventional techniques such as balanced trees.
However, the worst-case performance of hashing can be
O(n).
Theorem 8.1: let be the loading factor of a hash table
using a uniform hashing function . Then
For linear probing:
For rehashing, quadratic and random probing:
For Chaining:
Where is the expected number of key comparisons for searching a
key not in the hash table and is the average comparisons for keys in
the table.
Dynamic Hashing
If all slots of a hash table are filled on inserting a new record,
Simply increasing the bucket size needs to rehash all existing
records.
It results a lot of data movements and the hash table may become
inaccessible for a long period of time.
Dynamic hashing (also known as extendible hashing) ensures
that doubling table size only increases O(n) total time for a
sequence of n hash table operations.
Select a family of refinable hash functions:,
where refines and
the output domains of is a subset of the output domain of .
A function refines if then .
A Refinable hash function family
Any hash function mapping keys to positive integers can
be used as the base hash function.
If is selected as the initial bucket size, then
Let and
On overflow, the bucket size can be doubled and next
function is then used for the subsequent insertions.
Here, we define as the binary value formed by the least
significant bits of a good enough hash function .
Then is a refinable hash function family.
Dynamic Hashing Using Directories
When a bucket overflows, split only that bucket in two.
Conceptually the hash table is doubled.
It can be achieved by using a directory for indexing to
buckets.
In addition, keep track of how many bits actually using
director buckets
gd=3 of each bucket
depth of directory(global depth) and depth (local depth)
y split
gd= director buckets 000 ld=3
y overflow 001
200 ld=2 ld=2
01 010
ld=2 ld=2
10 011
11 ld=2 100 ld=2
101 new
ld=2 ld=3
110
111
Rule of Bucket Splitting
On bucket overflow:
If ( > ) then
Split bucket into bucket and bucket
Set the directory entry of to the address of bucket
For all other entries pointing to bucket , in increasing order of ,
Copy the value of the directory entry to the entry
where is the largest integer less than that is a power of 2.
Update local depths: ++ and ++
If (== ) then
Double the directory and copy the 1st half of the directory to the 2nd half
Update the global depth: ++
Split bucket into bucket and bucket
Set the directory entry of to the address of bucket
Update local depths: ++ and ++
The example of Dynamic Hashing
Using Directories
Let the hash function be the right table.
A0 100 000
Assume that each bucket has 2 slots.
A1 100 001
A directory, , is used to point to buckets. B0 101 000
The size of directory depends on the number of B1 101 001
bits of used to index into the directory. C1 110 001
Initially, if is used, then each bucket is indexed C2 110 010
C3 110 011
by the 2-bit binaries of 00, 01, 10, 11.
C5 110 101
Let keys are inserted in the order of
A0, B0, A1, B1, C2, and C3.
gd=2𝑑 buckets
The directory and buckets are as the right figure. 00 A B ld=2
0 0
01 A B ld=2
1 1
10 C ld=2
2
11 C ld=2
3
Insertion of C5 (110 101)
Since and and the bucket pointed by is filled by A1 and
B1, an overflow occurs.
According to the 2nd bucket splitting rule,
Double the directory, copy 1st half to 2nd half, set ,
Split into 2 buckets of 001(A1, B1) and 101(C5)
gd=3 𝑑
buckets
Set pointing to bucket (A1, B1)
000 A B ld=2
Set pointing to bucket (C5) 0 0
001 A B ld=3
Set and 010
1
C
1
ld=2
2
011 C ld=2
gd=2𝑑 buckets
100
3
00 A B ld=2
0 0 101 C ld=3
01 A B ld=2
110 5
1 1
10 C ld=2
2 111
11 C ld=2
3
Insertion of C1 (110 001) after C5
Since 𝑔𝑑 = 3 and ℎ(𝐶1, 3) = 001 gd=4 𝑑 buckets
A0 B0
0000 ld=2
and the bucket 𝑚 pointed by 0001 A1 C1 ld=4
𝑑[001] is filled by A1 and B1, an 0010 C2 ld=2
overflow occurs. 0011 C3 ld=2
0100
According to the 2nd splitting rule, 0101 C5 ld=3
Double the directory, copy 1st half 0110
1000
Split 𝑚 into 2 buckets of 0001(A1, 1001 B1
ld=4
C1) and 1001(B1) 1010
Set pointing to (A1, C1) 1011
1101
Set 𝑙𝑑(𝑑[0001])=4 and
1110
𝑙𝑑(𝑑[1001])=4 1111
Insertion of C1 (110 001) after C3
Since 𝑔𝑑 = 2 and ℎ(𝐶1, 2) = 01 and the gd=4 𝑑 buckets
A0 B0
0000 ld=2
bucket 𝑚 pointed by 𝑑[01] is filled by A1 0001 A1 C1 ld=4
and B1, an overflow occurs. 0010 C2 ld=2
According to the 2nd bucket splitting rule, 0011 C3 ld=2
0100
Double the directory, copy 1st half to 2nd half,
0101
set 𝑔𝑑=3, ld=3
0110
Because h(C1,3)=h(A1,3)=h(B1,3)=001,
0111
splitting can not resolve overflow, 1000
d[101] points to an empty bucket 1001 B1
ld=4
The directory needs to double again, . 1010
Now can be divided in such a way that 1011
1101
d[1001] 1001(B1)
1110
1111
Directoryless Dynamic Hashing
It is possible to use a huge hash table that is impossible to
dynamically changing its size but dynamically activate more
buckets.
Two control variables and are used to keep track of the active
buckets, where .
At any time, only buckets 0 ~ are active.
Each active bucket is the start of a chain of buckets.
The other buckets on a chain are called overflow buckets.
The active buckets and are indexed using
The active buckets are indexed using
A hash table ht with r=2 and q=0
Assume the hash function is the table in the
A0 100 000
right.
A1 100 001
The active buckets are .
B0 101 000
(a) r=2, q=0 (b) Insert C5, r=2, q=1 h(C5,2)=01, overflow
B1 101 001
B4 A0 activate (100)
00 000 - overflo , B4 101 100
A0
w
A1 A1 bucket rehash keys in , by B5 101 101
01 001 C5 A0 to 000 and B4 to 100
B5 B5 C1 110 001
C5 to overflow bucket of
C2 C2 C2 110 010
10 010 001
- - C3 110 011
Active buckets are
C3 C3 C5 110 101
11
- 011 - 000, 100 use h(k,3)
B4 001~011 use h(k,2)
100 new
- active
bucket
Insertion of C1 (110 001)