11 Hashing
11 Hashing
• If we look at the function’s inputs and outputs, they probably won’t “make
sense”
• This function is called a hash function because it “makes hash” of its inputs
Hash Function
• Hash function h:
• Mapping from U to the slots of a hash table T[0..m–1].
h : U {0,1,…, m–1}
0
U
(universe of keys) h(k1)
h(k4)
K k1 k4 no two elements have the same key
(actual k2 i.e. keys are distinct and
h(k2) the range of keys is 0 to m-1
keys)
k3
h(k3)
Hash function: is a
A: set of all search-key values. mapping function which T[i] = k, if key of data k = i
maps all the set of search T[i] = NULL, otherwise
keys to the address where
m–1
actual records are placed.
Finding the Hash Function
• How can we come up with this magic function?
A prime not too close to an exact power of 2 is often a good choice for m.
Keys as Natural Numbers
• Hash functions assume that the keys are natural numbers.
• Collisions are normally treated as “first come, first served”—the first value
that hashes to the location gets it
• We have to find something to do with the second and subsequent values that
hash to this same location
Collisions
0
U
(universe of keys) h(k1)
h(k4)
K k1 k4
(actual k2
h(k2)=h(k
) 5) Ex: mod m=10, key 75 and 25
keys) k5
k3 collisio
n
h(k3)
m–1
Handling Collisions
0
k1 k4
• Chaining
• Store all elements that hash to the same slot in a linked list. k5 k2 k6
• Store a pointer to the head of the linked list in the hash table k7
k8
k3
slot. m–1
0
U
(universe of keys) X h(k1)=h(k4)
k1
k4
K
(actual k2
X
k5 k6 h(k2)=h(k5)=h(k6)
keys)
k8 k7
k3 X
h(k3)=h(k7)
h(k8)
m–1
Chaining (cont’d)
0
U
(universe of keys) k1 k4
k1
k4
K
(actual k2 k6
k5 k5 k2 k6
keys)
k8 k7
k3
k7 k3
k8
m–1
Dictionary Operations with Chaining
• Chained-Hash-Insert (T, x)
• Insert x at the head of list T[h(key[x])].
• Worst-case complexity – O(1).
• Chained-Hash-Delete (T, x)
• Delete x from the list T[h(key[x])].
• Worst-case complexity – proportional to length of list with singly-linked lists. O(1)
with doubly-linked lists.
• Chained-Hash-Search (T, k)
• Search an element with key k in list T[h(k)].
• Worst-case complexity – proportional to length of list.
Analysis of Hashing with Chaining:
Worst Case
T
key?
• Worst case:
• All n keys hash to the same slot
• Worst-case time to search is (n), plus time to compute the hash chain
function
• Main objective is to provide a hashfunction to minimize the collision. m-1
Open Addressing
• If collision occurs, open addressing scheme probes for some other empty
(or open) location in which to place the item.
• We wrap around from the last table location to the first table location if
necessary.
Linear Probing: Example
0 9
• Table Size is 11 (0..10)
1
• More the number of collisions, higher the probes that are required to find a free location
and lesser is the performance.
• Clusters can get close to one another, and merge into a larger cluster.
• Thus, the one part of the table might be quite dense, even though another part has relatively few items.
• Primary clustering causes long probe searches and therefore decreases the overall efficiency.
Quadratic Probing
• Primary clustering problem can be almost eliminated if we use quadratic
probing scheme.
• In quadratic probing,
• We start from the original hash location I
• If the location is free, the value is stored in it, else subsequent locations probed are
offset by factors that depend in a quadratic manner on the probe number i.
• If a location is occupied, we check the locations i+12 , i+22 , i+32 , i+42 ...
• We wrap around from the last table location to the first table location if necessary.
Quadratic Probing: Example
0
1
• Table Size is 11 (0..10)
2 2
• Hash Function: h(x) = x mod 11 3 13
4 25
• Insert keys:
5
• 20 mod 11 = 9
• 30 mod 11 = 8 6 24
• 2 mod 11 = 2 7 9
• 13 mod 11 = 2 2+12=3
8 30
• 25 mod 11 = 3 3+12=4
• 24 mod 11 = 2 2+12, 2+22=6 9 20
• 10 mod 11 = 10 10 10
• 9 mod 11 = 9 9+12, 9+22 mod 11, 9+32 mod 11 =7
Double Hashing
• Although quadratic probing is free from primary clustering, it is still liable to what is known as secondary
clustering.
• It means that if there is a collision between two keys, then the same probe sequence will be followed for both.
• Double hashing also reduces clustering. In double hashing, we use two hash functions rather than a single
function.
• In linear probing and and quadratic probing , the probe sequences are independent from the key.
• We can select increments used during probing using a second hash function. The second hash function h2
should be:
h2(key) 0
h2 h1
• Insert keys: 5
• x=58: h1(key) h1(58) 58 mod 11 = 3 6 91
• x=14
• h1(key) 14 mod 11 = 3 7
• h1(key)+h2(key) h1(14) + h2(14) 3+7=10
8
• x=91:
• h1(key) 91 mod 11 = 3 9
• h1(key)+h2(key) h1(91) + h2(91) 3+7=10
10 14
• h1(key)+(2*h2(key) (3+2*7) mod 11=6
Open Addressing: Retrieval &
Deletion
• In open addressing, to find an item with a given key:
• We probe the locations (same as insertion) until we find the desired item or we reach to an
empty location.
• We CANNOT simply delete an item from the hash table because this new empty (deleted
locations) cause to stop prematurely (incorrectly) indicating a failure during a retrieval.
• Solution: We have to have three kinds of locations in a hash table: Occupied, Empty, Deleted.
• A deleted location will be treated as an occupied location during retrieval and insertion.
Hashing Techniques
• Static Hashing
• One of the problems with static hashing is that we need to know how many records are
going to be stored in the index. If over time a large number of records are added,
resulting in far more records than buckets, lookups would have to search through a large
number of records stored in a single bucket, or in one or more overflow buckets, and
would thus become inefficient.
• Dynamic Hashing: the hash index can be rebuilt with an increased number of
buckets. For example, if the number of records becomes twice the number of
buckets, the index can be rebuilt with twice as many buckets as befor
• Linear
• Extendable
Static Hashing
• With hash based indexing, we assume that we have a function h, which tells us
where to place any given record.
• E.g., page_number = h(value) mod N, N should b prime
Data
Page
Data
Page
Data
Page
::: Data
Page N-
Data
Page N
1 2 3 1
Static Hashing
• A bucket is a unit of storage containing one or more records (a bucket is typically a
disk block).
• In a hash file organization we obtain the bucket of a record directly from its search-
key value using a hash function.
• Hash function h is a function from the set of all search-key values K to the set of all
bucket addresses B.
• Hash function is used to locate records for access, insertion as well as deletion.
• Records with different search-key values may be mapped to the same bucket; thus
entire bucket has to be searched sequentially to locate a record.
Static Hashing: Overflow
• Insertion may cause overflow. The
solution is to create chains of overflow
pages.
• Worst has function maps all search-key
values to the same bucket h
allocated sequentially
and never de-allocated
Data Data Data Data
Primary bucket pages
Page Page Page ::: Data
Page N- Page N
allocated (as needed) 1 2 3 1
when corresponding Overflow
buckets become full Long overflow chains
for 3
degrade performance!
Overflow pages
Deficiencies of Static Hashing
• In static hashing, function h maps search-key values to a fixed set of B bucket
addresses.
• Databases grow with time. If initial number of buckets is too small, performance will degrade
due to too much overflows.
• If file size at some point in the future is anticipated and number of buckets allocated accordingly,
significant amount of space will be wasted initially.
• One option is periodic re-organization of the file with a new hash function, but it is very
expensive.
• These problems can be avoided by using techniques that allow the number of buckets
to be modified dynamically.
Dynamic Hashing
• Dynamic = Changing number of Buckets B dynamically
• Two methods
• Extendible (or Extensible) Hashing: Grow B by doubling it
• Linear Hashing: Grow B by incrementing it by 1
• Must avoid oscillations when removes and additions are both common.
Extendible Hashing
• Idea: Use directory of pointers to buckets,
• double # of buckets B by doubling the directory, splitting just the
bucket that overflowed!
• Directory much smaller than file, so doubling it is much cheaper.
• Only one page of data entries is split. No overflow blocks.
• Trick lies in how hash function is adjusted!
Extendible Hashing Overview
The result of applying a hash function h is treated as a binary number and the l
d bits are interpreted as an offset into the directory
d is referred to as the
global depth of the hash
Extract last d bits file and is kept as part
of the header of the file
• The (last) d bits are used as an index in a directory array containing entries,
which usually resides in primary memory
• The value d, the directory size (), and the number of buckets change
automatically as the file expands and contracts
Example 00
2
Bucket A
4* 12* 32* 16*
01 Bucket B
13 = 1101 1* 5* 21* 13*
d=2 10
Bucket C
the directory size 11
10*
Bucket D
15* 7* 19*
DIRECTORY
00 Bucket A
4* 12* 32* 16*
01 Bucket B
5 = 101 1* 5* 21*
10
Bucket C
10*
11
Bucket D
15* 7* 19*
DIRECTORY
DATA PAGES
Example
• Directory is array of size 4. LOCAL DEPTH 2
Bucket A
GLOBAL DEPTH 4* 12* 32* 16*
• To find bucket for r, take last `global depth’
# bits of h(r); we denote r by h(r). 2 2
Bucket B
• If h(r) = 5 = binary 101, it is in bucket 00 1* 5* 21* 13*
pointed to by 01. 01
10 2
• Global depth of directory: Max # of bits 11 10*
Bucket C
Bucket A Bucket A
LOCAL DEPTH 2 LOCAL DEPTH 2
GLOBAL DEPTH 4* 12* 32* 16* GLOBAL DEPTH 4* 12* 32* 16*
2 2 Bucket B 2 2 Bucket B
01 01
Bucket C Bucket C
10 2 10 2
11 10* 11 10* 6*
Bucket D Bucket D
DIRECTORY 2 DIRECTORY 2
15* 7* 19* 15* 7* 19*
00 Bucket A
4* 12* 32* 16*
01 Bucket B
13 = 1101 1* 5* 21* 13*
10
Bucket C
10*
11
Bucket D
15* 7* 19*
DIRECTORY
Insert h(r) = 20 (Causes Doubling)
20 = binary 10100
2 2 Bucket B 3 2
Example: insert 9* 3
1 =00001
000 1* 9* Bucket B
9= 01001
001
010
10* Bucket C
9 = 1001 011
100
101 15* 7* 19* Bucket D
110
Almost there… 111
4* 12* 20* Bucket A2
(`split image‘ of A)
DIRECTORY 5 =00101
5* 21* 13* Bucket B2
13=01101
(`split image‘ of B) 20=10101
Extendible Hashing: Inserting
Entries
Find the appropriate bucket (as in search), split the bucket
if full, double the directory if necessary and insert the given
entry LOCAL DEPTH 3
GLOBAL DEPTH 32* 16* Bucket A
3
Example: insert 9* 3
000 1* 9* Bucket B
001 2
010
10* Bucket C
9 = 1001 011
100 2
• Delete: If removal of data entry makes bucket empty, can be merged with `split
image’. If each directory element points to same bucket as its split image, we
can halve directory (this is rare in practice).
Linear Hashing
• This is another dynamic hashing scheme, an alternative to Extendible Hashing.
• LH handles the problem of long overflow chains without using a directory, and
handles duplicates.
• Idea: Use a family of hash functions h0, h1, h2, ...
• hi(key) = h(key) mod(2iN); N = initial # buckets
• h is some hash function (range is not 0 to N-1)
• If N = 2d0, for some d0, hi consists of applying h and looking at the last di bits, where di =
d0 + i.
• hi+1 doubles the range of hi (similar to directory doubling)
Linear Hashing: Bucket Split
• When the first overflow occurs (it can occur in any bucket), bucket 0, which is
pointed by p, is split (rehashed) into two buckets:
• The original bucket 0 and a new bucket m.
• A new empty page is also added in the overflown bucket to accommodate the
overflow.
• The search values originally mapped into bucket 0 (using function h0) are now
distributed between buckets 0 and m using a new hashing function h1.
Linear Hashing: Insertion
• Locate bucket to insert
• If bucket to insert into is full:
• Add overflow page and insert data entry.
• (Maybe) Split Next bucket and increment Next.
• A split occurs in case of overflow
• Since buckets are split round-robin, long overflow chains don’t develop!
• Round-robin: the next pointer is indeed iteratively goes on the number of levels
which are number of bits. Next run you need to increase the bit
Linear Hashing: Background Insert 20
3
We have seen what it 32* 16*
2
means to split a bucket… 4* 12* 32* 16*
3
4* 12* 20*
Before After
We have seen
what it means
2 2 2
to add an
4* 12* 32* 16* 4* 12* 32* 16* 20*
overflow page
to a bucket…
Before After
Triggering Splits
• Let l denote the Linear Hashing scheme’s load factor, i.e., l = S ∕ b where S is
the total number of records and b is the number of buckets used.
Bucket to
be split
hLevel Next
h
Level
Linear Hashing: Example
43=101011
3 bits 2 bits
32=10000
9=1001
44=10110
!!!!The bucket that is split may not be the same as the one that overflowed! 36=10010
Linear Hashing: Example
Insert 29 (00011101)
Linear Hashing: Example
Insert 22 (00010110)
Example: End of Round
All previous nodes before the next pointer are already splitted out.
We’re gonna increase level by one and move to the beginning.
We’re gonna split in the next round with one more bits.
50=110010
Summary
• Hash-based indexes: best for equality searches, cannot support range
searches.
• Static Hashing can lead to long overflow chains.
• Extendible Hashing avoids overflow pages by splitting a full bucket when a
new data entry is to be added to it. (Duplicates may require overflow pages.)
• Directory to keep track of buckets, doubles periodically.
• Can get large with skewed data; additional I/O if this does not fit in main
memory.
• Linear Hashing avoids a directory by splitting buckets round-robin, and using
overflow pages.