0% found this document useful (0 votes)
9 views

11 Hashing

The document discusses hashing as a method for efficient data retrieval, highlighting the importance of hash functions that map keys to specific slots in a hash table. It covers various collision handling techniques such as chaining and open addressing, including linear and quadratic probing, as well as double hashing. Additionally, it addresses the challenges of static and dynamic hashing, emphasizing the need for effective hash functions to minimize collisions and improve search efficiency.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

11 Hashing

The document discusses hashing as a method for efficient data retrieval, highlighting the importance of hash functions that map keys to specific slots in a hash table. It covers various collision handling techniques such as chaining and open addressing, including linear and quadratic probing, as well as double hashing. Additionally, it addresses the challenges of static and dynamic hashing, emphasizing the need for effective hash functions to minimize collisions and improve search efficiency.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 60

Hashing

BBM371 Data Management


Motivation
• Consider the problem of searching an array for a given value

• If the array is not sorted, the search requires O(n) time


• If the value isn’t there, we need to search all n elements
• If the value is there, we search n/2 elements on average

• If the array is sorted, we can do a binary search


• A binary search requires O(log n) time
• About equally fast whether the element is found or not

• It doesn’t seem like we could do much better


• How about an O(1), that is, constant time search?
• We can do it if the array is organized in a particular way
Hashing
• Suppose we were to come up with a “magic function” that, given a value to
search for, would tell us exactly where in the array to look
• If it’s in that location, it’s in the array
• If it’s not in that location, it’s not in the array

• This function would have no other purpose

• If we look at the function’s inputs and outputs, they probably won’t “make
sense”

• This function is called a hash function because it “makes hash” of its inputs
Hash Function
• Hash function h:
• Mapping from U to the slots of a hash table T[0..m–1].
h : U  {0,1,…, m–1}

• With arrays, key k maps to slot A[k].

• With hash tables, key k maps or “hashes” to slot T[h[k]].

• h[k] is the hash value of key k.


Hashing (cont’d) Bucket: a unit of storage that can store one or more records
Universal set of keys: All possible values that can be stored in a linked list of index entries or records
your domain for key representation. T : set of all bucket addresses

0
U
(universe of keys) h(k1)

h(k4)
K k1 k4 no two elements have the same key
(actual k2 i.e. keys are distinct and
h(k2) the range of keys is 0 to m-1
keys)
k3

h(k3)
Hash function: is a
A: set of all search-key values. mapping function which T[i] = k, if key of data k = i
maps all the set of search T[i] = NULL, otherwise
keys to the address where
m–1
actual records are placed.
Finding the Hash Function
• How can we come up with this magic function?

• In general, we cannot--there is no such magic function 


• In a few specific cases, where all the possible values are known in advance, it has
been possible to compute a perfect hash function

• What is the next best thing?


• A perfect hash function would tell us exactly where to look
• In general, the best we can do is a function that tells us where to start looking!
• The hash function should have a efficient relation to provide it in a constant time.
Example Hash Function
• A hash function is a mathematical formula which, when applied to a key, produces an
integer which can be used as an index for the key in the hash table.
• The main aim of a hash function is that elements should be relatively, randomly, and
uniformly distributed.
• Map a key k into one of the m slots by taking the remainder of k divided by m
(Division Method). That is,
h(k) = k mod m

h(ssn) = ssn mod 100 (i.e., the last two digits)

e.g., if ssn = 10123411 then h(10123411) = 11

A prime not too close to an exact power of 2 is often a good choice for m.
Keys as Natural Numbers
• Hash functions assume that the keys are natural numbers.

• When they are not, have to interpret them as natural numbers.


• It could be strings, large objects

• Example: Interpret a character string as an integer expressed in some


radix notation. Suppose the string is CLRS:

• ASCII values: C=67, L=76, R=82, S=83.


• There are 128 basic ASCII values.
• So, CLRS = 67·1283+76 ·1282+ 82·1281+ 83·1280 = 141,764,947.
Collision
• When two values hash to the same array location, this is called a collision

• There is no hash function that eliminates collisions completely. A good hash


function can only minimize the number of collisions by spreading the
elements uniformly throughout the array.

• Collisions are normally treated as “first come, first served”—the first value
that hashes to the location gets it

• We have to find something to do with the second and subsequent values that
hash to this same location
Collisions

0
U
(universe of keys) h(k1)

h(k4)
K k1 k4
(actual k2
h(k2)=h(k
) 5) Ex: mod m=10, key 75 and 25
keys) k5
k3 collisio
n
h(k3)

m–1
Handling Collisions
0
k1 k4

• Chaining
• Store all elements that hash to the same slot in a linked list. k5 k2 k6

• Store a pointer to the head of the linked list in the hash table k7
k8
k3

slot. m–1

• Open Addressing (closed hashing)


• All elements stored in hash table itself.
• When collisions occur, use a systematic (consistent) procedure
to store elements in free slots of the table.
• probe sequence
Chaining

0
U
(universe of keys) X h(k1)=h(k4)

k1
k4
K
(actual k2
X
k5 k6 h(k2)=h(k5)=h(k6)
keys)
k8 k7
k3 X
h(k3)=h(k7)
h(k8)
m–1
Chaining (cont’d)

0
U
(universe of keys) k1 k4

k1
k4
K
(actual k2 k6
k5 k5 k2 k6
keys)
k8 k7
k3
k7 k3

k8
m–1
Dictionary Operations with Chaining
• Chained-Hash-Insert (T, x)
• Insert x at the head of list T[h(key[x])].
• Worst-case complexity – O(1).

• Chained-Hash-Delete (T, x)
• Delete x from the list T[h(key[x])].
• Worst-case complexity – proportional to length of list with singly-linked lists. O(1)
with doubly-linked lists.

• Chained-Hash-Search (T, k)
• Search an element with key k in list T[h(k)].
• Worst-case complexity – proportional to length of list.
Analysis of Hashing with Chaining:
Worst Case
T

• How long does it take to search for an element with a given 0

key?

• Worst case:
• All n keys hash to the same slot
• Worst-case time to search is (n), plus time to compute the hash chain
function
• Main objective is to provide a hashfunction to minimize the collision. m-1
Open Addressing
• If collision occurs, open addressing scheme probes for some other empty
(or open) location in which to place the item.

• The sequence of locations that we examine is called the probe sequence.

• The process of examining memory locations in the hash table is called


probing.
• There are different open-addressing schemes:
• Linear Probing
• Quadratic Probing
• Double Hashing
Linear Probing
• In linear probing, we search the hash table sequentially starting from
the original hash location.

• If a location is occupied, we check the next location

• We wrap around from the last table location to the first table location if
necessary.
Linear Probing: Example
0 9
• Table Size is 11 (0..10)
1

• Hash Function: h(x) = x mod 11 2 2


3 13
• Insert keys: 4 25
• 20 mod 11 = 9 5 24
• 30 mod 11 = 8 6
• 2 mod 11 = 2
7
• 13 mod 11 = 2  2+1=3
• 25 mod 11 = 3  3+1=4 8 30
• 24 mod 11 = 2  2+1, 2+2, 2+3=5 9 20
• 10 mod 11 = 10 10 10
• 9 mod 11 = 9  9+1, 9+2 mod 11 =0
Linear Probing: Clustering Problem
• One of the problems with linear probing is that table items tend to cluster together in the
hash table.
• there is a higher risk of more collisions where one collision has already taken place.
• This means that the table contains groups of consecutively occupied locations.

• More the number of collisions, higher the probes that are required to find a free location
and lesser is the performance.

• This phenomenon is called primary clustering.

• Clusters can get close to one another, and merge into a larger cluster.
• Thus, the one part of the table might be quite dense, even though another part has relatively few items.

• Primary clustering causes long probe searches and therefore decreases the overall efficiency.
Quadratic Probing
• Primary clustering problem can be almost eliminated if we use quadratic
probing scheme.

• In quadratic probing,
• We start from the original hash location I
• If the location is free, the value is stored in it, else subsequent locations probed are
offset by factors that depend in a quadratic manner on the probe number i.
• If a location is occupied, we check the locations i+12 , i+22 , i+32 , i+42 ...
• We wrap around from the last table location to the first table location if necessary.
Quadratic Probing: Example
0
1
• Table Size is 11 (0..10)
2 2
• Hash Function: h(x) = x mod 11 3 13
4 25
• Insert keys:
5
• 20 mod 11 = 9
• 30 mod 11 = 8 6 24
• 2 mod 11 = 2 7 9
• 13 mod 11 = 2  2+12=3
8 30
• 25 mod 11 = 3  3+12=4
• 24 mod 11 = 2  2+12, 2+22=6 9 20
• 10 mod 11 = 10 10 10
• 9 mod 11 = 9  9+12, 9+22 mod 11, 9+32 mod 11 =7
Double Hashing
• Although quadratic probing is free from primary clustering, it is still liable to what is known as secondary
clustering.
• It means that if there is a collision between two keys, then the same probe sequence will be followed for both.
• Double hashing also reduces clustering. In double hashing, we use two hash functions rather than a single
function.

• In linear probing and and quadratic probing , the probe sequences are independent from the key.

• We can select increments used during probing using a second hash function. The second hash function h2
should be:
h2(key)  0
h2  h1

• We first probe the location h1(key)


• If the location is occupied, we probe the location h 1(key)+h2(key), h1(key)+(2*h2(key)), ...
Double Hashing: Example 0
1
• Table Size is 11 (0..10)
2
3 58
• Hash Functions: h1(x) = x mod 11
h2(x) = 7 – (x mod 7) 4

• Insert keys: 5
• x=58: h1(key) h1(58) 58 mod 11 = 3 6 91
• x=14
• h1(key)  14 mod 11 = 3 7
• h1(key)+h2(key) h1(14) + h2(14)  3+7=10
8
• x=91:
• h1(key)  91 mod 11 = 3 9
• h1(key)+h2(key) h1(91) + h2(91)  3+7=10
10 14
• h1(key)+(2*h2(key) (3+2*7) mod 11=6
Open Addressing: Retrieval &
Deletion
• In open addressing, to find an item with a given key:

• We probe the locations (same as insertion) until we find the desired item or we reach to an
empty location.

• Deletions in open addressing cause complications:

• We CANNOT simply delete an item from the hash table because this new empty (deleted
locations) cause to stop prematurely (incorrectly) indicating a failure during a retrieval.

• Solution: We have to have three kinds of locations in a hash table: Occupied, Empty, Deleted.

• A deleted location will be treated as an occupied location during retrieval and insertion.
Hashing Techniques
• Static Hashing
• One of the problems with static hashing is that we need to know how many records are
going to be stored in the index. If over time a large number of records are added,
resulting in far more records than buckets, lookups would have to search through a large
number of records stored in a single bucket, or in one or more overflow buckets, and
would thus become inefficient.
• Dynamic Hashing: the hash index can be rebuilt with an increased number of
buckets. For example, if the number of records becomes twice the number of
buckets, the index can be rebuilt with twice as many buckets as befor
• Linear
• Extendable
Static Hashing
• With hash based indexing, we assume that we have a function h, which tells us
where to place any given record.
• E.g., page_number = h(value) mod N, N should b prime

Data
Page
Data
Page
Data
Page
::: Data
Page N-
Data
Page N
1 2 3 1
Static Hashing
• A bucket is a unit of storage containing one or more records (a bucket is typically a
disk block).

• In a hash file organization we obtain the bucket of a record directly from its search-
key value using a hash function.

• Hash function h is a function from the set of all search-key values K to the set of all
bucket addresses B.

• Hash function is used to locate records for access, insertion as well as deletion.

• Records with different search-key values may be mapped to the same bucket; thus
entire bucket has to be searched sequentially to locate a record.
Static Hashing: Overflow
• Insertion may cause overflow. The
solution is to create chains of overflow
pages.
• Worst has function maps all search-key
values to the same bucket h
allocated sequentially
and never de-allocated
Data Data Data Data
Primary bucket pages
Page Page Page ::: Data
Page N- Page N
allocated (as needed) 1 2 3 1
when corresponding Overflow
buckets become full Long overflow chains
for 3
degrade performance!
Overflow pages
Deficiencies of Static Hashing
• In static hashing, function h maps search-key values to a fixed set of B bucket
addresses.

• Databases grow with time. If initial number of buckets is too small, performance will degrade
due to too much overflows.

• If file size at some point in the future is anticipated and number of buckets allocated accordingly,
significant amount of space will be wasted initially.

• If database shrinks, again space will be wasted.

• One option is periodic re-organization of the file with a new hash function, but it is very
expensive.

• These problems can be avoided by using techniques that allow the number of buckets
to be modified dynamically.
Dynamic Hashing
• Dynamic = Changing number of Buckets B dynamically

• Two methods
• Extendible (or Extensible) Hashing: Grow B by doubling it
• Linear Hashing: Grow B by incrementing it by 1

• To save storage space both methods can choose to shrink B dynamically

• Must avoid oscillations when removes and additions are both common.
Extendible Hashing
• Idea: Use directory of pointers to buckets,
• double # of buckets B by doubling the directory, splitting just the
bucket that overflowed!
• Directory much smaller than file, so doubling it is much cheaper.
• Only one page of data entries is split. No overflow blocks.
• Trick lies in how hash function is adjusted!
Extendible Hashing Overview
The result of applying a hash function h is treated as a binary number and the l
d bits are interpreted as an offset into the directory

d is referred to as the
global depth of the hash
Extract last d bits file and is kept as part
of the header of the file

To search for a data entry, apply a hash function h to the key


and take the last d bits of its binary representation to get the
bucket number
Extendible Hashing
• h(k) maps keys to a fixed address space

• File pointers point to blocks of records known as buckets,


• where an entire bucket is read by one physical data transfer, buckets may be added to or
removed from the file dynamically

• The (last) d bits are used as an index in a directory array containing entries,
which usually resides in primary memory

• The value d, the directory size (), and the number of buckets change
automatically as the file expands and contracts
Example 00
2
Bucket A
4* 12* 32* 16*
01 Bucket B
13 = 1101 1* 5* 21* 13*
d=2 10
Bucket C
the directory size 11
10*

Bucket D
15* 7* 19*
DIRECTORY

00 Bucket A
4* 12* 32* 16*
01 Bucket B
5 = 101 1* 5* 21*
10
Bucket C
10*
11
Bucket D
15* 7* 19*
DIRECTORY
DATA PAGES
Example
• Directory is array of size 4. LOCAL DEPTH 2
Bucket A
GLOBAL DEPTH 4* 12* 32* 16*
• To find bucket for r, take last `global depth’
# bits of h(r); we denote r by h(r). 2 2
Bucket B
• If h(r) = 5 = binary 101, it is in bucket 00 1* 5* 21* 13*
pointed to by 01. 01
10 2
• Global depth of directory: Max # of bits 11 10*
Bucket C

needed to tell which bucket an entry


belongs to. DIRECTORY 2
Bucket D
15* 7* 19*
• Local depth of a bucket: # of bits used to DATA PAGES
determine if an entry belongs to this bucket.
Local depth should be less than or equal to global depth
Insert an Item
• Locate the bucket
• If there is space in bucket, insert the item.
• If bucket is full, split it (allocate new page, re-distribute).

• If necessary, double the directory.


• If insert causes local depth to become > global depth
• directory is doubled by copying it over and `fixing’ pointer to split image page.
Insert6 =h(r) =
binary 00110
6 (The Easy Case)
6 = binary 00110

Bucket A Bucket A
LOCAL DEPTH 2 LOCAL DEPTH 2

GLOBAL DEPTH 4* 12* 32* 16* GLOBAL DEPTH 4* 12* 32* 16*

2 2 Bucket B 2 2 Bucket B

00 1* 5* 21* 13* 00 1* 5* 21* 13*

01 01
Bucket C Bucket C
10 2 10 2

11 10* 11 10* 6*

Bucket D Bucket D
DIRECTORY 2 DIRECTORY 2
15* 7* 19* 15* 7* 19*

DATA PAGES DATA PAGES


Extendible Hashing: Inserting
Entries
 Find the appropriate bucket (as in search), split the bucket
if full, double the directory if necessary and insert the given
entry

 Example: insert 13*


2

00 Bucket A
4* 12* 32* 16*
01 Bucket B
13 = 1101 1* 5* 21* 13*
10
Bucket C
10*
11
Bucket D
15* 7* 19*
DIRECTORY
Insert h(r) = 20 (Causes Doubling)
20 = binary 10100

FULL, hence, split and redistribute! 3 Bucket A


LOCAL DEPTH
Bucket A
LOCAL DEPTH 2
GLOBAL DEPTH 32* 16* 32= 10000
4* 12* 32* 16* 16= 01000
GLOBAL DEPTH
2 2 Bucket B

2 2 Bucket B 00 1* 5* 21* 13*


01
00 1* 5* 21* 13*
10 2 Bucket C
01 11 10* The third bit
Bucket C distinguishes
10 2
2 Bucket D between these
11 10*
DIRECTORY two buckets!
15* 7* 19*
Bucket D
DIRECTORY 2
3 Bucket A2
15* 7* 19* 4 =00100
4* 12* 20* 12=01100
DATA PAGES (`split image' 20=10100
of Bucket A)
Insert h(r) = 20 (Causes Doubling)
20 = binary 10100

LOCAL DEPTH 3 Bucket A LOCAL DEPTH 3


32* 16* 32* 16* Bucket A
GLOBAL DEPTH GLOBAL DEPTH

2 2 Bucket B 3 2

00 1* 5* 21* 13* 000 1* 5* 21* 13* Bucket B


01 001
10 2 Bucket C 010 2
11 10* 011 10* Bucket C
100
2 Bucket D 2
DIRECTORY 101
15* 7* 19* 110 15* 7* 19* Bucket D
111
3 Bucket A2 3
4* 12* 20* DIRECTORY 4* 12* 20* Bucket A2
(`split image'
(`split image' of Bucket A)
of Bucket A)
Double the directory and
increase the global depth
Extendible Hashing: Inserting
Entries
 Find the appropriate bucket (as in search), split the bucket
if full, double the directory if necessary and insert the given
entry
GLOBAL DEPTH 32* 16* Bucket A

FULL, hence, split!


 Example: insert 9* 3
000 1* 5* 21* 13* Bucket B
001
010
9 = 1001 011 10* Bucket C
100
101
110 15* 7* 19* Bucket D
111

DIRECTORY 4* 12* 20* Bucket A2


(`split image'
of Bucket A)
Extendible Hashing: Inserting
Entries
 Find the appropriate bucket (as in search), split the bucket
if full, double the directory if necessary and insert the given
entry
GLOBAL DEPTH 32* 16* Bucket A

 Example: insert 9* 3
1 =00001
000 1* 9* Bucket B
9= 01001
001
010
10* Bucket C
9 = 1001 011
100
101 15* 7* 19* Bucket D
110
Almost there… 111
4* 12* 20* Bucket A2
(`split image‘ of A)
DIRECTORY 5 =00101
5* 21* 13* Bucket B2
13=01101
(`split image‘ of B) 20=10101
Extendible Hashing: Inserting
Entries
 Find the appropriate bucket (as in search), split the bucket
if full, double the directory if necessary and insert the given
entry LOCAL DEPTH 3
GLOBAL DEPTH 32* 16* Bucket A

3
 Example: insert 9* 3
000 1* 9* Bucket B
001 2
010
10* Bucket C
9 = 1001 011
100 2

There was no need to 101 15* 7* 19* Bucket D

double the directory! 110 3


111
4* 12* 20* Bucket A2
(`split image‘ of A)
When NOT to double the DIRECTORY 3

directory? 5* 21* 13* Bucket B2


(`split image‘ of B)
Comments on Extendible Hashing
• If directory fits in memory, equality search answered with one disk access; else
two.
• 100MB file, 100 bytes/rec, 4K pages contains 1,000,000 records (as data entries) and
25,000 directory elements; chances are high that directory will fit in memory.
• Directory grows in spurts, and, if the distribution of hash values is skewed, directory can
grow large.

• Delete: If removal of data entry makes bucket empty, can be merged with `split
image’. If each directory element points to same bucket as its split image, we
can halve directory (this is rare in practice).
Linear Hashing
• This is another dynamic hashing scheme, an alternative to Extendible Hashing.
• LH handles the problem of long overflow chains without using a directory, and
handles duplicates.
• Idea: Use a family of hash functions h0, h1, h2, ...
• hi(key) = h(key) mod(2iN); N = initial # buckets
• h is some hash function (range is not 0 to N-1)
• If N = 2d0, for some d0, hi consists of applying h and looking at the last di bits, where di =
d0 + i.
• hi+1 doubles the range of hi (similar to directory doubling)
Linear Hashing: Bucket Split
• When the first overflow occurs (it can occur in any bucket), bucket 0, which is
pointed by p, is split (rehashed) into two buckets:
• The original bucket 0 and a new bucket m.

• A new empty page is also added in the overflown bucket to accommodate the
overflow.

• The search values originally mapped into bucket 0 (using function h0) are now
distributed between buckets 0 and m using a new hashing function h1.
Linear Hashing: Insertion
• Locate bucket to insert
• If bucket to insert into is full:
• Add overflow page and insert data entry.
• (Maybe) Split Next bucket and increment Next.
• A split occurs in case of overflow
• Since buckets are split round-robin, long overflow chains don’t develop!
• Round-robin: the next pointer is indeed iteratively goes on the number of levels
which are number of bits. Next run you need to increase the bit
Linear Hashing: Background Insert 20

3
We have seen what it 32* 16*
2
means to split a bucket… 4* 12* 32* 16*
3
4* 12* 20*

Before After

We have seen
what it means
2 2 2
to add an
4* 12* 32* 16* 4* 12* 32* 16* 20*
overflow page
to a bucket…
Before After
Triggering Splits

• A split performed whenever a bucket overflow occurs is an uncontrolled split.

• Let l denote the Linear Hashing scheme’s load factor, i.e., l = S ∕ b where S is
the total number of records and b is the number of buckets used.

• The load factor achieved by uncontrolled splits is usually between 50–70%,


depending on the page size and the search value distribution.

• In practice, higher storage utilization is achieved if a split is triggered not by


an overflow, but when the load factor l becomes greater than some upper
threshold, which is called controlled split.
Overview of Splitting as Rounds
• Splits occur in a round robin fashion, i.e., as rounds.

Buckets split in this round:


Bucket to be split If h Level ( search key value )
Next is in this range, must use
h Level+1 ( search key value )
Buckets that existed at the
to decide if entry is in
beginning of this round: `split image' bucket.
this is the range of
hLevel
`split image' buckets:
created (through splitting
of other buckets) in this round
Bucket to
be split
Next
Bucket to h
Level
be split
Next

Bucket to
be split
hLevel Next
h
Level
Linear Hashing: Example
43=101011

3 bits 2 bits

32=10000

9=1001

44=10110
!!!!The bucket that is split may not be the same as the one that overflowed! 36=10010
Linear Hashing: Example
Insert 29 (00011101)
Linear Hashing: Example
Insert 22 (00010110)
Example: End of Round
All previous nodes before the next pointer are already splitted out.
We’re gonna increase level by one and move to the beginning.
We’re gonna split in the next round with one more bits.

50=110010
Summary
• Hash-based indexes: best for equality searches, cannot support range
searches.
• Static Hashing can lead to long overflow chains.
• Extendible Hashing avoids overflow pages by splitting a full bucket when a
new data entry is to be added to it. (Duplicates may require overflow pages.)
• Directory to keep track of buckets, doubles periodically.
• Can get large with skewed data; additional I/O if this does not fit in main
memory.
• Linear Hashing avoids a directory by splitting buckets round-robin, and using
overflow pages.

You might also like