0% found this document useful (0 votes)
24 views

Data and File Structures: Hashing

Hashing provides O(1) access time by mapping keys to addresses using a hash function, allowing retrieval regardless of file size. Collisions occur when different keys map to the same address. Progressive overflow resolves collisions by placing records in the next available address, wrapping around the address space. Buckets store multiple records per address to reduce collisions. Deletions use tombstones to mark deleted records without hindering searches or reuse of slots. Variations like double hashing, chaining, and scatter tables further improve performance.

Uploaded by

Shushmitha Gowda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

Data and File Structures: Hashing

Hashing provides O(1) access time by mapping keys to addresses using a hash function, allowing retrieval regardless of file size. Collisions occur when different keys map to the same address. Progressive overflow resolves collisions by placing records in the next available address, wrapping around the address space. Buckets store multiple records per address to reduce collisions. Deletions use tombstones to mark deleted records without hindering searches or reuse of slots. Variations like double hashing, chaining, and scatter tables further improve performance.

Uploaded by

Shushmitha Gowda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Data and File Structures

Chapter 11

Hashing

1
Motivation
• Sequential Searching can be done in O(N) access time,
meaning that the number of seeks grows in proportion
to the size of the file.
• B-Trees improve on this greatly, providing O(Logk N)
access where k is a measure of the leaf size
- i.e., the number of records that can be stored in a leaf.
• However, what we would like to achieve is an O(1)
access, which means that no matter how big a file
grows, access to a record always takes the same small
number of seeks.
• Hashing techniques can achieve such performance
provided that the file does not increase in time.

2
What is Hashing?

• A Hash function is a function h(K) which transforms a


key K into an address.
• Hashing is like indexing in that it involves associating a
key with a relative record address.
• However, Hashing is different from indexing in two
important ways:
– With hashing, there is no obvious connection
between the key and the location.
– With hashing two different keys may be
transformed to the same address.
3
Example – 1/2

4
Example – 2/2

5
Collisions
• When two different keys produce the same address,
there is a collision.
– The keys involved are called synonyms.
• Coming up with a hashing function that avoids
collision is extremely difficult.
– It is best to simply find ways to deal with them.
• Possible Solutions:
– Spread out the records
– Use extra memory
– Put more than one record at a single address.

6
Distribution of Records among Address –
1/2
• Records can be distributed among addresses in
different ways: there may be
- (a) no synonyms (uniform distribution);
- (b) only synonyms (worst case);
- (c) a few synonyms (happens with random distributions).

7
Distribution of Records among Address –
1/2

• Purely uniform distributions are difficult to obtain.


• Random distributions can be easily derived, but they
are not perfect since they may generate a fair number
of synonyms.
• We want better hashing methods.

8
Better than Random Distribution
• Here are some methods that are potentially better
than random:
– Examine keys for a pattern
• Ex., F(year)=(year-1970) mod (2012-1970+1)
– Fold parts of the key
• Folding means extracting digits from a key and adding
the parts together as in the previous example
– Divide the key by a prime number
– Square the key and take the middle
• Ex., key=453  453^2 = 205209  52
– Radix transformation
• Transform the number into another base and then
divide by the maximum address
• Ex., key=453  base 11: 382  382 mod 99  85
9
Collision Resolution: Progressive
Overflow
• How do we deal with records that cannot fit into their
home address? A simple approach: Progressive
Overflow or Linear Probing.
• If a key, k1, hashes into the same address, a1, as
another key, k2, then look for the first available address,
a2, and place k1 in a2.
– If the end of the address space is reached, then wrap around
it.
• When searching for a key that is not in, if the address
space is not full, then an empty address will be reached
or the search will come back to where it began.
10
Example

11
Search Length when using Progressive
Overflow
• Progressive Overflow causes extra searches and thus
extra disk accesses.
• If there are many collisions, then many records will be
far from “home”.
• Definitions: Search length refers to the number of
accesses required to retrieve a record from secondary
memory.
– The average search length is the average number of times
you can expect to have to access the disk to retrieve a record.
– Average search length =
• (Total search length) / (Total number of records)

12
Example

13
Storing More than One Record per
Address: Buckets
• Definition: A bucket describes a block of records
sharing the same address that is retrieved in one disk
access.
• When a record is to be stored or retrieved, its home
bucket address is determined by hashing.
• When a bucket is filled, we still have to worry about
the record overflow problem, but this occurs much
less often than when each address can hold only one
record.

14
Example

overflow

15
Effect of Buckets on Performance
• To compute how densely packed a file is, we need to
consider:
– 1) the number of addresses, N, (buckets)
– 2) the number of records we can put at each address, b,
(bucket size) and
– 3) the number of records, r.
– Then, Packing Density = r/bN
• Though the packing density does not change when
halving the number of addresses and doubling the
size of the buckets, the expected number of
overflows decreases dramatically.
16
Example (N=1000, r=750)

17
Making Deletions
• Deleting a record from a hashed file is more
complicated than adding a record for two reasons:
– The slot freed by the deletion must not be allowed
to hinder later searches
– It should be possible to reuse the freed slot for later
additions.
• In order to deal with deletions we use tombstones,
– i.e., a marker indicating that a record once lived there but no
longer does.
– Tombstones solve both the problems caused by deletion.
• Insertion of records is slightly different when using
tombstones.
18
Example (Progressive Overflow)

19
Other Collision Resolution Techniques
• There are a few variations on random hashing that
may improve performance:
– Double Hashing: When an overflow occurs, use a
second hashing function to map the record to its
overflow location.
– Chained Progressive Overflow: Like Progressive
overflow except that synonyms are linked together
with pointers.
– Chaining with a Separate Overflow Area: Like
chained progressive overflow except that overflow
addresses do not occupy home addresses.
– Scatter Tables: The Hash file contains no records,
but only pointers to records. I.e., it is an index.
20
Example (Double Hashing)

21
Example (Chained Progressive Overflow)

22
Example
(Chained with a Separate Overflow Area)

23
Example (Scatter Tables)

24

You might also like