Unit 6.2 Indexing and Hashing
Unit 6.2 Indexing and Hashing
-Ashu Mehta
Database systems
Indexing
• Indexing is a data structure technique to efficiently
retrieve records from database files based on some
attributes on which the indexing has been done.
• Indexing in database systems is similar to the one
we see in books.
• Two basic kinds of indices:
– Ordered indices: search keys are stored in sorted order
– Hash indices: used to access data that is distributed
uniformly across a range of buckets using a “hash
function”.
Purpose of Indexing
• It is a data structure that is added to a file to
provide faster access to the data.
• It reduces the number of blocks that the
DBMS has to check.
Index Evaluation Metrics
• Access types: Access types supported efficiently. E.g.,
– records with a specified value in the attribute
– or records with an attribute value falling in a specified range of values.
• Access time: Time it takes to find a particular data item, or set
of items, using the technique
• Insertion time: Time it takes to insert a new data item. This
value includes the time it takes to find the correct place to
insert the new data item, as well as the time it takes to update
the index structure
• Deletion time: Time it takes to delete a data item. This value
includes the time it takes to find the item to be deleted, as well
as the time it takes to update the index structure
• Space overhead: The additional space occupied by an index
structure.
Basic Concepts
• Search Key: An attribute or set of attributes
used to look up records in a file is called
search key.
• An Index file consists of records( called index
entries) of the form
Ordered Indices
• Each index structure is associated with a
particular search key.
• An ordered index, stores the value of the
search keys in sorted order and associates
with each search key the record that contains
it. E.g., index of a book, library catalog.
• A file may have several indices, on different
search keys.
Ordered Indices
• Primary index: in a sequentially ordered file, the index
whose search key specifies the sequential order of the file.
– Also called clustering index
– The search key of a primary index is usually but not necessarily
the primary key.
• Secondary index: an index whose search key specifies an
order different from the sequential order of the file.
– Also called non-clustering index.
• Index-sequential file: ordered sequential file with a primary
index.
Dense Index Files
Dense Index Files
Sparse Index Files
Multilevel Index
Multilevel Index
Secondary Indices
• Secondary indices must be dense, with an
index entry for every search-key value, and a
pointer to every record in the file.
• A primary index may be sparse, storing only
some of the search-key values, since it is
always possible to find records with
intermediate search-key values by a
sequential access to a part of the file.
Secondary Indices
• If the search key of a secondary index is not a candidate key,
it is not enough to point to just the first record with each
search-key value. The remaining records with the same
search-key value could be anywhere in the file, since the
records are ordered by the search key of the primary index,
rather than by the search key of the secondary index.
• Therefore, a secondary index must contain pointers to all the
records.
• An extra level of indirection is used to implement secondary
indices on search keys that are not candidate keys.
EXAMPLE
Hashing
• One disadvantage of sequential file organization
is that we must access an index structure to
locate data, or must use binary search, and that
results in more I/O operations.
• File organizations based on the technique of
hashing allow us to avoid accessing an index
structure.
• Hashing also provides a way of constructing
indices.
Example
• Hash file organization of account file, using
branch_name as key
• There are 10 buckets,
• The binary representation of the ith character is
assumed to be the integer i.
• The hash function returns the sum of the binary
representations of the characters modulo 10
– E.g. h(Perryridge) = 5 h(Round Hill) = 3 h(Brighton) =
3
Hashing
• A bucket is a unit of storage containing one or more records
(a bucket is typically a disk block).
• In a hash file organization we obtain the bucket of a record
directly from its search-key value using a hash function.
• Hash function h is a function from the set of all search-key
values K to the set of all bucket addresses B.
• Hash function is used to locate records for access, insertion as
well as deletion.
• Records with different search-key values may be mapped to
the same bucket; thus entire bucket has to be searched
sequentially to locate a record.
Static Hashing
• In static hashing, when a search-key value is
provided the hash function always computes
the same address.
• For example, if mod-4 hash function is used
then it shall generate only 5 values. The
output address shall always be same for that
function. The numbers of buckets provided
remain same at all times.
Hash Function
• Worst hash function maps all search-key values to the same
bucket; this makes access time proportional to the number
of search-key values in the file.
• An ideal hash function is having following properties:
• The distribution is uniform. That is, the hash function assigns
each bucket the same number of search-key values from the
set of all possible search-key values.
• The distribution is random. That is, in the average case, each
bucket will have nearly the same number of values assigned
to it, regardless of the actual distribution of search-key
values.
Handling of Bucket Overflows
• If the bucket does not have enough space, a bucket
overflow is said to occur.
• Bucket overflow can occur for several reasons:
– Insufficient buckets: The number of buckets, denoted by nB ,
must be chosen such that nB > nr /fr, where nr denotes the total
number of records that will be stored and fr denotes the
number of records that will fit in a bucket.
– Skew: Some buckets are assigned more records than are
others, so a bucket may overflow even when other buckets
still have space. This situation is called bucket skew. This can
occur due to two reasons:
• multiple records have same search-key value
• chosen hash function produces non-uniform distribution of key
values
Handling of Bucket Overflows
• Although the probability of bucket overflow can be reduced, it
cannot be eliminated; it is handled by using overflow buckets.
• Overflow chaining – the overflow buckets of a given bucket
are chained together in a linked list.
• Above scheme is called closed hashing.
Handling of Bucket Overflows
• Linear Probing: When hash function generates an address at
which data is already stored, the next free bucket is allocated
to it. This mechanism is called Open Hashing.
• Open hashing does not use overflow buckets, is not suitable
for database applications.
Hash Indices
• Hashing can be used not only for file organization, but also
for index-structure creation.
• A hash index organizes the search keys, with their
associated record pointers, into a hash file structure.
• The hash function is constructed as follows:
– Apply hash function on a search key to identify a bucket, and
store the key and its associated pointers in the bucket
• Strictly speaking, hash indices are always secondary indices
– if the file itself is organized using hashing, a separate primary
hash index on it using the same search-key is unnecessary.
– However, we use the term hash index to refer to both secondary
index structures and hash organized files.
Example of Hash Index
Deficiencies of Static Hashing
• In static hashing, function h maps search-key values to a fixed
set of B of bucket addresses. Databases grow or shrink with
time.
– If initial number of buckets is too small, and file grows, and the hash
function is choose based on the current file size, performance will
degrade due to too much overflows.
– If space is allocated for anticipated growth, a significant amount of
space will be wasted initially (and buckets will be underfull).
– If database shrinks, again space will be wasted.
• One solution: periodic re-organization of the file with a new
hash function
– Expensive, disrupts normal operations
• Better solution: allow the number of buckets to be modified
dynamically.
Dynamic Hashing
• Dynamic hashing provides a mechanism in
which data buckets are added and removed
dynamically and on-demand.
• Dynamic hashing is also known as extended
hashing.
• Hash function, in dynamic hashing, is made to
produce large number of values and only a
few are used initially.
Dynamic Hashing
Hashing Practice Problems
Problem 1