0% found this document useful (0 votes)
27 views39 pages

Database Indexing and Hashing Techniques

The document discusses indexing and hashing in database systems, explaining how indexing improves data retrieval efficiency through ordered and hash indices. It outlines the purpose of indexing, evaluation metrics, and basic concepts such as search keys and index types, including primary, secondary, dense, and sparse indices. Additionally, it covers hashing techniques, including static and dynamic hashing, and addresses issues like bucket overflows and the implementation of hash indices.

Uploaded by

sauravyadv31
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views39 pages

Database Indexing and Hashing Techniques

The document discusses indexing and hashing in database systems, explaining how indexing improves data retrieval efficiency through ordered and hash indices. It outlines the purpose of indexing, evaluation metrics, and basic concepts such as search keys and index types, including primary, secondary, dense, and sparse indices. Additionally, it covers hashing techniques, including static and dynamic hashing, and addresses issues like bucket overflows and the implementation of hash indices.

Uploaded by

sauravyadv31
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

Indexing and Hashing

Database systems
Indexing
• Indexing is a data structure technique to efficiently
retrieve records from database files based on some
attributes on which the indexing has been done.
• Indexing in database systems is similar to the one
we see in books.
• Two basic kinds of indices:
– Ordered indices: search keys are stored in sorted order
– Hash indices: used to access data that is distributed
uniformly across a range of buckets using a “hash
function”.
Purpose of Indexing
• It is a data structure that is added to a file to
provide faster access to the data.
• It reduces the number of blocks that the
DBMS has to check.
Index Evaluation Metrics
• Access types: Access types supported efficiently. E.g.,
– records with a specified value in the attribute
– or records with an attribute value falling in a specified range of
values.
• Access time: Time it takes to find a particular data item, or set
of items, using the technique
• Insertion time: Time it takes to insert a new data item. This
value includes the time it takes to find the correct place to
insert the new data item, as well as the time it takes to update
the index structure
• Deletion time: Time it takes to delete a data item. This value
includes the time it takes to find the item to be deleted, as well
as the time it takes to update the index structure
• Space overhead: The additional space occupied by an index
structure.
Basic Concepts
• Search Key: An attribute or set of attributes
used to look up records in a file is called
search key.
• An Index file consists of records( called index
entries) of the form
Ordered Indices
• Each index structure is associated with a
particular search key.
• An ordered index, stores the value of the
search keys in sorted order and associates
with each search key the record that contains
it. E.g., index of a book, library catalog.
• A file may have several indices, on different
search keys.
Ordered Indices
• Primary index: in a sequentially ordered file, the index
whose search key specifies the sequential order of the file.
– Also called clustering index
– The search key of a primary index is usually but not necessarily
the primary key.
• Secondary index: an index whose search key specifies an
order different from the sequential order of the file.
– Also called non-clustering index.
• Index-sequential file: ordered sequential file with a
primary index.
Dense Index Files
Dense Index
Dense Index: It has index entries for every
search key value (and hence every record) in
the database file. The dense index can be built
on order as well as unordered fields of the
database files.
Dense Index Files
Sparse Index:
It has index entries for only some of the search
key values/records in the database file

.
Sparse Index Files
Multilevel Index
Multilevel Index
Secondary Indices
• Secondary indices must be dense, with an
index entry for every search-key value, and a
pointer to every record in the file.
• A primary index may be sparse, storing only
some of the search-key values, since it is
always possible to find records with
intermediate search-key values by a
sequential access to a part of the file.
Secondary Indices
• If the search key of a secondary index is not a candidate
key, it is not enough to point to just the first record with
each search-key value. The remaining records with the
same search-key value could be anywhere in the file, since
the records are ordered by the search key of the primary
index, rather than by the search key of the secondary index.
• Therefore, a secondary index must contain pointers to all
the records.
• An extra level of indirection is used to implement
secondary indices on search keys that are not candidate
keys.
EXAMPLE
Hashing
• One disadvantage of sequential file
organization is that we must access an index
structure to locate data, or must use binary
search, and that results in more I/O operations.
• File organizations based on the technique of
hashing allow us to avoid accessing an index
structure.
• Hashing also provides a way of constructing
indices.
Example
• Hash file organization of account file, using
branch_name as key
• There are 10 buckets,
• The binary representation of the ith character is
assumed to be the integer i.
• The hash function returns the sum of the binary
representations of the characters modulo 10
Hashing
• A bucket is a unit of storage containing one or more records
(a bucket is typically a disk block).
• In a hash file organization we obtain the bucket of a record
directly from its search-key value using a hash function.
• Hash function h is a function from the set of all search-key
values K to the set of all bucket addresses B.
• Hash function is used to locate records for access, insertion
as well as deletion.
• Records with different search-key values may be mapped to
the same bucket; thus entire bucket has to be searched
sequentially to locate a record.
Static Hashing
• In static hashing, when a search-key value is
provided the hash function always computes
the same address.
• For example, if mod-4 hash function is used
then it shall generate only 5 values. The
output address shall always be same for that
function. The numbers of buckets provided
remain same at all times.
Hash Function
• Worst hash function maps all search-key values to the
same bucket; this makes access time proportional to the
number of search-key values in the file.
• An ideal hash function is having following properties:
• The distribution is uniform. That is, the hash function
assigns each bucket the same number of search-key values
from the set of all possible search-key values.
• The distribution is random. That is, in the average case,
each bucket will have nearly the same number of values
assigned to it, regardless of the actual distribution of
search-key values.
Handling of Bucket Overflows
• If the bucket does not have enough space, a bucket
overflow is said to occur.
• Bucket overflow can occur for several reasons:
– Insufficient buckets: The number of buckets, denoted by
nB , must be chosen such that nB > nr /fr, where nr denotes the
total number of records that will be stored and fr denotes
the number of records that will fit in a bucket.
– Skew: Some buckets are assigned more records than are
others, so a bucket may overflow even when other buckets
still have space. This situation is called bucket skew. This
can occur due to two reasons:
• multiple records have same search-key value
• chosen hash function produces non-uniform distribution of key
values
Handling of Bucket Overflows
• Although the probability of bucket overflow can be reduced, it
cannot be eliminated; it is handled by using overflow buckets.
• Overflow chaining – the overflow buckets of a given bucket
are chained together in a linked list.
• Above scheme is called closed hashing.
Handling of Bucket Overflows
• Linear Probing: When hash function generates an address at
which data is already stored, the next free bucket is allocated
to it. This mechanism is called Open Hashing.
• Open hashing does not use overflow buckets, is not suitable
for database applications.
Hash Indices
• Hashing can be used not only for file organization, but
also for index-structure creation.
• A hash index organizes the search keys, with their
associated record pointers, into a hash file structure.
• The hash function is constructed as follows:
– Apply hash function on a search key to identify a bucket, and
store the key and its associated pointers in the bucket
• Strictly speaking, hash indices are always secondary
indices
– if the file itself is organized using hashing, a separate primary
hash index on it using the same search-key is unnecessary.
– However, we use the term hash index to refer to both
secondary index structures and hash organized files.
Example of Hash Index
Deficiencies of Static Hashing
• In static hashing, function h maps search-key values to a
fixed set of B of bucket addresses. Databases grow or shrink
with time.
– If initial number of buckets is too small, and file grows, and the
hash function is choose based on the current file size,
performance will degrade due to too much overflows.
– If space is allocated for anticipated growth, a significant amount
of space will be wasted initially (and buckets will be underfull).
– If database shrinks, again space will be wasted.
• One solution: periodic re-organization of the file with a new
hash function
– Expensive, disrupts normal operations
• Better solution: allow the number of buckets to be modified
dynamically.
Dynamic Hashing
• Dynamic hashing provides a mechanism in
which data buckets are added and removed
dynamically and on-demand.
• Dynamic hashing is also known as extended
hashing.
• Hash function, in dynamic hashing, is made to
produce large number of values and only a
few are used initially.
Dynamic Hashing
Hashing Practice Problems
Problem 1

• Consider a hash table of size seven, with starting index


zero, and a hash function (3x + 4)mod7. Assuming the
hash table is initially empty, which of the following is
the contents of the table when the sequence 1, 3, 8,
10 is inserted into the table using Open hashing? Note
that ‘_’ denotes an empty location in the table.
(A) 8, _, _, _, _, _, 10
(B) 1, 8, 10, _, _, _, 3
(C) 1, _, _, _, _, _,3
(D) 1, 10, 8, _, _, _, 3
• 1=> (3x+4)mod7=7mod7=0
• 3 => (3x+4)mod7=13mod7=6
• 8 => (3x+4)mod7=28mod7=0
Because address ‘0’ is not empty, store 8 at next
empty data bucket ‘1’
• 10 => (3x+4)mod7=34mod7=6
Because address ‘6’ is not empty, store 10 at next empty data
bucket ‘2’
Correct option is B
Problem 2
• The keys 12, 18, 13, 2, 3, 23, 5 and 15 are
inserted into an initially empty hash table of
length 10 using open addressing with hash
function h(k) = k mod 10 and linear probing.
What is the resultant hash table?
• H(k)=kmod10
• 12=> 12mod10=2
• 18=>18mod10=8
• 13=>13mod10=3
• 2=>2mod10=2, not empty, next available=4
• 3=>3mod10=3, not empty, next available=5
• 23=>23mod10=3, not empty, next available=6
• 5=>5mod10=5, not empty, next available=7
• 15=>15mod10=5, not empty, next available=9
Correct option is C
Problem 3
• For question number 2, what would the correct option if method
used is closed hashing?
• H(k)=kmod10
• 12=> 12mod10=2
• 18=>18mod10=8
• 13=>13mod10=3
• 2=>2mod10=2
• 3=>3mod10=3
• 23=>23mod10=3
• 5=>5mod10=5
• 15=>15mod10=5
Correct option is D
Problem 4
• A hash table of length 10 uses open
addressing with hash function h(k)=k mod 10,
and linear probing. After inserting 6 values
into an empty hash table, the table is as
shown below.
Which one of the following choices gives a
possible order in which the key values could
have been inserted in the table?
(A) 46, 42, 34, 52, 23, 33
(B) 34, 42, 23, 52, 33, 46
(C) 46, 34, 42, 23, 52, 33
(D) 42, 46, 33, 23, 34, 52
• Solution: We will check whether sequence given in option A can
lead to hash table given in question. Option A inserts 46, 42, 34,
52, 23, 33 as:
• For key 46, h(46) is 46%10 = 6. Therefore, 46 is placed at 6th index
For key 42, h(42) is 42%10 = 2. Therefore, 42 is placed at 2nd index
For key 34, h(34) is 34%10 = 4. Therefore, 34 is placed at 4th index
• For key 52, h(52) is 52%10 = 2. However, index 2 is occupied with
42. Therefore, 52 is placed at 3rd index in the hash table. But in
given hash table, 52 is placed at 5th index. Therefore, sequence in
option A can’t generate hash table given in question.
• In the similar way, we can check for other options as well which
leads to answer as (C).

You might also like