0% found this document useful (0 votes)
4 views17 pages

hashing-2 (1)

The document discusses hashing techniques for efficient data retrieval, focusing on equality searches that can be performed in a single disk access. It covers types of hashing, hash functions, collision resolution methods, and the design considerations for effective hashing functions. Additionally, it includes examples of indexing in databases like Oracle and DB2, highlighting the importance of choosing the right index for different types of queries.

Uploaded by

bhaivipin283
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views17 pages

hashing-2 (1)

The document discusses hashing techniques for efficient data retrieval, focusing on equality searches that can be performed in a single disk access. It covers types of hashing, hash functions, collision resolution methods, and the design considerations for effective hashing functions. Additionally, it includes examples of indexing in databases like Oracle and DB2, highlighting the importance of choosing the right index for different types of queries.

Uploaded by

bhaivipin283
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Hashing

• Given a search key, can we guess its


location in the file?
• Goal:
– Support equality searches in one disk access!
• Method: hash keys into addresses
key page

Types of Hashing
• What does H(K) point to:
– A cell of a table in memory where K* is stored
(internal hashing)
– A bucket on disk where K* is stored (external
hashing)
• A bucket consists of 1 or more pages.
• Hash file maintenance:
– Static hashing
• File size is fixed
– Dynamic & extensible hashing
2
• File size can grow

1
Hashing to a File
Key
r records

N slots*
H(K)


*Slots store either the actual records


(clustered index) or (key, ptr) pairs
(unclustered index)
3

Hash Function

• Input: a field of a record; usually its key K


(student id, name, …)
• Compute index function H(K)
H(K): K → A
to find the address of K*.
H(K)=A is the address of the record (or index
entry) with key K
4

2
Hashing Function 1
Student id Name address
0234134 John 4
0349423 Mary 3
0428421 Jean 1
1324532 Sandy 2
2374734 Randy 4

Let some digits of the key, for example the last digit
of the student id, represent the location.
5

Hash Function 2
• Key is student id (six digits), we have
100,000 record positions (0 – 99,999)
• H(K): student_id mod 99999
085768 → 085768 mod 99999 = 85768
134281 → 134281 mod 99999 = 34282
101004 → 101004 mod 99999 = 1005

3
More Hash Functions
• Folding
– Replace the key by numeric code
• ALBERT = 01 22 02 05 18 20
– Fold and Add
• 0122 + 0205 + 1820 = 2147
– Take the modulo relative to the size of address space
• 2147 mod 101 = 26
• Midsquare: Square key and take middle
– (453)2 = 205209 → 52
• Radix Transformation
– (453)10 = (382)11 → 382 mod 99 = 85 7

Hashing Function 3
• concatenate the alphabetic positions of all letters,
partition the result into equal parts, multiply each
part by its position, fold and add, divide the result
by the size of the address space (a prime number)
and take the reminder.
Name Address
John 10 15 08 14 (1015*1 + 0814*2) mod 43 = 20
Mary 13 01 18 25 (1301*1 + 1825*2) mod 43 = 6
Jean 10 05 01 14 (1005*1 + 0114*2) mod 43 = 29
Sandy 19 01 14 04 25 (1901*1 + 1404*2 + 0025*3) mod 43 = 11
Randy 18 01 14 04 25 (1801*1 + 1404*2 + 0025*3) mod 43 = 40

4
Hash Function Design Issues
• Key space
– The set of all possible values for keys
• Address space (N)
– The set of all storage units
– Physical location of file
• In general
– Address space must accommodate all records in
file
– Address space is usually much smaller than key
space 9

Features of Hashing
• Randomizing
– Records are randomly spread over the whole
storage space
• Collision
– Two different keys may be hashed into the
same address (synonyms)
– To deal with it, two ways:
• choose hashing functions that reduce collisions
• rearrange the storage of records to reduce collisions
10

5
Good and Bad Functions

Best Worst Acceptable

1 1 1
2 2 2
A 3 A 3 A 3
B 4 B 4 B 4
C 5 C 5 C 5
D 6 D 6 D 6
E 7 E 7 E 7
F 8 F 8 F 8
G 9 G 9 G 9
10 10 10

11

Choice of Hash Function


• Perfect hash function
– One-to-one: No synonyms
– Onto: Key space = Address space
– Not feasible for large and active files
• Desirable hashing function
– Minimize collisions
– Relatively smaller address space
• Tradeoff
– The larger the address space, the easier it is to
avoid collisions
– The larger the address space, the worse the 12
storage utilization becomes

6
A Hashing Function
1. Convert the key to a number (if it is not)
key K
2. Compute an address from the number
address = K mod M
• Suggestion: Choose M to be a prime
number (why?).

13

Collisions
• A key is mapped to an address that is full.
• Collision Resolution: Where to store the
overflow key?
– Static methods
• Linear probing
• Double hashing
• Separate overflow
– Dynamic methods
• Extendable hashing
• Linear hashing
14

7
Linear Probing
• For each key, generate a sequence of
addresses A0, A1, A2, …
A0 = hash(key) mod M
Ai+1 = [Ai + step] mod M

M : file size (max # of addresses)


step: a constant

15

Example
Key hash(key) = A0 A1 A2 A3 A4
Mozart 1 2 3 4 5
Tchaikovsky 1 2 3 4 5
Ravel 3 4 5 6 0
Beethoven 5 6 0 1 2
Mendelssohn 5 6 0 1 2
Bach 3 4 5 6 0
Greig 3 4 5 6 0

2 M=7
step = 1
3

6 16

8
Linear Probing - Problems
• Performance degradation as more rows are
added.
• Waste of space as more rows are deleted.
• These are problems for all static methods
• Solutions
– Reorganization
– Use a dynamic method

17

Extendable Hashing
• The address space is changed dynamically.
• The hash function is adjusted to
accommodate the change.
• A common family of hash functions
– hk(key) = h(key) mod 2k (use the last k bits of
h(key))
– At any given time a unique hash, hk , is used

18

9
Extendable Hashing - Example
v h(v)
pete 11010
mary 00000
jane 11110
bill 00000
john 01001
vince 10101
Location karen 10111
mechanism
buckets

directory
00
The size of the directory hk(key) = h(key) mod 2k
01
corresponds to the currently k=2 directory size = 22 = 4
active hash function hk 10 (use last k=2 bits of h(key))
11 19

Example (con’t)
Next action: insert ‘sol’, where h(sol) = 10001.

v h(v)
mary, bill B0 pete 11010
mary 00000
john, vince B1 jane 11110
h2 bill 00000
pete, jane B2
john 01001
vince 10101
karen 10111
karen B3 sol 10001

sol, can’t be stored here since the bucket is full


20

10
Example (con’t)
directory Solution:
000 mary, bill B0 1. Split the overfilled bucket
001
2. Switch to h3 (double the directory)
john, sol B1 hk(key) = h(key) mod 2k
010 k=3 directory size = 23 = 8
011 pete, jane B2 (use last k=3 bits of h(key))
100 3. Update the pointers
101 karen B3
v h(v)
110 pete 11010
111 mary 00000
vince B4 jane 11110
3 bill 00000
john 01001
Current hash vince 10101
current_hash identifies karen 10111
current hash function. sol 10001 21

Example (con’t)
mary, bill B0
000
• Next action: Insert judy,
001 where h(judy) = 00110
john, sol B1
010 • B2 overflows, but directory
011 pete, jane B2 need not be extended
100
karen B3
101
3
110
Current hash 111 vince B4

Need a mechanism for deciding whether the directory has to be


doubled.

22

11
Example (con’t)
mary, bill B0
000 2
001 john, sol B1
010 3 Bucket level

011 pete, jane B2


2
100
karen B3
3 101
2
110
Current hash
111 vince B4
3

Add a bucket level – if current_hash > bucket_level[i],


then do not enlarge directory
23

Example (con’t)
mary, bill B0
000
2
001 john, sol B1 v h(v)
010 3 pete 11010
011 pete, jane B2 mary 00000
X
3
jane 11110
100 bill 00000
karen B3
3 101 john 01001
2
vince 10101
110
Current hash
karen 10111
111 vince B4 sol 10001
3 judy 00110
judy, jane B5
3

24

12
v h(v)
pete 11010
mary 00000
jane 11110
bill 00000
john 01001
vince 10101
karen 10111

sol 10001
judy 00110
25

Hash Indices - Summary


• Range search is not supported.
– Since adjacent elements in range might hash to
different buckets
• Partial key search is not supported.
– Entire key must be provided
• But, an equality search on average takes
only 1 disk access

26

13
Indexing in Oracle
(un-clustered index)
• Create an un-clustered index on author:
CREATE TABLE book (
callnochar(10),
author char(20),
title char(30),
year char(4),
PRIMARY KEY (callno)
);

CREATE INDEX authidx ON book (author);

• Result: an un-clustered dense index on author.

27

Indexing in Oracle
(clustered index on primary key)
• Create a clustered index on callno:
CREATE TABLE book (
callno char(10),
author char(20),
title char(30),
year char(4),
PRIMARY KEY (callno)
)
ORGANIZATION INDEX;
• This syntax allows a clustered index on the
primary key of the table only.
28

14
Indexing in Oracle
(clustered index on non-primary key columns)
• Create a clustered index on author:
CREATE TABLE book (
callnochar(10),
author char(20),
title char(30),
year char(4),
PRIMARY KEY (callno)
)
cluster authcl(author);

CREATE INDEX authidx on cluster authcl;


• An Oracle cluster may contain rows from more
than one table.
29

Indexing in DB2
• Create un-clustered indexes on callno and author:

CREATE INDEX callno_idx on book (callno)


CREATE INDEX auth_idx on book (author)

• Can make (only) one index clustered:

CREATE INDEX auth_idx on book (author) cluster

Data must be (preferably) sorted on clustering column(s) in the OS file.

30

15
Choosing an Index

Ex 1 SELECT E. Id
FROM Employee E
WHERE E.Salary < :upper AND E.Salary > :lower

- a range search on Salary.


- Suppose the primary key is employee id; it is likely that
there is a main, clustered index on that attribute that is
of no use for this query.
- Choose a secondary, B+ tree index with search key Salary

Choosing an Index
Ex 2 SELECT T.StudId
FROM Transcript T
WHERE T.Grade = :grade

- an equality search on Grade.


- Suppose the primary key is (StudId, Semester, CrsCode); it is
likely that there is a main, clustered index on these attributes
that is of no use for this query.
- Choose a secondary, B+ tree or hash index with search key
Grade

32

16
Choosing an Index
Ex 3 SELECT T.CrsCode, T.Grade
FROM Transcript T
WHERE T.StudId = :id AND T.Semester = ‘F2000’

- Equality search on StudId and Semester.


- If the primary key is (StudId, Semester, CrsCode) it is
likely that there is a main, clustered index on this
sequence of attributes.
- If the main index is a B+ tree it can be used for this search.
- If the main index is a hash it cannot be used for this
search. Choose B+ tree or hash with search key StudId
or (StudId, Semester)

33

Choosing an Index
Ex 3 (con’t)
SELECT T.CrsCode, T.Grade
FROM Transcript T
WHERE T.StudId = :id AND T.Semester = ‘F2000’

- Suppose Transcript has primary key (CrsCode, StudId, Semester).


Can this index be useful (independent of being hash or B+ tree)?

34

17

You might also like