0% found this document useful (0 votes)
57 views

06 Hashtables

The document discusses hash tables, which are data structures that map keys to values using a hash function. It covers different hashing schemes for handling collisions when multiple keys hash to the same slot, including linear probing. With linear probing, collisions are resolved by searching linearly for the next empty slot in the hash table. Keys and values are stored together in the table. Insertions, deletions, and lookups require scanning the table from the initial hashed slot.

Uploaded by

akshay
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views

06 Hashtables

The document discusses hash tables, which are data structures that map keys to values using a hash function. It covers different hashing schemes for handling collisions when multiple keys hash to the same slot, including linear probing. With linear probing, collisions are resolved by searching linearly for the next empty slot in the hash table. Keys and values are stored together in the table. Insertions, deletions, and lookups require scanning the table from the initial hashed slot.

Uploaded by

akshay
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 85

06 Hash Tables

Intro to Database Systems Andy Pavlo


15-445/15-645
Fall 2019 AP Computer Science
Carnegie Mellon University
2

ADMINISTRIVIA

Project #1 is due Fri Sept 27th @ 11:59pm

Homework #2 is due Mon Sept 30th @ 11:59pm

CMU 15-445/645 (Fall 2019)


3

C O U R S E S TAT U S

We are now going to talk about how Query Planning


to support the DBMS's execution
engine to read/write data from pages. Operator Execution

Two types of data structures: Access Methods


→ Hash Tables Buffer Pool Manager
→ Trees
Disk Manager

CMU 15-445/645 (Fall 2019)


4

D ATA S T R U C T U R E S

Internal Meta-data
Core Data Storage
Temporary Data Structures
Table Indexes

CMU 15-445/645 (Fall 2019)


5

DESIGN DECISIONS

Data Organization
→ How we layout data structure in memory/pages and what
information to store to support efficient access.

Concurrency
→ How to enable multiple threads to access the data
structure at the same time without causing problems.

CMU 15-445/645 (Fall 2019)


6

H A S H TA B L E S

A hash table implements an unordered


associative array that maps keys to values.
It uses a hash function to compute an offset into
the array for a given key, from which the desired
value can be found.

Space Complexity: O(n)


Operation Complexity:
→ Average: O(1) Money cares about constants!
→ Worst: O(n)
CMU 15-445/645 (Fall 2019)
7

S TAT I C H A S H TA B L E

Allocate a giant array that has one slot hash(key)


for every element you need to store. 0 abc
1 Ø
To find an entry, mod the key by the 2 def
number of elements to find the offset
in the array. ⋮
n xyz

CMU 15-445/645 (Fall 2019)


7

S TAT I C H A S H TA B L E

Allocate a giant array that has one slot hash(key)


for every element you need to store. 0
1 abcdefghi
To find an entry, mod the key by the 2 xyz123
number of elements to find the offset
in the array. ⋮ defghijk
n

CMU 15-445/645 (Fall 2019)


8

ASSUMPTIONS

You know the number of elements hash(key)


ahead of time. 0
1 abcdefghi
Each key is unique. 2 xyz123
Perfect hash function. ⋮ defghijk
→ If key1≠key2, then n
hash(key1)≠hash(key2)

CMU 15-445/645 (Fall 2019)


9

H A S H TA B L E

Design Decision #1: Hash Function


→ How to map a large key space into a smaller domain.
→ Trade-off between being fast vs. collision rate.

Design Decision #2: Hashing Scheme


→ How to handle key collisions after hashing.
→ Trade-off between allocating a large hash table vs.
additional instructions to find/insert keys.

CMU 15-445/645 (Fall 2019)


10

T O D AY ' S A G E N D A

Hash Functions
Static Hashing Schemes
Dynamic Hashing Schemes

CMU 15-445/645 (Fall 2019)


11

HASH FUNCTIONS

For any input key, return an integer


representation of that key.

We do not want to use a cryptographic hash


function for DBMS hash tables.

We want something that is fast and has a low


collision rate.

CMU 15-445/645 (Fall 2019)


12

HASH FUNCTIONS

CRC-64 (1975)
→ Used in networking for error detection.
MurmurHash (2008)
→ Designed to a fast, general purpose hash function.
Google CityHash (2011)
→ Designed to be faster for short keys (<64 bytes).
Facebook XXHash (2012)
→ From the creator of zstd compression.
Google FarmHash (2014)
→ Newer version of CityHash with better collision rates.
CMU 15-445/645 (Fall 2019)
13

HASH FUNCTION BENCHMARK


Intel Core i7-8700K @ 3.70GHz
crc64 std::hash MurmurHash3 CityHash FarmHash XXHash3

4000
Throughput (MB/sec)

3000

2000

1000

0
1 2 3 4 5 6 7 8
Source: Fredrik Widlund
Key Size (bytes)
CMU 15-445/645 (Fall 2019)
14

HASH FUNCTION BENCHMARK


Intel Core i7-8700K @ 3.70GHz
crc64 std::hash MurmurHash3 CityHash FarmHash XXHash3
28000
128
Throughput (MB/sec)

64 192
21000
32
14000

7000

0
1 51 101 151 201 251
Source: Fredrik Widlund
Key Size (bytes)
CMU 15-445/645 (Fall 2019)
15

S TAT I C H A S H I N G S C H E M E S

Approach #1: Linear Probe Hashing

Approach #2: Robin Hood Hashing

Approach #3: Cuckoo Hashing

CMU 15-445/645 (Fall 2019)


16

LINEAR PROBE HASHING

Single giant table of slots.

Resolve collisions by linearly searching for the


next free slot in the table.
→ To determine whether an element is present, hash to a
location in the index and scan for it.
→ Have to store the key in the index to know when to stop
scanning.
→ Insertions and deletions are generalizations of lookups.

CMU 15-445/645 (Fall 2019)


17

LINEAR PROBE HASHING

hash(key)
A
B
A | val <key>|<value>
C
D
E
F

CMU 15-445/645 (Fall 2019)


17

LINEAR PROBE HASHING

hash(key) B | val
A
B
A | val
C
D
E
F

CMU 15-445/645 (Fall 2019)


17

LINEAR PROBE HASHING

hash(key) B | val
A
B
A | val
C
D C | val
E
F

CMU 15-445/645 (Fall 2019)


17

LINEAR PROBE HASHING

hash(key) B | val
A
B
A | val
C
D C | val
E D | val
F

CMU 15-445/645 (Fall 2019)


17

LINEAR PROBE HASHING

hash(key) B | val
A
B
A | val
C
D C | val
E D | val
F E | val

CMU 15-445/645 (Fall 2019)


17

LINEAR PROBE HASHING

hash(key) B | val
A
B
A | val
C
D C | val
E D | val
F E | val
F | val

CMU 15-445/645 (Fall 2019)


18

LINEAR PROBE HASHING DELETES

hash(key) B | val
A
B
A | val
Delete C
D C | val
E D | val
F E | val
F | val

CMU 15-445/645 (Fall 2019)


18

LINEAR PROBE HASHING DELETES

hash(key) B | val
A
B
A | val
Delete C
D
E D | val
F E | val
F | val

CMU 15-445/645 (Fall 2019)


18

LINEAR PROBE HASHING DELETES

hash(key) B | val
A
B
A | val
C
Find D
E D | val
F E | val
F | val

CMU 15-445/645 (Fall 2019)


18

LINEAR PROBE HASHING DELETES

hash(key) B | val Approach #1: Tombstone


A
B
A | val
C
Find D
E D | val
F E | val
F | val

CMU 15-445/645 (Fall 2019)


18

LINEAR PROBE HASHING DELETES

hash(key) B | val Approach #1: Tombstone


A Approach #2: Movement
B
A | val
C
Find D
E D | val
F E | val
F | val

CMU 15-445/645 (Fall 2019)


18

LINEAR PROBE HASHING DELETES

hash(key) B | val Approach #1: Tombstone


A Approach #2: Movement
B
A | val
C
Find D D | val
E E | val
F F | val

CMU 15-445/645 (Fall 2019)


18

LINEAR PROBE HASHING DELETES

hash(key) B | val Approach #1: Tombstone


A Approach #2: Movement
B
A | val
C
Find D D | val
E E | val
F F | val

CMU 15-445/645 (Fall 2019)


18

LINEAR PROBE HASHING DELETES

hash(key) B | val Approach #1: Tombstone


A Approach #2: Movement
B
A | val
C
Find D D | val
E E | val
F F | val

CMU 15-445/645 (Fall 2019)


19

NON-UNIQUE KEYS
Value Lists
Choice #1: Separate Linked List XYZ value1
→ Store values in separate storage area for ABC
value2
value3
each key.
value1
value2
Choice #2: Redundant Keys
→ Store duplicate keys entries together in
the hash table. XYZ|value1
ABC|value1
XYZ|value2
XYZ|value3
ABC|value2

CMU 15-445/645 (Fall 2019)


20

ROBIN HOOD HASHING

Variant of linear probe hashing that steals slots


from "rich" keys and give them to "poor" keys.
→ Each key tracks the number of positions they are from
where its optimal position in the table.
→ On insert, a key takes the slot of another key if the first
key is farther away from its optimal position than the
second key.

CMU 15-445/645 (Fall 2019)


21

ROBIN HOOD HASHING

hash(key)
A
B A | val [0] # of "Jumps" From First Position
C
D
E
F

CMU 15-445/645 (Fall 2019)


21

ROBIN HOOD HASHING

hash(key) B | val [0]


A
B A | val [0]
C
D
E
F

CMU 15-445/645 (Fall 2019)


21

ROBIN HOOD HASHING

hash(key) B | val [0]


A
B A | val [0] A[0] == C[0]
C
C | val [1]
D
E
F

CMU 15-445/645 (Fall 2019)


21

ROBIN HOOD HASHING

hash(key) B | val [0]


A
B A | val [0]
C
C | val [1] C[1] > D[0]
D
E D | val [1]
F

CMU 15-445/645 (Fall 2019)


21

ROBIN HOOD HASHING

hash(key) B | val [0]


A
B A | val [0] A[0] == E[0]
C
C | val [1] C[1] == E[1]
D
E D | val [1] D[1] < E[2]
F

CMU 15-445/645 (Fall 2019)


21

ROBIN HOOD HASHING

hash(key) B | val [0]


A
B A | val [0] A[0] == E[0]
C
C | val [1] C[1] == E[1]
D
E E | val [2] D[1] < E[2]
F D | val [2]

CMU 15-445/645 (Fall 2019)


21

ROBIN HOOD HASHING

hash(key) B | val [0]


A
B A | val [0]
C
C | val [1]
D
E E | val [2]
F D | val [2] D[2] > F[0]
F | val [1]

CMU 15-445/645 (Fall 2019)


22

C U C KO O H A S H I N G

Use multiple hash tables with different hash


function seeds.
→ On insert, check every table and pick anyone that has a
free slot.
→ If no table has a free slot, evict the element from one of
them and then re-hash it find a new location.

Look-ups and deletions are always O(1) because


only one location per hash table is checked.

CMU 15-445/645 (Fall 2019)


23

C U C KO O H A S H I N G
Hash Table #1 Hash Table #2
Insert A
hash1(A) hash2(A)

⋮ ⋮

CMU 15-445/645 (Fall 2019)


23

C U C KO O H A S H I N G
Hash Table #1 Hash Table #2
Insert A
hash1(A) hash2(A)
A|val

⋮ ⋮

CMU 15-445/645 (Fall 2019)


23

C U C KO O H A S H I N G
Hash Table #1 Hash Table #2
Insert A
hash1(A) hash2(A)
A|val
Insert B
hash1(B) hash2(B)
⋮ ⋮

CMU 15-445/645 (Fall 2019)


23

C U C KO O H A S H I N G
Hash Table #1 Hash Table #2
Insert A
hash1(A) hash2(A) B|val
A|val
Insert B
hash1(B) hash2(B)
⋮ ⋮

CMU 15-445/645 (Fall 2019)


23

C U C KO O H A S H I N G
Hash Table #1 Hash Table #2
Insert A
hash1(A) hash2(A) B|val
A|val
Insert B
hash1(B) hash2(B)
⋮ ⋮
Insert C
hash1(C) hash2(C)

CMU 15-445/645 (Fall 2019)


23

C U C KO O H A S H I N G
Hash Table #1 Hash Table #2
Insert A
hash1(A) hash2(A) C|val
A|val
Insert B
hash1(B) hash2(B)
⋮ ⋮
Insert C
hash1(C) hash2(C)

CMU 15-445/645 (Fall 2019)


23

C U C KO O H A S H I N G
Hash Table #1 Hash Table #2
Insert A
hash1(A) hash2(A) C|val
B|val
Insert B
hash1(B) hash2(B)
⋮ ⋮
Insert C
hash1(C) hash2(C)
hash1(B)

CMU 15-445/645 (Fall 2019)


23

C U C KO O H A S H I N G
Hash Table #1 Hash Table #2
Insert A
hash1(A) hash2(A) C|val
B|val
Insert B
hash1(B) hash2(B) A|val
⋮ ⋮
Insert C
hash1(C) hash2(C)
hash1(B)
hash2(A)
CMU 15-445/645 (Fall 2019)
24

O B S E R VAT I O N

The previous hash tables require the DBMS to


know the number of elements it wants to store.
→ Otherwise it has rebuild the table if it needs to
grow/shrink in size.

Dynamic hash tables resize themselves on demand.


→ Chained Hashing
→ Extendible Hashing
→ Linear Hashing

CMU 15-445/645 (Fall 2019)


25

CHAINED HASHING

Maintain a linked list of buckets for each slot in


the hash table.
Resolve collisions by placing all elements with the
same hash key into the same bucket.
→ To determine whether an element is present, hash to its
bucket and scan for it.
→ Insertions and deletions are generalizations of lookups.

CMU 15-445/645 (Fall 2019)


26

CHAINED HASHING

Buckets
hash(key)

⋮ ⋮

CMU 15-445/645 (Fall 2019)


26

CHAINED HASHING

Buckets
hash(key)

⋮ ⋮

CMU 15-445/645 (Fall 2019)


28

EXTENDIBLE HASHING

Chained-hashing approach where we split buckets


instead of letting the linked list grow forever.

Multiple slot locations can point to the same


bucket chain.

Reshuffling bucket entries on split and increase the


number of bits to examine.
→ Data movement is localized to just the split chain.

CMU 15-445/645 (Fall 2019)


29

EXTENDIBLE HASHING
global 2 00010… 1 local
01110…
00…
01… 10101… 2 local
10… 10011…

11…
11010… 2 local

CMU 15-445/645 (Fall 2019)


29

EXTENDIBLE HASHING
global 2 00010… 1 local
01110… Find A
00… hash(A) = 01110…
01… 10101… 2 local
10… 10011…

11…
11010… 2 local

CMU 15-445/645 (Fall 2019)


29

EXTENDIBLE HASHING
global 2 00010… 1 local
01110… Find A
00… hash(A) = 01110…
01… 10101… 2 local
10… 10011… Insert B
10111… hash(B) = 10111…
11…
11010… 2 local

CMU 15-445/645 (Fall 2019)


29

EXTENDIBLE HASHING
global 2 00010… 1 local
01110… Find A
00… hash(A) = 01110…
01… 10101… 2 local
10… 10011… Insert B
10111… hash(B) = 10111…
11…
11010… 2 local Insert C
hash(C) = 10100…

CMU 15-445/645 (Fall 2019)


29

EXTENDIBLE HASHING
global 2 00010… 1 local
01110… Find A
00… hash(A) = 01110…
01… 10101… 2 local
10… 10011… Insert B
10111… hash(B) = 10111…
11…
11010… 2 local Insert C
hash(C) = 10100…

CMU 15-445/645 (Fall 2019)


29

EXTENDIBLE HASHING
global 3
2 00010… 1 local
01110… Find A
00… hash(A) = 01110…
01… 10101… 2 local
10… 10011… Insert B
10111… hash(B) = 10111…
11…
11010… 2 local Insert C
hash(C) = 10100…

CMU 15-445/645 (Fall 2019)


29

EXTENDIBLE HASHING
global 3
2 00010… 1 local
01110… Find A
000… hash(A) = 01110…
010… 10101… 2 local
100… 10011… Insert B
10111… hash(B) = 10111…
110…
001… 11010… 2 local Insert C
011… hash(C) = 10100…
101…
111…

CMU 15-445/645 (Fall 2019)


29

EXTENDIBLE HASHING
global 3
2 00010… 1
01110… Find A
000… hash(A) = 01110…
010… 3
100… 10011… Insert B
hash(B) = 10111…
110…
001… 10101… 3 Insert C
011… hash(C) = 10100…
10111…
101…
111… 11010… 2

CMU 15-445/645 (Fall 2019)


29

EXTENDIBLE HASHING
global 3
2 00010… 1
01110… Find A
000… hash(A) = 01110…
010… 3
100… 10011… Insert B
hash(B) = 10111…
110…
001… 10101… 3 Insert C
011… 10100…
hash(C) = 10100…
10111…
101…
111… 11010… 2

CMU 15-445/645 (Fall 2019)


30

LINEAR HASHING

The hash table maintains a pointer that tracks the


next bucket to split.
→ When any bucket overflows, split the bucket at the
pointer location.

Use multiple hashes to find the right bucket for a


given key.

Can use different overflow criterion:


→ Space Utilization
→ Average Length of Overflow Chains
CMU 15-445/645 (Fall 2019)
31

LINEAR HASHING
8
20
0
1 5
2 9
13
3
6

7
11

CMU 15-445/645 (Fall 2019)


31

LINEAR HASHING
Split 8
Pointer
20
0
1 5
2 9
13
3
6

hash1(key) = key % n 7
11

CMU 15-445/645 (Fall 2019)


31

LINEAR HASHING
Split 8
Pointer Find 6
20
0
hash1(6) = 6 % 4 = 2
1 5
2 9
13
3
6

hash1(key) = key % n 7
11

CMU 15-445/645 (Fall 2019)


31

LINEAR HASHING
Split 8
Pointer Find 6
20
0
hash1(6) = 6 % 4 = 2
1 5 Insert 17
2 9
hash1(17) = 17 % 4 = 1
13
3
6

hash1(key) = key % n 7
11

CMU 15-445/645 (Fall 2019)


31

LINEAR HASHING
Split 8
Pointer Find 6
20
0
hash1(6) = 6 % 4 = 2
1 5 17 Insert 17
2 9
hash1(17) = 17 % 4 = 1
13
3
6 Overflow!

hash1(key) = key % n 7
11

CMU 15-445/645 (Fall 2019)


31

LINEAR HASHING
Split 8
Pointer Find 6
20
0
hash1(6) = 6 % 4 = 2
1 5 17 Insert 17
2 9
hash1(17) = 17 % 4 = 1
13
3
6 Overflow!
4

hash1(key) = key % n 7
hash2(key) = key % 2n 11

CMU 15-445/645 (Fall 2019)


31

LINEAR HASHING
Split 8
Pointer Find 6
0
hash1(6) = 6 % 4 = 2
1 5 17 Insert 17
2 9
hash1(17) = 17 % 4 = 1
13
3
6
4

hash1(key) = key % n 7
hash2(key) = key % 2n 11 20

CMU 15-445/645 (Fall 2019)


31

LINEAR HASHING
Split 8
Pointer Find 6
0
hash1(6) = 6 % 4 = 2
1 5 17 Insert 17
2 9
hash1(17) = 17 % 4 = 1
13
3
4 6 Find 20
hash1(20) = 20 % 4 = 0
hash1(key) = key % n 7
hash2(key) = key % 2n 11 20

CMU 15-445/645 (Fall 2019)


31

LINEAR HASHING
Split 8
Pointer Find 6
0
hash1(6) = 6 % 4 = 2
1 5 17 Insert 17
2 9
hash1(17) = 17 % 4 = 1
13
3
4 6 Find 20
hash1(20) = 20 % 4 = 0
hash2(20) = 20 % 8 = 4
hash1(key) = key % n 7
hash2(key) = key % 2n 11 20

CMU 15-445/645 (Fall 2019)


31

LINEAR HASHING
Split 8
Pointer Find 6
0
hash1(6) = 6 % 4 = 2
1 5 17 Insert 17
2 9
hash1(17) = 17 % 4 = 1
13
3
4 6 Find 20
hash1(20) = 20 % 4 = 0
hash2(20) = 20 % 8 = 4
hash1(key) = key % n 7 Find 9
hash2(key) = key % 2n 11 20
hash1(9) = 9 % 4 = 1
CMU 15-445/645 (Fall 2019)
31

LINEAR HASHING
Split 8
Pointer Find 6
0
hash1(6) = 6 % 4 = 2
1 5 17 Insert 17
2 9
hash1(17) = 17 % 4 = 1
13
3
4 6 Find 20
hash1(20) = 20 % 4 = 0
hash2(20) = 20 % 8 = 4
hash1(key) = key % n 7 Find 9
hash2(key) = key % 2n 11 20
hash1(9) = 9 % 4 = 1
CMU 15-445/645 (Fall 2019)
32

LINEAR HASHING

Splitting buckets based on the split pointer will


eventually get to all overflowed buckets.
→ When the pointer reaches the last slot, delete the first
hash function and move back to beginning.

The pointer can also move backwards when


buckets are empty.

CMU 15-445/645 (Fall 2019)


33

LINEAR HASHING DELETES


Split 8
Pointer Delete 20
0
hash1(20) = 20 % 4 = 0
1 5 17
2 9
13
3
6
4

hash1(key) = key % n 7
hash2(key) = key % 2n 11 20

CMU 15-445/645 (Fall 2019)


33

LINEAR HASHING DELETES


Split 8
Pointer Delete 20
0
hash1(20) = 20 % 4 = 0
hash2(20) = 20 % 8 = 4
1 5 17
2 9
13
3
6
4

hash1(key) = key % n 7
hash2(key) = key % 2n 11 20

CMU 15-445/645 (Fall 2019)


33

LINEAR HASHING DELETES


Split 8
Pointer Delete 20
0
hash1(20) = 20 % 4 = 0
hash2(20) = 20 % 8 = 4
1 5 17
2 9
13
3
6
4

hash1(key) = key % n 7
hash2(key) = key % 2n 11 20

CMU 15-445/645 (Fall 2019)


33

LINEAR HASHING DELETES


Split 8
Pointer Delete 20
0
hash1(20) = 20 % 4 = 0
hash2(20) = 20 % 8 = 4
1 5 17
2 9
13
3
6
4

hash1(key) = key % n 7
hash2(key) = key % 2n 11

CMU 15-445/645 (Fall 2019)


33

LINEAR HASHING DELETES


Split 8
Pointer Delete 20
0
hash1(20) = 20 % 4 = 0
hash2(20) = 20 % 8 = 4
1 5 17
2 9
13
3
6
4

hash1(key) = key % n 7
hash2(key) = key % 2n 11

CMU 15-445/645 (Fall 2019)


33

LINEAR HASHING DELETES


Split 8
Pointer Delete 20
0
hash1(20) = 20 % 4 = 0
hash2(20) = 20 % 8 = 4
1 5 17
2 9
13
3
6

hash1(key) = key % n 7
11

CMU 15-445/645 (Fall 2019)


33

LINEAR HASHING DELETES


Split 8
Pointer Delete 20
0
hash1(20) = 20 % 4 = 0
hash2(20) = 20 % 8 = 4
1 5 17
9 21
2 Insert 21
13
3 hash1(21) = 21 % 4 = 1
6 Overflow!

hash1(key) = key % n 7
11

CMU 15-445/645 (Fall 2019)


34

CONCLUSION

Fast data structures that support O(1) look-ups that


are used all throughout the DBMS internals.
→ Trade-off between speed and flexibility.

Hash tables are usually not what you want to use


for a table index…

CMU 15-445/645 (Fall 2019)


35

NEXT CLASS

B+Trees
→ aka "The Greatest Data Structure of All Time!"

CMU 15-445/645 (Fall 2019)

You might also like