Tree-Structured Indexes: R & G Chapter 9
Tree-Structured Indexes: R & G Chapter 9
Lecture 17
R & G Chapter 9
Abraham Lincoln
Administrivia
• Homeworks re-arranged
IV V
I II III Query Buffer VI VII
Query Sorting Joins Opt Mgmt FDs ER Total Adjust w/EC
Total 24 10 6 10 10 20 20 100
Avg 12.5 7.5 2.2 2.4 8.1 12.6 11.1 56.3 60.0
Min 4.0 0.0 0.0 0.0 0.0 0.0 3.0 23.5 26.1 26.1
Max 21.0 10.0 6.0 8.5 10.0 20.0 19.0 82.0 86.7 90.7
StdDev 3.5 2.9 1.6 2.1 2.1 4.4 3.7 10.6 10.8
Review
• Last week:
– How Internet Apps use Databases
– How to access Databases from Programs
– How to extend Databases with Programs
• This week:
– Tree Indexes (HW4 or 5?)
– Hash Indexes
• Next week:
– Transactions
– Concurrency Control
Today: B-Tree Indexes
• Upshot
– Don’t brag about being an ISAM expert on your resume
– Do understand how they work, tradeoffs with B+-trees
Range Searches
• ``Find all students with gpa > 3.0’’
– If data is in sorted file, do binary search to find first such
student, then scan to find others.
– Cost of binary search can be quite high.
• Simple idea: Create an `index’ file.
– Level of indirection again!
k1 k2 kN Index File
ISAM
P K P K 2 P K m Pm
0 1 1 2
Non-leaf
Pages
Leaf
Pages
Overflow
page
Primary pages
20 33 51 63
10* 15* 20* 27* 33* 37* 40* 46* 51* 55* 63* 97*
Comments on ISAM
Data Pages
20 33 51 63
10* 15* 20* 27* 33* 37* 40* 46* 51* 55* 63* 97*
After Inserting 23*, 48*, 41*, 42* ...
Root
Index 40
Pages
20 33 51 63
Primary
Leaf 10* 15* 20* 27* 33* 37* 40* 46* 51* 55* 63* 97*
Pages
20 33 51 63
10* 15* 20* 27* 33* 37* 40* 46* 55* 63*
• Pros
– ????
• Cons
– ????
B+ Tree: The Most Widely Used Index
• Insert/delete at log F N cost; keep tree height-
balanced. (F = fanout, N = # leaf pages)
• Minimum 50% occupancy (except for root). Each
node contains d <= m <= 2d entries. The
parameter d is called the order of the tree.
• Supports equality and range-searches efficiently.
Index Entries
(Direct search)
Data Entries
("Sequence set")
Example B+ Tree
Root
13 17 24 30
2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*
Root
13 17 24 30
2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*
Example B+ Tree - Inserting 8*
Root
17
5 13 24 30
2* 3* 5* 7* 8* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*
…
5 (Note that 5 is
s copied up and
minimum continues to appear in the leaf.)
occupancy is
guaranteed in 2* 3* 5* 7* 8*
both leaf and
index pg splits.
• Note difference
between copy-
Entry to be inserted in parent node.
up and push-up; 17 (Note that 17 is pushed up and only
be sure you appears once in the index. Contrast
…
this with a leaf split.)
understand the
reasons for this.
5 13 24 30
Deleting a Data Entry from a B+ Tree
• Start at root, find leaf L where entry belongs.
Root
17
5 13 24 30
2* 3* 5* 7* 8* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*
Root
17
5 13 27 30
2* 3* 5* 7* 8* 14* 16* 22* 24* 27* 29* 33* 34* 38* 39*
Root
5 13 17 30
Root
22
5 13 17 20 30
2* 3* 5* 7* 8* 14* 16* 17* 18* 20* 21* 22* 27* 29* 33* 34* 38* 39*
After Re-distribution
• Intuitively, entries are re-distributed by `pushing
through’ the splitting entry in the parent node.
• It suffices to re-distribute index entry with key 20;
we’ve re-distributed 17 as well for illustration.
Root
17
5 13 20 22 30
2* 3* 5* 7* 8* 14* 16* 17* 18* 20* 21* 22* 27* 29* 33* 34* 38* 39*
Prefix Key Compression
• Important to increase fan-out. (Why?)
• Key values in index entries only `direct traffic’;
can often compress them.
– E.g., If we have adjacent index entries with search
key values Dannon Yogurt, David Smith and
Devarakonda Murthy, we can abbreviate David Smith
to Dav. (The other keys can be compressed too ...)
• Is this correct? Not quite! What if there is a data entry
Davey Jones? (Can only compress David Smith to Davi)
• In general, while compressing, must leave each index entry
greater than every key value (in any subtree) to its left.
• Insert/delete must be suitably modified.
Bulk Loading of a B+ Tree
• If we have a large collection of records, and we
want to create a B+ tree on some field, doing so
by repeatedly inserting records is very slow.
– Also leads to minimal leaf utilization --- why?
• Bulk Loading can be done much more efficiently.
• Initialization: Sort all data entries, insert pointer
to first (leaf) page in a new (root) page.
Root
Sorted pages of data entries; not yet in B+ tree
3* 4* 6* 9* 10* 11* 12* 13* 20* 22* 23* 31* 35* 36* 38* 41* 44*
Bulk Loading (Contd.)
Root 10 20
may go up right-most
path to the root.)
• Much faster than Root 20
repeated inserts,
especially when one 10 35 Data entry pages
considers locking! not yet in B+ tree
6 12 23 38
3* 4* 6* 9* 10* 11* 12* 13* 20*22* 23* 31* 35* 36* 38*41* 44*
Summary of Bulk Loading