0% found this document useful (0 votes)
14 views

Indexing - II

Uploaded by

f20211140
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Indexing - II

Uploaded by

f20211140
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 57

Index structures/files

 Secondary Index
 Multi-Level Index
 B+- Tree
Index Structure
S
Search key
value

Location Mechanism
Location mechanism
facilitates finding
index entry for S
S Index entries

Once index entry is


found, the row can
be directly accessed S, …….
Dense indexes

 Every key from the data file is represented


 Entries are in the same order as that of the file
 Binary search can be used to find the required
<key, pointer>
 No.of blocks searched ‘log n’ instead of n/2 on an
average
 Example: 1,000,000 tuples, 10 tuples/4096 byte
block, key field 30 bytes, pointer 8 bytes
 Data file takes 400MB space
 Index file will take 10,000 blocks with100 entries/block
 Search will involve at most log10000 = 13 blocks in
MM
 Memory can also be optimized by keeping only
most searched blocks in memory
 Hence a record can be retrieved with less than 14
disk I/Os
Sparse indexes
 Useful if dense index is too large
 Uses less space at the cost of possibly more time
to search
 Generally a record, usually the first, per block is
represented
 Sparse index for previous example would take only
1000 blocks, 4MB
 But, it can not give quick answer to query ‘does
there exist a record with key value K?”
 It requires one disk I/O with searching in the

block
 Search K: find entry with largest key  K
Sparse Vs Dense Index
 Dense index: index entry for each data
record
 Unclustered index must be dense
 Clustered index need not be dense
 Sparse index: index entry for each block
of data file
Sparse Vs. Dense Index
Id Name Dept

Sparse,
clustered
index sorted
on Id
data file sorted Dense,
on Id unclustered
index sorted
on Name
Clustered vs. Unclustered Index

 Clustered (main/primary) index: index entries and


rows are ordered in the same way
 An integrated storage structure is always clustered
 There can be at most one clustered index on a table
 Unclustered (secondary) index: index entries and
rows are not sorted on the same search key
 There can be many secondary indices on a table
Clustering and Non-clustering
 Non-clustering indices have to be dense.
 Indices offer substantial benefits when searching for
records.
 When a file is modified, every index on the file must
be updated. Updating indices imposes overhead on
database modification.
 Sequential scan using clustering index is efficient, but
a sequential scan using a non-clustering index is
expensive – each record access may fetch a new
block from disk.
 Block fetch requires about 5 to 10 micro seconds, versus
about 100 nanoseconds for memory access
Indexing and Hashing 9
Clustered Index
 Good for range searches
 Use location mechanism to locate index
entry at start of range
 This locates first data record.
 Subsequent data records are contiguous if
index is clustered (not so if unclustered)
 Minimizes page transfers and maximizes
likelihood of cache hits
Types of Single-Level Indexes
 Secondary Index
A secondary index provides a secondary means of
accessing a file for which some primary access already
exists.
The secondary index may be on a field which is a
candidate key and has a unique value in every record, or
a nonkey with duplicate values.
The index is an ordered file with two fields.
 The first field is of the same data type as some
nonordering field of the data file that is an indexing
field.
 The second field is either a block pointer or a record
pointer. There can be many secondary indexes (and
hence, indexing fields) for the same file.
 Includes one entry for each record in the data file;
hence, it is a dense index
A dense secondary
index (with block
pointers) on a
nonordering key
field of a file.
Secondary indexes
 SELECT name, address
FROM MovieStar
WHERE birthdate= ‘1952-01-01’
 CREATE INDEX BDIndex ON MovieStar(birthdate);
 Secondary indexes are always ‘dense’
 Second level index could be ‘sparse’
 Secondary indexes are usually with duplicates
Secondary Indices Example

Secondary index on balance field of account

 Index record points to a bucket that contains


pointers to all the actual records with that particular
search-key value.
20
Secondary index 40

10 10
10 20
20
20 50
30
20
30 10
40 50
50
60
20
 Pointers in one index block may refer to
multiple data blocks
 Results in more number of Disk I/Os
 Unavoidable problem
 Using ‘bucket file’ between index file and data
file
 Single entry <k,p> for each value ‘k’ where p
points to location in bucket file containing all
other pointers of records with value ‘k’
 Avoids wastage of space due to multiple storage
of same value ‘k’
Definition of Bucket

 Bucket - another form of a storage unit


that can store one or more records of
information.

 Buckets are used if the search key value


cannot form a candidate key, or if the
file is not stored in search key order.
20
40

10 10
20 20
30
40 50
30
50
60 10
50

60
Index file 20

Bucket file Data file


 Application of ‘bucket file’
 It can help answer queries efficiently using
intersection of pointer sets
 Example
 SELECT title
FROM Movie
WHERE StudioName=‘Disney’ AND year=1995;
 This reduces number of Disk I/Os
Movie Tuples
Buckets for studio Buckets for year

Disney 1995

Studio index Year index


Multi-level indexes
 When an index is too large with even binary
search taking too many disk I/Os
 Define second level index: index on index
 This can continue to multi-level index structure
 Second and higher level indexes must be sparse
 Second level index in previous example would
take only 10 blocks, 40KB
 Search involves 2 disk I/Os and searching in the
block
Multilevel Index

 If an index does not fit in memory, access becomes


expensive.
 To reduce number of disk accesses to index records,
treat the index kept on disk as a sequential file and
construct a sparse index on it.
 outer index – a sparse index on main index

 inner index – the main index file

 If even outer index is too large to fit in main


memory, yet another level of index can be created,
and so on.
 Indices at all levels must be updated on insertion or
deletion from the file. 22
Multilevel Index (Cont.)
outer index inner index

Data
Index Block 0
Block 0

M
 Data
Block 1
M

Index 
Block 1

M


M

CIS552 23
Multi-level indexes
 When an index is too large with even binary
search taking too many disk I/Os
 Define second level index: index on index
 This can continue to multi-level index structure
 Second and higher level indexes must be sparse
 Second level index in previous example would
take only 100 blocks, 400KB
 Search involves 6 disk I/Os and searching in the
block
A Two-level Primary Index
Estimating Costs
 For simplicity we estimate the cost of an operation by
counting the number of blocks that are read or
written to disk.
 We ignore the possibility of blocked access which
could significantly lower the cost of I/O.
 We assume that each relation is stored in a separate
file with B blocks and R records per block.

CIS552 Indexing and Hashing 26


Choosing Indexing Technique
 Five Factors involved when choosing the
indexing technique:
 access type
 access time
 insertion time
 deletion time
 space overhead
Indexing Definitions
 Access type is the type of access being used.
 Access time - time required to locate the
data.
 Insertion time - time required to insert the
new data.
 Deletion time - time required to delete the
data.
 Space overhead - the additional space
occupied by the added data structure.
Index Evaluation Metrics
 Access time for:
 Equality searches – records with a specified

value in an attribute
 Range searches – records with an attribute

value falling within a specified range.


 Insertion time
 Deletion time
 Space overhead

29
B+-Tree Index
A B+-tree is a rooted tree satisfying the following properties:
o All paths from root to leaf are of the same length
o Each node that is not a root or a leaf has between n/2 and
n children. [Non leaf node]
o A leaf node has between (n–1)/2 and n–1 values
o Special cases:
o If the root is not a leaf, it has at least 2 children.

o If the root is a leaf (that is, there are no other nodes in

the tree), it can have between 0 and (n–1) values.


B+-Tree Node Structure
 Typical node

o Ki are the search-key values


o Pi are pointers to children (for non-leaf nodes) or
pointers to records or buckets of records (for leaf
nodes).
o The search-keys in a node are ordered
K1 < K2 < K3 < . . . < Kn–1
Example of B+-Tree
B+ Tree: Most Widely Used
Index
 Insert/delete at log F N cost; keep tree height-balanced.
(F = fanout, N = # leaf pages)
 Minimum 50% occupancy (except for root). Each node
contains n <= m <= 2n entries. The parameter n is
called the order of the tree.
 Supports equality and range-searches efficiently.

Index Entries
(Direct search)

Data Entries
("Sequence set")
Dynamic Multilevel Indexes Using B-Trees
and B+-Trees

 Because of the insertion and deletion problem, most


multi-level indexes use B-tree or B+-tree data
structures, which leave space in each tree node (disk
block) to allow for new index entries
 These data structures are variations of search trees that
allow efficient insertion and deletion of new search
values.
 In B-Tree and B+-Tree data structures, each node
corresponds to a disk block
 Each node is kept between half-full and completely full
Dynamic Multilevel Indexes Using B-Trees
and B+-Trees (contd.)

 An insertion into a node that is not full is quite


efficient; if a node is full the insertion causes a split
into two nodes
 Splitting may propagate to other tree levels
 A deletion is quite efficient if a node does not become
less than half full
 If a deletion causes a node to become less than half
full, it must be merged with neighboring nodes
Difference between B-tree and B+-tree

 In a B-tree, pointers to data records exist at all levels


of the tree
 In a B+-tree, all pointers to data records exists at the
leaf-level nodes
 A B+-tree can have less levels (or higher capacity of
search values) than the corresponding B-tree
B-tree structures. (a) A node in a B-tree with q – 1 search
values. (b) A B-tree of order p = 3. The values were
inserted in the order 8, 5, 1, 7, 3, 12, 9, 6.
The nodes of a B+-tree. (a) Internal node of a B+-tree with
q –1 search values. (b) Leaf node of a B+-tree with q – 1
search values and q – 1 data pointers.
Observations about B+-trees
o Since the inter-node connections are done by pointers,
“logically” close blocks need not be “physically” close.
o The non-leaf levels of the B+-tree form a hierarchy of
sparse indices.
o The B+-tree contains a relatively small number of levels
o
Level below root has at least 2* n/2 values
o
Next level has at least 2* n/2 * n/2 values
o
.. etc.
o If there are K search-key values in the file, the tree

height is no more than  logn/2(K)


o thus searches can be conducted efficiently.
o Insertions and deletions to the main file can be handled
efficiently, as the index can be restructured in logarithmic
time.
Queries on B+-Trees
 Find all records with a search-key value of
k.
1. N=root
2. Repeat
1. Examine N for the smallest search-key value > k.
2. If such a value exists, assume it is Ki. Then set N =
Pi
3. Otherwise k  Kn–1. Set N = Pn
Until N is a leaf node
3. If for some i, key Ki = k follow pointer Pi to
the desired record or bucket.
4. Else no record with search-key value k exists.
Example B+ Tree
 Search begins at root, and key comparisons direct it
to a leaf.
 Search for 5*, 15*, all data entries >= 24* ...
Root

13 17 24 30

2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*

 Based on the search for 15*, we know it is not in the tree!


Query on B+ Trees
 In processing a query, a path is traversed in the tree from
the root to some leaf node.
 If there are K search-key values in the file, the path is no
longer than
 A node is generally the same size as a disk block, typically
4 kilobytes, and n is typically around 100 (40 bytes per
index entry).
 With 1 million search key values and n = 100, at most
log50(1,000,000) = 4 nodes are accessed in a lookup.
 Contrast this with a balanced binary tree with 1 million
search key values — around 20 nodes are accessed in a
lookup
 above difference is significant since every node access may need
a disk I/O, costing around 20 milliseconds!
B+ Trees in Practice
 Typical order: 100. Typical fill-factor: 67%.
 average fanout = 133
 Typical capacities:
 Height 4: 1334 = 312,900,700 records
 Height 3: 1333 = 2,352,637 records
 Can often hold top levels in buffer pool:
 Level 1 = 1 page = 8 Kbytes
 Level 2 = 133 pages = 1 Mbyte
 Level 3 = 17,689 pages = 133 MBytes
Inserting a Data Entry into a B+ Tree

 Find correct leaf L.


 Put data entry onto L.
 If L has enough space, done!
 Else, must split L (into L and a new node L2)
 Redistribute entries evenly, copy up middle key.

 Insert index entry pointing to L2 into parent of L.

 This can happen recursively


 To split index node, redistribute entries evenly, but
push up middle key. (Contrast with leaf splits.)
 Splits “grow” tree; root split increases height.
 Tree growth: gets wider or one level taller at top.
Updates on B+-Trees: Insertion

B+-Tree before and after insertion of “Clearview”


Inserting 8* into Example B+ Tree

Entry to be inserted in parent node.


 Observe how 5 (Note that 5 is
s copied up and
continues to appear in the leaf.)
minimum
occupancy is
2* 3* 5* 7*
guaranteed in
8*

both leaf and


index pg splits.
 Note difference Entry to be inserted in parent node.
between copy- 17 (Note that 17 is pushed up and only
appears once in the index. Contrast
this with a leaf split.)
up and push-
up; be sure you 5 13 24 30
understand the
reasons for
this.
Example B+ Tree After Inserting 8*

Root
17

5 13 24 30

2* 3* 5* 7* 8* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*

 Notice that root was split, leading to increase in height.


 In this example, we can avoid split by re-distributing
entries; however, this is usually not done in
practice.
An example of insertion
in a B+-tree with q = 3
and pleaf = 2.

8,5,1,7,3,12,9,6
Deleting a Data Entry from a B+ Tree

 Start at root, find leaf L where entry belongs.


 Remove the entry.
 If L is at least half-full, done!
 If L has only d-1 entries,
 Try to re-distribute, borrowing from sibling (adjacent

node with same parent as L).


 If re-distribution fails, merge L and sibling.

 If merge occurred, must delete entry (pointing to L or


sibling) from parent of L.
 Merge could propagate to root, decreasing height.
Example Tree After (Inserting 8*, Then)
Deleting 19* and 20* ...
Root

17

5 13 27 30

2* 3* 5* 7* 8* 14* 16* 22* 24* 27* 29* 33* 34* 38* 39*

 Deleting 19* is easy.


 Deleting 20* is done with re-distribution. Notice how
middle key is copied up.
... And Then Deleting 24*

 Must merge. 30

 Observe `toss’ of index


entry (on right), and `pull 22* 27* 29* 33* 34* 38* 39*
down’ of index entry
(below).

Root
5 13 17 30

2* 3* 5* 7* 8* 14* 16* 22* 27* 29* 33* 34* 38* 39*


An example of
deletion from a
B+-tree.

You might also like