L11 QueryProcessing I
L11 QueryProcessing I
Dr Renata Borovica-Gajic
Lecture 11
Query Processing Part I
DBMS
Query processing module
Parser/
Optimizer Executor TODAY &
Compiler
Next time
Storage module Concurrency
control module
Lock mgr.
Buffer pool mgr.
Crash recovery
module
Concurrency
Disk space mgr. control module
Log mgr.
Query Select *
From Blah B
Where B.blah = “foo” Usually there is a
heuristics-based
Query Parser rewriting step before
the cost-based steps.
Next week
Query Optimizer
Schema Statistics
Query Plan Evaluator
INFO20003 Database Systems © University of Melbourne 5
Relational Operations
• Sailors (S):
–Each tuple is 50 bytes long, 80 tuples per page, 500 pages
–N = NPages(S) = 500, pS=NTuplesPerPage(S) = 80
–NTuples(S) = 500*80 = 40000
• Reserves (R):
–Each tuple is 40 bytes long, 100 tuples per page, 1000 pages
–M= NPages(R) = 1000, pR=NTuplesPerPage(R) =100
–NTuples(R) = 100000
• Example: Let’s say that 10% of Reserves tuples qualify, and let’s
say that index occupies 50 pages
• RF = 10% = 0.1, NPages(I) = 50, NPages(R) = 1000,
NTuplesPerPage(R) = 100
• Cost:
1. Clustered index:
Cost = (NPages(I) + NPages(R))*RF
Cost = (50+ 1000)*0.1 = 105 (I/O) Cheapest access path
2. Unclustered index:
Cost = (NPages(I) + NTuples(R))*RF
Cost = (50+ 100000)*0.1 = 10005 (I/O)
3. Heap Scan:
Cost = NPages(R) = 1000 (I/O)
INFO20003 Database Systems © University of Melbourne 14
General Selection Conditions
3. Apply the predicates that don’t match the index (if any) later
on
– These predicates are used to discard some retrieved tuples, but
do not affect number of tuples/pages fetched (nor the total cost)
– In this case selection over other predicates is said to be done
“on-the-fly”
• Overview
• Selections
• Projections
11,80
12,10
12,10
12,75
13,20
13,20
13,75
INPUT 1
... INPUT 2
... OUTPUT ...
INPUT B-1
Disk Disk
B Memory buffers
Readings: Chapter 13, Ramakrishnan & Gehrke, Database Systems
INFO20003 Database Systems © University of Melbourne 21
External Merge Sort: Example
3,4 6,2 9,4 8,7 5,6 3,1 9,2 6,1 8,2 3,4 5,5 6,3 Input file
Pass 1
8,9 6,9 6,8
6,7 5,6 5,5 4-page runs
Pass 2
2,3
Main Memory
3,4 6,2 9,4 8,7 5,6 3,1 9,2 6,1 8,2 3,4 5,5 6,3 Input file
Pass 1
8,9 6,9 6,8
6,7 5,6 5,5 4-page runs
Pass 2
2,3 1,1
Main Memory
3,4 6,2 9,4 8,7 5,6 3,1 9,2 6,1 8,2 3,4 5,5 6,3 Input file
Pass 1
8,9 6,9 6,8
6,7 5,6 5,5 4-page runs
Pass 2
2,3 1,1 2,3
Main Memory
3,4 6,2 9,4 8,7 5,6 3,1 9,2 6,1 8,2 3,4 5,5 6,3 Input file
Pass 1
8,9 6,9 6,8
6,7 5,6 5,5 4-page runs
Pass 2
2,3 1,1 2,3
Main Memory
3,4 6,2 9,4 8,7 5,6 3,1 9,2 6,1 8,2 3,4 5,5 6,3 Input file
Pass 1
8,9 6,9 6,8
6,7 5,6 5,5 4-page runs
Pass 2
2,3 1 2,3
1
Main Memory
3,4 6,2 9,4 8,7 5,6 3,1 9,2 6,1 8,2 3,4 5,5 6,3 Input file
Pass 1
8,9 6,9 6,8
6,7 5,6 5,5 4-page runs
Pass 2
2,3 2,3
1
Main Memory
3,4 6,2 9,4 8,7 5,6 3,1 9,2 6,1 8,2 3,4 5,5 6,3 Input file
Pass 1
8,9 6,9 6,8
6,7 5,6 5,5 4-page runs
Pass 2
2,3 2,3 2,3
1
Main Memory
3,4 6,2 9,4 8,7 5,6 3,1 9,2 6,1 8,2 3,4 5,5 6,3 Input file
Pass 1
8,9 6,9 6,8
6,7 5,6 5,5 4-page runs
Pass 2
2,3 2,3 2,3
1
Main Memory
3,4 6,2 9,4 8,7 5,6 3,1 9,2 6,1 8,2 3,4 5,5 6,3 Input file
Pass 1
8,9 6,9 6,8
6,7 5,6 5,5 4-page runs
Pass 2
3 2,3 2,3
1,2
Main Memory
3,4 6,2 9,4 8,7 5,6 3,1 9,2 6,1 8,2 3,4 5,5 6,3 Input file
Pass 1
8,9 6,9 6,8
6,7 5,6 5,5 4-page runs
Pass 2
3 3 2,3
1,2
Main Memory
3,4 6,2 9,4 8,7 5,6 3,1 9,2 6,1 8,2 3,4 5,5 6,3 Input file
Pass 1
8,9 6,9 6,8
6,7 5,6 5,5 4-page runs
Pass 2
3 3 3
1,2
Main Memory
3,4 6,2 9,4 8,7 5,6 3,1 9,2 6,1 8,2 3,4 5,5 6,3 Input file
Pass 1
8,9 6,9 6,8
6,7 5,6 5,5 4-page runs
Pass 2
3 3 3
Main Memory
1,2
Cost = ReadTable + Read the entire table and keep only projected attributes
WriteProjectedPages + Write pages with projected attributes to disk
SortingCost + Sort pages with projected attributes with external sort
ReadProjectedPages Read sorted projected pages to discard adjacent
duplicates
WriteProjectedPages = NPages(R)* PF
PF: Projection Factor says how much are we projecting, ratio with respect to
all attributes (e.g. keeping ¼ of attributes, or 10% of all attributes)
Every time we read and write
SortingCost = 2*NumPasses*ReadProjectedPages
• Example: Let’s say that we project ¼ of all attributes, and let’s say
that we have 20 pages in memory
• PF = 1/4 = 0.25, NPages(R) = 1000
• With 20 memory pages we can sort in 2 passes
Cost = ReadTable +
WriteProjectedPages +
SortingCost +
ReadProjectedPages
= 1000 + 0.25 * 1000 + 2*2*250 + 250 = 2500 (I/O)
• Hashing-based projection
–1. Scan R, extract only the needed attributes
–2. Hash data into buckets
•Apply hash function h1 to choose one of B output buffers
–3. Remove adjacent duplicates from a bucket
•2 tuples from different partitions guaranteed to be distinct
Original
Relation Buckets
1
INPUT 2
hash
function
... h1
B-1
Original
Relation OUTPUT Partitions
1
1
INPUT 2
hash 2
function
... h1
B-1
B-1
1. Partitioning phase:
–Read R using one input buffer
–For each tuple:
•Discard unwanted fields
•Apply hash function h1 to choose one of B-1 output buffers
–Result is B-1 partitions (of tuples with no unwanted fields)
•2 tuples from different partitions guaranteed to be distinct
2. Duplicate elimination phase:
–For each partition
•Read it and build an in-memory hash table
–using hash function h2 (<> h1) on all fields
•while discarding duplicates
–If partition does not fit in memory
•Apply hash-based projection algorithm recursively to this
partition (we will not do this…)
Our example:
Cost = ReadTable +
WriteProjectedPages +
ReadProjectedPages
= 1000 + 0.25 * 1000 + 250 = 1500 (I/O)