Course08 - RelEval
Course08 - RelEval
Students:
Each tuple is 50 bytes long, 80 tuples per page, 500 pages.
Courses:
Each tuple is 50 bytes long, 80 tuples per page, 100 pages.
Evaluations:
Each tuple is 40 bytes long, 100 tuples per page, 1000 pages.
8
Equality Joins With One Join Column
SELECT *
FROM Evaluations R, Students S
WHERE R.sid=S.sid
9
Techniques to Implement Join
Iteration
Simple/Page-Oriented Nested Loops
Indexing
Index Nested Loops
Partition
Sort Merge Join
Hash
10
Simple Nested Loops Join
foreach tuple r in R do
foreach tuple s in S do
if ri == sj then add <r, s> to result
For each tuple in the outer relation R, we scan the
entire inner relation S.
Cost: M + pR * M * N = 1000 + 100*1000*500 I/Os.
Page-oriented Nested Loops join: For each page of
R, get each page of S, and write out matching pairs
of tuples <r, s>, where r is in R-page and s is in S-
page.
Cost: M + M*N = 1000 + 1000*500
If smaller relation (S) is outer, cost = 500 + 500*1000
11
Block Nested Loops Join
Use one page as an input buffer for scanning the inner S,
one page as the output buffer, and use all remaining pages
to hold ``block’’ of outer R.
For each matching tuple r in R-block, s in S-page, add
<r, s> to result. Then read next R-block, scan S, etc.
R&S Join Result
Block of R
...
... ...
Input buffer for S Output buffer
12
Examples of Block Nested Loops
Cost: Scan of outer + #outer blocks * scan of inner
#outer blocks =no of pages of outer / blocksize
With Evaluations (R) as outer, and 100-pages block of R:
Cost of scanning R is 1000 I/Os; a total of 10 blocks.
Per block of R, we scan Students (S); 10*500 I/Os.
If space for just 90 pages of R, we would scan S 12
times.
With 100-page block of Students as outer:
Cost of scanning S is 500 I/Os; a total of 5 blocks.
Per block of S, we scan Evaluations; 5*1000 I/Os.
With sequential reads considered, analysis changes: may be
best to divide buffers evenly between R and S.
13
Index Nested Loops Join
foreach tuple r in R do
foreach tuple s in S where ri == sj do
add <r, s> to result
14
Examples of Index Nested Loops
Hash-index (Alt. 2) on sid of Students (as inner):
Scan Evaluations: 1000 page I/Os, 100*1000 tuples.
For each Evaluations tuple: 1.2 I/Os to get data entry in
index, plus 1 I/O to get (the exactly one) matching
Students tuple cost 220,000. Total: 221,000 I/Os.
Hash-index (Alt. 2) on sid of Evaluations (as inner):
Scan Students: 500 page I/Os, 80*500 tuples.
For each Evaluations tuple: 1.2 I/Os to find index page
with data entries, plus cost of retrieving matching
Evaluations tuples. Assuming uniform distribution, 2.5
evaluations per student (100,000 / 40,000). Cost of
retrieving them is 1 or 2.5 I/Os depending on whether
the index is clustered. Total: from 88,500 to 148,500 I/Os
15
Sort-Merge Join (R i=j S)
Sort R and S on the join column, then scan them to do a
``merge’’ (on join col.), and output result tuples.
Advance scan of R until current R-tuple > current S
tuple, then advance scan of S until current S-tuple >
current R tuple; do this until current R tuple = current S
tuple.
At this point, all R tuples with same value in Ri (current
R group) and all S tuples with same value in Sj (current S
group) match; output <r, s> for all pairs of such tuples.
Then resume scanning R and S.
R is scanned once; each S group is scanned once per
matching R tuple. (Multiple scans of an S group are likely
to find needed pages in buffer.)
16
Example of Sort-Merge Join
sid sname age sid cid day grade
22 dustin 20 28 101 15/6/04 8
28 102 22/6/04 8
28 yuppy 21
31 101 15/6/04 9
31 johnny 20
31 102 22/6/04 10
44 guppy 22 31 103 30/6/04 10
58 rusty 21 58 101 16/6/04 7
Cost: M log2 M + N log2 N + (M+N)
The cost of scanning, M+N, could be M*N (very unlikely!)
With 35, 100 or 300 buffer pages, both Evaluations and
Students can be sorted in 2 passes; total join cost: 7500.
17
Refinement of Sort-Merge Join
We can combine the merging phases in the sorting
of R and S with the merging required for the join.
With B > L , where L is the size of the larger relation, using
the sorting refinement that produces runs of length 2B in
Pass 0, number of runs of each relation is < B/2.
Allocate 1 page per run of each relation, and `merge’ while
checking the join condition.
Cost: read+write each relation in Pass 0 + read each relation
in (only) merging pass (+ writing of result tuples).
In example, cost goes down from 7500 to 4500 I/Os.
Partitions
of R & S Join Result
Read in a partition Hash table for partition
of R, hash it using hash
fn
Ri (k < B-1 pages)
20
Cost of Hash-Join
21
General Join Conditions
Equalities over several attributes (e.g., R.sid=S.sid AND
R.rname=S.sname):
For Index NL, build index on <sid, sname> (if S is
inner); or use existing indexes on sid or sname.
For Sort-Merge and Hash Join, sort/partition on
combination of the two join columns.
Inequality conditions (e.g., R.rname < S.sname):
For Index NL, need (clustered!) B+ tree index.
•Range probes on inner; number of matches likely
to be much higher than for equality joins.
Hash Join, Sort Merge Join not applicable.
Block NL quite likely to be the best join method here.
22