External Sorting
External Sorting
External Sorting
SK YAKOOB
Associate Professor
Why Sort?
A classic problem in computer science!
Data requested in sorted order
e.g., find students in increasing gpa order
Sorting is first step in bulk loading B+ tree
index.
Sorting useful for eliminating duplicate
copies in a collection of records
Sort-merge join algorithm involves sorting.
81 94 11 96 12 35 17 99
81 94 11 96 12 35 17 99
Can recursively divide Sort Sort
again and again
11 81 94 96 12 17 35 99
Overview on Merge-Sort
(Cont’d)
Merge : Merge two sorted lists and repeatedly choose
the smaller of the two “heads” of the lists
81 94 11 96 12 35 17 99
81 94 11 96 12 35 17 99
Can recursively
divide again and Sort Sort
again
11 81 94 96 12 17 35 99
Merge
11 12 17 35 81 94 96 99
2-Way Sort: Requires 3
Buffers
Phase 1: PREPARE.
Read a page, sort it, write it.
Main
only one buffer page is used memory
1 buffer OUTPUT
INPUT 2
1,2
2,3
Pass I > 0 Only three
3,4 8-page runs
memory blocks are needed 4,5
6,6
7,8
9
1
Two-Way External Merge Sort
3,4 6,2 9,4 8,7 5,6 3,1 2 Input file
PASS 0
3,4 2,6 4,9 7,8 5,6 1,3 2 1-page runs
Costs for one pass:
PASS 1
all pages 2,3 4,7 1,3
2-page runs
4,6 8,9 5,6 2
# of passes : PASS 2
height of tree 2,3
4,4 1,2 4-page runs
6,7 3,5
Total cost :
8,9 6
product of above PASS 3
1,2
2,3
Notice: We ignored the 3,4 8-page runs
4,5
CPU cost to sort a block in 6,6
memory or merge two 7,8
blocks 9
1
Two-Way External Merge Sort
3,4 6,2 9,4 8,7 5,6 3,1 2 Input file
Each pass we read + PASS 0
write each page in file. 3,4 2,6 4,9 7,8 5,6 1,3 2 1-page runs
PASS 1
N pages in file => 2N 2,3 4,7 1,3
2-page runs
4,6 8,9 5,6 2
PASS 2
Number of passes 2,3
4,4 1,2 4-page runs
log2 N 1 6,7
8,9
3,5
6
PASS 3
1,2
2,3
So total cost is: 3,4 8-page runs
4,5
2 N log 2 N 1 6,6
7,8
9
1
External Merge Sort
1
Phase 1 : Prepare
INPUT 1
...
INPUT 2
...
INPUT B
Disk Disk
B Main memory
buffers
1
Phase 2 : Merge
INPUT 1
...
INPUT 2
... OUTPUT
INPUT B-1
Disk Disk
B Main memory
buffers
Merge as many sorted sublists into one long
sorted list.
Keep 1 buffer for the output
Use B-1 buffers to read from B-1 lists
1
General External Merge Sort
How can we utilize more than 3 buffer pages?
To sort a file with N pages using B buffer
pages:
N /B
Pass 0: use B buffer pages.
Produce sorted runs of B pages each.
Pass 1, 2, …, etc.: merge B-1 runs.
INPUT 1
... ..
INPUT 2
... OUTPUT
INPUT B-1 .
Disk Disk
B Main memory
buffers 1
Cost of External Merge Sort
1
Example
Buffer : with 5 buffer pages
File to sort : 108 pages
Pass 0:
• Size of each run?
• Number of runs?
Pass 1:
• Size of each run?
• Number of runs?
Pass 2: ???
1
Example
Buffer : with 5 buffer pages
File to sort : 108 pages
Pass 0: 108 / 5 = 22 sorted runs of 5
pages each (last run is only 3 pages)
Pass 1: 22 / 4 = 6 sorted runs of 20
pages each (last run is only 8 pages)
Pass 2: 2 sorted runs, 80 pages and 28
pages
Pass 3: Sorted file of 108 pages
Number of passes = 5
2
Example: 5-Way Merge for 20
Runs
Number of passes = 2
2
Number of Passes of External
Sort
- gain of utilizing all available buffers
- importance of a high fan-in during merging
#Buffers available in main-memory
2
Internal Main-Memory Sorting
Quicksort is a fast way to sort in
memory.
An alternative is “heapsort”
2
Summary
External sorting is important; DBMS may
dedicate part of buffer pool for sorting!
External merge sort minimizes disk I/O costs:
Two-Way External Sorting
• Only 3 memory buffers are needed
Multi-Way External Sorting
• Depends on the number of memory buffers available
2
Summary (Cont’d)
External merge sort minimizes disk I/O costs:
Pass 0: Produces sorted runs of size B (# buffer
pages).
Later passes: merge runs.
# of runs merged at a time depends on B, and
block size.
Larger block size means less I/O cost per page.
Larger block size means smaller # runs merged.
In practice, # of runs rarely more than 2 or 3.