0% found this document useful (0 votes)
20 views

External Sorting

External sorting is a technique used to sort large datasets that cannot fit into main memory, involving a sorting phase and a merge phase. The external merge sort method sorts data stored in intermediate files and merges them to create a sorted output. Efficient use of buffer pages during the sorting process minimizes disk I/O costs, with the number of passes required depending on the size of the data and the available buffers.
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

External Sorting

External sorting is a technique used to sort large datasets that cannot fit into main memory, involving a sorting phase and a merge phase. The external merge sort method sorts data stored in intermediate files and merges them to create a sorted output. Efficient use of buffer pages during the sorting process minimizes disk I/O costs, with the number of passes required depending on the size of the data and the available buffers.
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 26

UNIT-IV

External Sorting

SK YAKOOB
Associate Professor
Why Sort?
 A classic problem in computer science!
 Data requested in sorted order
 e.g., find students in increasing gpa order
 Sorting is first step in bulk loading B+ tree
index.
 Sorting useful for eliminating duplicate
copies in a collection of records
 Sort-merge join algorithm involves sorting.

 Problem: sort 10GB of data with 1MB of RAM.


External Sorting
 External sorting is a technique in which the data is
stored on the secondary memory, in which part by
part data is loaded into the main memory and then
sorting can be done over there. Then this sorted
data will be stored in the intermediate files!
 Finally, these files will be merged to get a sorted
data.
 by using the external sorting technique, a huge
amount of data can be sorted easily.
 Basically, it consists of two phases that are:
 Sorting phase: This is a phase in which a large amount of
data is sorted in an intermediate file.
 Merge phase: In this phase, the sorted files are combined
into a single larger file.
External merge sort
 The external merge sort is a technique in which
the data is stored in intermediate files and then each
intermediate files are sorted independently and then
combined or merged to get a sorted data.
 For example: Let us consider there are 10,000
records which have to be sorted. For this, we need to
apply the external merge sort method. Suppose the
main memory has a capacity to store 500 records in
a block, with having each block size of 100 records.
Two-Way Merge Sort
 Two-way merge sort is a technique
which works in two stages which are as
follows here:
 Stage 1: Firstly break the records into
the blocks and then sort the individual
record with the help of two input tapes.
 Stage 2: In this merge the sorted
blocks and then create a single sorted
file with the help of two output tapes.
Two-Way Merge Sort
Algorithm for Two-Way Merge Sort:
Step 1) Divide the elements into the blocks of size M. Sort each block
and then write on disk.
Step 2) Merge two runs
Read first value on every two runs.
Then compare it and sort it.
Write the sorted record on the output tape.
Step 3) Repeat the step 2 and get longer and longer runs on alternates
tapes. Finally, at last, we will get a single sorted list.
Overview on Merge-Sort
 Merge : Merge two sorted lists and repeatedly choose
the smaller of the two “heads” of the lists

 Merge Sort: Divide records into two parts; merge-sort


those recursively, and then merge the lists.

81 94 11 96 12 35 17 99

81 94 11 96 12 35 17 99
Can recursively divide Sort Sort
again and again

11 81 94 96 12 17 35 99
Overview on Merge-Sort
(Cont’d)
 Merge : Merge two sorted lists and repeatedly choose
the smaller of the two “heads” of the lists
81 94 11 96 12 35 17 99

81 94 11 96 12 35 17 99
Can recursively
divide again and Sort Sort
again
11 81 94 96 12 17 35 99

Merge

11 12 17 35 81 94 96 99
2-Way Sort: Requires 3
Buffers
 Phase 1: PREPARE.
 Read a page, sort it, write it.
Main
 only one buffer page is used memory

 Phase 2, 3, …, etc.: MERGE:


 Three buffer pages used.
1 buffer
INPUT 1
1 buffer

1 buffer OUTPUT
INPUT 2

Main memory Disk


Disk
buffers
Two-Way External Merge Sort
3,4 6,2 9,4 8,7 5,6 3,1 2 Input file
 Idea: Divide PASS 0
and conquer: 3,4 2,6 4,9 7,8 5,6 1,3 2 1-page runs
PASS 1
sort sub files and 2,3 4,7 1,3
2-page runs
merge into larger 4,6 8,9 5,6 2
PASS 2
sorts 2,3
4,4 1,2 4-page runs
Pass 0  Only one 6,7 3,5
8,9 6
memory block is needed PASS 3

1,2
2,3
Pass I > 0  Only three
3,4 8-page runs
memory blocks are needed 4,5
6,6
7,8
9

1
Two-Way External Merge Sort
3,4 6,2 9,4 8,7 5,6 3,1 2 Input file
PASS 0
3,4 2,6 4,9 7,8 5,6 1,3 2 1-page runs
 Costs for one pass:
PASS 1
all pages 2,3 4,7 1,3
2-page runs
4,6 8,9 5,6 2
 # of passes : PASS 2
height of tree 2,3
4,4 1,2 4-page runs
6,7 3,5
 Total cost :
8,9 6
product of above PASS 3

1,2
2,3
Notice: We ignored the 3,4 8-page runs
4,5
CPU cost to sort a block in 6,6
memory or merge two 7,8
blocks 9

1
Two-Way External Merge Sort
3,4 6,2 9,4 8,7 5,6 3,1 2 Input file
 Each pass we read + PASS 0
write each page in file. 3,4 2,6 4,9 7,8 5,6 1,3 2 1-page runs
PASS 1
 N pages in file => 2N 2,3 4,7 1,3
2-page runs
4,6 8,9 5,6 2
PASS 2
 Number of passes 2,3
4,4 1,2 4-page runs
 log2 N   1 6,7
8,9
3,5
6
PASS 3

1,2
2,3
 So total cost is: 3,4 8-page runs

 
4,5
2 N  log 2 N   1 6,6
7,8
9
1
External Merge Sort

 What if we had more buffer pages?


 How do we utilize them wisely ?

- Two main ideas !

1
Phase 1 : Prepare

INPUT 1

...
INPUT 2
...
INPUT B
Disk Disk
B Main memory
buffers

• Construct as large as possible starter lists.


 Will reduce the number of needed passes

1
Phase 2 : Merge
INPUT 1

...
INPUT 2
... OUTPUT

INPUT B-1
Disk Disk
B Main memory
buffers
 Merge as many sorted sublists into one long
sorted list.
 Keep 1 buffer for the output
 Use B-1 buffers to read from B-1 lists

1
General External Merge Sort
How can we utilize more than 3 buffer pages?
 To sort a file with N pages using B buffer
pages:
 N /B
Pass 0: use B buffer pages.
Produce sorted runs of B pages each.
 Pass 1, 2, …, etc.: merge B-1 runs.
INPUT 1

... ..
INPUT 2
... OUTPUT

INPUT B-1 .
Disk Disk
B Main memory
buffers 1
Cost of External Merge Sort

 Number of passes:1   log B  1  N / B  


 Cost = 2N * (# of passes)

1
Example
 Buffer : with 5 buffer pages
 File to sort : 108 pages

 Pass 0:
• Size of each run?
• Number of runs?

 Pass 1:
• Size of each run?
• Number of runs?

 Pass 2: ???

1
Example
 Buffer : with 5 buffer pages
 File to sort : 108 pages
 Pass 0: 108 / 5  = 22 sorted runs of 5
pages each (last run is only 3 pages)
 Pass 1:  22 / 4  = 6 sorted runs of 20
pages each (last run is only 8 pages)
 Pass 2: 2 sorted runs, 80 pages and 28
pages
 Pass 3: Sorted file of 108 pages

• Total I/O costs: ?


1
Example
 Buffer : with 5 buffer pages
 File to sort : 108 pages

 Pass 0: 108 / 5  = 22 sorted runs of 5


pages each (last run is only 3 pages)
 Pass 1: 22 / 4  = 6 sorted runs of 20 pages
each (last run is only 8 pages)
 Pass 2: 2 sorted runs, 80 pages and 28 pages
 Pass 3: Sorted file of 108 pages

• Total I/O costs: 2*N * (4 passes)


2
Example: 2-Way Merge for 20
Runs

Number of passes = 5
2
Example: 5-Way Merge for 20
Runs

Number of passes = 2

2
Number of Passes of External
Sort
- gain of utilizing all available buffers
- importance of a high fan-in during merging
#Buffers available in main-memory

N B=3 B=5 B=9 B=17 B=129 B=257


100 7 4 3 2 1 1
1,000 10 5 4 3 2 2
10,000 13 7 5 4 2 2
#Pages 100,000 17 9 6 5 3 3
in File 1,000,000 20 10 7 5 3 3
10,000,000 23 12 8 6 4 3
100,000,000 26 14 9 7 4 4
1,000,000,000 30 15 10 8 5 4

2
Internal Main-Memory Sorting
 Quicksort is a fast way to sort in
memory.

 An alternative is “heapsort”

2
Summary
 External sorting is important; DBMS may
dedicate part of buffer pool for sorting!
 External merge sort minimizes disk I/O costs:
 Two-Way External Sorting
• Only 3 memory buffers are needed
 Multi-Way External Sorting
• Depends on the number of memory buffers available

2
Summary (Cont’d)
 External merge sort minimizes disk I/O costs:
 Pass 0: Produces sorted runs of size B (# buffer
pages).
 Later passes: merge runs.
 # of runs merged at a time depends on B, and
block size.
 Larger block size means less I/O cost per page.
 Larger block size means smaller # runs merged.
 In practice, # of runs rarely more than 2 or 3.

You might also like