Ads Unit-1
Ads Unit-1
SORTING
Introduction to Sorting
Sorting is nothing but storage of data in sorted order, it can be in ascending or descending
order. The term Sorting comes into picture with the term Searching. There are so many
things in our real life that we need to search, like a particular record in database, roll
numbers in merit list, a particular telephone number, any particular page in a book etc.
Sorting arranges data in a sequence which makes searching easier. Every record which is
going to be sorted will contain one key. Based on the key the record will be sorted. For
example, suppose we have a record of students, every such record will have the following
data:
Roll No.
Name
Age
Class
Here Student roll no. can be taken as key for sorting the records in ascending or descending
order. Now suppose we have to search a Student with roll no. 15, we don't need to search the
complete record we will simply search between the Students with roll no. 10 to 20.
Sorting Efficiency
There are many techniques for sorting. Implementation of particular sorting technique
depends upon situation. Sorting techniques mainly depends on two parameters. First
parameter is the execution time of program, which means time taken for execution of
program. Second is the space, which means space taken by the program.
Types of Sorting Techniques
Sorting can be classified in two ways:
I. Internal Sorting:
This method uses only the primary memory during sorting process. All data items are held
in main memory and no secondary memory is required this sorting process. If all the data
that is to be sorted can be accommodated a time in memory is called internal sorting. There is
a limitation for internal sorting, they can only process relatively small lists due to memory
constraints.
1. Selection Sort:
ADS[UNIT-1] Page 1
1. Selection sort algorithm
2. Heap sort algorithm
2. Insertion Sort:
3. Exchange Sort:
Sorting large amount of data requires external or secondary memory. This process uses
external memory such as HDD, to store the data which is not fir into the main memory. So,
primary memory holds the currently being sorted data only. All external sorts are based on
process of merging. Different parts of data are sorted separately and merged together.
Merge Sort
For a disk, there are three factors contributing to the read/write time:
(1) Seek time: time taken to position the read/write heads to the correct cylinder. This will
depend on the number of cylinders across which the heads have to move.
(2) Latency time: time until the right sector of the track is under the read/write head.
(3) Transmission time: time to transmit the block of data to/from the disk.
The most popular method for sorting on external storage devices is merge sort. This method
consists of two distinct phases.
First, segments of the input list are sorted using a good internal sort method. These sorted
segments, known as runs, are written onto external storage as they are generated.
Second, the runs generated in phase one are merged together until only one run is left.
Example: A list containing 4500 records is to be sorted using a computer with an internal
memory capable of sorting at most 750 records. The input list is maintained on disk and has
a block length of 250 records. We have available another disk that may be used as a scratch
ADS[UNIT-1] Page 2
pad. The input disk is not to be written on. One way to accomplish the sort using the general
function outlined above is to
(1) Internally sort three blocks at a time (i.e., 750 records) to obtain six runs R1 to R6
(2) Set aside three blocks of internal memory, each capable of holding 250 records. Two of
these blocks will be used as input buffers and the third as an output buffer. Merge runs R.
and R 2: This merge is carried out by first reading one block of each of these runs into input
buffers.
We shall assume that each time a block is read from or written onto the disk the maximum
seek and latency times are experienced. Although this is not true in general it will simplify
the analysis. The computing times for the various operations in our 4500- record example are
given in Figure:
ADS[UNIT-1] Page 3
k- Way Merging
The two-way merge function merge (Program 7.7) is almost identical to the merge
function just described (Figure 7.20). In general. if we start with m runs, the! merge
tree corresponding to Figure 7.20 will have fiog2m 1+1 levels. for a total of fiOg2m1
passes over the data list. The number of passes over the data can be reduced by using a
higher order merge
ADS[UNIT-1] Page 4
correctly, then the time to output one buffer will be the same as the CPU time needed to fill
the second buffer.
Example: Assume that a two-way merge is carried out using four input buffers, and two
output buffers, ou [0] and ou [I]. Each buffer is capable of holding two records. The first few
records of run 0 have key value I, 3, 5, 7. 8. 9. The first few records of run I have key value
2,4,6, 15.20.25.
Buffering Algorithm:
Step 1: Input the first block of each of the k runs, setting up k linked queues, each having one
block of data. Put the remaining k input blocks into a linked stack of free input blocks. Set ou
to 0.
ADS[UNIT-1] Page 5
Step 2: Let Last Key [i] be the last key input from run i. Let NextRun be the run for which
Last Key is minimum. If LastKey [NextRun ] = +∞, then initiate the input of the next block
from run NextRun.
Step 3: Use a function k-way merge to merge records from the k input queues into
the output buffer ou. Merging continues until either the output buffer gets full or a record
with key +∞ is merged into ou. If, during this merge, an input buffer becomes empty before
the output buffer gets full or before +∞ is merged into ou, the k-way merge advances to the
next buffer on the same queue and returns the empty buffer to the stack of empty buffers.
However, if an input buffer becomes empty at the same time as the output buffer gets full or
+∞ is merged into ou, the empty buffer is left on the queue, and K-way merge does not
advance to the next buffer on the queue. Rather, the merge terminates.
Step 4: Wait for any ongoing disk input/output to complete.
Step 5: If an input buffer has been read, add it to the queue for the appropriate
run. Determine the next run to read from by determining NextRun such that LastKey
[NextRun ] is minimum.
Step 6: If lastKey[nextRun]≠+∞, then initiate reading the next block from run nextRun into a
free input buffer.
Step 8: If a record with key +∞ has not been merged into the output buffer, go back to step 3.
Otherwise wait for the ongoing write to complete and then terminate.
Ex: Each run consists of four blocks of two records each; the last key in the fourth block of each
of these three runs is +∞. We have six input buffers and two output buffers. Figure 7.25 shows
the status of the input buffer queues, the run from which the next’ block is being read, and the
output buffer being output at the beginning of each iteration of the loop of Steps 3 through 8
of the buffering algorithm.
ADS[UNIT-1] Page 6
Run Generation:
It is possible to generate runs those are only as large as the no of records can be held in
internal memory at one time.
Winner Trees
Each internal node represents a match played between its two children; the winner of the
match is stored at the internal node.
Ex:
ADS[UNIT-1] Page 7
Time To Sort
ƒO(n) time
ƒO(log n) time
• Initialize
ƒO(n) time
• Get winner
ADS[UNIT-1] Page 8
ƒO(1) time
ƒO(log n) time
Loser Tree: Each match node stores the match loser rather than the match winner.
ADS[UNIT-1] Page 9
Replace and Replay:
Analysis of runs: When the input list is already sorted. only one run is generated. On the
average. the run size is almost 2k. The time required to generate all the runs for an 11 run list
is 0(11 log k). as it takes O(log k) time to adjust the loser tree each time a record is output.
The circular nodes represent a two-way merge using as input the data of the children nodes.
The square nodes represent the initial runs. We shall refer to the circular nodes as internal
nodes and the square ones as external nodes. Each figure is a merge tree.
ADS[UNIT-1] Page 10
In the first merge tree. we begin by merging the runs of size 2 and 4 to get one of size 6; next
this is merged with the run of size 5 to get a run of size 11; finally this run of size II is merged
with the run of size 15 to get the desired sorted run of size 26. When merging is done using
the first merge tree. some records are involved in only one merge. and others are involved in
up to three merges. In the second merge tree each record is involved in exactly two merges.
Since the time for a merge is linear in the number of records being merged, the total merge
time is obtained by summing the products of the run lengths and the distance from the root
of the corresponding {external nodes. This sum is called the weighted external path length.
For the two trees 01 Figure 7.26, the respective weighted external path lengths are
2 . 3 + 4 . 3 + 5 . 2 + 15 . 1 = 43
and
2 . 2 + 4 . 2 + 2 + 5 . 2 + 15 . 2 = 52
another application for binary trees with minimum weighted external path length. Suppose
we wish to obtain an optimal set of codes for messages M1………..Mn + 1 Each code is a
binary string that will be used for transmission of the corresponding message. At the
receiving end the code will be decoded using a decode tree. A decode tree is a binary tree in
which external nodes represent messages.
The binary bits in the code word for a message determine the branching needed at each
level of the decode tree to reach the correct external node.
If we interpret a zero as a left branch and a one as a right branch. then the decode tree from
above figure corresponds to codes 000. 001; 01, and 1 for messages M1. M2. M3. and M4.
respectively. These codes are called Huffman codes.
A very nice solution to the problem of finding a binary tree with minimum
weighted external path length has been given by D. Huffman.
Ex: Suppose we have the weights q1=2, q2=3, q3=5, q4=7, q5=9a and q6=13. Construct a
Huffman tree and find external path length.
NOTE: The number in a circular node represents the sum of weights of external nodes in
subtree.
ADS[UNIT-1] Page 11
The weighted external path length for above tree is:
2.4+3.4+5.3+13.2+7.2+9.2=93
Analysis of Huffman tree:
Heap initialization takes O(n) time.
Push and pop requires only O(log n) time.
Therefore the asymptotic time for the algorithm is: O(n log n).
ADS[UNIT-1] Page 12