0% found this document useful (0 votes)
10 views

Hashsorting

Uploaded by

dobrev666
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Hashsorting

Uploaded by

dobrev666
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

HASH SORT: A LINEAR TIME COMPLEXITY

MULITIPLE-DIMENSIONAL SORT ALGORITHM



ORIGINALLY ENTITLED ”MAKING A HASH OF SORTS”
WILLIAM F. GILREATH
( [email protected] )
arXiv:cs/0408040v1 [cs.DS] 17 Aug 2004

Abstract.

Sorting and hashing are two completely different concepts in computer science, and appear
mutually exclusive to one another. Hashing is a search method using the data as a key to map to the
location within memory, and is used for rapid storage and retrieval. Sorting is a process of organizing
data from a random permutation into an ordered arrangement, and is a common activity performed
frequently in a variety of applications.

Almost all conventional sorting algorithms work by comparison, and in doing so have a linearith-
mic greatest lower bound on the algorithmic time complexity. Any improvement in the theoretical
time complexity of a sorting algorithm can result in overall larger gains in implementation perfor-
mance. A gain in algorithmic performance leads to much larger gains in speed for the application
that uses the sort algorithm. Such a sort algorithm needs to use an alternative method for order-
ing the data than comparison, to exceed the linearithmic time complexity boundary on algorithmic
performance.
The hash sort is a general purpose non-comparison based sorting algorithm by hashing, which has
some interesting features not found in conventional sorting algorithms. The hash sort asymptotically
outperforms the fastest traditional sorting algorithm, the quick sort. The hash sort algorithm has a
linear time complexity factor – even in the worst case. The hash sort opens an area for further work
and investigation into alternative means of sorting.

1. Theory.
1.1. Sorting.
Sorting is a common processing activity for computers to perform. Data that
is sorted is of the form where data items have an increasing value – ascending, or
alternatively decreasing value – descending. Regardless of the arrangement form,
sorting establishes an ordering of the data. The arrangement of data from some
random configuration into an ordered one is necessary often in many algorithms,
applications, and programs. The need for sorting to arrange and order data makes
the space and temporal complexity of the algorithm used very paramount.

A bad choice of algorithm for sorting data by a designer or programmer can


result in mediocre performance in the end. With the large need for sorting, and the
important concern of performance, many different types and kinds of algorithms for
sorting have been devised. Some algorithms used such as quick sort and bubble sort
are very widespread and often used. Other algorithms such as bin sort and pigeonhole
sort are not as widely known.

The plethora of sorting algorithms available does not change the tantamount
question of how fast it is possible to sort. This is a question of temporal efficiency,
and is the most significant criterion for a sort algorithm. Along with it, what will
affect the temporal efficiency and why it does is just as important a concern. Still
another important concern, though more subtle, is the space requirements for using
∗ Special thanks and appreciation to Dr. Michael Mascagni for the opportunity to present and

publish this paper.


1
the algorithm – space efficiency. This is a matter of algorithm overhead and resource
requirements involved in the sorting process.

All sort algorithms for the most part have the temporal efficiency and causes for
changes in it are well understood. This limit or barrier to sorting speed is a greatest
lower bound which is O(N log N ). In essence, the reasoning behind this is that sorting
is a comparative or decision making process of N items, which forms a binary tree
of depth log N . For each data item, a decision must be made where to move it to
maintain the ordering property desired. With N data items, and log N decision time,
the minimum speed possible is the product, which is N log N . This lower bound is
often never reached in practice, it is often a multiplied or added constant away from
the theoretical maximum.

There is no theoretic greatest lower bound for space efficiency, and often this
complexity measure characterizes a sort algorithm. The space requirements are highly
dependent on the underlying method used in the sort algorithm, so space efficiency
directly reflects this. While no theoretical limit is available, an optimum sort algo-
rithm will require N + 1 storage. This is the most optimal because N data items, and
one additional data item used by the sort algorithm. A bubble sort algorithm has
this optimal storage efficiency. However, the optimal space efficiency is subordinate
to the temporal efficiency in a sorting algorithm. The bubble sort, while having an
optimal space efficiency, is well reputed to be very poor in performance, far above the
theoretical lower bound.

1.2. Hashing.
Hashing is a process of searching through data, using each data item itself as a
key in the search. Hashing does not order the data items with respect to each other.
Hashing organizes the data according to a mapping function which is used to hash
each data item. Hashing is an efficient method to organize and search data quickly.
The temporal efficiency or speed of the hashing process is determined by the hash
function and its complexity.

Hash functions are mathematically based, and a common hash function uses the
remainder mod operation to map data items. Other hash functions are based on
mathematical formulas and equations. The construction of a hash function is usu-
ally from multiplicative, divisional, additive operations, or some mix of them. The
choice of hash function follows from the data items and involves temporal and space
organization compromises.

Hashing is not a direct mathematical mapping of the data items into a space
organization. Hashing functions usually have a phenomena of a hash collision or
clash, where two data items map to the same spatial location. This is detrimental
to the hash function, and many techniques for handling hash collisions exist. Hash
collisions introduce temporal inefficiency into a hash algorithm, as the handling of a
collision represents additional time to re-map the data item.

A special type of hash function, a perfect hash function, exists and is perfect in
the sense it will not have collisions. These types of perfect hash functions are often
available only for narrow applications, reserved words in a compiler symbol table for
example. Perfect hash functions usually involve restricted data items and do not fit
in to a general purpose hash method While perfect hashing is possible, often a regular
2
hash function is used with data in which the number of collisions is few or the collision
resolution method is simple.
2. Algorithm.
2.1. Super-Hash Function.
The heart of the hash sort algorithm is the concept of combining hashing along
with sorting. This key concept is embodied in the notion of a super-hash function.
A super-hash function is ”super” in that it is a hash function that has a consistent
ordering, and is a perfect hash function. A hash function of this type will order
the data by hashing, but maintain a consistent ordering and not scatter the data.
This hash function is perfect so that as the data is ordered it is uniquely placed when
hashed, so that no data items ambiguously share a location in the ordered data. Along
with that, the necessity for hash collision resolution can be avoided altogether.

The super-hash function is a generalized function, in that it is extendible math-


ematically, and is not a specialized type of hash function. The super-hash function
operates on a data set which is of positive integer values. With a super-hash function,
the restrictions on the data set as the domain is that it is within a bounded range,
between a minimum and maximum value. The only other restriction is that each in-
teger value be unique – no duplicate data values. This highly closed set of data values
as positive integers is necessary to build a preliminary super-hash function. Once a
mathematically proven super-hash function is formulated, then other less restricted
data sets can be explored.

A super-hash function as described and within the parameters is not a complex


or rigorous function to define. A super-hash function uses the standard hash function
using the modulus or residue operator. This operator is integer remainder, called
mod, is a common hash function in the form (x mod n). The other part of the
super-hash function is another hash function called a mash function, for modified
hash or magnitude hash. The mash function uses the integer division operator. This
operator, called div, is used in the form similar to the hash of (x div n). Both of
these functions, the mash function and the hash function, together form the super-
hash function. This super-hash function is mathematically based and extensible, and
is also a perfect hash function. For the super-hash function to be perfect, it must be
an injective mapping.

The super-hash function works using a combination of a regular hash function


and the mash function. Together both of these functions make a super-hash function,
but not as the composition of the two. Both the hash function, and mash function are
sub-functions of a super-hash function. The regular has function (x mod n) works
using the remainder or residues of values. So numbers are of the form c · x + r, where
r is the remainder obtained using the regular has function. When hashing by a value
n, the resulting hashes map from the range 0 to n − 1. In essence, a set of values is
formed so that each value in the set is {c·x+0, c·x+1, . . . , c·x+(n−2), c·x+(n−1)}.

A hash function provides some distinction among the values, using the remainder
or residue of the value. However, regular hashing experiences collisions, or where
values hash to the same value. The problem is that values of the same remainder
are indistinguishable to the regular hash function. Given a particular remainder r, all
values which are multiples of c are equivalent. So a set of the form {c1 · x + r1 , c2 ·
x + r1 , . . . , cn−1 · x + r1 , cn · x + r1 }. So for n = 10, r = 1 the following set of values
3
are equivalent under the regular hash function: {1, 11, 21, 31, . . . 1001, 10001, c · x + 1}.
It is the equivalence of the values under regular hashing which causes collisions, as
values map to the same hash result. There is no distinction among larger and smaller
values within the same hash result. Some partitioning or separation of the values is
obtained, but what is needed is another hash function to distinguish the hash values
further by their magnitude relative to one another in the same set.

This is what the mash function is for, a magnitude hash. A mash function is
of the same form as a regular hash function, only it uses div rather than mod for
(x mod n). The div operator on the form of a value c · x + r gives the value of c,
where x is the base (usually decimal, base 10). The mash function maps values into a
set where the values mash to the same result, based upon the magnitude. So the mash
function shares the same problem as a regular hash that all values are mapped into an
equivalent set. So a set of the from {c1 ·x+r1 , c1 ·x+r2 , . . . , c1 ·x+rn−1 , c1 ·x+rn } has
all values mashed to the same result. With n = 10, c = 3 the following set of values
are equivalent under the mash function {30, 31, 32, 33, 34, 35, 36, 37, 38, 39}. With the
mash function, some partitioning of the values is obtained, but there is no distinction
among those values which is unique to the values.

Together, however, a hash function, and mash function can both distinguish val-
ues, by magnitude, and by residue. The minimal form of this association between the
two functions is an ordinal pair of the form (c, r) where c is the multiple of the base
obtained with the mash function, and r is the remainder of the value obtained with the
hash function. Essentially an ordinal pair form a unique representation to the value,
using the magnitude constant, and residue. Further, each pair distinguishes larger
and smaller values, using the mash result, and equal magnitudes are distinguished
from each other using the residue. So the mapping is a perfect hash, as all values
are uniquely mapped to a result, and is ordering, since the magnitude of the values
is preserved in the mapping. A formal proof of this property, an injective mapping
from one set to a resulting set, is given.

The values for n in the hash function and mash function are determined by the
range of values involved. The proof gives the validation of the injective nature of the
mapping, and the mathematical properties of the parameters used, but no general
guidelines for determining the parameters. The mathematical proof only indicates
that the same mapping value must be used in the hash and mash functions.

Multiple iterations of mapping functions can be used, so multiple mapping values


can be used to form a set of mapping pairs. The choice of values for the mapping
values depends on the number of dimensions to be mapped into, and the range of
values that the mapping pairs can take on. Numerous, smaller mapping values will
produce large mapping ordinates for the mash function, and small values for the hash
function. This would be a diverse mix of values, but it depends upon the use of the
hash sort and what is desired by the user of the algorithm.

The organization of the data elements in the matrix can be of row-major or


column-major order. The mapping into a column and row by the hash and mash
function determines if the matrix is row or column major mapping. A row-major
order mapping would have the rows mapped by the mash function, and the columns
mapped by the hash function. A column major order mapping would interchange the
mash and hash functions for the rows and columns respectively.
4
2.2. Construction of the Super-Hash Function.
2.2.1. Method of Construction. The super-hash function consists of a regular
hash function using the mod operator, and a mash function using the div operator.
Together these form ordinal pairs which uniquely map the data element into a square
matrix. The important component of the super-hash function to determine is the
mapping constant Θ.

To determine Θ, it must be calculated from the range of the values that are to
be mapped using the super-hash function. The range determines the dimensionality
or size of the matrix, which is square for this super-hash function.

Given a range R[i, j] where i is the lower-bound, and j is the upper-bound, then:

1. Compute the length L of the range where L = (j − i) + 1 .


2. Determine the nearest square integer to L. The nearest square is calculated
by:

Θ = ⌈ L⌉

The final value computed is the nearest square to L, which is the length of the
range of values. In essence the values are being constructed in the form of a number
which is tailored to the length of the range of values.

value( dx , mx ) = dx · Θ + mx

where Θ is determined by the range of the values to be mapped.


2.2.2. Example of Constructing a Super-Hash Function. As an example,
suppose you have a range of values from 13 to 123. The length of the range is 123 -
13 + 1. The length of this range is 111.

The nearest square is then calculated as follows:



Θ = ⌈ 111⌉

which evaluates as:

Θ = ⌈10.53565375 . . .⌉

Θ = 11

Part of the mapping involves subtracting the lower bound of the range, so that
all the values are initially mapped to the range 0..(j − i). So the super-hash function
for this range of data values is:

d = (x − 13) div 11
F (x) ≡
m = (x − 13) mod 11

For the lowest value, 13 maps to (0,0). The largest value 123 maps to (10,0).

Reconstructing the values from the ordinal pairs is of the form:

value( dx , mx ) = dx · Θ + mx + i
5
so that for (0,0)

value( 0, 0) = (0 · 11 + 0) + 13 = 13

and for (10,0)

value( 10, 0) = (10 · 11 + 0) + 13 = 123

One final point is that constructing a super-hash function requires knowledge of


the range of possible values of the data elements. All data types in computer languages
have a specific range of values for the different types, such as int, float, unsigned int,
long to list types from the C programming language. While these types are often
used without needing to know the range of values, there does exist such a property
on those values and variables defined of that particular type in a program.
2.3. In-situ Hash Sort Algorithm.
The hash sort algorithm uses the super-hash function iteratively on an entire data
set within the range of the super-hash function. This is the in-situ version of the hash
sort, which works ”in site”. Before the iterative process, an initialization is performed.
A source value is retrieved, and is mapped by super-hash to another location. At the
destination location, the destination value is exchanged with the source one, and
stored at the location. The new value now is a source value. The algorithm process
then repeats iteratively for a source, destination value. The algorithm terminates at
the end of the list.

Pseudo-code illustrating a very generalized from of the in-situ hash sort is:

(m1 , m2 , · · · , mn−1 , mn ) ←− initialize;

WHILE NOT ( end of list ) DO

temp ←− get(m1 , m2 , · · · , mn−1 , mn );

value −→ put(m1 , m2 , · · · , mn−1 , mn );

value = temp ;

( m1 , m2 , · · · , mn−1 , mn ) −→ superhash(temp);

END WHILE ;

Pascal Version:

Procedure hash_sort(var list: data_list; n: integer);


Const
D=10; (* D is the dimension size for a 10 x 10 matrix *)

6
Var
x_c, h_c: integer;
which, where: integer;
value: integer;
Begin
x_c := 0; (* set the counts to zero *)
h_c := 0;
where := 0; (* set the initial starting points to zero *)

while( (x_c < n) AND (h_c < n) ) do


begin (* loop until exchange and hysteresis count equal data size *)
value := List[where div i, where mod j]; (* get a value *)
if (value = where) then (* check for hysteresis *)
begin
where := where + 1; (* on hysteresis move where to next position *)
h_c := h_c + 1; (* on hysteresis increment hysteresis count *)
end
else (* if no hysteresis, swap values and increment exchange count *)
begin
List[where div D, where mod D] := List[value div D, value mod D];
List[value div D, value mod D] := value;
x_c := x_c + 1;
end;
end;
End;

C version:

#define DIM 10 /* dimension size for the matrix */

void hash_sort(int& list[DIM][DIM], int n){


int x_c = 0,h_c = 0;
int which = 0,where = 0;
int value;

while( (x_c < n) && (h_c < n) ){


value = list[where / DIM][where % DIM];
if (value == where){
where++; /* on hysteresis move where to next position */
h_c++; /* on hysteresis increment hysteresis count */
} else {
/* if no hysteresis, swap values and increment exchange count */
list[where / DIM][where % DIM] = list[value / DIM][value % DIM];
list[value / DIM][value % DIM] = value;
x_c++; /* increment the exchange count */
}
}
}
Generic Pseudo-code of operations involved in in-situ version of algorithm:
PROCEDURE HASH_SORT;
7
Initialize variables;

WHILE ( Counts < Data Size ) DO


Mathematically process value;
Compute destination in array;

IF hysteresis THEN
Move where to next location;
Increment hysteresis count;
ELSE
Exchange value with destination;
Increment exchange count;
END IF;
END WHILE;

END PROCEDURE:

The hash sort is a simple algorithm, the complexity is in the super-hash function
which is the key to the algorithm. The remaining components of the algorithm are
for the exchange of values, and continuous iteration until a termination condition.

The hash sort is a linear time complexity sorting algorithm. This linear time
complexity stems from the mapping nature of the algorithm. The hash sort uses
the intrinsic nature of the values to map them in order. The mapping function is
applied repeatedly in an iterative manner until the entire list of values is ordered.
The mathematical expression for the run-time complexity is:

F (time) = 2 · c · N

where c ≥ 1 , and N is the size of the list of data values.

The function for the run-time complexity is derived from the complexity of the
mapping function, and the iterative nature of the algorithm. The mapping function is
a composite function, with two sub-functions. Multiple applications of the mapping
function are possible because of the extendible nature of the hash sort, but at least
one mapping is required. Hence, the constant c is a positive integer value, which
represents the number of sub-mapping within the mapping function.

The mapping function uses two sub-functions as a composite, so the overall time
for the mapping is the product of two multiplied by the number of sub-mappings.
The value of the product of two multiplied by the number of sub-mappings is always
greater than one, hence it is the dimension of the hash sort. The constant c is not
dependent on the data values or the size of the data values, it is an implementation
constant; once the constant is chosen it is unaltered.

The mapping function must be applied iteratively to the range of values forming
the data list. This makes the overall time complexity of the hash sort the product of
the complexity of the mapping function multiplied by the size of the range of the list
of values. The hash sort remains multiple dimension, and linear in time complexity.
The time complexity of the hash sort is then:
8
F (time) = O(N )

The space complexity of the hash sort is considering the storage of the data
values, and not the variables to used in the mapping process. The space complexity
of the hash sort is dependent upon which variant of the hash sort is used. The in-situ
hash sort, which uses the same multi-dimensional data structure as it maps the values
from the initial location to the final location, requires (N + 1) storage space. The data
structure is the size of the range of values, which is N . One additional storage space
is required as a temporary storage location as values are exchanged when mapped.
The space complexity for the in-situ hash sort is:

F (space) = O(N )

The direct hash sort maps from a one-dimensional data structure to the multi-
dimensional data structure of the same size. No temporary storage is required, as no
values are exchanged in the mapping process, the values are directly mapped to the
final location within the multi-dimensional data structure. The two data structures
are organized differently, but are of the same size N . Thus, the direct hash sort
requires 2 · N in terms of space complexity. The space complexity for the direct hash
sort is:

F (space) = O(N )

Each variation of the hash sort has different properties, but the space complexity
is (N + 1) for the in-situ hash sort, and 2 · N for the direct hash sort. For each
variation, the space complexity is linearly related to the size of the list of data values.

The hash sort performance asymptotically outperforms conventional sorting al-


gorithms (such as quick sort), which are N logN time complexity performance. This
is readily apparent from a simple inequality involving the ratio of the two algorithms.
As the size of the data N increases without bound, then the ratio between the hash
sort and a quick sort should be less than one.

If the ratio is greater than one, the time complexity of the hash sort is greater
than the quick sort. If the ratio is exactly one then the two sort algorithms perform
with the same complexity. A ratio of less than one indicates the hash sort is less than
the time complexity of the quick sort, therefore outperforming it.
2·k·N +c
lim < 1.0
N →+∞ N · logN
which simplifies to:
2 · k + Nc
lim < 1.0
N →+∞ logN
taking the limN →+∞ the ratio then becomes:
c
2·k+ ∞
< 1.0
log ∞
9
which simplifies to:

2·k
< 1.0

which then reduces to:

0.0 < 1.0

showing that the ratio of the two algorithms is indeed less than one.

Therefore this means the hash sort will asymptotically outperform the quick sort.
2.4. Variations of the Hash Sort. There are two types of versions of the
hash sort, the in-situ hash sort, and the direct hash sort. Variations upon the hash
sort are upon these two primary types. The in-situ hash sort is the basic form of the
hash sort, which works by exchanging values as explained previously. The in-situ hash
sort has a problem which can increase its time-complexity, but the hash sort algorithm
remains linear. The in-situ hash sort has a problem with data values that map to
their current location. In essence, the data value is already in-site, where it belongs.
Since the in-situ hash sort will determine where it belongs, then exchange, this would
cause the in-situ hash sort to halt. To remedy this, another iterative mechanism keeps
the in-situ hash sort algorithm going by relocating the current location to the next
one. When the current location and destination location are the same, the in-situ
hash sort has to be forced to proceed.

This forcing of the in-situ hash sort does add more time-complexity, but as a
linear multiplied constant. The number of data elements that map back to the cur-
rent location they are at is the amount of hysteresis present in the data. The term
hysteresis is borrowed from electrical theory, meaning to lag; hysteresis in the in- situ
hash sort causes a lag, which the algorithm must be forced out of. The worst case
for hysteresis is that all of the data elements map to the location they are at. In this
case, the in-situ hash sort would have to be pushed along until it is through the data
elements. In this worst case, the time of the algorithm becomes a double of the linear
time complexity, increasing the time, but linearly to 2(2·k)·N , or 4 ·k ·N . The in-situ
hash sort is more space efficient, using only the space to store the data elements, and
a temporary storage for sorting the data elements, or N + 1.

The direct hash sort is a variation upon the in-situ to avoid the problem with
hysteresis. The direct hash sort uses a temporary data array to hold the data ele-
ments. The data elements are then mapped from the single dimension array into the
multiple dimension array. No element can map to its current location, as it is be-
ing mapped from a one-dimensional array into a multiple dimensional. However, the
storage requirements for the direct hash sort require twice the storage, or 2 · N . The
tradeoff for time efficiency is a worsening of the space efficiency. The time complexity
is 2 · k · N , as the problem with hysteresis never surfaces.

The variations in the hash sort can be applied to the two primary forms, the
in-situ and the direct hash sort. The variations are in dimensionality, and relaxing
a restriction imposed upon the data set that is sorted. The version of the hash sort
mentioned is two-dimensional, but the hash sort can be of any dimension d, where
d ≥ 2.
10
The map by hashing would then be multiple applications of the hash scheme. For
a d-dimensional hash sort, it will be of the form 2 · k · N , where k is the dimension-
ality of the hashing structure. Note that k is an implementation constant, so once
decided upon, remains a constant multiple of 2. Either primary type of hash sort
algorithm, the in-situ hash sort, or the direct hash sort, can be extended into higher
dimensionality.

The other variation upon the primary forms of the hash sort concerns the restric-
tion of unique data values. This restriction upon the data set can be relaxed, but with
each location in the hash structure a count is required of the number of data elements
that hash to that location. For large amounts of values in the data set, the hash
sort will ”compress” them into the hash structure. When the original sorted data set
is required, each data element would be enumerated by the number of hashes to its
location. The flexibility of the hash sort to accommodate non-unique data values is
inefficient if there are few repeated data values. In such a case, nearly half of the hash
structure will be used only for single data values.

Both the in-situ hash sort and the direct hash sort have another problem, which
is inherent in either variant of the hash sort algorithm. If the data values within the
range are not all there, then the data set is sparse. The range is used to determine
the hash structure size, and the time to run the algorithm. The hash sort presumes
all data values in the data set within the range are present. If not, then the hash
sort will still proceed sorting for the range size, not the data set size. So for the data
set size Nd , and the range size Nr , if Nd ≥ Nr , the hash sort algorithm performs as
expected, or better.

If Nd < Nr , then the sparsity problem surfaces. The hash sort will sort on a
range of values, of which some are non-existent. The smaller data set size to range
size will not be reflected in the time-complexity of the hash sort, or in the required
hash structure; so when the cardinality of the data set and cardinality of the range of
values within the data set are inconsistent, the hash sort doesn’t flop, but it becomes
inefficient. The empty spaces within the hash structure then must have some sentinel
value outside the data range to distinguish them from actual data values. The direct
hash sort alleviates the sparsity problem somewhat for the hash sort time, but still is
inefficient in the hash structure as some of the locations will be blank or empty.

Code for the direct hash sort algorithm is:

C version:

#define DIM 10 /* dimension size of the matrix */

typedef struct tag {


int count;
int value;
} element;

element matrix M[DIM][DIM];

void hash_sort(list L[], element M[ ][DIM], int size)


{
11
int value,x;
int row, col;

for(x=0;x < size;x++)


{
/* get the value and detemine its row, column location */
value = L[x];
row = value % DIM;
col = value / DIM;

if(M[col][row] == value)
/* if the value is already here, increment count */
M[col][row].count++;
else {
/* store the value, initialize the count to 1 */
M[col][row].count = 1;
M[col][row].value = value;
}
}
}

Pascal version:

const
d = 10;
size = 100;
type
record tag =
count: integer;
value: integer;
end;

matrix = array[d,d] of tag;


list = array[size] of integer;

procedure hash_sort(var matrix M; list L; integer size)


var
value: integer;
row, col: integer;
x: integer;
begin
(* go through list and map values into matrix *)
for x:= 1 to size do begin
(* get current value, determine row and column location *)
value:=L[x];
row:=value div x;
col:=value mod x;

if M[col,row]=value then
12
(* if value already here *)
M[col,row].count:=M[col,row].count + 1;
else begin
(* if maps first time, store value, initialize count to 1 *)
M[col,row].count:=1;
M[col,row].value:=value;
end; (*if*)

end; (*for*)

end; (*hash_sort*)

2.5. Differences between Hash Sort Variants.

2.5.1. Distinction of Concepts. There are three distinct concepts involved in


the hash sort which need to be distinguished from one another. Each has an important
significance relating to the hash sort, and is explained as it is identified.

These three concepts are:

1. Number of data elements N


2. Range of the data values from a lower to upper bound R
3. Square matrix which is the data structure M

The number of data elements N is the total data size to be mapped by the hash
sort. The number of data elements determines the time-complexity of the hash sort,
the amount of time being linearly proportional to the number of data elements.

The in-situ hash sort, the number of data elements is less or equal to the size of
the square matrix M . The direct hash sort, the number of elements can be of any
size. The only requirement is that the data elements fall within the range R.

The square matrix M is constructed around the range R of the data values so that
all possible values within the range R do map. The super-hash function maps data
elements within the range R into the square matrix M . Therefore, the storage space
M is formed from the range R of the data values, not the number of data elements
N.

The time complexity is dependent upon the number of data elements, the size N .
The in-situ hash sort can have less elements N than the matrix capacity M , but the
hash sort algorithm must go through the entire matrix M . Hence sparsity of data
values within the matrix is highly inefficient. The direct hash sort is different since
the mapping is from a list into the matrix M . The time complexity is linear again to
the size N of the list of elements.

In both cases, there is linear time complexity, only with the in-situ, it is dependent
more upon the size of the matrix M than the amount of data in the matrix. In doing
so, the time complexity is more dependent upon the range of values for the in-situ
than the number of them. No more data elements can be stored in the matrix than
its size M permits, which is related to the range R.
13
2.5.2. Analysis of Hash Sort Variants. Depending upon the variant used,
the hash sort can have a time complexity linearly proportional to either the size of the
data structure which is a matrix M , or the size of the data list L. In each variant of
the hash sort, the linear nature of the algorithm is proportionate to a greatest lower
bound.

The in-situ variant of the hash sort is linearly proportional to the size of the data
structure, the matrix M . The matrix M has a size determined in part bythe super-
hash function. The size of the matrix places a least upper bound on the possible size
of the list L, and in doing so forms a least upper bound of O(M ) time complexity.
This least upper bound constraint stems from the fact that the data list L and the
data structure the matrix M are the same entity.

The direct variant of the hash sort is linearly proportional to the size of the data
list L, as the data structure the matrix M and the list L are seperate entities. The
time complexity is independent of the data structure in the direct hash sort. This
independence of entities permits a greatest lower bound dependent on the size of the
data list L and not the data structure the matrix M .

For each variant of the hash sort, the time complexity of the algorithm is linear.
The linearity is constrained in relation to two seperate determining factors for each
variant of the hash sort. In one variant, the in-situ version of the hash sort, the time
complexity has a least upper bound determined by the matrix M . The size of the
data list L in M can be less than, but no more than the size of the matrix M because
the data list L and the data structure M are the same entity. The direct hash sort
variant seperates these two entities and in doing so the constraint is a greatest lower
bound dependent upon the size of the data list L.

A table summarizing these distinctions is given below:

Variant Big-Oh Constraint Comment


In-situ O(M) Data Structure Size O(M ) ≤ O(N )
Direct O(N) Data List Size 0 ≤ O(N )

2.5.3. Example Walk-through of Hash Sort Variants.

In-situ hash sort example with 2-dimensional 3 x 3 matrix

Matrix:

m = 0 1 2
d
0 * * *
1 * * *
2 * * *

The values in the matrix must be mapped to the range 0 .. 8; this range of values
14
forms a ordinal pair of the form (d, m) from (0,0) to (2,2). Values in the matrix
increase from left to right m = 0 .. 2, and top to bottom d = 0 .. 2. This is how the
ordering should place values once they are sorted.

Any range of values can be used, but for simplicity, the range of values in the
matrix is from 1 to 9. The hash sort algorithm used will be in-situ (or in site) so the
problem of hysteresis is present. Hysteresis is the occurrence where a value is in its
correct location, so the algorithm maps it back on to itself.

The hash sort will involve starting at some initial point in the matrix, then travers-
ing the matrix and mapping each value to its correct location. As each values is
mapped, order is preserved (a large value will map to a ”higher” position than a
”lower” one), and there are no collisions (excepting self-collisions which are hystere-
sis). The values are mapped by subtracting one, then applying the mod and div
operators.

The where, or which digit is being handled is in parentheses, the computed desti-
nation is in brackets. The value used in the computation is given, and the computed
ordinal pair. A before and after illustration is given to show the exchange of the two
values which is computed, then done by the hash sort.

The super-hash function for this data set is:

d = (x − 1) div 3 ; m = (x − 1) mod 3

m = 0 1 2
d
0 5 8 1
1 9 7 2
2 4 6 3

Matrix Initial Configuration

m = 0 1 2 m = 0 1 2
d d
0 (5) 8 1 0 (7) 8 1
1 9 [7] 2 1 9 5 2
2 4 6 3 2 4 6 3

Before After

The start position is initially at (0,0). The value is 5, subtracting 1 is 4. The mapping
(illustrated once) is d = (4 div 3) = 1, m = (4 mod 3) = 1. Thus the computed
destination for where the value goes is (1,1).

15
m = 0 1 2 m = 0 1 2
d d
0 (7) 8 1 0 (4) 8 1
1 9 5 2 1 9 5 2
2 [4] 6 3 2 7 6 3

Before After

The next value is 7, subtracting 1 is 6. The computed destination for where the value
goes is (2,0).

m = 0 1 2 m = 0 1 2
d d
0 (4) 8 1 0 (9) 8 1
1 [9] 5 2 1 4 5 2
2 7 6 3 2 7 6 3

Before After

The next value is 4, subtracting 1 is 3. The computed destination for where the value
goes is (1,0).

m = 0 1 2 m = 0 1 2
d d
0 (9) 8 1 0 (3) 8 1
1 4 5 2 1 4 5 2
2 7 6 [3] 2 7 6 9

Before After

The next value is 9, subtracting 1 is 8. The computed destination for where the value
goes is (2,2).

m = 0 1 2 m = 0 1 2
d d
0 (3) 8 [1] 0 (1) 8 3
1 4 5 2 1 4 5 2
2 7 6 9 2 7 6 9

Before After

The next value is 3, subtracting 1 is 2. The computed destination for where the value
goes is (0,2).

m = 0 1 2 m = 0 1 2
d d
0 <1> 8 3 0 1 (8) 3
16
1 4 5 2 1 4 5 2
2 7 6 9 2 7 6 9

Before After

The next value is 1, subtracting 1 is 0. The computed destination for where the value
goes is (0,0).

Here is an example of hysteresis noted by the angular brackets; the start and final
positions are equal, so the value has mapped back onto itself. The start position is
’forced’ to the next position or the algorithm would be stuck in an infinite loop and
so would ”stall”.

m = 0 1 2 m = 0 1 2
d d
0 1 (8) 3 0 1 (6) 3
1 4 5 2 1 4 5 2
2 7 [6] 9 2 7 8 9

Before After

The next value is 8, subtracting 1 is 7. The computed destination for where the value
goes is (2,1).

m = 0 1 2 m = 0 1 2
d d
0 1 (6) 3 0 1 (2) 3
1 4 5 [2] 1 4 5 6
2 7 8 9 2 7 8 9

Before After

The next value is 6, subtracting 1 is 5. The computed destination for where the value
goes is (1,2).

m = 0 1 2 m = 0 1 2
d d
0 1 <2> 3 0 1 2 3
1 4 5 2 1 4 5 6
2 7 8 9 2 7 8 9

Before After

The next value is 2, subtracting 1 is 1. The computed destination for where the value
goes is (0,1).

Here is another example of hysteresis; the start and final positions are equal, so the
value has mapped back onto itself. The start position is ’forced’ to the next position
or the algorithm would remain stuck.
17
The matrix is now sorted, but the algorithm has no way of knowing this. It will
continue from 3 to 9 until the hysteresis count equals the size of the data N . The
remaining sorting data has caused ”hysteresis” by causing the algorithm to lag in a
sense, getting tripped up over already sorted data. In an ideal arrangement of the
data, no hysteresis would occur; then the algorithm knows it has finished sorting with
the exchange count equals the size of the data N . But unfortunately, this is an ideal
situation which most likely will never occur in practice.

Direct Hash Sort Example with 2 x 2 Matrix

Matrix

m = 0 1
d
0 <*,*> <*,*>

1 <*,*> <*,*>

The values of the data set are from 7 to 10. A list L of size 7 of data elements to
be mapped will be used. Each location in the matrix is the value, and an associated
count of the number of values in that location. Once all the values from the list L are
mapped, the hash sort algorithm will terminate.

The hash function is:

d = (x − 7) div 2 ; m = (x − 7) mod 2

As each value is mapped, an brief explanation of the process and what happens to
the matrix will be given. As the list L is mapped, it will become progressively smaller
in representation through the walk-through. Once the list size is zero, the hash sort
will be complete. The left-most element in the list L will be the one being mapped
by the hash sort algorithm.

L = { 7, 8, 7, 9, 10, 7, 8, 8 }

Matrix

m = 0 1
d
0 <*,*> <*,*>

1 <*,*> <*,*>

Matrix Initial Configuration


Step 0

The first value is 7. The hash sort maps the value to the location in the matrix
as d = (7 -7) div 2, m = (7-7) mod 2, which is (0,0)

18
L = { 8, 7, 9, 10, 7, 8, 8 }

Matrix

m = 0 1
d
0 <7,1> <*,*>

1 <*,*> <*,*>

Step 1

The next value is 8. The hash sort maps the value to the location in the matrix
as d = (8-7) div 2, m = (8-7) mod 2, which is (0,1).

L = { 7, 9, 10, 7, 8, 8 }

Matrix

m = 0 1
d
0 <7,1> <8,1>

1 <*,*> <*,*>

Step 2

The next value is 7. The hash sort maps the value to the location in the matrix
as d = (7-7) div 2, m = (7-7) mod 2, which is (0,0).

L = { 9, 10, 7, 8, 8 }

Matrix

m = 0 1
d
0 <7,2> <8,1>

1 <*,*> <*,*>

Step 3

The next value is 9. The hash sort maps the value to the location in the matrix
as d = (9-7) div 2, m = (9-7) mod 2, which is (1,0)

L = { 10, 7, 8, 8 }
19
Matrix

m = 0 1
d
0 <7,2> <8,1>

1 <9,1> <*,*>

Step 4

The next value is 10. The hash sort maps the value to the location in the matrix
as d = (10-7) div 2, m = (10-7) mod 2, which is (1,1).

L = { 7, 8, 8 }

Matrix

m = 0 1
d
0 <7,2> <8,1>

1 <9,1> <10,1>

Step 5

The next value is 7. The hash sort maps the value to the location in the matrix
as d = (7-7) div 2, m = (7-7) mod 2, which is (0,0).

L = { 8, 8 }

Matrix

m = 0 1
d
0 <7,3> <8,1>

1 <9,1> <10,1>

Step 6

The next value is 8. The hash sort maps the value to the location in the matrix
as d = (8-7) div 2, m = (8-7) div 2, which is (0,1).

20
L = { 8 }

Matrix

m = 0 1
d
0 <7,3> <8,1>

1 <9,1> <10,1>

Step 7

The next value is 8. The hash sort maps the value to the location in the matrix
as d = (8-7) div 2, m = (8-7) div 2, which is (0,1).

L = { }

Matrix

m = 0 1
d
0 <7,3> <8,2>

1 <9,1> <10,1>

Matrix Final Configuration

All 7 data elements have been mapped by the hash sort into the matrix. There
is no hysteresis with the direct hash sort, as all data elements are mapped by value
into the appropriate location. Moreover, the time of the direct hash sort is linearly
proportional to the size of the data list L.

2.6. Other Similar Algorithms.


There are three other algorithms that have very strong similarities to the hash
sort algorithm. These algorithms are: address calculation sort, bin sort, and the radix
sort. While similar, these algorithms are distinct from the hash sort.

2.6.1. Address Calculation Sort:.


The address calculation sort is very similar to the hash sort. The address calcu-
lation sort is sometimes referred to as sorting by hashing. The address calculation
sort uses a hashing method that is order-preserving, similar to the hash sort. How-
ever, the address calculation has the problem that if the distribution is not uniformly
distributed, then the address calculation sort degenerates into an O(N 2 ) time com-
plexity. This is the sparsity problem with the hash sort, but it does not lead to such
an extreme degeneration in the hash sort as it does with the address calculation sort.
Another variant of the address calculation sort is the pigeonhole sort, in which the
data list of elements is subdivided into bins, and then within each bin the sub data
list is sorted.
21
2.6.2. Bin Sort:.
Bin sort is similar to hash sort in that data is stored in a ”bin” which it is mapped
to. The bin sort, has multiple distinct values mapping to a similar bin. Unlike the
hash sort, where redundant data values map to the same location, the bin sort has
distinct elements possibly mapping to the same bin. So the bin sort within each bin
has multiple data elements within the same bin. If there a N elements, and M bins,
then the bin sort is linear time O(N + M ). However, if the number of bins is N 2 ,
then the bin sort will degenerate into a worst case of O(N 2 ).
2.6.3. Radix Sort:.
The radix sort is similar to the hash sort in that the digits or sub-elements of each
data value are used in the sort. The algorithm uses the digits of the data element
to map it to its unique location. Hash sort does this indirectly not by each sub-
element, but by mathematical mapping. Radix sort for m-sized data element with n
elements has a time-complexity of O(M · N ). If the sub-data elements become very
dense, then m becomes more approximately logN , then the radix sort degenerates to
a O(N · log N ) algorithm. So hence, the radix sort depends on M much less than N
by a sizable ratio.
2.6.4. Summary of Similar Algorithms.
The similarity between the address calculation sort, bin sort, and radix sort is
that a non-comparative method for sorting is used. However, all three algorithms
degenerate into worst case in the very least being a comparative sort algorithm, or
polynomial time algorithm. This worst case occurs at the extremes of data density,
either too sparse or too dense, which then overloads the algorithm. The hash sort,
while not incapable of degenerating, only becomes worst case of another linear time
constant. So the hash sort is not as sensitive to extreme cases, and is more robust.
2.7. Features of Hash Sort. The hash sort has the following strengths:

• Linear time complexity, even in the worst case; for the in-situ proportional to
the data structure size, and for the direct proportional to the data list size.

• The hash sort puts data elements in the correct position, does not move them
afterward – data quiesence

• Data independence – data elements in the data set are mapped by their unique
value, and do not depend on the predecessor or successor in the data set

• High speed lookup is possible once the data is sorted – faster than binary
search; or alternatively, to the approximate location within the data structure.

The hash sort has the following weaknesses:

• Sparsity of data values in range – wasteful of space

• Multi-dimensional data structure is required – square planar matrices are


inconsistent with underlying linear memory in one-dimension

• Works only with numeric values, requires conversion for non-numeric value

• The data range of values must be known for the algorithm to work effectively

3. Testing.
22
3.1. Testing Methodology.
The testing methodology of the sort algorithm involves two perspectives, an em-
pirical or quantitative viewpoint, and a mathematical or qualitative view. The math-
ematical or qualitative point of view looks at the hash sort algorithm alone. The
algorithm is tested for its characteristic behavior; this testing can be exhaustive, as
there are an infinite number of sizes of test cases to test the algorithm with. The
mathematical methodology then, tests for data size, and data arrangement or struc-
turing. These form tests for which the size of the test data list is increasing, and
tests in which there are partially sorted sublists within the overall test data list. For
simplicity, the testing emphasis was placed on the in-situ version of the hash sort.

The testing approach for determing the behavior of the hash sort algorithm focus
on the algorithmic time complexity. Testing is non-exhaustive, as all possible test
cases can not be generated or tested upon. The algorithmic behavior is tested on
different test cases to avoid the possibility of an anomalous ”best” case or extreme
”worst” case scenario. The hash sort algorithm is tested on different sizes and dif-
ferent permutations of data lists to evaluate the algorithmic performance. Complete
data lists of unsorted, or fully sorted lists are used, along with partially sorted data
lists. The hash sort is also compared to other algorithms that sort, to give relative
comparisons and contrasts for better assessment.

The two other sorting algorithms used to compare and contrast the hash sort
are the bubble sort, and the quick sort. Each algorithm represents an extreme in
algorithmic performance. The bubble sort is an O(n2) algorithm, but has excellent,
linear O(N ) performance on partially sorted data lists. The quick sort is known as the
best, and fastest sorting algorithm of O(N log N ) performance. However, the quick
sort does falter on already sorted data lists, degenerating into a O(n2) time complexity
algorithm. Thus, the bubble sort is best when the quick sort is at its worst, and vice-
versa. Again the extremes of algorithmic performance in pre- existing algorithms have
been selected for comparison and contrast with the hash sort algorithm.
3.2. Test Cases.
Testing on data size looks at increasing data sizes and the rate of growth in the
time complexity of the algorithm. Testing on data size is on ”complete” lists of data
elements, lists which are fully sorted or are unsorted. These types of ”complete” data
test cases form the extremes of the other variations in the data test case. The sorted
lists are either fully sorted in ascending or descending fashion, and the unsorted list
is the median between the two extremes. Size testing looks at the effects of increasing
sizes on these three types of ”complete” sorted data test cases.
3.3. Test Program.
The test program has to handle to important issues dealing with the program
code, and the test platform. The test platform, or the computer system the code is
run on, the test program must not be biased by any hardware or platform architecture
enhancements. So a general-purpose computer, without any specific enhancements or
features is used. For further platform independence, different computers should be
used, of different sizes and performance to be sure of the independence of the results
obtained.

The code must be written in a ”neutral” computer language to avoid any esoteric
features of a language, and be readable. Similar to platform independence, the tests
23
must be independent of the language implementing the test programs. Any program-
ming language features which give a biased advantage in the generated code being
optimized or better for is to be avoided.

The code must be readable, so a programming language which is expressive along


with good programming style must be used. Unreadable code will hamper duplication
of results by others and cast doubt on the efficacy and credibility of the tests. The
code must be portable, so that independent platform testing can be conducted, and
require minimum effort to re-write and re-work the code for continued testing. The

test case generator and the test sort programs are integrated together in one program.
The test case generator generates the particular test case in internal storage within
the program, and the hash sort, bubble sort, and quick sort all process the data. The
time to process the data is noted and written to an output data file. The time in
microseconds, and the size of the data set is recorded by the program.

3.3.1. Test Platform Specifications. The table summarizes the different com-
puter systems which were used to test the hash sort.

Processor: Sparc Sparc Cyrix i686-P200


Hardware: sun4u sun4m vt 5099A
Operating System: SunOS 5.5.1 SunOS 5.5.1 Linux 2.0.33
Processors: 12 4 1
Total Memory: 3072 Megabytes 256 Megabytes 64 Megabytes
GNU gcc Version: 2.7.2.3 2.7.2.3 2.7.33

The original Pascal implementations were converted to C using the the p2c con-
verter. The generated C code was hand optimized to eliminate inefficiencies and
to follow strict ANSI C compliance. The code was modified for the UNIX system
environment, in particular the system timing calls. Some of the routines such as
write matrix and write list were originally used to check that the output was a sorted
list or array – verifying the algorithms were working. Once this had been checked,
the call to the procedure was commented out, as the actual test program dumped
its output to a data file. These routines in the final run for testing timing are still
present in the source code.

4. Analysis of Results.

4.1. Test Expectations.


The expectations of the tests are for the quick sort and bubble sort to perform as
outlined. Both algorithms have been extensively studied over time, so any change in
these algorithm’s performance and behavior would be a shocking surprise. The hash
sort is expected to remain linear, even as the bubble sort and quick sort fall apart
in their worst case scenarios. The hash sort will falter, but will not degenerate as
badly as the bubble sort and quick sort do in the extreme cases. Nor is the hash sort
expected to have any special cases of superior performance.

The hash sort will have linear time complexity performance which will have a
Big-Oh of O(N ) in relation to the test case size. This consistency and stability of
the algorithm through different cases where the comparison algorithms degenerate or
accelerate performance is expected to be verified in the tests. The worst case of the
24
algorithm performance will be a constant multiple of linear time c · N , but it will
remain linear time-complexity.

4.2. Test Case Results.


The table of the run-time performance of each algorithm in the appendix shows
an increasing time with increasing size of the data set. The bubble sort ”explodes” as
expected, and soon is surpassed by the hash sort and quick sort. Both the hash sort
time and the quick sort time increase, but what is of interest is the ratio of the two.
The hash sort does not immediately surpass the quick sort, in fact it is not until the
data set N ≥ 146 that this occurs. A steady trend of the hash sort gaining occurs up
to data set N < 120. When the data set is in the range (120 ≤ N < 146) the hash
sort to quick sort performance ratio occasionally is less than one. The performance
fluctuates until data set N > 145, when the ratio of the hash sort to the quick sort
< 1.0, meaning the hash sort is performing faster than the quick sort. A decreasing
ratio was expected, but the hash sort did not immediately make gains on the quick
sort until the data set was much larger.

5. Conclusions.

5.1. Testing Conclusions.


The hash sort performed as expected, as did the bubble sort and the quick sort
algorithms. What was surprising initially is that the hash sort does not have a per-
formance lead over the quick sort. The bubble sort was soon surpassed by the hash
sort as expected, but the hash sort did not exceed the performance of the quick sort
as it should have theoretically.

The reason for this seemingly strange performance of the algorithm is the theoret-
ical consideration that the hash sort and the quick sort operate using the same types
of operations with the same underlying machine code clock cycles. This of course is
not correct, the quick sort is using a compare based machine instruction, whereas the
hash sort is using an integer divide machine instruction.

A comparison of the clock cycle times for the compare and integer divide instruc-
tions on a Intel 486 processor gives some indication of this (Brey pp. 723 - 729).

Opcode Addressing Mode Clock Cycles


CMP register-register 1
CMP memory-register 3
DIV register 40
DIV memory 40
IDIV register 43
IDIV memory 44

The ratio of a compare instruction to a integer divide is 1:42 for register to


register based machine instructions, and 1:21 for memory based machine instruction.
The summary of this is that the quick sort and the hash sort are utilizing different
machine instructions to implement the algorithm, and theoretically they are equivalent
at an abstract level. At the implementation level, this is not a valid assumption to
make. A simple analysis of the algorithms in this context shows this.
25
F (Quick sort) > F (Hash sort)
c2 · N · log2 N > c1 · 2 · N dividing through by N
c2 · log2 N > c1 · 2 dividing through by c2
2
log2 N > c1 · 2 substituting N = 146c2
c1 ·2
log2 (1462 ) > c2 move square down
c1 ·2
2 · log2 (146) > c2 divide through by 2
c1
log2 (146) > c2 evaluate the log
c1
14.38 > c2

This rough analysis is an approximation of the arithmetic instruction time to


comparison instruction time. It does not directly correspond to the ratios of clock cy-
cles for the Intel instructions. There are other factors, such as compiler optimization,
and memory manipulation overhead to be considered. But this rough analysis does
give some insight into the ”critical mass” that the hash sort must reach before it can
exceed a comparison based algorithm – in this case the quick sort.
5.2. Evaluation Criteria. The three algorithms used are to be evaluated on
the basis of the following criteria:

• coding complexity – how difficult or easy is it to code the algorithm?

• time complexity – how fast is the algorithm in performance?

• space complexity – how efficient is the algorithm in using memory?

To rank the algorithm on each of the criteria, the simple expedient of following
a grading system is used. Rankings are: excellent, good, fair, poor, bad. A comment
as to the ranking and why it was judged to rate such a score is given.
5.3. Evaluation of the Algorithm. The hash sort has the following rankings:

• coding complexity: excellent; the hash sort is a short algorithm to encode,


uses only a simple iterative loop and array manipulation.

• time complexity: good; the hash sort is linear in theory, but in practice must
reach a certain critical mass, so its overall time complexity is more degenerate
than the theory behind the algorithm would indicate.

• space complexity: fair; the hash sort has some overhead in the algorithm, but
of which most are counters which are incremented as the algorithm progresses.
But the algorithm is fitting a n-dimensional array onto a one-dimensional
memory so this is somewhat awkward and introduces memory manipulation
complexity overhead.

5.4. Comparison to Test Algorithms.


Criteria Hash Sort Quick Sort Bubble Sort
Coding Complexity Excellent Fair Excellent
Time Complexity Excellent Good Poor
Space Complexity Fair Good Good

26
5.5. Summary of Sort Algorithms.
The hash sort compared to the bubble sort is similar in coding complexity, very
short code which is simple and easy to follow. However, it is much faster than the
bubble sort, although it is not as efficient in using memory. The bubble sort is much
more efficient in using memory, but it does have much data movement as it sorts the
data.

The quick sort is much more complex to code, even the recursive version. A
non-recursive version of the quick sort becomes quite complex and ugly in coding.
The quick sort does outperform the hash sort initially, but it is still a linearithmic
algorithm, and does have a degenerate worst case, although it is rare in practice. The
quick sort is good at managing memory, but it is a recursive algorithm, which has
somewhat more overhead than an iterative algorithm. Because of its partitioning and
sub-dividing approach, the amount of data movement is less than with bubble sort.

5.6. Further Work.


The initial development and investigation into the properties, performance, and
promise of the hash sort seem thorough, there is still other avenues of research. Further
research on the hash sort can be done in terms of investigating a recursive version of
the algorithm, seeing if it can be parallelized, remedying the problem of data sparsity,
and finding other mapping functions.

5.6.1. Recursive implementation:.


The hash sort has been implemented as an iterative algorithm, but some sorts,
such as the quick sort, are inherently recursive. A recursive version of the hash sort
may be possible, but it may not translate well into a recursive definition or imple-
mentation. The possibilities of the hash sort being implemented a smaller mappings
of the same data set is an interesting possibility.

5.6.2. Parallelization of algorithm:.


The hash sort algorithm has been implemented as a serial algorithm on a unipro-
cessor system, although two of the systems it was tested on had multiple processors.
It is possible that the hash sort could be parallelized to increase its performance be-
yond linear time. The tradeoffs of such a parallel algorithm, and the issues involved
are another possibility for research.

5.6.3. Sparsity of data:.


The problem of sparse data within the range of the hash sort is a problem with the
algorithm. There may be ways of handling sparse data so that space can be utilized
more efficiently than the implemented algorithm here does. If the problem with data
sparsity can be resolved, then the hash sort could become a magnitude more useable
in applications.

5.6.4. Other mapping functions:.


The heart of the hash sort is the super-hash functions, which are perfect hashing
which preserve ordering. This mapping function used the classical hash function along
with a mash function to hash magnitude. Other mapping functions which have the
same properties would be very interesting functions in themselves, but could be the
basis for another variant of the hash sort. A less system complex mapping function
that did not use the time consuming div and mod operations would be an obvious
improvement in the hash sort.
27
5.6.5. Machine code optimization:.
The difference in the machine instructions used to implement the underlying al-
gorithms makes the hash sort need a ”critical mass” to reach a sufficient size before it
meets its theoretical performance expectations. Optimizations of the machine gener-
ated code to reduce the size of the critical mass required are an interesting possibility.
Such optimizations could be compiler-based, or strictly upon the properties of the
hash sort algorithm.

5.6.6. Alternative data structure:.


The current version of the hash sort uses multiple N square matrices for the data
structure being mapped to by the super-hash function. One possibility to be explored
is using non-square matrices, and possible other data structures that are not square
or rectangular. A change in the data structure mapped to reflects a change in the
mapping function, as the mapping function is determined by the data structure being
mapped to by the super-hash function.

6. Application.

6.1. Criteria for using Hash Sort.


The hash sort, although a general-purpose algorithm, is not an algorithm that
meets all needs for applications. Because of its properties and features, there are
several criteria which guide using it. These criteria are:

• data set is within a known range

• data set is numeric or inherently manipulatable as numbers

• the data needs to be ordered quickly then accessed frequently large data set
with heavy density within the data range

6.2. Applications for Hash Sort.


There applications suited for the hash sort can be surmised from the criteria
mentioned previously, but some applications include:

• database systems – for high speed lookup and retrieval

• data mining – organizing and searching enormous quantities of data

• operating system – file organization and access

6.3. Example Application with Hash Sort.


One interesting application of the hash sort is with data communications. This
example involves data communication on a X.25 packet switched network. In this
method of data communications, data is disassembled into a packet, which is then
sent through the network. At the reception point, the packets are reassembled into the
original data. One problem with the X.25 packet switched network, the packets are
out of order when received from the sender, or ”...may arrive at the destination node
out of sequence.” (Stallings 1985 p. 249) That is, data can arrive in sequence as sent,
but it may not. This is because of ”...datagrams are considered to be self-contained,
the network makes no attempt to check or preserve entry sequence.” (Rosner 1982 p.
117)
28
There is a need to sort the packets once they are received. This is not so straight-
forward, because there is the offhand chance that the packets may be in order, so
sorting may be unnecessary. Using a quick sort on a sorted sequence would be the
worst-case degenerate example, the quick sort would become an O(N 2 ) algorithm –
as bad as bubble sort.

Even worse, all the packets must be received before they can be sorted, if sorting
is required. To use a quick sort for this application, a separate algorithm to check
if the packets are already sorted is required, then if not, the quick sort algorithm is
used.

Overall this is adding addition time overhead in terms of:

T otal time = T ime(receive) + T ime(check sorted) + T ime(quick sort)

T otal time = T ime(r) + T ime(N ) + T ime( N · logN )

N is the size of the packets received, and to check that the packets are in order
requires a linear search O(N ) to examine all packets, although not when they are out
of order.

The hash sort is a better algorithm to use, although the packets would have to be
stored in a matrix structure rather than a linear list. A check for the packets being
received in order is unnecessary, because the hash sort will not degenerate on sorted
data – it will be an extreme form of hysteresis, but the hash sort remains linear in
such cases.

More interesting is that the hash sort does not have to wait for the last packet
to be received – it can begin to sort the packets as they are received. The hash sort
maps the packets to where each uniquely belongs. Since no two packets share the
same location, they are unique. So the hash sort can uniquely determine from the
packet itself where it belongs in the matrix. As packets are received, they are placed
into their appropriate location. By the time of the last packet arrival, all the data is
in place.

The hash sort may be slower than the data reception time, in which case the
algorithm will continue to sort the packets as they are stored. The total time in this
case would be:

T otal time = T ime(receive) + T ime(hash sort)

T otal time = T ime(r) + T ime(N )

This time is the same as if the data were received in the quick sort implementa-
tion already in order, and the algorithm to check that the data is in order is used.
This example shows the power of data independence in the hash sort algorithm – a
data element does not depend on its predecessor or successor for its location, it is
determined uniquely by the data element itself.
29
This property of data independence, and the speed of the hash sort and its robust-
ness in the degenerate case, make such applications where it is applicable to use it.
One drawback however, is that the size of the packets sent must be sufficient enough
to achieve the ”critical mass” needed by the hash sort.
7. Mathematical Proofs.
7.1. Proof of Injective Mapping Function.

Proof for a general super-hash function with a n-tuple of 2

Theorem:

Given a set S of unique integers in Z such that S = {v1 6= v2 . . . vn−1 6= vn }, and


given a function F (x) such that:

p = x div n; Mash function
F (x) ≡
q = x mod n; Hash function

Where n > 0, then the resulting image set M of ordinal pairs formed is M =
(p1 , q1 ), (p2 , q2 ), . . . , (pn−1 , qn−1 ), (pn , qn ) from the pre-image set S is an injective map-
ping F : S → M .

Proof:

Proof by Contradiction: Proceed with the assumption that given S as described


before that ∀S, ∃(x, y) ∈ S such that F : X = F : Y when x 6= y, or that the function
F : S → M is a non-injective mapping.
The definition of injective (one-to-one) is ”A function f is one-to-one if and only
if f (x) = f (y) implies that x = y for all x and y in the domain of the function.”

[∀(x, y) ∈ F : X → Y, where (f (x) = f (y)) ⊃ (x = y) is injective]

By taking the contrapositive of this definition, it is restated as ”A function f is


one-to-one if and only if f (x) = f (y) whenever x 6= y in the domain of the function.”
(Rosen 1991, p.57)

[∀(x, y) ∈ F : X → Y, where ¬(x = y) ⊃ ¬(f (x) = f (y)) is injective]

rewrite as

[∀(x, y) ∈ F : X → Y, where (f (x) 6= f (y)) ⊃ (x 6= y) is injective]

By taking the contrapositive of this definition, it can be restated as ”A function


f is one-to-one if and only if f (x) 6= f (y) whenever x 6= y. (Rosen 1991, p.57)

Using the contrapositive of the definition of an injective function, it is readily


clear that the mapping F : S → M is not injective if there are at least two integers i1
and i2 such that by the mapping function F , (p1 , q1 ) = (p2 , q2 ) in M . This is assumed
to be true, as a non-injective mapping function.

By the definition of S, all the integers are unique in Z, so the integers have the
property that for any two integers x,x′ ∈ S so that x 6= x′ .
Take the case for x:
30
px = x div n qx = x mod n
px = dx where dx ≥ 0 qx = mx where 0 ≤ mx < n

Take x′ = x + c where 0 < c < n:


px = x′ div n qx = x′ mod n
px = (x + c) div n qx = (x + c) mod n
px = (x div n) + (c div n) qx = (x mod n) + (c mod n)
px = dx + k where dk , k ≥ 0 qx = mx + c where 0 ≤ mx < n

so then that for x = x′ that F (x) 6= F (x′ ). But this is for x′ = x + c where 0 < c < n.
Take the case when c ≥ n. Using the same definition for x, take x′ = x + c where
c = n.
px = x′ div n qx = x′ mod n
px = (x + n) div n qx = (x + n) mod n
px = (x div n) + (n div n) qx = (x mod n) + (n mod n)
px = dx + 1 where dx ≥ 0 qx = mx + 0 where 0 ≤ mx < n
px = dx + 1 where dx ≥ 0 qx = mx where 0 ≤ mx < n

so then that for x 6= x′ that F (x) 6= F (x′ ).

For each case of the form x′ = x + c, where c < n and c ≥ n, if x = x′ then


F (x) = F (x′ ) which contradicts with the original assumption that ∃(x, y) ∈ S so that
F : X = F : Y when x 6= y.

So the occurence of dx = d′x and mx = m′x such that x = x′ is never true and
will never occur, which is the only possibility for the mapping F : S → M to be
non-injective.

Thus it follows that no two ordinal pairs in M formed from any two integers in S
by F would ever be equal. Since this is the only counter-example for the definition of
F (x) to be an inejective mapping, then the only conclusion is that F : S → M must
be an injective mapping under F (x).

Q.E.D.
7.2. Proof of Injective Multiple Mappings.

Proof using general super-hash function

Theorem:

For an injective mapping F : S → M , using a set of mapping value n ∈ N , the set


of unique mappings values used in the super-hash function, that a function composed
of multiple sub-mappings of F with n ∈ N is a complete mapping function overall.
Therefore such a constructed function would be an injective mapping.

Proof:

Proof by Demonstration: Define a function G, which non-redundantly selects


nx ∈ N , where x ≤k N k, the cardinality of the set of elements in N . Therefore, each
selected element is guaranteed by the definition to be unique from the predecessor
31
or successor elements, so that no element once selected as a mapping value will be
selected again. So the function G selects an element once and only once from N .

The selection function G is an injective function by its definition. For each nk ∈ N


there is an integer from 1 ≤ x ≤k N k which is uniquely associated with the particular
element selected n. The definition of an injective function is ”A function f is one-to-
one if and only if f (x) = f (y) implies that x = y for all x and y in the domain of the
function.”

For the particular selection or iteration of G which is k, there is a unique element


nk , selected only once by G from N . Thus, iterations k − 1, k, and k + 1 will not select
the same element, so G(k) = nk , so that k = nk for the set N under the selection
function G. Hence the definition and construction of G make an injective function.

The mapping function F : S → M has been proven to be an injective mapping


function previously for n, where n is the mapping value used in the mash and hash
sub-functions. Composing a new function using the mapping function F with the
selection function G, the resulting composed function H is an injective mapping.

For F : S is using a mapping value n, selected from N . So F : S could be rewritten


as:

F (n): S

going one step further,

F (G): S

and moreover since G uses N as its domain, then

F (G : N ) : S

The notation is rapidly becoming cumbersome, so in general using a mapping


function F along with a selection function G, we have a function H = F ◦ G, meaning
the function H is a composition of the functions F and G. With this in mind, the
proof that H is a injective function is very straightforward, and involves the property
that F and G are injective functions. Therefore, the composite function H is also an
injective function.

Proof of composite of two injective functions is also injective is:

If f : X → Y and g : Y → Z are injections (injective functions), then so is the


composite of g ◦ f : X → Z. Suppose gf (x) = gf (x′ ). Since g(f (x)) = g(f (x′ )), and g
is an injection, we must have f (x) = f (x′ ), and since f is an injection, x = x′ . Hence
g ◦ f is an injection. (Biggs 1985, p. 31)

For the function F , is uses a particular mapping value n, where G selects n ∈ N .


So it follows that G(k) : N → n, for a particular iteration or selection k. F (n) : S →
M has already been proved as an injective mapping by and of itself. For multiple
iterations of k, F (G(k) : N ) : S → M is an injective mapping.

The power of F to map from one mapping instance, to muliple k iterations is


expanded using a mapping value set N , and a function to uniquely select particular
32
elements as mapping values G. Hence the injective mapping property of F composed
with G has been generalized into multiple mappings.

Q.E.D

REFERENCES

[1] Aho, Hopcroft, and Ullman, Data Structures and Algorithms, Addison-Wesley, Reading, Mass.,
1983.
[2] Ambel, O. and Knuth, D.E., ”Ordered Hash Tables”, Computer Journal, vol. 17, no. 2, pp.
135-142, 1974.
[3] Amstel J.J. Van and Poirters J.A., The Design of Data Structures and Algorithms, Prentice-Hall,
Englewood Cliffs, N.J., 1989.
[4] Baase, Sarah. Computer Algorithms: Introduction to Design and Analysis, Addison-Wesley,
Reading, Mass., 1978.
[5] Barron, D.W. and Bishop, J.M., Advanced Programming : A Practical Course., John Wiley &
Sons, New York, 1984.
[6] Brey, Barry B. The Intel Microprocessors, Macmillian Publshing, New York, 1994.
[7] Ellzey, Roy S., Data Structures for Computer Information Systems, Science Research Associates,
Chicago, IL, 1982.
[8] Harrison, Malcom C., Data Structures and Programming. Scott, Foresman, and Co.,1973.
[9] Horowitz, E. and Sahni, S., Fundamentals of Computer Algorithms., Potomac, Md., Computer
Science Press, 1978.
[10] Horwitz, E. and Sahni, S., Fundamentals of Data Structures. Computer Science Press, Woodland
Hills, Calif., 1976.
[11] Knuth, Donald. E. The Art of Computer Programming. Volume 3: Searching and Sorting.
Addison-Wesley, Reading, Mass., 1973.
[12] Kronsjo, Lydia I., Algorithms: Their Complexity and Efficiency., John Wiley & Sons, New
York, NY., 1979,
[13] Kruse, Robert L., Data Structures and Program Design. Prentice-Hall, Englewood Cliffs, N.J.,
1984.
[14] Lewis, T.G. and Smith, M.Z., Applying Data Structures., Houghton-Mifflin, Boston, Mass.,
1976.
[15] Lorin, Harold. Sorting and Sort Systems, Addison-Wesly, Reading, Mass., 1975.
[16] Morris, Robert, ”Scatter Storage Techniques”, Communications of the ACM, vol. 11, no. 1, pp.
38-44, 1968.
[17] Peterson, W.W. ”Addressing for Random-Access Storage”, IBM Journal of Research and De-
velopment, vol. 1, no. 2, pp. 130-146, 1957.
[18] Rich, Robert P. Internal Sorting Methods Illustrated with PL/1 Programs, Prentice-Hall, En-
glewood Cliffs, N.J., 1972.
[19] Tremblay, Jean-Paul and Sorenson, Paul G., An Introduction to Data Structures with Applica-
tions. 2nd ed., McGraw-Hill, New York, NY., 1984.
[20] Wirth, Niklaus. Data Structures + Algorithms = Programs. Prentice-Hall, Englewood Cliffs,
NJ., 1976.

33

You might also like