Hashsorting
Hashsorting
Abstract.
Sorting and hashing are two completely different concepts in computer science, and appear
mutually exclusive to one another. Hashing is a search method using the data as a key to map to the
location within memory, and is used for rapid storage and retrieval. Sorting is a process of organizing
data from a random permutation into an ordered arrangement, and is a common activity performed
frequently in a variety of applications.
Almost all conventional sorting algorithms work by comparison, and in doing so have a linearith-
mic greatest lower bound on the algorithmic time complexity. Any improvement in the theoretical
time complexity of a sorting algorithm can result in overall larger gains in implementation perfor-
mance. A gain in algorithmic performance leads to much larger gains in speed for the application
that uses the sort algorithm. Such a sort algorithm needs to use an alternative method for order-
ing the data than comparison, to exceed the linearithmic time complexity boundary on algorithmic
performance.
The hash sort is a general purpose non-comparison based sorting algorithm by hashing, which has
some interesting features not found in conventional sorting algorithms. The hash sort asymptotically
outperforms the fastest traditional sorting algorithm, the quick sort. The hash sort algorithm has a
linear time complexity factor – even in the worst case. The hash sort opens an area for further work
and investigation into alternative means of sorting.
1. Theory.
1.1. Sorting.
Sorting is a common processing activity for computers to perform. Data that
is sorted is of the form where data items have an increasing value – ascending, or
alternatively decreasing value – descending. Regardless of the arrangement form,
sorting establishes an ordering of the data. The arrangement of data from some
random configuration into an ordered one is necessary often in many algorithms,
applications, and programs. The need for sorting to arrange and order data makes
the space and temporal complexity of the algorithm used very paramount.
The plethora of sorting algorithms available does not change the tantamount
question of how fast it is possible to sort. This is a question of temporal efficiency,
and is the most significant criterion for a sort algorithm. Along with it, what will
affect the temporal efficiency and why it does is just as important a concern. Still
another important concern, though more subtle, is the space requirements for using
∗ Special thanks and appreciation to Dr. Michael Mascagni for the opportunity to present and
All sort algorithms for the most part have the temporal efficiency and causes for
changes in it are well understood. This limit or barrier to sorting speed is a greatest
lower bound which is O(N log N ). In essence, the reasoning behind this is that sorting
is a comparative or decision making process of N items, which forms a binary tree
of depth log N . For each data item, a decision must be made where to move it to
maintain the ordering property desired. With N data items, and log N decision time,
the minimum speed possible is the product, which is N log N . This lower bound is
often never reached in practice, it is often a multiplied or added constant away from
the theoretical maximum.
There is no theoretic greatest lower bound for space efficiency, and often this
complexity measure characterizes a sort algorithm. The space requirements are highly
dependent on the underlying method used in the sort algorithm, so space efficiency
directly reflects this. While no theoretical limit is available, an optimum sort algo-
rithm will require N + 1 storage. This is the most optimal because N data items, and
one additional data item used by the sort algorithm. A bubble sort algorithm has
this optimal storage efficiency. However, the optimal space efficiency is subordinate
to the temporal efficiency in a sorting algorithm. The bubble sort, while having an
optimal space efficiency, is well reputed to be very poor in performance, far above the
theoretical lower bound.
1.2. Hashing.
Hashing is a process of searching through data, using each data item itself as a
key in the search. Hashing does not order the data items with respect to each other.
Hashing organizes the data according to a mapping function which is used to hash
each data item. Hashing is an efficient method to organize and search data quickly.
The temporal efficiency or speed of the hashing process is determined by the hash
function and its complexity.
Hash functions are mathematically based, and a common hash function uses the
remainder mod operation to map data items. Other hash functions are based on
mathematical formulas and equations. The construction of a hash function is usu-
ally from multiplicative, divisional, additive operations, or some mix of them. The
choice of hash function follows from the data items and involves temporal and space
organization compromises.
Hashing is not a direct mathematical mapping of the data items into a space
organization. Hashing functions usually have a phenomena of a hash collision or
clash, where two data items map to the same spatial location. This is detrimental
to the hash function, and many techniques for handling hash collisions exist. Hash
collisions introduce temporal inefficiency into a hash algorithm, as the handling of a
collision represents additional time to re-map the data item.
A special type of hash function, a perfect hash function, exists and is perfect in
the sense it will not have collisions. These types of perfect hash functions are often
available only for narrow applications, reserved words in a compiler symbol table for
example. Perfect hash functions usually involve restricted data items and do not fit
in to a general purpose hash method While perfect hashing is possible, often a regular
2
hash function is used with data in which the number of collisions is few or the collision
resolution method is simple.
2. Algorithm.
2.1. Super-Hash Function.
The heart of the hash sort algorithm is the concept of combining hashing along
with sorting. This key concept is embodied in the notion of a super-hash function.
A super-hash function is ”super” in that it is a hash function that has a consistent
ordering, and is a perfect hash function. A hash function of this type will order
the data by hashing, but maintain a consistent ordering and not scatter the data.
This hash function is perfect so that as the data is ordered it is uniquely placed when
hashed, so that no data items ambiguously share a location in the ordered data. Along
with that, the necessity for hash collision resolution can be avoided altogether.
A hash function provides some distinction among the values, using the remainder
or residue of the value. However, regular hashing experiences collisions, or where
values hash to the same value. The problem is that values of the same remainder
are indistinguishable to the regular hash function. Given a particular remainder r, all
values which are multiples of c are equivalent. So a set of the form {c1 · x + r1 , c2 ·
x + r1 , . . . , cn−1 · x + r1 , cn · x + r1 }. So for n = 10, r = 1 the following set of values
3
are equivalent under the regular hash function: {1, 11, 21, 31, . . . 1001, 10001, c · x + 1}.
It is the equivalence of the values under regular hashing which causes collisions, as
values map to the same hash result. There is no distinction among larger and smaller
values within the same hash result. Some partitioning or separation of the values is
obtained, but what is needed is another hash function to distinguish the hash values
further by their magnitude relative to one another in the same set.
This is what the mash function is for, a magnitude hash. A mash function is
of the same form as a regular hash function, only it uses div rather than mod for
(x mod n). The div operator on the form of a value c · x + r gives the value of c,
where x is the base (usually decimal, base 10). The mash function maps values into a
set where the values mash to the same result, based upon the magnitude. So the mash
function shares the same problem as a regular hash that all values are mapped into an
equivalent set. So a set of the from {c1 ·x+r1 , c1 ·x+r2 , . . . , c1 ·x+rn−1 , c1 ·x+rn } has
all values mashed to the same result. With n = 10, c = 3 the following set of values
are equivalent under the mash function {30, 31, 32, 33, 34, 35, 36, 37, 38, 39}. With the
mash function, some partitioning of the values is obtained, but there is no distinction
among those values which is unique to the values.
Together, however, a hash function, and mash function can both distinguish val-
ues, by magnitude, and by residue. The minimal form of this association between the
two functions is an ordinal pair of the form (c, r) where c is the multiple of the base
obtained with the mash function, and r is the remainder of the value obtained with the
hash function. Essentially an ordinal pair form a unique representation to the value,
using the magnitude constant, and residue. Further, each pair distinguishes larger
and smaller values, using the mash result, and equal magnitudes are distinguished
from each other using the residue. So the mapping is a perfect hash, as all values
are uniquely mapped to a result, and is ordering, since the magnitude of the values
is preserved in the mapping. A formal proof of this property, an injective mapping
from one set to a resulting set, is given.
The values for n in the hash function and mash function are determined by the
range of values involved. The proof gives the validation of the injective nature of the
mapping, and the mathematical properties of the parameters used, but no general
guidelines for determining the parameters. The mathematical proof only indicates
that the same mapping value must be used in the hash and mash functions.
To determine Θ, it must be calculated from the range of the values that are to
be mapped using the super-hash function. The range determines the dimensionality
or size of the matrix, which is square for this super-hash function.
Given a range R[i, j] where i is the lower-bound, and j is the upper-bound, then:
The final value computed is the nearest square to L, which is the length of the
range of values. In essence the values are being constructed in the form of a number
which is tailored to the length of the range of values.
value( dx , mx ) = dx · Θ + mx
Θ = ⌈10.53565375 . . .⌉
Θ = 11
Part of the mapping involves subtracting the lower bound of the range, so that
all the values are initially mapped to the range 0..(j − i). So the super-hash function
for this range of data values is:
d = (x − 13) div 11
F (x) ≡
m = (x − 13) mod 11
For the lowest value, 13 maps to (0,0). The largest value 123 maps to (10,0).
value( dx , mx ) = dx · Θ + mx + i
5
so that for (0,0)
value( 0, 0) = (0 · 11 + 0) + 13 = 13
Pseudo-code illustrating a very generalized from of the in-situ hash sort is:
value = temp ;
( m1 , m2 , · · · , mn−1 , mn ) −→ superhash(temp);
END WHILE ;
Pascal Version:
6
Var
x_c, h_c: integer;
which, where: integer;
value: integer;
Begin
x_c := 0; (* set the counts to zero *)
h_c := 0;
where := 0; (* set the initial starting points to zero *)
C version:
IF hysteresis THEN
Move where to next location;
Increment hysteresis count;
ELSE
Exchange value with destination;
Increment exchange count;
END IF;
END WHILE;
END PROCEDURE:
The hash sort is a simple algorithm, the complexity is in the super-hash function
which is the key to the algorithm. The remaining components of the algorithm are
for the exchange of values, and continuous iteration until a termination condition.
The hash sort is a linear time complexity sorting algorithm. This linear time
complexity stems from the mapping nature of the algorithm. The hash sort uses
the intrinsic nature of the values to map them in order. The mapping function is
applied repeatedly in an iterative manner until the entire list of values is ordered.
The mathematical expression for the run-time complexity is:
F (time) = 2 · c · N
The function for the run-time complexity is derived from the complexity of the
mapping function, and the iterative nature of the algorithm. The mapping function is
a composite function, with two sub-functions. Multiple applications of the mapping
function are possible because of the extendible nature of the hash sort, but at least
one mapping is required. Hence, the constant c is a positive integer value, which
represents the number of sub-mapping within the mapping function.
The mapping function uses two sub-functions as a composite, so the overall time
for the mapping is the product of two multiplied by the number of sub-mappings.
The value of the product of two multiplied by the number of sub-mappings is always
greater than one, hence it is the dimension of the hash sort. The constant c is not
dependent on the data values or the size of the data values, it is an implementation
constant; once the constant is chosen it is unaltered.
The mapping function must be applied iteratively to the range of values forming
the data list. This makes the overall time complexity of the hash sort the product of
the complexity of the mapping function multiplied by the size of the range of the list
of values. The hash sort remains multiple dimension, and linear in time complexity.
The time complexity of the hash sort is then:
8
F (time) = O(N )
The space complexity of the hash sort is considering the storage of the data
values, and not the variables to used in the mapping process. The space complexity
of the hash sort is dependent upon which variant of the hash sort is used. The in-situ
hash sort, which uses the same multi-dimensional data structure as it maps the values
from the initial location to the final location, requires (N + 1) storage space. The data
structure is the size of the range of values, which is N . One additional storage space
is required as a temporary storage location as values are exchanged when mapped.
The space complexity for the in-situ hash sort is:
F (space) = O(N )
The direct hash sort maps from a one-dimensional data structure to the multi-
dimensional data structure of the same size. No temporary storage is required, as no
values are exchanged in the mapping process, the values are directly mapped to the
final location within the multi-dimensional data structure. The two data structures
are organized differently, but are of the same size N . Thus, the direct hash sort
requires 2 · N in terms of space complexity. The space complexity for the direct hash
sort is:
F (space) = O(N )
Each variation of the hash sort has different properties, but the space complexity
is (N + 1) for the in-situ hash sort, and 2 · N for the direct hash sort. For each
variation, the space complexity is linearly related to the size of the list of data values.
If the ratio is greater than one, the time complexity of the hash sort is greater
than the quick sort. If the ratio is exactly one then the two sort algorithms perform
with the same complexity. A ratio of less than one indicates the hash sort is less than
the time complexity of the quick sort, therefore outperforming it.
2·k·N +c
lim < 1.0
N →+∞ N · logN
which simplifies to:
2 · k + Nc
lim < 1.0
N →+∞ logN
taking the limN →+∞ the ratio then becomes:
c
2·k+ ∞
< 1.0
log ∞
9
which simplifies to:
2·k
< 1.0
∞
which then reduces to:
showing that the ratio of the two algorithms is indeed less than one.
Therefore this means the hash sort will asymptotically outperform the quick sort.
2.4. Variations of the Hash Sort. There are two types of versions of the
hash sort, the in-situ hash sort, and the direct hash sort. Variations upon the hash
sort are upon these two primary types. The in-situ hash sort is the basic form of the
hash sort, which works by exchanging values as explained previously. The in-situ hash
sort has a problem which can increase its time-complexity, but the hash sort algorithm
remains linear. The in-situ hash sort has a problem with data values that map to
their current location. In essence, the data value is already in-site, where it belongs.
Since the in-situ hash sort will determine where it belongs, then exchange, this would
cause the in-situ hash sort to halt. To remedy this, another iterative mechanism keeps
the in-situ hash sort algorithm going by relocating the current location to the next
one. When the current location and destination location are the same, the in-situ
hash sort has to be forced to proceed.
This forcing of the in-situ hash sort does add more time-complexity, but as a
linear multiplied constant. The number of data elements that map back to the cur-
rent location they are at is the amount of hysteresis present in the data. The term
hysteresis is borrowed from electrical theory, meaning to lag; hysteresis in the in- situ
hash sort causes a lag, which the algorithm must be forced out of. The worst case
for hysteresis is that all of the data elements map to the location they are at. In this
case, the in-situ hash sort would have to be pushed along until it is through the data
elements. In this worst case, the time of the algorithm becomes a double of the linear
time complexity, increasing the time, but linearly to 2(2·k)·N , or 4 ·k ·N . The in-situ
hash sort is more space efficient, using only the space to store the data elements, and
a temporary storage for sorting the data elements, or N + 1.
The direct hash sort is a variation upon the in-situ to avoid the problem with
hysteresis. The direct hash sort uses a temporary data array to hold the data ele-
ments. The data elements are then mapped from the single dimension array into the
multiple dimension array. No element can map to its current location, as it is be-
ing mapped from a one-dimensional array into a multiple dimensional. However, the
storage requirements for the direct hash sort require twice the storage, or 2 · N . The
tradeoff for time efficiency is a worsening of the space efficiency. The time complexity
is 2 · k · N , as the problem with hysteresis never surfaces.
The variations in the hash sort can be applied to the two primary forms, the
in-situ and the direct hash sort. The variations are in dimensionality, and relaxing
a restriction imposed upon the data set that is sorted. The version of the hash sort
mentioned is two-dimensional, but the hash sort can be of any dimension d, where
d ≥ 2.
10
The map by hashing would then be multiple applications of the hash scheme. For
a d-dimensional hash sort, it will be of the form 2 · k · N , where k is the dimension-
ality of the hashing structure. Note that k is an implementation constant, so once
decided upon, remains a constant multiple of 2. Either primary type of hash sort
algorithm, the in-situ hash sort, or the direct hash sort, can be extended into higher
dimensionality.
The other variation upon the primary forms of the hash sort concerns the restric-
tion of unique data values. This restriction upon the data set can be relaxed, but with
each location in the hash structure a count is required of the number of data elements
that hash to that location. For large amounts of values in the data set, the hash
sort will ”compress” them into the hash structure. When the original sorted data set
is required, each data element would be enumerated by the number of hashes to its
location. The flexibility of the hash sort to accommodate non-unique data values is
inefficient if there are few repeated data values. In such a case, nearly half of the hash
structure will be used only for single data values.
Both the in-situ hash sort and the direct hash sort have another problem, which
is inherent in either variant of the hash sort algorithm. If the data values within the
range are not all there, then the data set is sparse. The range is used to determine
the hash structure size, and the time to run the algorithm. The hash sort presumes
all data values in the data set within the range are present. If not, then the hash
sort will still proceed sorting for the range size, not the data set size. So for the data
set size Nd , and the range size Nr , if Nd ≥ Nr , the hash sort algorithm performs as
expected, or better.
If Nd < Nr , then the sparsity problem surfaces. The hash sort will sort on a
range of values, of which some are non-existent. The smaller data set size to range
size will not be reflected in the time-complexity of the hash sort, or in the required
hash structure; so when the cardinality of the data set and cardinality of the range of
values within the data set are inconsistent, the hash sort doesn’t flop, but it becomes
inefficient. The empty spaces within the hash structure then must have some sentinel
value outside the data range to distinguish them from actual data values. The direct
hash sort alleviates the sparsity problem somewhat for the hash sort time, but still is
inefficient in the hash structure as some of the locations will be blank or empty.
C version:
if(M[col][row] == value)
/* if the value is already here, increment count */
M[col][row].count++;
else {
/* store the value, initialize the count to 1 */
M[col][row].count = 1;
M[col][row].value = value;
}
}
}
Pascal version:
const
d = 10;
size = 100;
type
record tag =
count: integer;
value: integer;
end;
if M[col,row]=value then
12
(* if value already here *)
M[col,row].count:=M[col,row].count + 1;
else begin
(* if maps first time, store value, initialize count to 1 *)
M[col,row].count:=1;
M[col,row].value:=value;
end; (*if*)
end; (*for*)
end; (*hash_sort*)
The number of data elements N is the total data size to be mapped by the hash
sort. The number of data elements determines the time-complexity of the hash sort,
the amount of time being linearly proportional to the number of data elements.
The in-situ hash sort, the number of data elements is less or equal to the size of
the square matrix M . The direct hash sort, the number of elements can be of any
size. The only requirement is that the data elements fall within the range R.
The square matrix M is constructed around the range R of the data values so that
all possible values within the range R do map. The super-hash function maps data
elements within the range R into the square matrix M . Therefore, the storage space
M is formed from the range R of the data values, not the number of data elements
N.
The time complexity is dependent upon the number of data elements, the size N .
The in-situ hash sort can have less elements N than the matrix capacity M , but the
hash sort algorithm must go through the entire matrix M . Hence sparsity of data
values within the matrix is highly inefficient. The direct hash sort is different since
the mapping is from a list into the matrix M . The time complexity is linear again to
the size N of the list of elements.
In both cases, there is linear time complexity, only with the in-situ, it is dependent
more upon the size of the matrix M than the amount of data in the matrix. In doing
so, the time complexity is more dependent upon the range of values for the in-situ
than the number of them. No more data elements can be stored in the matrix than
its size M permits, which is related to the range R.
13
2.5.2. Analysis of Hash Sort Variants. Depending upon the variant used,
the hash sort can have a time complexity linearly proportional to either the size of the
data structure which is a matrix M , or the size of the data list L. In each variant of
the hash sort, the linear nature of the algorithm is proportionate to a greatest lower
bound.
The in-situ variant of the hash sort is linearly proportional to the size of the data
structure, the matrix M . The matrix M has a size determined in part bythe super-
hash function. The size of the matrix places a least upper bound on the possible size
of the list L, and in doing so forms a least upper bound of O(M ) time complexity.
This least upper bound constraint stems from the fact that the data list L and the
data structure the matrix M are the same entity.
The direct variant of the hash sort is linearly proportional to the size of the data
list L, as the data structure the matrix M and the list L are seperate entities. The
time complexity is independent of the data structure in the direct hash sort. This
independence of entities permits a greatest lower bound dependent on the size of the
data list L and not the data structure the matrix M .
For each variant of the hash sort, the time complexity of the algorithm is linear.
The linearity is constrained in relation to two seperate determining factors for each
variant of the hash sort. In one variant, the in-situ version of the hash sort, the time
complexity has a least upper bound determined by the matrix M . The size of the
data list L in M can be less than, but no more than the size of the matrix M because
the data list L and the data structure M are the same entity. The direct hash sort
variant seperates these two entities and in doing so the constraint is a greatest lower
bound dependent upon the size of the data list L.
Matrix:
m = 0 1 2
d
0 * * *
1 * * *
2 * * *
The values in the matrix must be mapped to the range 0 .. 8; this range of values
14
forms a ordinal pair of the form (d, m) from (0,0) to (2,2). Values in the matrix
increase from left to right m = 0 .. 2, and top to bottom d = 0 .. 2. This is how the
ordering should place values once they are sorted.
Any range of values can be used, but for simplicity, the range of values in the
matrix is from 1 to 9. The hash sort algorithm used will be in-situ (or in site) so the
problem of hysteresis is present. Hysteresis is the occurrence where a value is in its
correct location, so the algorithm maps it back on to itself.
The hash sort will involve starting at some initial point in the matrix, then travers-
ing the matrix and mapping each value to its correct location. As each values is
mapped, order is preserved (a large value will map to a ”higher” position than a
”lower” one), and there are no collisions (excepting self-collisions which are hystere-
sis). The values are mapped by subtracting one, then applying the mod and div
operators.
The where, or which digit is being handled is in parentheses, the computed desti-
nation is in brackets. The value used in the computation is given, and the computed
ordinal pair. A before and after illustration is given to show the exchange of the two
values which is computed, then done by the hash sort.
d = (x − 1) div 3 ; m = (x − 1) mod 3
m = 0 1 2
d
0 5 8 1
1 9 7 2
2 4 6 3
m = 0 1 2 m = 0 1 2
d d
0 (5) 8 1 0 (7) 8 1
1 9 [7] 2 1 9 5 2
2 4 6 3 2 4 6 3
Before After
The start position is initially at (0,0). The value is 5, subtracting 1 is 4. The mapping
(illustrated once) is d = (4 div 3) = 1, m = (4 mod 3) = 1. Thus the computed
destination for where the value goes is (1,1).
15
m = 0 1 2 m = 0 1 2
d d
0 (7) 8 1 0 (4) 8 1
1 9 5 2 1 9 5 2
2 [4] 6 3 2 7 6 3
Before After
The next value is 7, subtracting 1 is 6. The computed destination for where the value
goes is (2,0).
m = 0 1 2 m = 0 1 2
d d
0 (4) 8 1 0 (9) 8 1
1 [9] 5 2 1 4 5 2
2 7 6 3 2 7 6 3
Before After
The next value is 4, subtracting 1 is 3. The computed destination for where the value
goes is (1,0).
m = 0 1 2 m = 0 1 2
d d
0 (9) 8 1 0 (3) 8 1
1 4 5 2 1 4 5 2
2 7 6 [3] 2 7 6 9
Before After
The next value is 9, subtracting 1 is 8. The computed destination for where the value
goes is (2,2).
m = 0 1 2 m = 0 1 2
d d
0 (3) 8 [1] 0 (1) 8 3
1 4 5 2 1 4 5 2
2 7 6 9 2 7 6 9
Before After
The next value is 3, subtracting 1 is 2. The computed destination for where the value
goes is (0,2).
m = 0 1 2 m = 0 1 2
d d
0 <1> 8 3 0 1 (8) 3
16
1 4 5 2 1 4 5 2
2 7 6 9 2 7 6 9
Before After
The next value is 1, subtracting 1 is 0. The computed destination for where the value
goes is (0,0).
Here is an example of hysteresis noted by the angular brackets; the start and final
positions are equal, so the value has mapped back onto itself. The start position is
’forced’ to the next position or the algorithm would be stuck in an infinite loop and
so would ”stall”.
m = 0 1 2 m = 0 1 2
d d
0 1 (8) 3 0 1 (6) 3
1 4 5 2 1 4 5 2
2 7 [6] 9 2 7 8 9
Before After
The next value is 8, subtracting 1 is 7. The computed destination for where the value
goes is (2,1).
m = 0 1 2 m = 0 1 2
d d
0 1 (6) 3 0 1 (2) 3
1 4 5 [2] 1 4 5 6
2 7 8 9 2 7 8 9
Before After
The next value is 6, subtracting 1 is 5. The computed destination for where the value
goes is (1,2).
m = 0 1 2 m = 0 1 2
d d
0 1 <2> 3 0 1 2 3
1 4 5 2 1 4 5 6
2 7 8 9 2 7 8 9
Before After
The next value is 2, subtracting 1 is 1. The computed destination for where the value
goes is (0,1).
Here is another example of hysteresis; the start and final positions are equal, so the
value has mapped back onto itself. The start position is ’forced’ to the next position
or the algorithm would remain stuck.
17
The matrix is now sorted, but the algorithm has no way of knowing this. It will
continue from 3 to 9 until the hysteresis count equals the size of the data N . The
remaining sorting data has caused ”hysteresis” by causing the algorithm to lag in a
sense, getting tripped up over already sorted data. In an ideal arrangement of the
data, no hysteresis would occur; then the algorithm knows it has finished sorting with
the exchange count equals the size of the data N . But unfortunately, this is an ideal
situation which most likely will never occur in practice.
Matrix
m = 0 1
d
0 <*,*> <*,*>
1 <*,*> <*,*>
The values of the data set are from 7 to 10. A list L of size 7 of data elements to
be mapped will be used. Each location in the matrix is the value, and an associated
count of the number of values in that location. Once all the values from the list L are
mapped, the hash sort algorithm will terminate.
d = (x − 7) div 2 ; m = (x − 7) mod 2
As each value is mapped, an brief explanation of the process and what happens to
the matrix will be given. As the list L is mapped, it will become progressively smaller
in representation through the walk-through. Once the list size is zero, the hash sort
will be complete. The left-most element in the list L will be the one being mapped
by the hash sort algorithm.
L = { 7, 8, 7, 9, 10, 7, 8, 8 }
Matrix
m = 0 1
d
0 <*,*> <*,*>
1 <*,*> <*,*>
The first value is 7. The hash sort maps the value to the location in the matrix
as d = (7 -7) div 2, m = (7-7) mod 2, which is (0,0)
18
L = { 8, 7, 9, 10, 7, 8, 8 }
Matrix
m = 0 1
d
0 <7,1> <*,*>
1 <*,*> <*,*>
Step 1
The next value is 8. The hash sort maps the value to the location in the matrix
as d = (8-7) div 2, m = (8-7) mod 2, which is (0,1).
L = { 7, 9, 10, 7, 8, 8 }
Matrix
m = 0 1
d
0 <7,1> <8,1>
1 <*,*> <*,*>
Step 2
The next value is 7. The hash sort maps the value to the location in the matrix
as d = (7-7) div 2, m = (7-7) mod 2, which is (0,0).
L = { 9, 10, 7, 8, 8 }
Matrix
m = 0 1
d
0 <7,2> <8,1>
1 <*,*> <*,*>
Step 3
The next value is 9. The hash sort maps the value to the location in the matrix
as d = (9-7) div 2, m = (9-7) mod 2, which is (1,0)
L = { 10, 7, 8, 8 }
19
Matrix
m = 0 1
d
0 <7,2> <8,1>
1 <9,1> <*,*>
Step 4
The next value is 10. The hash sort maps the value to the location in the matrix
as d = (10-7) div 2, m = (10-7) mod 2, which is (1,1).
L = { 7, 8, 8 }
Matrix
m = 0 1
d
0 <7,2> <8,1>
1 <9,1> <10,1>
Step 5
The next value is 7. The hash sort maps the value to the location in the matrix
as d = (7-7) div 2, m = (7-7) mod 2, which is (0,0).
L = { 8, 8 }
Matrix
m = 0 1
d
0 <7,3> <8,1>
1 <9,1> <10,1>
Step 6
The next value is 8. The hash sort maps the value to the location in the matrix
as d = (8-7) div 2, m = (8-7) div 2, which is (0,1).
20
L = { 8 }
Matrix
m = 0 1
d
0 <7,3> <8,1>
1 <9,1> <10,1>
Step 7
The next value is 8. The hash sort maps the value to the location in the matrix
as d = (8-7) div 2, m = (8-7) div 2, which is (0,1).
L = { }
Matrix
m = 0 1
d
0 <7,3> <8,2>
1 <9,1> <10,1>
All 7 data elements have been mapped by the hash sort into the matrix. There
is no hysteresis with the direct hash sort, as all data elements are mapped by value
into the appropriate location. Moreover, the time of the direct hash sort is linearly
proportional to the size of the data list L.
• Linear time complexity, even in the worst case; for the in-situ proportional to
the data structure size, and for the direct proportional to the data list size.
• The hash sort puts data elements in the correct position, does not move them
afterward – data quiesence
• Data independence – data elements in the data set are mapped by their unique
value, and do not depend on the predecessor or successor in the data set
• High speed lookup is possible once the data is sorted – faster than binary
search; or alternatively, to the approximate location within the data structure.
• Works only with numeric values, requires conversion for non-numeric value
• The data range of values must be known for the algorithm to work effectively
3. Testing.
22
3.1. Testing Methodology.
The testing methodology of the sort algorithm involves two perspectives, an em-
pirical or quantitative viewpoint, and a mathematical or qualitative view. The math-
ematical or qualitative point of view looks at the hash sort algorithm alone. The
algorithm is tested for its characteristic behavior; this testing can be exhaustive, as
there are an infinite number of sizes of test cases to test the algorithm with. The
mathematical methodology then, tests for data size, and data arrangement or struc-
turing. These form tests for which the size of the test data list is increasing, and
tests in which there are partially sorted sublists within the overall test data list. For
simplicity, the testing emphasis was placed on the in-situ version of the hash sort.
The testing approach for determing the behavior of the hash sort algorithm focus
on the algorithmic time complexity. Testing is non-exhaustive, as all possible test
cases can not be generated or tested upon. The algorithmic behavior is tested on
different test cases to avoid the possibility of an anomalous ”best” case or extreme
”worst” case scenario. The hash sort algorithm is tested on different sizes and dif-
ferent permutations of data lists to evaluate the algorithmic performance. Complete
data lists of unsorted, or fully sorted lists are used, along with partially sorted data
lists. The hash sort is also compared to other algorithms that sort, to give relative
comparisons and contrasts for better assessment.
The two other sorting algorithms used to compare and contrast the hash sort
are the bubble sort, and the quick sort. Each algorithm represents an extreme in
algorithmic performance. The bubble sort is an O(n2) algorithm, but has excellent,
linear O(N ) performance on partially sorted data lists. The quick sort is known as the
best, and fastest sorting algorithm of O(N log N ) performance. However, the quick
sort does falter on already sorted data lists, degenerating into a O(n2) time complexity
algorithm. Thus, the bubble sort is best when the quick sort is at its worst, and vice-
versa. Again the extremes of algorithmic performance in pre- existing algorithms have
been selected for comparison and contrast with the hash sort algorithm.
3.2. Test Cases.
Testing on data size looks at increasing data sizes and the rate of growth in the
time complexity of the algorithm. Testing on data size is on ”complete” lists of data
elements, lists which are fully sorted or are unsorted. These types of ”complete” data
test cases form the extremes of the other variations in the data test case. The sorted
lists are either fully sorted in ascending or descending fashion, and the unsorted list
is the median between the two extremes. Size testing looks at the effects of increasing
sizes on these three types of ”complete” sorted data test cases.
3.3. Test Program.
The test program has to handle to important issues dealing with the program
code, and the test platform. The test platform, or the computer system the code is
run on, the test program must not be biased by any hardware or platform architecture
enhancements. So a general-purpose computer, without any specific enhancements or
features is used. For further platform independence, different computers should be
used, of different sizes and performance to be sure of the independence of the results
obtained.
The code must be written in a ”neutral” computer language to avoid any esoteric
features of a language, and be readable. Similar to platform independence, the tests
23
must be independent of the language implementing the test programs. Any program-
ming language features which give a biased advantage in the generated code being
optimized or better for is to be avoided.
test case generator and the test sort programs are integrated together in one program.
The test case generator generates the particular test case in internal storage within
the program, and the hash sort, bubble sort, and quick sort all process the data. The
time to process the data is noted and written to an output data file. The time in
microseconds, and the size of the data set is recorded by the program.
3.3.1. Test Platform Specifications. The table summarizes the different com-
puter systems which were used to test the hash sort.
The original Pascal implementations were converted to C using the the p2c con-
verter. The generated C code was hand optimized to eliminate inefficiencies and
to follow strict ANSI C compliance. The code was modified for the UNIX system
environment, in particular the system timing calls. Some of the routines such as
write matrix and write list were originally used to check that the output was a sorted
list or array – verifying the algorithms were working. Once this had been checked,
the call to the procedure was commented out, as the actual test program dumped
its output to a data file. These routines in the final run for testing timing are still
present in the source code.
4. Analysis of Results.
The hash sort will have linear time complexity performance which will have a
Big-Oh of O(N ) in relation to the test case size. This consistency and stability of
the algorithm through different cases where the comparison algorithms degenerate or
accelerate performance is expected to be verified in the tests. The worst case of the
24
algorithm performance will be a constant multiple of linear time c · N , but it will
remain linear time-complexity.
5. Conclusions.
The reason for this seemingly strange performance of the algorithm is the theoret-
ical consideration that the hash sort and the quick sort operate using the same types
of operations with the same underlying machine code clock cycles. This of course is
not correct, the quick sort is using a compare based machine instruction, whereas the
hash sort is using an integer divide machine instruction.
A comparison of the clock cycle times for the compare and integer divide instruc-
tions on a Intel 486 processor gives some indication of this (Brey pp. 723 - 729).
To rank the algorithm on each of the criteria, the simple expedient of following
a grading system is used. Rankings are: excellent, good, fair, poor, bad. A comment
as to the ranking and why it was judged to rate such a score is given.
5.3. Evaluation of the Algorithm. The hash sort has the following rankings:
• time complexity: good; the hash sort is linear in theory, but in practice must
reach a certain critical mass, so its overall time complexity is more degenerate
than the theory behind the algorithm would indicate.
• space complexity: fair; the hash sort has some overhead in the algorithm, but
of which most are counters which are incremented as the algorithm progresses.
But the algorithm is fitting a n-dimensional array onto a one-dimensional
memory so this is somewhat awkward and introduces memory manipulation
complexity overhead.
26
5.5. Summary of Sort Algorithms.
The hash sort compared to the bubble sort is similar in coding complexity, very
short code which is simple and easy to follow. However, it is much faster than the
bubble sort, although it is not as efficient in using memory. The bubble sort is much
more efficient in using memory, but it does have much data movement as it sorts the
data.
The quick sort is much more complex to code, even the recursive version. A
non-recursive version of the quick sort becomes quite complex and ugly in coding.
The quick sort does outperform the hash sort initially, but it is still a linearithmic
algorithm, and does have a degenerate worst case, although it is rare in practice. The
quick sort is good at managing memory, but it is a recursive algorithm, which has
somewhat more overhead than an iterative algorithm. Because of its partitioning and
sub-dividing approach, the amount of data movement is less than with bubble sort.
6. Application.
• the data needs to be ordered quickly then accessed frequently large data set
with heavy density within the data range
Even worse, all the packets must be received before they can be sorted, if sorting
is required. To use a quick sort for this application, a separate algorithm to check
if the packets are already sorted is required, then if not, the quick sort algorithm is
used.
N is the size of the packets received, and to check that the packets are in order
requires a linear search O(N ) to examine all packets, although not when they are out
of order.
The hash sort is a better algorithm to use, although the packets would have to be
stored in a matrix structure rather than a linear list. A check for the packets being
received in order is unnecessary, because the hash sort will not degenerate on sorted
data – it will be an extreme form of hysteresis, but the hash sort remains linear in
such cases.
More interesting is that the hash sort does not have to wait for the last packet
to be received – it can begin to sort the packets as they are received. The hash sort
maps the packets to where each uniquely belongs. Since no two packets share the
same location, they are unique. So the hash sort can uniquely determine from the
packet itself where it belongs in the matrix. As packets are received, they are placed
into their appropriate location. By the time of the last packet arrival, all the data is
in place.
The hash sort may be slower than the data reception time, in which case the
algorithm will continue to sort the packets as they are stored. The total time in this
case would be:
This time is the same as if the data were received in the quick sort implementa-
tion already in order, and the algorithm to check that the data is in order is used.
This example shows the power of data independence in the hash sort algorithm – a
data element does not depend on its predecessor or successor for its location, it is
determined uniquely by the data element itself.
29
This property of data independence, and the speed of the hash sort and its robust-
ness in the degenerate case, make such applications where it is applicable to use it.
One drawback however, is that the size of the packets sent must be sufficient enough
to achieve the ”critical mass” needed by the hash sort.
7. Mathematical Proofs.
7.1. Proof of Injective Mapping Function.
Theorem:
Where n > 0, then the resulting image set M of ordinal pairs formed is M =
(p1 , q1 ), (p2 , q2 ), . . . , (pn−1 , qn−1 ), (pn , qn ) from the pre-image set S is an injective map-
ping F : S → M .
Proof:
rewrite as
By the definition of S, all the integers are unique in Z, so the integers have the
property that for any two integers x,x′ ∈ S so that x 6= x′ .
Take the case for x:
30
px = x div n qx = x mod n
px = dx where dx ≥ 0 qx = mx where 0 ≤ mx < n
so then that for x = x′ that F (x) 6= F (x′ ). But this is for x′ = x + c where 0 < c < n.
Take the case when c ≥ n. Using the same definition for x, take x′ = x + c where
c = n.
px = x′ div n qx = x′ mod n
px = (x + n) div n qx = (x + n) mod n
px = (x div n) + (n div n) qx = (x mod n) + (n mod n)
px = dx + 1 where dx ≥ 0 qx = mx + 0 where 0 ≤ mx < n
px = dx + 1 where dx ≥ 0 qx = mx where 0 ≤ mx < n
So the occurence of dx = d′x and mx = m′x such that x = x′ is never true and
will never occur, which is the only possibility for the mapping F : S → M to be
non-injective.
Thus it follows that no two ordinal pairs in M formed from any two integers in S
by F would ever be equal. Since this is the only counter-example for the definition of
F (x) to be an inejective mapping, then the only conclusion is that F : S → M must
be an injective mapping under F (x).
Q.E.D.
7.2. Proof of Injective Multiple Mappings.
Theorem:
Proof:
F (n): S
F (G): S
F (G : N ) : S
Q.E.D
REFERENCES
[1] Aho, Hopcroft, and Ullman, Data Structures and Algorithms, Addison-Wesley, Reading, Mass.,
1983.
[2] Ambel, O. and Knuth, D.E., ”Ordered Hash Tables”, Computer Journal, vol. 17, no. 2, pp.
135-142, 1974.
[3] Amstel J.J. Van and Poirters J.A., The Design of Data Structures and Algorithms, Prentice-Hall,
Englewood Cliffs, N.J., 1989.
[4] Baase, Sarah. Computer Algorithms: Introduction to Design and Analysis, Addison-Wesley,
Reading, Mass., 1978.
[5] Barron, D.W. and Bishop, J.M., Advanced Programming : A Practical Course., John Wiley &
Sons, New York, 1984.
[6] Brey, Barry B. The Intel Microprocessors, Macmillian Publshing, New York, 1994.
[7] Ellzey, Roy S., Data Structures for Computer Information Systems, Science Research Associates,
Chicago, IL, 1982.
[8] Harrison, Malcom C., Data Structures and Programming. Scott, Foresman, and Co.,1973.
[9] Horowitz, E. and Sahni, S., Fundamentals of Computer Algorithms., Potomac, Md., Computer
Science Press, 1978.
[10] Horwitz, E. and Sahni, S., Fundamentals of Data Structures. Computer Science Press, Woodland
Hills, Calif., 1976.
[11] Knuth, Donald. E. The Art of Computer Programming. Volume 3: Searching and Sorting.
Addison-Wesley, Reading, Mass., 1973.
[12] Kronsjo, Lydia I., Algorithms: Their Complexity and Efficiency., John Wiley & Sons, New
York, NY., 1979,
[13] Kruse, Robert L., Data Structures and Program Design. Prentice-Hall, Englewood Cliffs, N.J.,
1984.
[14] Lewis, T.G. and Smith, M.Z., Applying Data Structures., Houghton-Mifflin, Boston, Mass.,
1976.
[15] Lorin, Harold. Sorting and Sort Systems, Addison-Wesly, Reading, Mass., 1975.
[16] Morris, Robert, ”Scatter Storage Techniques”, Communications of the ACM, vol. 11, no. 1, pp.
38-44, 1968.
[17] Peterson, W.W. ”Addressing for Random-Access Storage”, IBM Journal of Research and De-
velopment, vol. 1, no. 2, pp. 130-146, 1957.
[18] Rich, Robert P. Internal Sorting Methods Illustrated with PL/1 Programs, Prentice-Hall, En-
glewood Cliffs, N.J., 1972.
[19] Tremblay, Jean-Paul and Sorenson, Paul G., An Introduction to Data Structures with Applica-
tions. 2nd ed., McGraw-Hill, New York, NY., 1984.
[20] Wirth, Niklaus. Data Structures + Algorithms = Programs. Prentice-Hall, Englewood Cliffs,
NJ., 1976.
33