0% found this document useful (0 votes)

7 views

A_Low-Cost_Pipelined_Architecture_Based_on_a_Hybrid_Sorting_Algorithm

This paper presents a low-cost pipelined architecture utilizing a hybrid sorting algorithm that combines a bitonic sorter with cascaded bidirectional insertion sorting units (BISUs). The proposed architecture is designed for efficient hardware implementation, achieving reduced sorting cycles, lower area costs, and improved power consumption, making it suitable for devices with limited resources. Experimental results demonstrate significant enhancements in throughput and resource utilization compared to existing sorting designs.

Uploaded by

hhimanshukumar0408

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views

A_Low-Cost_Pipelined_Architecture_Based_on_a_Hybrid_Sorting_Algorithm

Uploaded by

hhimanshukumar0408

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 71, NO.

2, FEBRUARY 2024 717

A Low-Cost Pipelined Architecture Based on a

Hybrid Sorting Algorithm
You-Rong Chen , Chien-Chia Ho, Wei-Ting Chen , and Pei-Yin Chen , Senior Member, IEEE

Abstract— In this paper, a low-cost pipelined architecture The hardware implementation can also be easily deployed
based on a hybrid sorting algorithm is proposed. The proposed to devices with limited resources, which makes it a better
architecture is constructed with a bitonic sorter and several solution.
cascaded bidirectional insertion sorting units. The bidirectional
insertion sorting unit uses the segmented sorted subsequence Recently, many hardware architectures for sorting algo-
generated by the bitonic sorter as input, and records the rithms implemented with field programmable gate arrays
maximum and minimum values of the subsequence. After all (FPGAs) [16], [17], [18], [19], [20], [26], [27], [28], [29] or
segmented subsequences are processed through the cascaded bidi- very large-scale integrated (VLSI) circuits [21], [22], [23],
rectional insertion sorting units, a sorted sequence is obtained. [24] have been proposed. Farmahini-Farahani et al. [21]
The proposed architecture is implemented using the Verilog
hardware description language (HDL) and synthesized using the implemented a modular design that comprised hierarchical
Synopsys Design Compiler with a TSMC 90-nm cell library. sorting units, which were optimized for max-set selection.
The experimental results indicate that the proposed architecture Lin et al. [22] proposed a low-power, high-throughput modular
can not only shorten sorting cycles but also reduce hardware hardware design. Their architecture swaps the index values
area costs. Moreover, sorting cycles can be further shortened by of the sorted numbers instead of the numbers themselves to
increasing the parallelism of the proposed architecture. Under
the configuration that 2048 32-bit data to be sorted and 16 data reduce power dissipation. Mashimo et al. [26] proposed merge
have to be processed simultaneously, the proposed architecture network architectures in which the number of gate levels
can improve the throughput-to-gate-count ratio by 16%, and remains constant as the level of data parallelism increases, pre-
throughput-to-power-consumption-ratio by 25% compared to the venting a significant decrease in operating frequency. In [27],
existing sorting design. The proposed architecture makes the most Saitoh et al. improved the architecture proposed in [26] by
efficient use of hardware resources.
eliminating the feedback data paths in the merge logic to
Index Terms— Sorting architecture, low cost, low latency, shorten the critical path. Qiao et al. [28] presented an ana-
pipelined architecture, very large-scale integrated (VLSI) circuit. lytical framework for modeling the overall performance of the
I. I NTRODUCTION FPGA-accelerated external merge sort system and provided
an optimized solution for a specific device. Cho et al. [29]
S ORTING is a crucial process in many fields, including
scientific computing [1], image processing [2], wireless
networks [3], [4], [5], and database management [6], [7].
proposed a near-memory radix sort accelerator, which achieves
high throughput with a parallel 1-bit radix sorter.
Because of increases in the quantity of data generated in Zuluaga et al. [16] proposed new hardware structures
various applications, the development of high-performance called streaming sorting networks and an accompanying
sorting algorithms has become critical. Although some studies domain-specific hardware generation tool through which
have adopted parallel sorting methods that involve the use of users can easily tune the trade-off between area costs and
multicore central processing units [8] or graphics processing performance. The designs in [19], [20], [23], and [24] devel-
oped architectures that do not require comparison operations
units [9], [10], the high resource costs limit their application.
between the data to be sorted. Ray et al. [19] proposed a
On the contrary, high throughput and operating speed can
comparison-free hardware sorting engine that identifies the
be achieved using dedicated hardware to implement the
largest element from the unsorted data in each iteration,
sorting algorithm [11], [12], [13], [14], [15], [16], [17], [18],
which is accomplished using basic logic gates. Let N be
[19], [20], [21], [22], [23], [24], [25], [26], [27], [28], [29].
the number of data to be sorted. The sorted result can be
Manuscript received 24 September 2023; revised 20 November 2023; obtained after N iterations. Based on the basic blocks proposed
accepted 11 December 2023. Date of publication 28 December 2023; date in [19], Ray et al. [20] presented a hardware-based parallel
of current version 30 January 2024. This work was supported in part by the
National Science and Technology Council, Taiwan, under Grant 110-2221- comparison-free sorter. The proposed parallel sorter in [20]
E-006-164-MY3. This article was recommended by Associate Editor X. S. incorporates concurrent comparison-free clusters to achieve a
Zhang. (Corresponding author: Pei-Yin Chen.) speed-up over nonparallel architectures. Although the designs
You-Rong Chen and Pei-Yin Chen are with the Digital Integrated Circuit
Design Laboratory, Department of Computer Science and Information Engi- in [19] and [20] require only N cycles to obtain sorted results,
neering, National Cheng Kung University, Tainan 70101, Taiwan (e-mail: their operating frequency is significantly dependent on N .
[email protected]; [email protected]). When N increases, the throughput of the designs in [19]
Chien-Chia Ho and Wei-Ting Chen are with MediaTek Inc., Hsinchu 30078,
Taiwan (e-mail: [email protected]; [email protected]). and [20] will also decrease. Moreover, the hardware resources
Digital Object Identifier 10.1109/TCSI.2023.3342929 required by the designs in [19] and [20] increase sharply
1549-8328 © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Indian Institute of Technology Palakkad. Downloaded on February 24,2025 at 11:32:01 UTC from IEEE Xplore. Restrictions apply.
718 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 71, NO. 2, FEBRUARY 2024

when N is larger than 256. Therefore, their designs are not

suitable for sorting large datasets. The designs in [23] and [24]
aim to develop architectures that are suitable for sorting large
datasets. Abdel-Hafeez et al. [23] proposed a hardware-based
comparison-free sorting algorithm. The area costs of their
hardware design had a low growth rate, and the number
of required sorting cycles in their algorithm was linearly
proportional to N . Consequently, the algorithm is suitable for
sorting large datasets. Chen et al. [24] designed a bidirectional
sorting architecture based on a quasi-comparison-free sorting
algorithm. The numbers to be sorted are divided into two
parts by a threshold, and the sorting process is conducted Fig. 1. The comparison of gate counts between the proposed sorting
architecture and the design in [18] with the data bit width K = 32.
concurrently for the two parts. The comparison circuits are
only used to narrow the range of numbers to be sorted, and
no comparison must be conducted in the sorting process;
thus, the number of required sorting cycles is effectively through. If the maximum and minimum values recorded
reduced. by the BISU are updated, the BISU can insert the
Batcher [25] proposed a sorting network based on bitonic original maximum and minimum values into the sorted
sort that can process comparisons in parallel and thus achieve subsequence simultaneously and keep the subsequence
high throughput. Chen et al. [17] implemented a bitonic sorting in order. After all segmented subsequences pass through
network on FPGA by using an energy- and memory-efficient the cascaded BISUs, a sorted sequence is obtained. With
mapping methodology. Bitonic sort can improve the through- the designed BISU, the required sorting cycles for the
put of hardware designs; however, it requires the use of many sorting architecture are significantly reduced.
comparators for parallel comparisons when N is large, which 2) When the amount of data processed in a cycle P and
results in high area costs and is a critical disadvantage for the bit width of data K are fixed, the area costs of the
hardware implementation. To prevent sharp increases in area proposed architecture are linearly proportional to the
costs, Chen et al. [18] proposed a hybrid pipelined architecture size of input data N . Thus, the proposed architecture
that consists of a bitonic sorter and several cascaded sorting is suitable for sorting large datasets. Fig. 1 shows the
units. The bitonic sorter generates segmented sorted subse- comparison of hardware area costs between the proposed
quences, and the sorting units then determine the max set of architecture and the design proposed in [18]. All the
the full sequence. Because the bitonic sorter only processes architectures in Fig. 1 are synthesized with the same
small segmented subsequences, area costs can be reduced. constraint, where the clock frequency is set as 500MHz.
Their architecture also uses data parallelism to enhance its Under the configuration that P is 16, the proposed
throughput. However, the area costs of their architecture can architecture reduces the gate counts by 12%, 17%, 15%,
be further improved. 13%, and 14% for N of 128, 256, 512, 1024, and 2048,
To strike a better balance between hardware area costs respectively, when compared to the design proposed
and high throughput, a low-cost pipelined architecture based in [18]. The total amount of reduced gate counts by the
on a hybrid sorting algorithm is proposed in this paper. proposed architecture is also proportional to N .
The proposed architecture can sort N data of bit width K 3) The designed BISU reduces the power consumption
by the cascaded novel bidirectional insertion sorting units of the proposed architecture. The importance of this
(BISUs) and enhance throughput by increasing the level of characteristic increases as the amount of processing data
data parallelism (P). Compared with existing sorting designs, or the bit width of the data increases.
the proposed hardware architecture has the lowest area costs 4) Both of the designs in [18] and the proposed architecture
and requires the fewest sorting cycles and the lowest power can process multiple data in a cycle. Compared with
consumption. Moreover, when compared with the architecture the design in [18], the proposed architecture requires
proposed in [18], which also adopts a pipelined architecture, fewer pipeline stages to complete the sorting process.
the proposed architecture can produce more sorted data under Thus, the area costs of the proposed architecture are
the same area costs or power consumption. The main contri- also reduced. Furthermore, the percentage reduction in
butions of this paper are as follows. the area costs of the proposed architecture compared
1) Inspired by the previous works [18] and [24], the with that of the design in [18] increases with P. The
proposed sorting architecture introduces the concept comparison between the proposed architecture and the
of bidirectional processing into the cascaded insertion design in [18] with different P is also illustrated in
sorting units and designs the novel BISU. The proposed Fig. 1. Under the configuration that N is 2048, the
sorting architecture adopts a hybrid design, which is con- proposed architecture can reduce the gate counts by 6%
structed with a bitonic sorter and several BISUs. When when P is 4, 11% when P is 8, and 14% when P
the BISU receives the segmented sorted subsequence is 16, compared to the design in [18]. As a result, the
from the bitonic sorter or the BISU in the previous stage, proposed architecture can enhance the throughput with
it will record the maximum and minimum values passed fewer additional hardware area costs.

Authorized licensed use limited to: Indian Institute of Technology Palakkad. Downloaded on February 24,2025 at 11:32:01 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: LOW-COST PIPELINED ARCHITECTURE BASED ON A HYBRID SORTING ALGORITHM 719

The remainder of this paper is organized as follows.

The relevant studies on sorting architecture are introduced
in Section II. Section III describes the proposed sorting
algorithm. The hardware architecture of the proposed
algorithm is introduced in Section IV. Section V presents the
experimental results of this paper and comparisons between
the performance of the proposed architecture and existing
sorting designs. Finally, the conclusion of this paper is
provided in Section VI.

II. L ITERATURE R EVIEW

In this section, the benefits and disadvantages of different
sorting architectures are reviewed.

A. Bitonic Sort Algorithm

Fig. 2. The architecture of the index counting module proposed in [24].
Bitonic sort is a sorting method proposed by Batcher [25],
which performs parallel comparisons to achieve high through- comparison-free sorting, Chen et al. [24] proposed a bidirec-
put. A bitonic sorting network has two stages. The first stage tional architecture, which is shown in Fig. 2. The architecture
is responsible for merging the input elements into a bitonic proposed in [24] divides the CA into a high-index part and a
sequence, which is composed of two subsequences: an ascend- low-index part, traversing the CA from both directions during
ing order subsequence and a descending order subsequence. the read-sort phase. However, there is still a limit to the number
After the unsorted sequence is merged into a bitonic sequence, of sorting cycles that can be reduced.
the second stage recursively divides the bitonic sequence into On the other hand, Ray et al. [19] propose a comparison-free
a larger subsequence and smaller subsequence. An arbitrary sorting engine that requires only N cycles to obtain sorted
element in the larger subsequence is no smaller than that results. The sorting engine proposed in [19] consists of K
in the smaller subsequence. The operation of the second cascaded blocks, and each of these blocks consists of N cells,
stage is completed when the length of the larger and smaller each of which is composed of a 2-input AND gate and a
subsequences becomes 1. multiplexer. An unsorted array (UA) of length N is used to
Although bitonic sort can increase throughput, it must record the indices of unsorted data, with all values in UA
process several parallel comparisons in a cycle. When bitonic initialized to 1. In each iteration, UA is filtered by the cascaded
sort is implemented in hardware, multiple comparators are blocks to identify the largest element from the unsorted data,
required in a single pipelined stage; thus, area costs increase and the value at the corresponding index of UA is set as 0.
sharply as N increases. The i-th cascaded block receives the filtered UA from the
previous stage and the i-th significant bit of each data as
B. Comparison-Free Sorting Algorithm input. Then, the n-th value of the filtered UA and the i-th
Comparison-free sorting algorithm is firstly proposed by significant bit of the n-th data are passed through the AND
Abdel-Hafeez et al. [23]. The algorithm is composed of the gate of the n-th cell. The results of the AND operations in all
write-evaluate phase and the read-sort phase. During the write- cells are then passed through an N -input OR gate to obtain
evaluate phase, the number of occurrences of each input data the selection signal for the multiplexers in all cells. After the
element is recorded. Thus, a count array (CA) of length 2 K is UA passes through the cascaded blocks, the index with value
required for data of bit width K . After all input elements are 1 will also be the index of the largest element. For hardware
counted, the algorithm starts the read-sort phase to generate implementation, the N -input OR gates in each cascaded block
the sorted data sequence. A sorted array (SA) of length N are implemented using a hierarchy of parallel 4-input OR
is used to store the sorting results in order. During the read- gates. Thus, its propagation delay is directly influenced by N .
sort phase, the CA will be traversed by the index. For the When N increases, the frequency and throughput of the sorting
element of CA whose value is not 0, its index value will be engine proposed in [19] decrease. Although Ray et al. present a
stored into SA and its value is decremented by 1. This process parallel version of the comparison-free sorting engine in [20],
repeats until the value of the current element becomes 0, and which incorporates several blocks into clusters to shorten the
then the index value is incremented by 1 to examine the critical path of their hardware architecture, the same problem
value of the next element. Since no comparator is required still persists. Moreover, the hardware resources required by the
in the comparison-free sorting algorithm, the growth rate of architectures proposed in [19] and [20] increase sharply when
the hardware area costs of the design in [23] is very low. N is larger than 256. As a result, the architectures proposed
However, the characteristics of this algorithm limits the ability in [19] and [20] are not suitable for sorting large datasets.
of parallel processing. It is hard to count multiple input data
or examine the value of multiple elements of CA in a cycle for C. Hybrid Sorting Algorithm
the hardware implementation, which may result in ambiguity To achieve high throughput and low latency with acceptable
in the value of registers. To accelerate the process of the hardware area costs, Chen et al. [18] proposed a hybrid

Authorized licensed use limited to: Indian Institute of Technology Palakkad. Downloaded on February 24,2025 at 11:32:01 UTC from IEEE Xplore. Restrictions apply.
720 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 71, NO. 2, FEBRUARY 2024

values passed through. When the registers of the BISU have to

be updated, the BISU can insert their original values into the
received subsequence simultaneously. The hardware area costs
and required sorting cycles can be effectively reduced with the
BISUs. Moreover, the percentage reduction in the area costs
of the proposed architecture compared with that of the design
in [18] increases with P.

III. P ROPOSED S ORTING A LGORITHM

The proposed sorting algorithm involves two steps, namely
bitonic sort and bidirectional insertion sorting. The following
subsections introduce each part of the proposed algorithm
under the premise that P is larger than 1 and smaller than N .
In the case that P is 1, the step of bitonic sort can be skipped,
and the bidirectional insertion sorting can be simplified as
comparison and swap operations between the input data and
the record values of the BISUs. In the case that P is N , the
sorted sequence can be directly obtained by bitonic sort.

A. Bitonic Sort
Initially, the data sequence to be sorted is segmented into
subsequences of length P, where P is much smaller than N .
Fig. 3. The architecture of the sorting logic module proposed in [18]. A bitonic sorting network is used to generate N /P segmented
sorted subsequences, which are then processed by the pro-
posed bidirectional insertion sorting architecture (introduced
sorting algorithm that combines bitonic sort and insertion sort. in Section III-B) to produce a sorted sequence with a full
The input data sequence is segmented into subsequences of length (N ). In this manner, the proposed sorting algorithm can
length P and then sorted with a bitonic sorter. Since P is take advantage of the bitonic sorting network to increase the
much smaller than N , it can prevent the hardware area costs throughput of the architecture, while preventing a considerable
of the bitonic sorter from becoming too high. The segmented increase in hardware resource usage.
sorted subsequences are then transmitted to cascaded insertion
sorting units, which record the largest values passed through B. Bidirectional Insertion Sorting
them. The architecture of the sorting logic used to determine
The proposed bidirectional insertion sorting architecture has
the order of the output subsequence and the recorded value
two operation modes: the recording mode and insertion mode.
in the insertion sorting units is illustrated in Fig. 3. DI[1]
Two storage arrays, namely a smaller-value array (SVA) and
to DI[P] and DO[1] to DO[P] represent the input data
larger-value array (LVA), are used to record the sorting results.
subsequence and the output data subsequence, respectively.
The length of the SVA and LVA is N /2, and their bit width is
The two subsequences are both in ascending order. RI M AX and
equal to the bit width of data (K ). The sorting architecture
RO M AX denote the original value and updated value recorded
can be divided into N /2 stages. The i-th stage is responsible
in the register of the insertion sorting unit, respectively. When
for controlling the comparisons and swapping between the
the INITin signal is low, the sorting logic rearranges the
minimum and maximum values of the input segmented sorted
data from DI and RI M AX . If the value of the register of the
subsequence and the values of SVA[i] and LVA[i]. When the
insertion sorting unit is updated with DI[P], its original value
architecture is reset, all of the elements in the SVA are set
will be inserted into the output data subsequence to keep
as MAX, and all of the elements in the LVA are set as MIN.
the subsequence in order. The processed subsequence is then
The value of MIN is 0, and the value of MAX is determined
passed to the next insertion sorting unit. After all segmented
according to K . Equations (1) and (2) present the formulas
subsequences pass through the cascaded insertion sorting units,
for MIN and MAX, respectively.
a sorted sequence of length N is obtained.
Although the design proposed in [18] can enhance the MIN = 0 (1)
throughput by parallel processing, its hardware area costs K
M AX = 2 − 1 (2)
increase with the level of data parallelism P. Furthermore,
there was room for improvement in its latency. Inspired by the Let the data sequence input into the i-th stage of the sorting
previous works [18] and [24], a low-cost pipelined architecture architecture be DIi , and let the data sequence output by this
based on a hybrid sorting algorithm is proposed in this paper. stage be DOi . The length of these sequences is P. The data of
The proposed architecture introduces the concept of bidirec- DIi are sorted with bitonic sort; thus, DIi is an ascending-order
tional processing into the cascaded insertion sorting units. sequence (from DIi [1] to DIi [P]). After the bidirectional
The designed BISU can record the maximum and minimum insertion sorting operations, DOi must also be in ascending

TABLE I
C ASES OF B IDIRECTIONAL I NSERTION S ORTING (X = D ON ’ T C ARE VALUE )

order. SVAi [i] and LVAi [i] represent the values of SVA[i] and sequence is completely invalid and that the uninitialized
LVA[i] before the bidirectional insertion sorting operations, stages remain in the recording mode.
while SVAo [i] and LVAo [i] represent the values of SVA[i] 2) Insertion Mode: When LVAi [i] is greater than or equal
and LVA[i] after the bidirectional insertion sorting operations. to SVAi [i], the i-th stage of the sorting architecture operates
Table I shows the cases of bidirectional insertion sorting. There in the insertion mode. When the i-th stage of the sorting
are three judgment conditions in the table, namely whether architecture is in the insertion mode, comparisons are made
SVAi [i] is greater than LVAi [i], whether SVAi [i] is greater between SVAi [i] and DIi [1] and between LVAi [i] and DIi [P].
than DIi [1], and whether LVAi [i] is smaller than DIi [P]. The According to the comparison results, there are four insertion
second row of the table shows the operations in the recording cases, which are displayed in the third to the sixth rows of
mode, while the subsequent rows show the operations in the Table I.
insertion mode. Details regarding the recording and insertion In the first insertion case, SVAi [i] is no larger than DIi [1],
modes are provided in the following text. and LVAi [i] is no smaller than DIi [P]. In this case, SVA[i]
1) Recording Mode: If SVAi [i] is greater than LVAi [i], and LVA[i] retain their original values, and the values of DIi
the i-th stage of the sorting architecture operates in the are directly assigned to DOi in sequence.
recording mode, and the values of DIi [1] and DIi [P] are In the second insertion case, SVAi [i] is greater than DIi [1],
directly stored into SVA[i] and LVA[i], respectively. When and LVAi [i] is no smaller than DIi [P]. In this case, the
the i-th stage of the sorting architecture is operating in the value of SVA[i] is updated with DIi [1], and SVAi [i] must be
recording mode, the values of SVA[i] and LVA[i] are not inserted into the output sequence DOi . The value of DOi [1]
yet replaced by valid data; thus, the SVAi [i] must be MAX, is the smaller value between SVAi [i] and DIi [2]. The value
and LVAi [i] must be MIN. The invalid data of SVA[i] and of DOi [k], where k is between 2 and P- 1, is the median of
LVA[i] are replaced by DIi [1] and DIi [P], and then inserted SVAi [i], DIi [k], and DIi [k + 1]. The value of DOi [P] is the
into DOi [P/2] and DOi [P/2 + 1], respectively. The val- greater value between SVAi [i] and DIi [P].
ues of DIi [2] to DIi [P/2] are then assigned to DOi [1] to In the third insertion case, SVAi [i] is no larger than DIi [1],
DOi [P/2-1], and the values of DIi [P/2 + 1] to DIi [P-1] are and LVAi [i] is smaller than DIi [P]. In this case, the value
assigned to DOi [P/2 + 2] to DOi [P]. of LVA[i] is updated with DIi [P]. The value of DOi [1] is
It is worth noting that when the i-th stage of the sorting the smaller value between LVAi [i] and DIi [1]. The value of
architecture is operating in the recording mode, all subsequent DOi [k], where k is between 2 and P- 1, is the median of
cascaded stages of the sorting architecture must also operate LVAi [i], DIi [k- 1], and DIi [k]. The value of DOi [P] is the
in the recording mode. The output sequence DOi then greater value between LVAi [i] and DIi [P- 1].
becomes the input sequence of the (i+ 1)-th stage of the In the fourth insertion case, SVAi [i] is greater than DIi [1],
sorting architecture (i.e., DIi+1 ). DIi+1 [1] and DIi+1 [P] will and LVAi [i] is smaller than DIi [P]. Therefore, the values
be used to update the values of SVA[i + 1] and LVA[i + 1], of SVA[i] and LVA[i] are updated by those of DIi [1] and
respectively. Assume that an input sequence with P valid DIi [P], respectively, and the replaced values must be inserted
data is used to initialize the values of the SVA and LVA of into the output sequence. The value of DOi [1] is the smaller
the i-th to (i+ P/2 - 1)-th stages of the sorting architecture. value between SVAi [i] and DIi [2], and the value of DOi [P] is
After the sequence is transferred to the (i+ P/2)-th stage the higher value between LVAi [i] and DIi [P- 1]. Five possible
of the sorting architecture, all of its data become invalid. values exist for DOi [k], where k is between 2 and P- 1,
Because of the replacement rules of the recording mode, all namely SVAi [i], LVAi [i], DIi [k- 1], DIi [k], and DIi [k+ 1], and
values from DIi+P/2 [1] to DIi+P/2 [P/2] are MAX, and all DOi [k] is set as the median of these five values. For hardware
values from DIi+P/2 [P/2+1] to DIi+P/2 [P] are MIN. This implementation, the median searching process is divided into
design can ensure that the values of the SVA and LVA of two steps. First, the median between SVAi [i], LVAi [i], and
the uninitialized stages remain unchanged when the input DIi [k] is determined. If the median value is DIi [k], DOi [k]

Authorized licensed use limited to: Indian Institute of Technology Palakkad. Downloaded on February 24,2025 at 11:32:01 UTC from IEEE Xplore. Restrictions apply.
722 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 71, NO. 2, FEBRUARY 2024

Fig. 4. The hardware architecture of the proposed sorting algorithm.

Fig. 6. An example of the structure of a BSU. (a) The simplified diagram
of a CU, and (b) the structure of a four-input BSU.

Fig. 5. The hardware architecture of CU.

is set as DIi [k] directly. If the median is SVAi [i], DOi [k] is
the smaller value between SVAi [i] and DIi [k+ 1]. Because
SVAi [i] must not be greater than LVAi [i], it must be greater
than the values of DIi [k] and DIi [k- 1] when it is selected as Fig. 7. The hardware architecture of the BISU.
the median. Consequently, the comparison between SVAi [i]
and DIi [k- 1] is saved. If the median is LVAi [i], then this
value is smaller than the DIi [k] and DIi [k+ 1]. In this case, while H is the maximum between them. An example of the
the value of DOi [k] is the higher value between LVAi [i] and structure of a BSU is illustrated in Fig. 6. For brevity, a CU
DIi [k- 1]. is represented with a vertical segment with circles at both
ends, which is shown in Fig. 6(a). Fig. 6(b) displays the
structure of a four-input BSU, where i1 to i4 represent the
IV. H ARDWARE A RCHITECTURE
input data sequence, and s1 to s4 represent the output sorted
The hardware architecture of the proposed sorting algorithm sequence. As displayed in Fig. 6(b), six CUs are required
is displayed in Fig. 4. This architecture consists of a bitonic in a four-input BSU. Notably, the required number of CUs
sorting unit (BSU) and several BISUs. The input EN is an will grow to 24 for an eight-input BSU. The required number
enable signal, and all sorting units are active only when EN of CUs increases dramatically with the length of the input
is high. The input OI is an output indication signal. When OI data sequence. Consequently, the proposed hybrid sorting
is high, the sorted data in the BISUs are output in sequence. architecture is adopted to reduce the hardware area costs.
The input data to be sorted are segmented into subsequences
of length P (i.e., DI). To enable the proposed hardware B. Bidirectional Insertion Sorting Unit
architecture to process a new input data sequence without idle
time, an inversion signal INV is used to separate two sets of BISUs consist of two data registers (Rdata1 and Rdata2 ),
data. The outputs of the proposed architecture are V and DO, a bidirectional insertion sorting logic (BISL), and an output
which denote the output valid signal and the output data stream control module (OCM). Fig. 7 depicts the hardware design of
of length P, respectively. The data of DO are valid when V the BISUs used in the proposed hardware architecture. The
is high. In the following subsections, the architectures of the terms Rmax and Rmin denote the data registers that store the
BSU and BISUs are introduced in detail. maximum and minimum values passing through the BISU,
respectively. When the INV signal is low, Rdata1 is assigned
as Rmin , and Rdata2 is assigned as Rmax . By contrast, when
A. Bitonic Sorting Unit the INV signal is high, Rdata1 is assigned as Rmax , and
The basic composition blocks of a BSU are comparison Rdata2 is assigned as Rmin . This design enables the proposed
units (CUs), which take two inputs and arrange them in architecture to process a new set of data before the previous
ascending order. The detailed architecture of a CU is shown data sequence has been fully output. A comparator is used
in Fig. 5. A CU is constructed with a comparator and two to determine whether the BISU should operate in the record-
multiplexers. L is the minimum between the inputs x and y, ing mode. The BISL implements the proposed bidirectional

insertion sorting algorithm and rearranges the values of Rmin ,

Rmax , and the input data stream DI to ensure that they are
sorted. The minimum and maximum values among the values
of Rmin , Rmax , and DI are used to update the values of
Rmin and Rmax , respectively, and the remaining values are
transmitted to the next BISU. Finally, the OCM is responsible
for controlling the output state of the BISU and ensure that the
output sequence is correct. Algorithm 1 shows the operations
of the BISU, and the detailed architectures of the BISL and
OCM are introduced in the following text.

Algorithm 1 Bidirectional Insertion Sorting Unit (BISU)

Input : ENin , Vin , INV, DI
/∗ DI is an ascending sequence with P element ∗ /
Output : ENout , Vout , INV, DO
/∗ DO is an ascending sequence with P element ∗ /
Registers:Rdata1 , Rdata2
Temporary values :RI M I N , RI M AX , RO M I N , RO M AX , INIT
/∗ Select RI M I N and RI M AX according to INV signal ∗ /
if INV = 0 then RI M I N ← Rdata1 , RI M AX ← Rdata2 ;
else RI M I N ← Rdata2 , RI M AX ← Rdata1 ;
/∗ Judge if the BISU has been initialized ∗ /
if RI M I N > RI M AX then INIT← 1 else INIT ← 0;
{RO M I N , DO, RO M AX } ← BISL(RI M I N , DI, RI M AX , ENin , INIT);
{ENout , Vout } ← OCM(ENin , Vin , INIT);
/∗ Update Rdata1 and Rdata2 according to INV signal ∗ / Fig. 8. The hardware architecture of the BISL.
if INV = 0 then Rdata1 ← RO M I N , Rdata2 ← RO M AX ;
else Rdata1 ← RO M AX , Rdata2 ← RO M I N ;
Algorithm 2 Bidirectional Insertion Sorting Logic (BISL)
Input : RI M I N , DI, RI M AX , EN, INIT
1) Bidirectional Insertion Sorting Logic: The hardware /∗ DI is an ascending sequence with P element ∗ /
architecture of the BISL is displayed in Fig. 8. The BISL uses Output : RO M I N , DO, RO M AX
/∗ DO is an ascending sequence with P element ∗ /
the EN signal, INIT signal, and the values of Rmin , Rmax , and fori = 1to Pdo
DI as input. The original values of Rmin and Rmax are denoted if i = 1 then
DO[1] ← SRL1 (EN, RI M I N , ×, DI[1], DI[2], RI M AX , INIT, DI[2]);
as RI M I N and RI M AX , and the values used to update Rmin else if i = P then
and Rmax are denoted as RO M I N and RO M AX , respectively. DO[P] ← SRL P (EN, RI M I N , DI[P-1], DI[P], ×, RI M AX , INIT, DI[P-1]);
When the EN signal is low, the BISL simply output the input else
if i = P/2 then
values without changing their order. When the INIT signal is DO[i] ←
high, the BISU operates in the recording mode. If the EN SRLi (EN, RI M I N , DI[i-1], DI[i], DI[i + 1], RI M AX , INIT, RI M I N );
signal is high and the INIT signal is low, the BISL operates else if i = P/2 + 1 then
DO[i] ←
in the insertion mode and rearranges the input sequence to SRLi (EN, RI M I N , DI[i-1], DI[i], DI[i + 1], RI M AX , INIT, RI M AX );
achieve the ascending order. Two comparators are used to else if i < P/2 then
compare the values between RI M I N and DI[1], and between DO[i] ←
SRLi (EN, RI M I N , DI[i-1], DI[i], DI[i + 1], RI M AX , INIT, DI[i + 1]);
RI M AX and DI[P]. The values of RO M I N and RO M AX are else if i > P/2 + 1 then
determined based on the comparison results, and the values DO[i] ←
SRLi (EN, RI M I N , DI[i-1], DI[i], DI[i + 1], RI M AX , INIT, DI[i-1]);
of the output sequence DO are determined using P Sequence end do
Rearrangement Logics (SRLs). The operations of the BISL are if EN = 0 then
shown in Algorithm 2. RO M I N /RO M AX ← RI M I N /RI M AX ;
else if INIT= 1then
Let SRLk denote the SRL used to determine the value RO M I N /RO M AX ←DI[1]/ DI[P];
of DO[k]. When k is between 2 and P, DO[k] can have else
five possible values, namely the values of RI M I N , RI M AX , if RI M I N > DI[1] then
RO M I N ← DI[1];
DI[k- 1], DI[k], and DI[k+ 1]. If k is equal to 1 or P, else
only four possible values exist for DO[k] because DI[0] and RO M I N ← RI M I N ;
DI[P+ 1] do not exist. Fig. 9 shows the architecture of an SRL if RI M AX < DI[P] then
RO M AX ← DI[P];
with five candidate values. An SRL with five candidate values else
is constructed using four comparators and five multiplexers, RO M AX ← RI M AX ;
whereas an SRL with four candidate values is constructed
using three comparators and four multiplexers. The Mux1
determines whether the SRL is operating. If the EN signal When the INIT signal is high, the value of Rmin is greater
is low, the value of DI[k] is transferred to DO[k] directly. than that of Rmax ; thus, the BISU operates in the recording
The Mux2 judges whether the BISU has been initialized. mode. SRLs assign the values of DO according to their index

Authorized licensed use limited to: Indian Institute of Technology Palakkad. Downloaded on February 24,2025 at 11:32:01 UTC from IEEE Xplore. Restrictions apply.
724 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 71, NO. 2, FEBRUARY 2024

Algorithm 3 k-th Sequence Rearrangement Logic (SRLk )

Input : EN, RI M I N , DI[k-1], DI[k], DI[k + 1], RI M AX , INIT,
DIr ecor d
Output : DO[k]
Temporary values :Dhigh , Dlow , Dmid , D Mux2
/∗ operation of the Mux4 ∗ /
if DI[k-1] > RI M AX then Dhigh ← DI[k-1] else Dhigh ←RI M AX ;
/∗ operation of the Mux5 ∗ /
if DI[k+1] < RI M I N then Dlow ← DI[k + 1] else Dlow ←RI M I N ;
/∗ operation of the Mux3 ∗ /
if DI[k] > RI M I N and DI[k] < RI M AX then
Dmid ← DI[k];
else if DI[k] < RI M I N and DI[k] < RI M AX then
Dmid ← Dlow ;
else if DI[k] > RI M I N and DI[k] > RI M AX then
Dmid ← Dhigh ;
/∗ operation of the Mux2 ∗ /
Fig. 9. The hardware architecture of an SRL with five candidate values.
if INIT = 0 then D Mux2 ← Dmid else D Mux2 ← DIr ecor d ;
/∗ operation of the Mux1 ∗ /
if EN = 1 then DO[k] ← D Mux2 else DO[k] ← DI[k];

values k. DO[P/2] is assigned a value of MAX, and DO[P/2 +

1] is assigned a value of MIN. When k is between 1 and (P/2
– 1), the value of DI[k+ 1] is transferred to DO[k]. When k
is between (P/2 + 2) and P, DO[k] is assigned the value of
DI[k- 1]. If the EN signal is high and the INIT signal is low, the
Mux3 , Mux4 , and Mux5 become responsible for selecting the
value of DO[k] from the candidate values. In SRL1 , the Mux4
can be saved because no need exists to determine whether
DI[k- 1] should be selected. In the same manner, the Mux5
of SRL P can be saved. The Mux4 selects the higher value
between those of RI M AX and DI[k- 1], whereas the Mux5
selects the lower value between those of RI M I N and DI[k+ 1].
The Mux3 determines its output from the values obtained from
the Mux4 and Mux5 and the value of DI[k]. If the value of
DI[k] is higher than that of RI M I N and smaller than that of
Fig. 10. The hardware architecture of the OCM.
RI M AX , the value of DI[k] is selected. When the value of DI[k]
is smaller than those of RI M I N and RI M AX , the output of the
Mux5 is selected. When the values of DI[k] is higher than those Algorithm 4 Output Control Module belonging to BISUi
of RI M I N and RI M AX , the Mux3 selects the value obtained (OCMi )
from the Mux4 as the output. Finally, when the value of DI[k] Input : ENin , Vin , INIT
is smaller than that of RI M I N but greater than that of RI M AX , Output : ENout , Vout
the value of Rmin is greater than that of Rmax , which indicates Registers: R O S
Parameters : init_value
that the BISL has not been initialized. In this case, the output Condition:init_value is 1 when i is a multiple of P,
of the Mux3 can be ignored because the SRL is operating in otherwise init_value is 0
the recording mode. Algorithm 3 shows the operations of the if Vin = 1 and R O S = 1 then ENout ← 0 else ENout ← ENin ;
SRL. After the sorting process is completed, the sorted data Vout ← Vin ;
sequence is stored in the Rmin and Rmax registers. The Rmin /∗ Update the value of R O S ∗ /
if INIT = 1 then
registers store the first half of the sorted data sequence in R O S ← init_value;
ascending order, whereas the Rmax registers store the second else if ENin = 1 and Vin = 1and R O S = 1then
half of the sorted data sequence in descending order. R O S ← 0;
2) Output Control Module: The hardware architecture of else
the OCM of a BISU is presented in Fig. 10, and the operations R O S ← R O S ; /∗ Keep original value ∗ /
of the OCM is shown in Algorithm 4. The OCM has three
input signals, namely the enable signal ENin , output valid
signal Vin , and initialization signal INIT. Moreover, it contains signals of OCM1 are the system enable signal EN and output
a register R O S that records the output state of the BISU that it indication signal OI, respectively, whereas the ENin and Vin
belongs to. The output signals ENout and Vout are transferred signals of the remaining OCMs are the ENout and Vout signals
to the next pipelined stage to control the output state of the of the OCM from the previous architecture stage. The INIT
subsequent BISUs. Let BISUi denote the i-th BISU, and let signal is 1 if the BISU operates in the recording mode. In this
OCMi denote the OCM belonging to BISUi . The ENin and Vin case, the R O S value of OCMi is set as 1 if i is a multiple

Fig. 11. Sorting process of the proposed architecture.

of P; otherwise, this value is set as 0. This design is used in the figure. The example of the sorting process is illustrated
to prevent the first half of the sorted output data sequence by P = 4, N = 8, and K = 4. Equations (3) and (4) present
from being stored into the Rmin registers. During the output the two sets of input data considered in the explained example,
process, input data sequences with only MIN values are used respectively. Notably, the invert signals from two consecutive
to replace the data in the Rmin registers. The output sequence sets of input data should be opposite to each other. Therefore,
should not be changed after P valid data are collected from the invert signals of set A are all 0, and those of set B are
the Rmin registers. Consequently, the value of ENout will be all 1.
pulled down to 0 if both the values of R O S and Vin are 1. 
Moreover, the value of R O S becomes 0 if the values of R O S ,  D I A = {2, 13, 1, 5 | 8, 10, 15, 1 | 0, 0, 0, 0 | 15, 15, 15, 15}


 E N = {1 | 1 | 1 | 1}
ENin , and Vin are all 1, which indicates that the sorted data in A
the Rmin registers of the previous P architecture stages have  V A = {0 | 0 | 1 | 1}

been output. When outputting the second half of the sorted


I N V A = {0 | 0 | 0 | 0}
data, input data sequences with only MAX values are used to
replace the data in the Rmax registers. The sorted data in the (3)

Rmax registers are gradually transferred backward to update  D I B = {7, 11, 5, 3 | 12, 2, 14, 6 | 0, 0, 0, 0 | 15, 15, 15, 15}

the values of the Rmax registers of the subsequent BISUs. The

 E N = {1 | 1 | 1 | 1}
B
output sequence consisting of data collected from the Rmax  VB = {0 | 0 | 1 | 1}
registers of BISU(N /2)−P+1 to BISU N /2 are output in each



cycle. I N VB = {1 | 1 | 1 | 1}
(4)
C. Example of the Sorting Process In Fig. 11, valid data are denoted in red color, and valid
In this section, an example is presented in Fig. 11 to output are highlighted with a gray background. At cycle 0,
demonstrate how the proposed BISU sorts an input data the value of Rmin is set as MAX, that is, 15, and the value of
sequence and outputs sorted results in ascending order. For Rmax is set as MIN, that is, 0. In cycle 1, the first segmented
brevity, the operations of the BSU and OCM are not depicted subsequence {1, 2, 5, 13}, which is sorted by the BSU, is input

Authorized licensed use limited to: Indian Institute of Technology Palakkad. Downloaded on February 24,2025 at 11:32:01 UTC from IEEE Xplore. Restrictions apply.
726 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 71, NO. 2, FEBRUARY 2024

to BISU1 . In BISU1 , the value of Rmin is greater than that of the Rmin and Rmax registers, respectively. Because the value
Rmax ; thus, BISU1 operates in the recording mode. The input of Rmin is not greater than that of Rmax in BISU1 , BISU1
data 1 and 13 are stored into the Rmin and Rmax registers, operates in the insertion mode. The smallest value in the input
respectively. The invalid data 15 and 0 are directly inserted into sequence is 2, which is smaller than the value of Rmin . The
the middle of the output sequence without comparison with the highest value in the input sequence is 14, which is higher
remaining data in the input sequence, and the remaining data than the value of Rmax . Therefore, 2 and 14 are stored into
in the input sequence are placed at both ends of the output the Rmin and Rmax registers, respectively, and 3 and 11 are
sequence. In cycle 2, the second segmented subsequence {1, inserted into the output sequence simultaneously. The output
8, 10, 15} is input to BISU1 , and the output sequence of BISU1 sequence of BISU1 then becomes {3, 6, 11, 12}.
is transferred to BISU2 . BISU2 also operates in the recording Finally, the first half and the second half of the sorted results
mode. The data 2 and 5 are stored into the Rmin and Rmax of data set A, namely {1, 1, 2, 5} and {8, 10, 13, 15}, are
registers, respectively, and the output sequence of BISU2 is obtained in cycle 6 and 7, respectively.
{15, 15, 0, 0}. The aforementioned design ensures that the
values maintained in the subsequent cascaded BISUs are not V. P ERFORMANCE A NALYSIS
updated until a valid data sequence is input. Because the value This section describes the performance of the proposed
of Rmin is not greater than that of Rmax in BISU1 , BISU1 sorting architecture. In Section V-A, the performance of the
operates in the insertion mode. The smallest value in the input proposed architecture is compared to the sorting design in [18],
sequence is 1, which is equal to the value of Rmin ; therefore, which also adopts a hybrid sorting algorithm, as well as the
Rmin does not need to be updated. The highest value in the sorting designs in [23] and [24], both of which exhibit a low
input sequence is 15, which is higher than the value of Rmax ; growth rate in hardware area costs. Because our bidirectional
therefore, 15 is stored into the Rmax register, and 13 is inserted insertion sorting architecture and the architecture proposed
into the output sequence. The output sequence of BISU1 then in [18] are both pipelined architectures, a detailed compari-
becomes {1, 8, 10, 13}, which is still arranged in ascending son between these architectures is provided in Section V-B.
order. The performance of the considered sorting architectures is
The first half of the sorted sequence is stored in the Rmin analyzed in terms of circuit area, sorting cycles, and power
register of each BISU in ascending order. A data sequence consumption. Circuit area is measured in terms of the gate
that only contains 0s is input to replace the sorted results in count of the circuit and rounded off to the nearest 1000 gates.
Rmin . In cycle 3, the value 0 is used to replace the Rmin The gate count is calculated by dividing the total cell area
value of BISU1 , namely 1, and 1 is inserted into the output of the circuit by the area of a two-input NAND gate. All
sequence; thus, the output sequence of BISU1 becomes {0, 0, compared architectures are implemented using the Verilog
0, 1}. In cycle 4, the value 0 is again used to replace the Rmin HDL and synthesized using the Synopsys Design Compiler
value of BISU2 , namely 1, and 1 is inserted into the output with a TSMC 90nm cell library, and the power consumption
sequence; thus, the output sequence of BISU2 becomes {0, 0, is measured using the Synopsys Prime Time PX.
1, 1}, which is transferred to the subsequent cascaded BISUs
to collect valid output values from the Rmin registers.
The second half of the sorted sequence is stored in the A. Comparison of the Sorting Performance
Rmax register of each BISU in descending order. To replace The performance of the proposed architecture is compared
the sorted results in Rmax , a data sequence that only contains with that of the sorting architectures proposed in [18], [23],
15s is input. In cycle 4, the Rmax value of BISU1 is 15; thus, and [24]. The architectures proposed in [23] and [24] use
the value of Rmax does not need to be updated. In cycle 5, the comparison-free sorting algorithms, which do not use com-
value 15 is used to replace the Rmax value of BISU2 , namely parators in the sorting process. The architecture proposed
13, and 13 is inserted into the output sequence; thus, the in [23] uses registers to record the number of occurrences
output sequence of BISU2 becomes {13, 15, 15, 15}, which is of data in an input sequence. The speed complexity and total
transferred to the subsequent cascaded BISUs to collect valid gate count complexity of the architecture proposed in [23] are
output values from the Rmax registers. of the order of O(N ). The architecture proposed in [24] is an
Meanwhile, the first segmented subsequence {3, 5, 7, 11} improved version of [23]. The architecture proposed in [24]
of data set B is input into BISU1 in cycle 5. Because the reduces the number of sorting cycles through bidirectional
values of INV B are 1, the registers used as Rmin and Rmax sorting and uses two auxiliary methods to reduce the wastage
are swapped. This design enables the architecture to output of clock cycles during the sorting process. The architecture
the sorting results and process new input data sequences proposed in [18] combines bitonic sort and insertion sort to
simultaneously. The Rmin and Rmax values of BISU1 are enhance the throughput with acceptable hardware area costs.
15 and 0, respectively. The value of Rmin is greater than that of Since the architectures proposed in [23] and [24] can only
Rmax ; thus, BISU1 operates in the recording mode again. The process one input data in a clock cycle, the architectures
values 3 and 11 are stored into the Rmin and Rmax registers, proposed in [18] and this paper are implemented in a special
respectively. In cycle 6, the second segmented subsequence case that P is 1 for a fair comparison with the architectures
{2, 6, 12, 14} is input to BISU1 , and the output sequence proposed in [23] and [24]. The insertion sorting unit in the
of BISU1 is transferred to BISU2 . BISU2 also operates in architecture proposed in [18] and the BISU in our architecture
the recording mode again. The data 5 and 7 are stored into are replaced with simple comparators and multiplexers. The bit

TABLE II TABLE IV
T HE C OMPARISON OF G ATE C OUNTS (K G ATES ) T HE C OMPARISON OF P OWER C ONSUMPTION (W)

TABLE III TABLE V

T HE C OMPARISON OF S ORTING C YCLES T HE C OMPARISON OF A REA -T IME P RODUCT (K G ATES × S ORTING T IME )

width of the data is fixed as 10 bits; the operating frequency

is set as 500 MHz; and the sizes of the input data (N ) in the
comparison are set as 1024, 2048, 4096, and 8192.
The gate count of each design is listed in Table II. The architecture is 17.6%, 15.2%, 16.4%, and 15.8% lower than
architecture proposed in [18] and our architecture exhibit that of the architecture proposed in [18] for the input sizes of
nearly equal gate counts for the four data sizes. For the 1024, 2048, 4096, and 8192, respectively. Moreover, the power
input sizes 1024, 2048, 4096, and 8192, the gate counts of consumption of our architecture is 67.4%, 52.5%, 37.8%, and
our architecture are 72.4%, 58.2%, 43.0%, and 29.6% lower 27.3% lower than that of the architecture proposed in [24] for
than those of the architecture proposed in [23], respectively. the aforementioned input sizes.
Moreover, for the aforementioned input sizes, the gate counts Finally, the area-time product of each compared architecture
of our architecture are 67.9%, 56.8%, 47.0%, and 39.5% lower is presented in Table V to comprehensively evaluate the
than those of the architecture proposed in [24], respectively. performance based on gate counts and sorting time. The
As presented in Table II, the architectures proposed in this sorting time is calculated by multiplying the required sort-
paper and [18] require fewer hardware resources than do those ing cycles by the clock period. The clock period of each
proposed in [23] and [24]. compared architecture is obtained by taking the reciprocal of
Table III presents the numbers of sorting cycles for the com- their maximum operating frequency. The maximum operating
pared architectures. The number of pipelined stages required frequency of the architectures in [18], [23], and [24] and
in the architecture proposed in [18] is equal to the number our architecture are 714.3MHz, 735.3MHz, 606.1MHz, and
of input data (N ); therefore, the number of sorting cycles 606.1MHz, respectively. From Table V, it can be observed that
required for this architecture is 2N . Comparison-free sorting the pipeline-based architectures, i.e. the architecture proposed
algorithms must count the number of occurrences of input in [18] and our architecture, strike a better balance between
data; thus, their numbers of sorting cycles increases with N . area costs and sorting time. The proposed method effectively
Furthermore, the numbers of sorting cycles required by the reduces area costs and required sorting cycles. Although our
architectures proposed in [23] and [24] are determined by architecture has the lowest maximum operating frequency
the distribution of the input data. The numbers of sorting among the compared architectures, it still has the smallest
cycles required by the architectures proposed in [23] and [24] area-time product.
range from 2N to 3N and from 1.5N to 2N + (2 K /2) – 2, In summary, our architecture has the lowest area costs,
respectively. The architecture proposed in this paper uses novel shortest sorting cycles, and lowest power consumption when
bidirectional insertion sorting algorithm. Hence, compared to compared to existing sorting designs.
the architecture proposed in [18], it reduces the number of
required pipelined stages by half and requires only 1.5N
sorting cycles. The numbers listed in Table III are obtained B. Detailed Comparisons With Pipelined Sorting Architecture
from the average experimental results obtained in [24] for To indicate the performance advantages of our architecture
10000 randomly generated test patterns. Among the four com- over the architecture proposed in [18], detailed comparisons
pared architectures, the architecture proposed in [23] requires between these architectures are provided in this section.
the highest number of sorting cycles, whereas our architecture Because both these architectures are pipeline based, six addi-
requires the lowest number of sorting cycles. tional metrics are used to analyze their efficiency: operating
The power consumption of each compared architecture frequency, latency, throughput, the throughput-to-gate-count
is presented in Table IV. The power consumption of our ratio, the throughput-to-power-consumption ratio, and the

Authorized licensed use limited to: Indian Institute of Technology Palakkad. Downloaded on February 24,2025 at 11:32:01 UTC from IEEE Xplore. Restrictions apply.
728 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 71, NO. 2, FEBRUARY 2024

TABLE VI
T HE C OMPARISON B ETWEEN T HE A RCHITECTURE P ROPOSED IN [18] AND O UR A RCHITECTURE U NDER VARIOUS B IT W IDTH

TABLE VII
T HE C OMPARISON B ETWEEN T HE A RCHITECTURE P ROPOSED IN [18] AND O UR A RCHITECTURE U NDER VARIOUS L EVELS OF DATA PARALLELISM

energy consumption per data byte. The sizes of the input does the architecture proposed in [18]. Notably, the gate count
data (N ) in the comparison are set as 128, 1024, and 2048. reduction compared with that of the architecture proposed
First, the performance of the aforementioned architec- in [18] achieved with our architecture increases with P. For
tures is analyzed by fixing the level of data parallelism as example, the proposed architecture can reduce the gate counts
4 (P = 4), and varying the bit width of the data (K = by 6% when P is 4, 11% when P is 8, and 14% when P
8, 16, 32); the relevant results are presented in Table VI. is 16, compared to the architecture proposed in [18] under
Second, the performance of these architectures is investigated the configuration that K is 32 and N is 2048. Therefore,
under a fixed bit width of 32 (K = 32) and different levels the proposed architecture can enhance the throughput by
of data parallelism (P = 4, 8, 16); the relevant results are increasing the level of data parallelism with fewer additional
presented in Table VII. To sort a set of data with a size of N , hardware area costs.
the architecture proposed in [18] requires N pipeline stages, Because the logic design of the proposed BISUs is more
with each stage recording one sorted data. In the proposed complex than that of the insertion sorting units used in [18],
architecture, only N /2 pipeline stages are required, with each the operating frequency of our architecture is marginally lower
stage recording two sorted data. Thus, in all comparison cases, than that of the architecture proposed in [18]. However, the
our architecture requires lower gate counts, fewer sorting proposed architecture can still achieve a close throughput with
cycles, shorter latency, and lower power consumption than the architecture proposed in [18]. The difference between the

throughputs of our architecture and the architecture proposed [2] A. Gabiger-Rose, M. Kube, R. Weigel, and R. Rose, “An FPGA-
in [18] ranges only from 1% to 5%. To further investigate based fully synchronized design of a bilateral filter for real-time image
denoising,” IEEE Trans. Ind. Electron., vol. 61, no. 8, pp. 4093–4104,
the efficiency of the proposed architecture, the throughput- Aug. 2014.
to-gate-count ratio, throughput-to-power-consumption ratio, [3] S. Chen, T. Zhang, and Y. Xin, “Relaxed K-best MIMO signal detector
and energy consumption per data byte are analyzed. Our design and VLSI implementation,” IEEE Trans. Very Large Scale Integr.
(VLSI) Syst., vol. 15, no. 3, pp. 328–337, Mar. 2007.
architecture has a higher throughput-to-gate-count ratio than
[4] M. Shabany and P. G. Gulak, “A 675 Mbps, 4×4 64-QAM K-best
does the architecture proposed in [18], which indicates that MIMO detector in 0.13 µm CMOS,” IEEE Trans. Very Large Scale
our design can generate sorted sequences by using hardware Integr. (VLSI) Syst., vol. 20, no. 1, pp. 135–147, Jan. 2012.
resources more efficiently than the design of [18]. When K [5] B. Y. Kong and I.-C. Park, “Improved sorting architecture for K-best
MIMO detection,” IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 64,
and N are fixed, the improvement of the throughput-to-gate- no. 9, pp. 1042–1046, Sep. 2017.
count ratio in our architecture against the architecture proposed [6] N. Govindaraju, J. Gray, R. Kumar, and D. Manocha, “GPUTeraSort:
in [18] increases with P. Taking the configuration that K is High performance graphics co-processor sorting for large database man-
32 and N is 2048 as an example, our architecture can enhance agement,” in Proc. ACM SIGMOD Int. Conf. Manag. Data, Jun. 2006,
pp. 325–336.
throughput-to-gate-count ratio by 5% when P is 4, 8% when [7] J. Casper and K. Olukotun, “Hardware acceleration of database oper-
P is 8, and 16% when P is 16, compared to the architecture ations,” in Proc. ACM/SIGDA FPGA, Monterey, CA, USA, 2014,
proposed in [18]. Our architecture also achieves a higher pp. 151–160.
throughput-to-power-consumption ratio and lower energy con- [8] H. Inoue, T. Moriyama, H. Komatsu, and T. Nakatani, “AA-sort: A new
parallel sorting algorithm for multi-core SIMD processors,” in Proc.
sumption per data byte. The improvements ranging from 7% 16th Int. Conf. Parallel Archit. Compilation Techn. (PACT), Sep. 2007,
to 30% in the throughput-to-power-consumption ratio, and the pp. 189–198.
reductions ranging from 6% to 23% in energy consumption [9] D. Merrill and A. Grimshaw, “High performance and scalable radix
sorting: A case study of implementing dynamic parallelism for GPU
per data byte. Under the same power consumption, our design computing,” Parallel Process. Lett., vol. 21, no. 2, pp. 245–272,
can produce a higher number of output results. Jun. 2011.
In summary, while the architecture proposed in [18] can [10] A. Davidson, D. Tarjan, M. Garland, and J. D. Owens, “Efficient parallel
operate at a slightly higher frequency than our architecture, merge sort for fixed and variable length keys,” in Proc. Innov. Parallel
Comput. (InPar), 2012, pp. 1–9.
the throughputs of both designs are close. Moreover, our [11] N. Tsuda, T. Satoh, and T. Kawada, “A piepline sorting chip,” in IEEE
architecture can produce more sorted data under the same Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, Feb. 1987,
area costs or power consumption when compared with the pp. 270–271.
architecture proposed in [18]. Thus, our architecture can sort [12] R. Marcelino, H. C. Neto, and J. M. P. Cardoso, “Unbalanced FIFO sort-
ing for FPGA-based systems,” in Proc. 16th IEEE Int. Conf. Electron.,
data more efficiently. Circuits Syst., Dec. 2009, pp. 431–434.
[13] D. Koch and J. Torresen, “FPGASort: A high performance sorting archi-
VI. C ONCLUSION tecture exploiting run-time reconfiguration on fpgas for large problem
sorting,” in Proc. 19th ACM/SIGDA Int. Symp. Field Program. Gate
In this paper, a low-cost pipelined architecture based on a Arrays, Feb. 2011, pp. 45–54.
hybrid sorting algorithm is proposed. With the proposed novel [14] G. Xiao, M. Martina, G. Masera, and G. Piccinini, “A parallel radix-
BISU, the number of pipelined stages can be considerably sort-based VLSI architecture for finding the first W maximum/minimum
reduced; thus, the number of required area costs, sorting values,” IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 61, no. 11,
pp. 890–894, Nov. 2014.
cycles, and power consumption can also be reduced. The [15] B. Yong Kong, H. Yoo, and I.-C. Park, “Efficient sorting architec-
proposed architecture was implemented using the Verilog ture for successive-cancellation-list decoding of polar codes,” IEEE
HDL and synthesized using the Synopsys Design Compiler Trans. Circuits Syst. II, Exp. Briefs, vol. 63, no. 7, pp. 673–677,
Jul. 2016.
with a TSMC 90-nm cell library. The experimental results [16] M. Zuluaga, P. Milder, and M. Püschel, “Streaming sorting networks,”
indicate that the proposed architecture required the lowest gate ACM Trans. Design Autom. Electron. Syst., vol. 21, no. 4, pp. 1–30,
counts, the fewest sorting cycles, and the lowest power con- Sep. 2016.
sumption among the compared sorting designs. Furthermore, [17] R. Chen, S. Siriyal, and V. Prasanna, “Energy and memory efficient
mapping of bitonic sorting on FPGA,” in Proc. ACM/SIGDA Int. Symp.
compared with an existing pipelined sorting architecture, Field-Programmable Gate Arrays, Feb. 2015, pp. 240–249.
the proposed architecture exhibited higher throughput-to-gate- [18] W. Chen, W. Li, and F. Yu, “A hybrid pipelined architecture for high
count and throughput-to-power-consumption ratios and thus performance top-K sorting on FPGA,” IEEE Trans. Circuits Syst. II,
Exp. Briefs, vol. 67, no. 8, pp. 1449–1453, Aug. 2020.
more efficient hardware resource usage.
[19] S. S. Ray, D. Adak, and S. Ghosh, “Worst case O(N) comparison-
In real world application, the usage scenarios are free hardware sorting engine,” IEEE Trans. Comput.-Aided
more diverse. A more compact design that is capable of Design Integr. Circuits Syst., vol. 41, no. 10, pp. 3332–3345,
balancing high performance with hardware area costs may be Oct. 2022.
[20] S. Saha Ray and S. Ghosh, “K-degree parallel comparison-free hardware
necessary. Future works include enabling the number of data sorter for complete sorting,” IEEE Trans. Comput.-Aided Design Integr.
involved in insertion sorting to be scalable and designing Circuits Syst., vol. 42, no. 5, pp. 1438–1449, May 2023.
high-performance algorithms that can simplify the logic [21] A. Farmahini-Farahani, H. J. Duwe III, M. J. Schulte, and K. Compton,
judgments. “Modular design of high-throughput, low-latency sorting units,” IEEE
Trans. Comput., vol. 62, no. 7, pp. 1389–1402, Jul. 2013.
[22] S.-H. Lin, P.-Y. Chen, and Y.-N. Lin, “Hardware design of low-power
R EFERENCES high-throughput sorting unit,” IEEE Trans. Comput., vol. 66, no. 8,
[1] L. Njejimana et al., “Design of a real-time FPGA-based data acquisition pp. 1383–1395, Aug. 2017.
architecture for the LabPET II: An APD-based scanner dedicated to [23] S. Abdel-Hafeez and A. Gordon-Ross, “An efficient O(N) comparison-
small animal PET imaging,” IEEE Trans. Nucl. Sci., vol. 60, no. 5, free sorting algorithm,” IEEE Trans. Very Large Scale Integr. (VLSI)
pp. 3633–3638, Oct. 2013. Syst., vol. 25, no. 6, pp. 1930–1942, Jun. 2017.

Authorized licensed use limited to: Indian Institute of Technology Palakkad. Downloaded on February 24,2025 at 11:32:01 UTC from IEEE Xplore. Restrictions apply.
730 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 71, NO. 2, FEBRUARY 2024

[24] W.-T. Chen, R.-D. Chen, P.-Y. Chen, and Y.-C. Hsiao, “A high- Chien-Chia Ho received the B.S. and M.S. degrees
performance bidirectional architecture for the quasi-comparison-free in computer science and information engineering
sorting algorithm,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 68, from National Cheng Kung University, Tainan,
no. 4, pp. 1493–1506, Apr. 2021. Taiwan, in 2020 and 2022, respectively. He is cur-
[25] K. E. Batcher, “Sorting networks and their applications,” in Proc. Spring rently a Senior Engineer with MediaTek Inc. His
Joint Comput. Conf.-AFIPS (Spring), vol. 1968, pp. 307–314. current research interests include image process-
[26] S. Mashimo, T. Van Chu, and K. Kise, “High-performance hardware ing, very large-scale integrated chip design, and
merge sorter,” in Proc. IEEE 25th Annu. Int. Symp. Field-Programmable embedded systems.
Custom Comput. Mach. (FCCM), Apr. 2017, pp. 1–8.
[27] M. Saitoh, E. A. Elsayed, T. V. Chu, S. Mashimo, and K. Kise, “A high-
performance and cost-effective hardware merge sorter without feedback
datapath,” in Proc. IEEE 26th Annu. Int. Symp. Field-Programmable
Custom Comput. Mach. (FCCM), Apr. 2018, pp. 197–204.
[28] W. Qiao, J. Oh, L. Guo, M. F. Chang, and J. Cong, “FANS: FPGA-
accelerated near-storage sorting,” in Proc. IEEE 29th Annu. Int. Symp. Wei-Ting Chen received the B.S. and Ph.D.
Field-Programmable Custom Comput. Mach. (FCCM), May 2021, degrees from the Departments of Engineering Sci-
pp. 106–114. ence and Computer Science and Information Engi-
[29] J. Cho, D. I. Maulana, and W. Jung, “A near-memory radix sort neering, National Cheng Kung University, Tainan,
accelerator with parallel 1-bit sorter,” in Proc. IEEE 30th Annu. Int. Taiwan, in 2017 and 2021, respectively. He is cur-
Symp. Field-Programmable Custom Comput. Mach. (FCCM), May 2022, rently a Senior Engineer with MediaTek Inc. His
p. 1. current research interests include image process-
ing, very large-scale integrated chip design, and
embedded systems.

Pei-Yin Chen (Senior Member, IEEE) received the

B.S. degree in electrical engineering from National
You-Rong Chen received the B.S. degree in com-
Cheng Kung University, Tainan, Taiwan, in 1986,
puter science and information engineering from
the M.S. degree in electrical engineering from
National Cheng Kung University, Tainan, Taiwan,
Pennsylvania State University, University Park, PA,
in 2020, where he is currently pursuing the Ph.D.
USA, in 1990, and the Ph.D. degree in electrical
degree. His current research interests include image
engineering from National Cheng Kung Univer-
processing, very large-scale integrated chip design,
sity in 1999. He is currently a Professor with the
and embedded systems.
Department of Computer Science and Information
Engineering, National Cheng Kung University. His
research interests include very large-scale integration
chip design, video compression, fuzzy logic control, and gray prediction.

Authorized licensed use limited to: Indian Institute of Technology Palakkad. Downloaded on February 24,2025 at 11:32:01 UTC from IEEE Xplore. Restrictions apply.