A_Low-Cost_Pipelined_Architecture_Based_on_a_Hybrid_Sorting_Algorithm
A_Low-Cost_Pipelined_Architecture_Based_on_a_Hybrid_Sorting_Algorithm
Abstract— In this paper, a low-cost pipelined architecture The hardware implementation can also be easily deployed
based on a hybrid sorting algorithm is proposed. The proposed to devices with limited resources, which makes it a better
architecture is constructed with a bitonic sorter and several solution.
cascaded bidirectional insertion sorting units. The bidirectional
insertion sorting unit uses the segmented sorted subsequence Recently, many hardware architectures for sorting algo-
generated by the bitonic sorter as input, and records the rithms implemented with field programmable gate arrays
maximum and minimum values of the subsequence. After all (FPGAs) [16], [17], [18], [19], [20], [26], [27], [28], [29] or
segmented subsequences are processed through the cascaded bidi- very large-scale integrated (VLSI) circuits [21], [22], [23],
rectional insertion sorting units, a sorted sequence is obtained. [24] have been proposed. Farmahini-Farahani et al. [21]
The proposed architecture is implemented using the Verilog
hardware description language (HDL) and synthesized using the implemented a modular design that comprised hierarchical
Synopsys Design Compiler with a TSMC 90-nm cell library. sorting units, which were optimized for max-set selection.
The experimental results indicate that the proposed architecture Lin et al. [22] proposed a low-power, high-throughput modular
can not only shorten sorting cycles but also reduce hardware hardware design. Their architecture swaps the index values
area costs. Moreover, sorting cycles can be further shortened by of the sorted numbers instead of the numbers themselves to
increasing the parallelism of the proposed architecture. Under
the configuration that 2048 32-bit data to be sorted and 16 data reduce power dissipation. Mashimo et al. [26] proposed merge
have to be processed simultaneously, the proposed architecture network architectures in which the number of gate levels
can improve the throughput-to-gate-count ratio by 16%, and remains constant as the level of data parallelism increases, pre-
throughput-to-power-consumption-ratio by 25% compared to the venting a significant decrease in operating frequency. In [27],
existing sorting design. The proposed architecture makes the most Saitoh et al. improved the architecture proposed in [26] by
efficient use of hardware resources.
eliminating the feedback data paths in the merge logic to
Index Terms— Sorting architecture, low cost, low latency, shorten the critical path. Qiao et al. [28] presented an ana-
pipelined architecture, very large-scale integrated (VLSI) circuit. lytical framework for modeling the overall performance of the
I. I NTRODUCTION FPGA-accelerated external merge sort system and provided
an optimized solution for a specific device. Cho et al. [29]
S ORTING is a crucial process in many fields, including
scientific computing [1], image processing [2], wireless
networks [3], [4], [5], and database management [6], [7].
proposed a near-memory radix sort accelerator, which achieves
high throughput with a parallel 1-bit radix sorter.
Because of increases in the quantity of data generated in Zuluaga et al. [16] proposed new hardware structures
various applications, the development of high-performance called streaming sorting networks and an accompanying
sorting algorithms has become critical. Although some studies domain-specific hardware generation tool through which
have adopted parallel sorting methods that involve the use of users can easily tune the trade-off between area costs and
multicore central processing units [8] or graphics processing performance. The designs in [19], [20], [23], and [24] devel-
oped architectures that do not require comparison operations
units [9], [10], the high resource costs limit their application.
between the data to be sorted. Ray et al. [19] proposed a
On the contrary, high throughput and operating speed can
comparison-free hardware sorting engine that identifies the
be achieved using dedicated hardware to implement the
largest element from the unsorted data in each iteration,
sorting algorithm [11], [12], [13], [14], [15], [16], [17], [18],
which is accomplished using basic logic gates. Let N be
[19], [20], [21], [22], [23], [24], [25], [26], [27], [28], [29].
the number of data to be sorted. The sorted result can be
Manuscript received 24 September 2023; revised 20 November 2023; obtained after N iterations. Based on the basic blocks proposed
accepted 11 December 2023. Date of publication 28 December 2023; date in [19], Ray et al. [20] presented a hardware-based parallel
of current version 30 January 2024. This work was supported in part by the
National Science and Technology Council, Taiwan, under Grant 110-2221- comparison-free sorter. The proposed parallel sorter in [20]
E-006-164-MY3. This article was recommended by Associate Editor X. S. incorporates concurrent comparison-free clusters to achieve a
Zhang. (Corresponding author: Pei-Yin Chen.) speed-up over nonparallel architectures. Although the designs
You-Rong Chen and Pei-Yin Chen are with the Digital Integrated Circuit
Design Laboratory, Department of Computer Science and Information Engi- in [19] and [20] require only N cycles to obtain sorted results,
neering, National Cheng Kung University, Tainan 70101, Taiwan (e-mail: their operating frequency is significantly dependent on N .
[email protected]; [email protected]). When N increases, the throughput of the designs in [19]
Chien-Chia Ho and Wei-Ting Chen are with MediaTek Inc., Hsinchu 30078,
Taiwan (e-mail: [email protected]; [email protected]). and [20] will also decrease. Moreover, the hardware resources
Digital Object Identifier 10.1109/TCSI.2023.3342929 required by the designs in [19] and [20] increase sharply
1549-8328 © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Indian Institute of Technology Palakkad. Downloaded on February 24,2025 at 11:32:01 UTC from IEEE Xplore. Restrictions apply.
718 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 71, NO. 2, FEBRUARY 2024
Authorized licensed use limited to: Indian Institute of Technology Palakkad. Downloaded on February 24,2025 at 11:32:01 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: LOW-COST PIPELINED ARCHITECTURE BASED ON A HYBRID SORTING ALGORITHM 719
Authorized licensed use limited to: Indian Institute of Technology Palakkad. Downloaded on February 24,2025 at 11:32:01 UTC from IEEE Xplore. Restrictions apply.
720 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 71, NO. 2, FEBRUARY 2024
A. Bitonic Sort
Initially, the data sequence to be sorted is segmented into
subsequences of length P, where P is much smaller than N .
Fig. 3. The architecture of the sorting logic module proposed in [18]. A bitonic sorting network is used to generate N /P segmented
sorted subsequences, which are then processed by the pro-
posed bidirectional insertion sorting architecture (introduced
sorting algorithm that combines bitonic sort and insertion sort. in Section III-B) to produce a sorted sequence with a full
The input data sequence is segmented into subsequences of length (N ). In this manner, the proposed sorting algorithm can
length P and then sorted with a bitonic sorter. Since P is take advantage of the bitonic sorting network to increase the
much smaller than N , it can prevent the hardware area costs throughput of the architecture, while preventing a considerable
of the bitonic sorter from becoming too high. The segmented increase in hardware resource usage.
sorted subsequences are then transmitted to cascaded insertion
sorting units, which record the largest values passed through B. Bidirectional Insertion Sorting
them. The architecture of the sorting logic used to determine
The proposed bidirectional insertion sorting architecture has
the order of the output subsequence and the recorded value
two operation modes: the recording mode and insertion mode.
in the insertion sorting units is illustrated in Fig. 3. DI[1]
Two storage arrays, namely a smaller-value array (SVA) and
to DI[P] and DO[1] to DO[P] represent the input data
larger-value array (LVA), are used to record the sorting results.
subsequence and the output data subsequence, respectively.
The length of the SVA and LVA is N /2, and their bit width is
The two subsequences are both in ascending order. RI M AX and
equal to the bit width of data (K ). The sorting architecture
RO M AX denote the original value and updated value recorded
can be divided into N /2 stages. The i-th stage is responsible
in the register of the insertion sorting unit, respectively. When
for controlling the comparisons and swapping between the
the INITin signal is low, the sorting logic rearranges the
minimum and maximum values of the input segmented sorted
data from DI and RI M AX . If the value of the register of the
subsequence and the values of SVA[i] and LVA[i]. When the
insertion sorting unit is updated with DI[P], its original value
architecture is reset, all of the elements in the SVA are set
will be inserted into the output data subsequence to keep
as MAX, and all of the elements in the LVA are set as MIN.
the subsequence in order. The processed subsequence is then
The value of MIN is 0, and the value of MAX is determined
passed to the next insertion sorting unit. After all segmented
according to K . Equations (1) and (2) present the formulas
subsequences pass through the cascaded insertion sorting units,
for MIN and MAX, respectively.
a sorted sequence of length N is obtained.
Although the design proposed in [18] can enhance the MIN = 0 (1)
throughput by parallel processing, its hardware area costs K
M AX = 2 − 1 (2)
increase with the level of data parallelism P. Furthermore,
there was room for improvement in its latency. Inspired by the Let the data sequence input into the i-th stage of the sorting
previous works [18] and [24], a low-cost pipelined architecture architecture be DIi , and let the data sequence output by this
based on a hybrid sorting algorithm is proposed in this paper. stage be DOi . The length of these sequences is P. The data of
The proposed architecture introduces the concept of bidirec- DIi are sorted with bitonic sort; thus, DIi is an ascending-order
tional processing into the cascaded insertion sorting units. sequence (from DIi [1] to DIi [P]). After the bidirectional
The designed BISU can record the maximum and minimum insertion sorting operations, DOi must also be in ascending
Authorized licensed use limited to: Indian Institute of Technology Palakkad. Downloaded on February 24,2025 at 11:32:01 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: LOW-COST PIPELINED ARCHITECTURE BASED ON A HYBRID SORTING ALGORITHM 721
TABLE I
C ASES OF B IDIRECTIONAL I NSERTION S ORTING (X = D ON ’ T C ARE VALUE )
order. SVAi [i] and LVAi [i] represent the values of SVA[i] and sequence is completely invalid and that the uninitialized
LVA[i] before the bidirectional insertion sorting operations, stages remain in the recording mode.
while SVAo [i] and LVAo [i] represent the values of SVA[i] 2) Insertion Mode: When LVAi [i] is greater than or equal
and LVA[i] after the bidirectional insertion sorting operations. to SVAi [i], the i-th stage of the sorting architecture operates
Table I shows the cases of bidirectional insertion sorting. There in the insertion mode. When the i-th stage of the sorting
are three judgment conditions in the table, namely whether architecture is in the insertion mode, comparisons are made
SVAi [i] is greater than LVAi [i], whether SVAi [i] is greater between SVAi [i] and DIi [1] and between LVAi [i] and DIi [P].
than DIi [1], and whether LVAi [i] is smaller than DIi [P]. The According to the comparison results, there are four insertion
second row of the table shows the operations in the recording cases, which are displayed in the third to the sixth rows of
mode, while the subsequent rows show the operations in the Table I.
insertion mode. Details regarding the recording and insertion In the first insertion case, SVAi [i] is no larger than DIi [1],
modes are provided in the following text. and LVAi [i] is no smaller than DIi [P]. In this case, SVA[i]
1) Recording Mode: If SVAi [i] is greater than LVAi [i], and LVA[i] retain their original values, and the values of DIi
the i-th stage of the sorting architecture operates in the are directly assigned to DOi in sequence.
recording mode, and the values of DIi [1] and DIi [P] are In the second insertion case, SVAi [i] is greater than DIi [1],
directly stored into SVA[i] and LVA[i], respectively. When and LVAi [i] is no smaller than DIi [P]. In this case, the
the i-th stage of the sorting architecture is operating in the value of SVA[i] is updated with DIi [1], and SVAi [i] must be
recording mode, the values of SVA[i] and LVA[i] are not inserted into the output sequence DOi . The value of DOi [1]
yet replaced by valid data; thus, the SVAi [i] must be MAX, is the smaller value between SVAi [i] and DIi [2]. The value
and LVAi [i] must be MIN. The invalid data of SVA[i] and of DOi [k], where k is between 2 and P- 1, is the median of
LVA[i] are replaced by DIi [1] and DIi [P], and then inserted SVAi [i], DIi [k], and DIi [k + 1]. The value of DOi [P] is the
into DOi [P/2] and DOi [P/2 + 1], respectively. The val- greater value between SVAi [i] and DIi [P].
ues of DIi [2] to DIi [P/2] are then assigned to DOi [1] to In the third insertion case, SVAi [i] is no larger than DIi [1],
DOi [P/2-1], and the values of DIi [P/2 + 1] to DIi [P-1] are and LVAi [i] is smaller than DIi [P]. In this case, the value
assigned to DOi [P/2 + 2] to DOi [P]. of LVA[i] is updated with DIi [P]. The value of DOi [1] is
It is worth noting that when the i-th stage of the sorting the smaller value between LVAi [i] and DIi [1]. The value of
architecture is operating in the recording mode, all subsequent DOi [k], where k is between 2 and P- 1, is the median of
cascaded stages of the sorting architecture must also operate LVAi [i], DIi [k- 1], and DIi [k]. The value of DOi [P] is the
in the recording mode. The output sequence DOi then greater value between LVAi [i] and DIi [P- 1].
becomes the input sequence of the (i+ 1)-th stage of the In the fourth insertion case, SVAi [i] is greater than DIi [1],
sorting architecture (i.e., DIi+1 ). DIi+1 [1] and DIi+1 [P] will and LVAi [i] is smaller than DIi [P]. Therefore, the values
be used to update the values of SVA[i + 1] and LVA[i + 1], of SVA[i] and LVA[i] are updated by those of DIi [1] and
respectively. Assume that an input sequence with P valid DIi [P], respectively, and the replaced values must be inserted
data is used to initialize the values of the SVA and LVA of into the output sequence. The value of DOi [1] is the smaller
the i-th to (i+ P/2 - 1)-th stages of the sorting architecture. value between SVAi [i] and DIi [2], and the value of DOi [P] is
After the sequence is transferred to the (i+ P/2)-th stage the higher value between LVAi [i] and DIi [P- 1]. Five possible
of the sorting architecture, all of its data become invalid. values exist for DOi [k], where k is between 2 and P- 1,
Because of the replacement rules of the recording mode, all namely SVAi [i], LVAi [i], DIi [k- 1], DIi [k], and DIi [k+ 1], and
values from DIi+P/2 [1] to DIi+P/2 [P/2] are MAX, and all DOi [k] is set as the median of these five values. For hardware
values from DIi+P/2 [P/2+1] to DIi+P/2 [P] are MIN. This implementation, the median searching process is divided into
design can ensure that the values of the SVA and LVA of two steps. First, the median between SVAi [i], LVAi [i], and
the uninitialized stages remain unchanged when the input DIi [k] is determined. If the median value is DIi [k], DOi [k]
Authorized licensed use limited to: Indian Institute of Technology Palakkad. Downloaded on February 24,2025 at 11:32:01 UTC from IEEE Xplore. Restrictions apply.
722 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 71, NO. 2, FEBRUARY 2024
is set as DIi [k] directly. If the median is SVAi [i], DOi [k] is
the smaller value between SVAi [i] and DIi [k+ 1]. Because
SVAi [i] must not be greater than LVAi [i], it must be greater
than the values of DIi [k] and DIi [k- 1] when it is selected as Fig. 7. The hardware architecture of the BISU.
the median. Consequently, the comparison between SVAi [i]
and DIi [k- 1] is saved. If the median is LVAi [i], then this
value is smaller than the DIi [k] and DIi [k+ 1]. In this case, while H is the maximum between them. An example of the
the value of DOi [k] is the higher value between LVAi [i] and structure of a BSU is illustrated in Fig. 6. For brevity, a CU
DIi [k- 1]. is represented with a vertical segment with circles at both
ends, which is shown in Fig. 6(a). Fig. 6(b) displays the
structure of a four-input BSU, where i1 to i4 represent the
IV. H ARDWARE A RCHITECTURE
input data sequence, and s1 to s4 represent the output sorted
The hardware architecture of the proposed sorting algorithm sequence. As displayed in Fig. 6(b), six CUs are required
is displayed in Fig. 4. This architecture consists of a bitonic in a four-input BSU. Notably, the required number of CUs
sorting unit (BSU) and several BISUs. The input EN is an will grow to 24 for an eight-input BSU. The required number
enable signal, and all sorting units are active only when EN of CUs increases dramatically with the length of the input
is high. The input OI is an output indication signal. When OI data sequence. Consequently, the proposed hybrid sorting
is high, the sorted data in the BISUs are output in sequence. architecture is adopted to reduce the hardware area costs.
The input data to be sorted are segmented into subsequences
of length P (i.e., DI). To enable the proposed hardware B. Bidirectional Insertion Sorting Unit
architecture to process a new input data sequence without idle
time, an inversion signal INV is used to separate two sets of BISUs consist of two data registers (Rdata1 and Rdata2 ),
data. The outputs of the proposed architecture are V and DO, a bidirectional insertion sorting logic (BISL), and an output
which denote the output valid signal and the output data stream control module (OCM). Fig. 7 depicts the hardware design of
of length P, respectively. The data of DO are valid when V the BISUs used in the proposed hardware architecture. The
is high. In the following subsections, the architectures of the terms Rmax and Rmin denote the data registers that store the
BSU and BISUs are introduced in detail. maximum and minimum values passing through the BISU,
respectively. When the INV signal is low, Rdata1 is assigned
as Rmin , and Rdata2 is assigned as Rmax . By contrast, when
A. Bitonic Sorting Unit the INV signal is high, Rdata1 is assigned as Rmax , and
The basic composition blocks of a BSU are comparison Rdata2 is assigned as Rmin . This design enables the proposed
units (CUs), which take two inputs and arrange them in architecture to process a new set of data before the previous
ascending order. The detailed architecture of a CU is shown data sequence has been fully output. A comparator is used
in Fig. 5. A CU is constructed with a comparator and two to determine whether the BISU should operate in the record-
multiplexers. L is the minimum between the inputs x and y, ing mode. The BISL implements the proposed bidirectional
Authorized licensed use limited to: Indian Institute of Technology Palakkad. Downloaded on February 24,2025 at 11:32:01 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: LOW-COST PIPELINED ARCHITECTURE BASED ON A HYBRID SORTING ALGORITHM 723
Authorized licensed use limited to: Indian Institute of Technology Palakkad. Downloaded on February 24,2025 at 11:32:01 UTC from IEEE Xplore. Restrictions apply.
724 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 71, NO. 2, FEBRUARY 2024
Authorized licensed use limited to: Indian Institute of Technology Palakkad. Downloaded on February 24,2025 at 11:32:01 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: LOW-COST PIPELINED ARCHITECTURE BASED ON A HYBRID SORTING ALGORITHM 725
of P; otherwise, this value is set as 0. This design is used in the figure. The example of the sorting process is illustrated
to prevent the first half of the sorted output data sequence by P = 4, N = 8, and K = 4. Equations (3) and (4) present
from being stored into the Rmin registers. During the output the two sets of input data considered in the explained example,
process, input data sequences with only MIN values are used respectively. Notably, the invert signals from two consecutive
to replace the data in the Rmin registers. The output sequence sets of input data should be opposite to each other. Therefore,
should not be changed after P valid data are collected from the invert signals of set A are all 0, and those of set B are
the Rmin registers. Consequently, the value of ENout will be all 1.
pulled down to 0 if both the values of R O S and Vin are 1.
Moreover, the value of R O S becomes 0 if the values of R O S , D I A = {2, 13, 1, 5 | 8, 10, 15, 1 | 0, 0, 0, 0 | 15, 15, 15, 15}
E N = {1 | 1 | 1 | 1}
ENin , and Vin are all 1, which indicates that the sorted data in A
the Rmin registers of the previous P architecture stages have V A = {0 | 0 | 1 | 1}
been output. When outputting the second half of the sorted
I N V A = {0 | 0 | 0 | 0}
data, input data sequences with only MAX values are used to
replace the data in the Rmax registers. The sorted data in the (3)
Rmax registers are gradually transferred backward to update D I B = {7, 11, 5, 3 | 12, 2, 14, 6 | 0, 0, 0, 0 | 15, 15, 15, 15}
the values of the Rmax registers of the subsequent BISUs. The
E N = {1 | 1 | 1 | 1}
B
output sequence consisting of data collected from the Rmax VB = {0 | 0 | 1 | 1}
registers of BISU(N /2)−P+1 to BISU N /2 are output in each
cycle. I N VB = {1 | 1 | 1 | 1}
(4)
C. Example of the Sorting Process In Fig. 11, valid data are denoted in red color, and valid
In this section, an example is presented in Fig. 11 to output are highlighted with a gray background. At cycle 0,
demonstrate how the proposed BISU sorts an input data the value of Rmin is set as MAX, that is, 15, and the value of
sequence and outputs sorted results in ascending order. For Rmax is set as MIN, that is, 0. In cycle 1, the first segmented
brevity, the operations of the BSU and OCM are not depicted subsequence {1, 2, 5, 13}, which is sorted by the BSU, is input
Authorized licensed use limited to: Indian Institute of Technology Palakkad. Downloaded on February 24,2025 at 11:32:01 UTC from IEEE Xplore. Restrictions apply.
726 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 71, NO. 2, FEBRUARY 2024
to BISU1 . In BISU1 , the value of Rmin is greater than that of the Rmin and Rmax registers, respectively. Because the value
Rmax ; thus, BISU1 operates in the recording mode. The input of Rmin is not greater than that of Rmax in BISU1 , BISU1
data 1 and 13 are stored into the Rmin and Rmax registers, operates in the insertion mode. The smallest value in the input
respectively. The invalid data 15 and 0 are directly inserted into sequence is 2, which is smaller than the value of Rmin . The
the middle of the output sequence without comparison with the highest value in the input sequence is 14, which is higher
remaining data in the input sequence, and the remaining data than the value of Rmax . Therefore, 2 and 14 are stored into
in the input sequence are placed at both ends of the output the Rmin and Rmax registers, respectively, and 3 and 11 are
sequence. In cycle 2, the second segmented subsequence {1, inserted into the output sequence simultaneously. The output
8, 10, 15} is input to BISU1 , and the output sequence of BISU1 sequence of BISU1 then becomes {3, 6, 11, 12}.
is transferred to BISU2 . BISU2 also operates in the recording Finally, the first half and the second half of the sorted results
mode. The data 2 and 5 are stored into the Rmin and Rmax of data set A, namely {1, 1, 2, 5} and {8, 10, 13, 15}, are
registers, respectively, and the output sequence of BISU2 is obtained in cycle 6 and 7, respectively.
{15, 15, 0, 0}. The aforementioned design ensures that the
values maintained in the subsequent cascaded BISUs are not V. P ERFORMANCE A NALYSIS
updated until a valid data sequence is input. Because the value This section describes the performance of the proposed
of Rmin is not greater than that of Rmax in BISU1 , BISU1 sorting architecture. In Section V-A, the performance of the
operates in the insertion mode. The smallest value in the input proposed architecture is compared to the sorting design in [18],
sequence is 1, which is equal to the value of Rmin ; therefore, which also adopts a hybrid sorting algorithm, as well as the
Rmin does not need to be updated. The highest value in the sorting designs in [23] and [24], both of which exhibit a low
input sequence is 15, which is higher than the value of Rmax ; growth rate in hardware area costs. Because our bidirectional
therefore, 15 is stored into the Rmax register, and 13 is inserted insertion sorting architecture and the architecture proposed
into the output sequence. The output sequence of BISU1 then in [18] are both pipelined architectures, a detailed compari-
becomes {1, 8, 10, 13}, which is still arranged in ascending son between these architectures is provided in Section V-B.
order. The performance of the considered sorting architectures is
The first half of the sorted sequence is stored in the Rmin analyzed in terms of circuit area, sorting cycles, and power
register of each BISU in ascending order. A data sequence consumption. Circuit area is measured in terms of the gate
that only contains 0s is input to replace the sorted results in count of the circuit and rounded off to the nearest 1000 gates.
Rmin . In cycle 3, the value 0 is used to replace the Rmin The gate count is calculated by dividing the total cell area
value of BISU1 , namely 1, and 1 is inserted into the output of the circuit by the area of a two-input NAND gate. All
sequence; thus, the output sequence of BISU1 becomes {0, 0, compared architectures are implemented using the Verilog
0, 1}. In cycle 4, the value 0 is again used to replace the Rmin HDL and synthesized using the Synopsys Design Compiler
value of BISU2 , namely 1, and 1 is inserted into the output with a TSMC 90nm cell library, and the power consumption
sequence; thus, the output sequence of BISU2 becomes {0, 0, is measured using the Synopsys Prime Time PX.
1, 1}, which is transferred to the subsequent cascaded BISUs
to collect valid output values from the Rmin registers.
The second half of the sorted sequence is stored in the A. Comparison of the Sorting Performance
Rmax register of each BISU in descending order. To replace The performance of the proposed architecture is compared
the sorted results in Rmax , a data sequence that only contains with that of the sorting architectures proposed in [18], [23],
15s is input. In cycle 4, the Rmax value of BISU1 is 15; thus, and [24]. The architectures proposed in [23] and [24] use
the value of Rmax does not need to be updated. In cycle 5, the comparison-free sorting algorithms, which do not use com-
value 15 is used to replace the Rmax value of BISU2 , namely parators in the sorting process. The architecture proposed
13, and 13 is inserted into the output sequence; thus, the in [23] uses registers to record the number of occurrences
output sequence of BISU2 becomes {13, 15, 15, 15}, which is of data in an input sequence. The speed complexity and total
transferred to the subsequent cascaded BISUs to collect valid gate count complexity of the architecture proposed in [23] are
output values from the Rmax registers. of the order of O(N ). The architecture proposed in [24] is an
Meanwhile, the first segmented subsequence {3, 5, 7, 11} improved version of [23]. The architecture proposed in [24]
of data set B is input into BISU1 in cycle 5. Because the reduces the number of sorting cycles through bidirectional
values of INV B are 1, the registers used as Rmin and Rmax sorting and uses two auxiliary methods to reduce the wastage
are swapped. This design enables the architecture to output of clock cycles during the sorting process. The architecture
the sorting results and process new input data sequences proposed in [18] combines bitonic sort and insertion sort to
simultaneously. The Rmin and Rmax values of BISU1 are enhance the throughput with acceptable hardware area costs.
15 and 0, respectively. The value of Rmin is greater than that of Since the architectures proposed in [23] and [24] can only
Rmax ; thus, BISU1 operates in the recording mode again. The process one input data in a clock cycle, the architectures
values 3 and 11 are stored into the Rmin and Rmax registers, proposed in [18] and this paper are implemented in a special
respectively. In cycle 6, the second segmented subsequence case that P is 1 for a fair comparison with the architectures
{2, 6, 12, 14} is input to BISU1 , and the output sequence proposed in [23] and [24]. The insertion sorting unit in the
of BISU1 is transferred to BISU2 . BISU2 also operates in architecture proposed in [18] and the BISU in our architecture
the recording mode again. The data 5 and 7 are stored into are replaced with simple comparators and multiplexers. The bit
Authorized licensed use limited to: Indian Institute of Technology Palakkad. Downloaded on February 24,2025 at 11:32:01 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: LOW-COST PIPELINED ARCHITECTURE BASED ON A HYBRID SORTING ALGORITHM 727
TABLE II TABLE IV
T HE C OMPARISON OF G ATE C OUNTS (K G ATES ) T HE C OMPARISON OF P OWER C ONSUMPTION (W)
Authorized licensed use limited to: Indian Institute of Technology Palakkad. Downloaded on February 24,2025 at 11:32:01 UTC from IEEE Xplore. Restrictions apply.
728 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 71, NO. 2, FEBRUARY 2024
TABLE VI
T HE C OMPARISON B ETWEEN T HE A RCHITECTURE P ROPOSED IN [18] AND O UR A RCHITECTURE U NDER VARIOUS B IT W IDTH
TABLE VII
T HE C OMPARISON B ETWEEN T HE A RCHITECTURE P ROPOSED IN [18] AND O UR A RCHITECTURE U NDER VARIOUS L EVELS OF DATA PARALLELISM
energy consumption per data byte. The sizes of the input does the architecture proposed in [18]. Notably, the gate count
data (N ) in the comparison are set as 128, 1024, and 2048. reduction compared with that of the architecture proposed
First, the performance of the aforementioned architec- in [18] achieved with our architecture increases with P. For
tures is analyzed by fixing the level of data parallelism as example, the proposed architecture can reduce the gate counts
4 (P = 4), and varying the bit width of the data (K = by 6% when P is 4, 11% when P is 8, and 14% when P
8, 16, 32); the relevant results are presented in Table VI. is 16, compared to the architecture proposed in [18] under
Second, the performance of these architectures is investigated the configuration that K is 32 and N is 2048. Therefore,
under a fixed bit width of 32 (K = 32) and different levels the proposed architecture can enhance the throughput by
of data parallelism (P = 4, 8, 16); the relevant results are increasing the level of data parallelism with fewer additional
presented in Table VII. To sort a set of data with a size of N , hardware area costs.
the architecture proposed in [18] requires N pipeline stages, Because the logic design of the proposed BISUs is more
with each stage recording one sorted data. In the proposed complex than that of the insertion sorting units used in [18],
architecture, only N /2 pipeline stages are required, with each the operating frequency of our architecture is marginally lower
stage recording two sorted data. Thus, in all comparison cases, than that of the architecture proposed in [18]. However, the
our architecture requires lower gate counts, fewer sorting proposed architecture can still achieve a close throughput with
cycles, shorter latency, and lower power consumption than the architecture proposed in [18]. The difference between the
Authorized licensed use limited to: Indian Institute of Technology Palakkad. Downloaded on February 24,2025 at 11:32:01 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: LOW-COST PIPELINED ARCHITECTURE BASED ON A HYBRID SORTING ALGORITHM 729
throughputs of our architecture and the architecture proposed [2] A. Gabiger-Rose, M. Kube, R. Weigel, and R. Rose, “An FPGA-
in [18] ranges only from 1% to 5%. To further investigate based fully synchronized design of a bilateral filter for real-time image
denoising,” IEEE Trans. Ind. Electron., vol. 61, no. 8, pp. 4093–4104,
the efficiency of the proposed architecture, the throughput- Aug. 2014.
to-gate-count ratio, throughput-to-power-consumption ratio, [3] S. Chen, T. Zhang, and Y. Xin, “Relaxed K-best MIMO signal detector
and energy consumption per data byte are analyzed. Our design and VLSI implementation,” IEEE Trans. Very Large Scale Integr.
(VLSI) Syst., vol. 15, no. 3, pp. 328–337, Mar. 2007.
architecture has a higher throughput-to-gate-count ratio than
[4] M. Shabany and P. G. Gulak, “A 675 Mbps, 4×4 64-QAM K-best
does the architecture proposed in [18], which indicates that MIMO detector in 0.13 µm CMOS,” IEEE Trans. Very Large Scale
our design can generate sorted sequences by using hardware Integr. (VLSI) Syst., vol. 20, no. 1, pp. 135–147, Jan. 2012.
resources more efficiently than the design of [18]. When K [5] B. Y. Kong and I.-C. Park, “Improved sorting architecture for K-best
MIMO detection,” IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 64,
and N are fixed, the improvement of the throughput-to-gate- no. 9, pp. 1042–1046, Sep. 2017.
count ratio in our architecture against the architecture proposed [6] N. Govindaraju, J. Gray, R. Kumar, and D. Manocha, “GPUTeraSort:
in [18] increases with P. Taking the configuration that K is High performance graphics co-processor sorting for large database man-
32 and N is 2048 as an example, our architecture can enhance agement,” in Proc. ACM SIGMOD Int. Conf. Manag. Data, Jun. 2006,
pp. 325–336.
throughput-to-gate-count ratio by 5% when P is 4, 8% when [7] J. Casper and K. Olukotun, “Hardware acceleration of database oper-
P is 8, and 16% when P is 16, compared to the architecture ations,” in Proc. ACM/SIGDA FPGA, Monterey, CA, USA, 2014,
proposed in [18]. Our architecture also achieves a higher pp. 151–160.
throughput-to-power-consumption ratio and lower energy con- [8] H. Inoue, T. Moriyama, H. Komatsu, and T. Nakatani, “AA-sort: A new
parallel sorting algorithm for multi-core SIMD processors,” in Proc.
sumption per data byte. The improvements ranging from 7% 16th Int. Conf. Parallel Archit. Compilation Techn. (PACT), Sep. 2007,
to 30% in the throughput-to-power-consumption ratio, and the pp. 189–198.
reductions ranging from 6% to 23% in energy consumption [9] D. Merrill and A. Grimshaw, “High performance and scalable radix
sorting: A case study of implementing dynamic parallelism for GPU
per data byte. Under the same power consumption, our design computing,” Parallel Process. Lett., vol. 21, no. 2, pp. 245–272,
can produce a higher number of output results. Jun. 2011.
In summary, while the architecture proposed in [18] can [10] A. Davidson, D. Tarjan, M. Garland, and J. D. Owens, “Efficient parallel
operate at a slightly higher frequency than our architecture, merge sort for fixed and variable length keys,” in Proc. Innov. Parallel
Comput. (InPar), 2012, pp. 1–9.
the throughputs of both designs are close. Moreover, our [11] N. Tsuda, T. Satoh, and T. Kawada, “A piepline sorting chip,” in IEEE
architecture can produce more sorted data under the same Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, Feb. 1987,
area costs or power consumption when compared with the pp. 270–271.
architecture proposed in [18]. Thus, our architecture can sort [12] R. Marcelino, H. C. Neto, and J. M. P. Cardoso, “Unbalanced FIFO sort-
ing for FPGA-based systems,” in Proc. 16th IEEE Int. Conf. Electron.,
data more efficiently. Circuits Syst., Dec. 2009, pp. 431–434.
[13] D. Koch and J. Torresen, “FPGASort: A high performance sorting archi-
VI. C ONCLUSION tecture exploiting run-time reconfiguration on fpgas for large problem
sorting,” in Proc. 19th ACM/SIGDA Int. Symp. Field Program. Gate
In this paper, a low-cost pipelined architecture based on a Arrays, Feb. 2011, pp. 45–54.
hybrid sorting algorithm is proposed. With the proposed novel [14] G. Xiao, M. Martina, G. Masera, and G. Piccinini, “A parallel radix-
BISU, the number of pipelined stages can be considerably sort-based VLSI architecture for finding the first W maximum/minimum
reduced; thus, the number of required area costs, sorting values,” IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 61, no. 11,
pp. 890–894, Nov. 2014.
cycles, and power consumption can also be reduced. The [15] B. Yong Kong, H. Yoo, and I.-C. Park, “Efficient sorting architec-
proposed architecture was implemented using the Verilog ture for successive-cancellation-list decoding of polar codes,” IEEE
HDL and synthesized using the Synopsys Design Compiler Trans. Circuits Syst. II, Exp. Briefs, vol. 63, no. 7, pp. 673–677,
Jul. 2016.
with a TSMC 90-nm cell library. The experimental results [16] M. Zuluaga, P. Milder, and M. Püschel, “Streaming sorting networks,”
indicate that the proposed architecture required the lowest gate ACM Trans. Design Autom. Electron. Syst., vol. 21, no. 4, pp. 1–30,
counts, the fewest sorting cycles, and the lowest power con- Sep. 2016.
sumption among the compared sorting designs. Furthermore, [17] R. Chen, S. Siriyal, and V. Prasanna, “Energy and memory efficient
mapping of bitonic sorting on FPGA,” in Proc. ACM/SIGDA Int. Symp.
compared with an existing pipelined sorting architecture, Field-Programmable Gate Arrays, Feb. 2015, pp. 240–249.
the proposed architecture exhibited higher throughput-to-gate- [18] W. Chen, W. Li, and F. Yu, “A hybrid pipelined architecture for high
count and throughput-to-power-consumption ratios and thus performance top-K sorting on FPGA,” IEEE Trans. Circuits Syst. II,
Exp. Briefs, vol. 67, no. 8, pp. 1449–1453, Aug. 2020.
more efficient hardware resource usage.
[19] S. S. Ray, D. Adak, and S. Ghosh, “Worst case O(N) comparison-
In real world application, the usage scenarios are free hardware sorting engine,” IEEE Trans. Comput.-Aided
more diverse. A more compact design that is capable of Design Integr. Circuits Syst., vol. 41, no. 10, pp. 3332–3345,
balancing high performance with hardware area costs may be Oct. 2022.
[20] S. Saha Ray and S. Ghosh, “K-degree parallel comparison-free hardware
necessary. Future works include enabling the number of data sorter for complete sorting,” IEEE Trans. Comput.-Aided Design Integr.
involved in insertion sorting to be scalable and designing Circuits Syst., vol. 42, no. 5, pp. 1438–1449, May 2023.
high-performance algorithms that can simplify the logic [21] A. Farmahini-Farahani, H. J. Duwe III, M. J. Schulte, and K. Compton,
judgments. “Modular design of high-throughput, low-latency sorting units,” IEEE
Trans. Comput., vol. 62, no. 7, pp. 1389–1402, Jul. 2013.
[22] S.-H. Lin, P.-Y. Chen, and Y.-N. Lin, “Hardware design of low-power
R EFERENCES high-throughput sorting unit,” IEEE Trans. Comput., vol. 66, no. 8,
[1] L. Njejimana et al., “Design of a real-time FPGA-based data acquisition pp. 1383–1395, Aug. 2017.
architecture for the LabPET II: An APD-based scanner dedicated to [23] S. Abdel-Hafeez and A. Gordon-Ross, “An efficient O(N) comparison-
small animal PET imaging,” IEEE Trans. Nucl. Sci., vol. 60, no. 5, free sorting algorithm,” IEEE Trans. Very Large Scale Integr. (VLSI)
pp. 3633–3638, Oct. 2013. Syst., vol. 25, no. 6, pp. 1930–1942, Jun. 2017.
Authorized licensed use limited to: Indian Institute of Technology Palakkad. Downloaded on February 24,2025 at 11:32:01 UTC from IEEE Xplore. Restrictions apply.
730 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 71, NO. 2, FEBRUARY 2024
[24] W.-T. Chen, R.-D. Chen, P.-Y. Chen, and Y.-C. Hsiao, “A high- Chien-Chia Ho received the B.S. and M.S. degrees
performance bidirectional architecture for the quasi-comparison-free in computer science and information engineering
sorting algorithm,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 68, from National Cheng Kung University, Tainan,
no. 4, pp. 1493–1506, Apr. 2021. Taiwan, in 2020 and 2022, respectively. He is cur-
[25] K. E. Batcher, “Sorting networks and their applications,” in Proc. Spring rently a Senior Engineer with MediaTek Inc. His
Joint Comput. Conf.-AFIPS (Spring), vol. 1968, pp. 307–314. current research interests include image process-
[26] S. Mashimo, T. Van Chu, and K. Kise, “High-performance hardware ing, very large-scale integrated chip design, and
merge sorter,” in Proc. IEEE 25th Annu. Int. Symp. Field-Programmable embedded systems.
Custom Comput. Mach. (FCCM), Apr. 2017, pp. 1–8.
[27] M. Saitoh, E. A. Elsayed, T. V. Chu, S. Mashimo, and K. Kise, “A high-
performance and cost-effective hardware merge sorter without feedback
datapath,” in Proc. IEEE 26th Annu. Int. Symp. Field-Programmable
Custom Comput. Mach. (FCCM), Apr. 2018, pp. 197–204.
[28] W. Qiao, J. Oh, L. Guo, M. F. Chang, and J. Cong, “FANS: FPGA-
accelerated near-storage sorting,” in Proc. IEEE 29th Annu. Int. Symp. Wei-Ting Chen received the B.S. and Ph.D.
Field-Programmable Custom Comput. Mach. (FCCM), May 2021, degrees from the Departments of Engineering Sci-
pp. 106–114. ence and Computer Science and Information Engi-
[29] J. Cho, D. I. Maulana, and W. Jung, “A near-memory radix sort neering, National Cheng Kung University, Tainan,
accelerator with parallel 1-bit sorter,” in Proc. IEEE 30th Annu. Int. Taiwan, in 2017 and 2021, respectively. He is cur-
Symp. Field-Programmable Custom Comput. Mach. (FCCM), May 2022, rently a Senior Engineer with MediaTek Inc. His
p. 1. current research interests include image process-
ing, very large-scale integrated chip design, and
embedded systems.
Authorized licensed use limited to: Indian Institute of Technology Palakkad. Downloaded on February 24,2025 at 11:32:01 UTC from IEEE Xplore. Restrictions apply.