Implementing a Very High-speed Secure Hash Algorithm 3 Accelerator Based on PCI-express
Implementing a Very High-speed Secure Hash Algorithm 3 Accelerator Based on PCI-express
Corresponding Author:
Tan-Phat Dang
University of Science, Vietnam National University
Ho Chi Minh City, Vietnam
Email: [email protected]
1. INTRODUCTION
The increasing demand for massive amounts of real-time data transformations and processing neces-
sitates high-performance data servers. This process acquires data from various sources, such as remote devices
via the Internet, and then transmits it to dedicated hardware like graphic processing units (GPU) or hardware
accelerators through burst or streaming mechanisms. peripheral component interconnect express (PCIe) has
been utilized to enhance the data transfer rate to dedicated hardware [1], [2]. Operating on point-to-point topol-
ogy, PCIe enables devices to communicate directly with other components without sharing bandwidth with
other devices on the bus. PCIe utilizes multiple independent lanes, ranging from one lane to 32 lanes, for data
transfer between the host and end device. Each lane comprises two pairs of differential signaling wires, one
for transmitting data (Tx) and one for receiving data (Rx). Therefore, the PCIe operation speed can range from
2.5 GT/s to 32 GT/s for Gen1 to Gen5, respectively. Furthermore, the implementation of a direct memory
access (DMA) for PCIe to eliminate a central processing unit (CPU) intervention has appeared in earlier works
[1], [3], which involves transferring data from the main memory of the host device to a temporary DMA local
Int J Reconfigurable & Embedded Syst, Vol. 14, No. 1, March 2025: 1–11
Int J Reconfigurable & Embedded Syst ISSN: 2089-4864 ❒ 3
The remaining sections of the paper are organized as follows. Section 2 provides background informa-
tion on SHA-3 algorithms. Our hardware design is comprehensively analyzed in section 3, covering the model
of the SHA-3 accelerator to PC through PCIe, configurable buffers, and multiple KECCAK-f architecture.
Evaluation and comparison of our results with other approaches are presented in section 4. Finally, section 5
concludes the paper.
2. SHA-3 PRELIMINARY
The construction of SHA-3 differs from the Merkle–Damgard design in SHA-1 and SHA-2, instead
adopting the SPONGE construction [9], which is comprised of two main phases: absorbing and squeezing,
as shown in Figure ??. Prior to the absorbing and squeezing phases, the input message m of arbitrary length
undergoes a padding process. This ensures that the message is expanded to a multiple of r bits (1152, 1088, 832,
or 576 bits) by appending the pattern ”10*1”. However, the SHA-3 hash function requires that the message
m must append the suffix ”01” to support domain separation [? ]. Consequently, the pattern ”0110*1” is
appended to message m, as illustrated in Figure ??. During the absorbing phase, the padded message m is
partitioned into several blocks of size r. Each r-sized block is then combined with a capacity c to form a
1600-bit block, which is subsequently processed sequentially by each KECCAK-f function until all blocks are
processed. In the squeezing phase, the length of the output d (224, 256, 384, or 512 bits) can vary depending
on the selected mode.
Absorbing Squeezing
θ ρ π χ ι
f: KECCAK-f
r = 1152/1088/832/576 bits for SHA3-224/256/832/512.
d = 224/256/384/512 bits for SHA3-224/256/832/512.
c = b - r, where b = 1600 bits.
The KECCAK-f function is a fundamental component in SHA-3 and is utilized in both the absorbing
and squeezing phases. The input message is converted into a three-dimensional array, denoted by x, y, and z,
formatting a 5 × 5 × 64 state array. The KECCAK-f function operates on this state array during 24 rounds of a
round function (Rnd), with each round consisting of five step mappings: θ (theta), ρ (rho), π (pi), χ (chi), and
ι (iota).
Implementing a very high-speed secure hash algorithm 3 accelerator based on PCI ... (Huu-Thuan Huynh)
4 ❒ ISSN: 2089-4864
Int J Reconfigurable & Embedded Syst, Vol. 14, No. 1, March 2025: 1–11
Int J Reconfigurable & Embedded Syst ISSN: 2089-4864 ❒ 5
32 bits Control/Status
(b)
Padding SHA-3
Registers
Stage 0 Stage 1 Stage 2
RAM In 0
256 bits RAM In 1 Multiple DMA Write
KECCAK-f 256 bits
RAM Out 0 63Gbps
System
256 bits RAM Out 1 Buffer PCIe IP 256 bits RAM In/Out SHA-3
memory
Clock domain 0 Clock domain 1
FPGA DMA Read
(a) (c)
Figure 2. Overall architecture and data flow for SHA-3 accelerator on the system level (a) system architecture,
(b) data flow on the system level, and (c) the three stages involve transferring data from the system memory to
the SHA-3 accelerator
Figure 3. The proposed SHA-3 architecture in detail (a) the buffer and multiple KECCAK-f architectures and
(b) the two Rnd architectures: re-scheduled and sub-pipelined ways, and (c) θ and (ρ - π - χ - ι) architectures
3.2. Buffer
To facilitate flexibility in handling different modes, we introduce a configurable buffer capable of
switching between BI and BO based on the selected mode. Each mode needs a distinct block size r, as il-
lustrated in Figure ??. For SHA3-224 mode, with a data width of 256 bits, the input block size is 1152 bits,
corresponding to five BIs. Similarly, for SHA3-256/384/512 modes, the required number of BIs is 5/4/3, re-
spectively. Conversely, the output size for SHA3-224/256/384/512 modes in the 256-bit base is 1/1/2/2 BOs.
To optimize the buffer utilization, we propose four BIs (BI 0 to BI 3), one BO, and one BIO, as depicted in
Figure ??(a). The buffers cascade to each other, with the output of the preceding buffer serving as the input
for the subsequent one. BIs only accept new input when the valid in signal is activated; otherwise, they retain
Implementing a very high-speed secure hash algorithm 3 accelerator based on PCI ... (Huu-Thuan Huynh)
6 ❒ ISSN: 2089-4864
the current value. BIO exhibits slightly more complexity than BI, as it can receive data from the preceding
buffer when the sel signal is triggered; otherwise, it functions as a BO, receiving hash values from multiple
KECCAK-f units. BO is responsible for retrieving hash values and serially pushing them to RAM Out. For
SHA3-224/256/384/512 modes, it takes 1/1/2/2 clock cycles to complete writing data to RAM Out. To stream-
line complexity, the receiving process in BI takes 4/4/5/5 clock cycles for SHA3-224/256/384/512 modes,
respectively. Each data loaded into BI requires one clock cycle. Therefore, if the modes do not provide suffi-
cient data within those clock cycles, zero inputs are inserted. For example, in the case of SHA3-512 requiring
three blocks of 256 bits, the subsequent two blocks consist of zeros.
3.3. Multiple KECCAK-f
The multiple KECCAK-f module comprises mapping and three KECCAK-f instances, as illustrated
in Figure ??(a). The 1152-bit data from the preceding phase is fed into the Mapping block, which appends
zeros to expand it to a 1600-bit data size. In our design, we opt for three KECCAK-f instances to reduce
the interval of input data to 8 clock cycles. The output of each KECCAK-f instance is 512 bits in size, and
depending on the selected mode, truncation is applied to the output data.
In this work, we introduce two architectures for KECCAK-f : the re-scheduled and sub-pipelined ar-
chitectures, depicted in Figure ??(b). In a conventional architecture, the sequence of steps includes θ - ρ - π - χ - ι,
with a register placed at the end of the ι step to indicate the completion of one round [? ]. Our re-scheduled
architecture reorders these steps to ρ - π - χ - ι - θ by inserting a register between the θ and ρ steps. As a result,
re-scheduled architecture requires 25 repetitions to complete the hash value, one more compared to the base
architecture. During the first repetition, only the θ step is implemented, while the remaining repetitions execute
all steps in the sequence of ρ - π - χ - ι - θ. The re-scheduled architecture offers higher efficiency compared
to the conventional architecture. This is proven via synthesis results on the Stratix 10 device, revealing that
the re-scheduled architecture achieves a frequency of 336.36 MHz, surpassing the conventional architecture’s
frequency of 321.85 MHz by 4.31%. Moreover, the re-scheduled architecture utilizes fewer resources, with a
reduction in adaptive logic module (ALM) utilization of 16.67% (4214 ALMs compared to 5057 ALMs in the
conventional architecture).
Unlike previous works [28], [29], where the sub-pipelined technique typically inserts two registers:
one between the π and χ steps or between the θ and ρ steps and another at the end of the ι step, our sub-
pipelined architecture uses registers between the θ and ρ steps and another register before the θ step. This
decision is based on the observation that the critical path of the θ step is greater than that of the ρ, π, χ, and
ι steps. Specifically, the θ step requires at least four XOR gate levels to complete, while the remaining steps
need only AND and two XOR gates, as shown in Figure ??(c). By isolating the θ step, we aim to improve the
delay for KECCAK-f. However, adding the register in the round increases the number of clock cycles required,
doubling it to 48 clock cycles. To mitigate this increase in clock cycles, our design is capable of handling
two data simultaneously at two different stages. For example, if data 1 is processed in the θ stage, data 2 is
processed in the (ρ - π - χ - ι) stage. In the next clock cycle, data 1 moves to the (ρ - π - χ - ι) stage while data 2
transitions to the θ stage. Thus, the average time to generate one hash value is reduced to 24 clock cycles. The
advantage of our sub-pipelined architecture is that it increases the frequency while maintaining a fixed number
of clock cycles at 24, thereby increasing throughput.
In both re-scheduled and sub-pipelined architectures, the five steps are consistently grouped into two
parts: θ and (ρ - π - χ - ι). The formulation of the θ step is optimized by combining C[x] and D[x], as indicated
by the red area in the θ part of Figure ??(c), denoted as CD[x] in (1). CD[x] serves as the shared element,
utilized by A[x, y], and two levels of XOR operation are employed to reduce the delay for the θ step.
CD[x] = A[x − 1, 0] ⊕ A[x − 1, 1] ⊕ A[x − 1, 2] ⊕ A[x − 1, 3] ⊕ A[x − 1, 4]⊕
ROT (A[x + 1, 0], 1) ⊕ ROT (A[x + 1, 1], 1) ⊕ ROT (A[x + 1, 2], 1)⊕
(1)
ROT (A[x + 1, 3], 1) ⊕ ROT (A[x + 1, 4], 1)
A[x, y] = A[x, y] ⊕ CD[x]
The hardware implementation of (ρ + π) steps utilizes a net connection, which requires no additional
resources or delay, based on the combination of (ρ + π) steps illustrated in [? ]. Furthermore, the combination
of (ρ - π - χ - ι) steps is depicted in Figure ??(c). Unlike previous works [? ], which utilized 64-bit RC, we
have simplified this process by storing only the non-zero bits in RC. Therefore, only the bit positions 0, 1, 3, 7,
15, 31, and 63 are stored, effectively reducing resource usage.
Int J Reconfigurable & Embedded Syst, Vol. 14, No. 1, March 2025: 1–11
Int J Reconfigurable & Embedded Syst ISSN: 2089-4864 ❒ 7
#bit × F max
TP = (2)
#clock
Implementing a very high-speed secure hash algorithm 3 accelerator based on PCI ... (Huu-Thuan Huynh)
8 ❒ ISSN: 2089-4864
TP
Ef f. = (3)
Area
The throughput measurement results of the two architectures, MRS and MSS, on DE10-Pro are pre-
sented in Table ??. To determine the number of clock cycles (#clock), each architecture is equipped with a
counter to record the elapsed clocks, starting immediately when the core begins operation and stopping upon
completion of the process. The data size for throughput measurement of the MRS and MSS architectures is
tested up to 32 KB. Collaborating with the data length and operating frequencies of MRS and MSS, which are
280 MHz and 380 MHz, respectively, we compute the throughput for each mode. Specifically, the throughput
for MRS is 35.55 Gbps, 33.60 Gbps, 27.69 Gbps, and 19.23 Gbps for SHA3-224, SHA3-256, SHA3-384, and
SHA3-512 modes, respectively. Similarly, for MSS, the throughput is 43.12 Gbps, 41.20 Gbps, 36.27 Gbps,
and 25.11 Gbps for SHA3-224, SHA3-256, SHA3-384, and SHA3-512 modes, respectively. The resources uti-
lized by MRS and MSS are obtained from the Quartus tool, with MSS utilizing 8273 ALMs and 8374 registers,
while MSS utilizes 9485 ALMs and 12832 registers, respectively. As a result, the efficiency of MRS and MSS
for all modes are as follows: 4.30 Mbps/ALM, 4.06 Mbps/ALM, 3.35 Mbps/ALM, and 2.32 Mbps/ALM for
MRS, and 4.55 Mbps/ALM, 4.34 Mbps/ALM, 3.82 Mbps/ALM, and 2.65 Mbps/ALM for MSS, respectively.
1 2 3 4 5 6 7 8 9 10
RAM In
Stage 0
Overhead
Buffer In
Padding
KECCAK-f 0
Stage 1
KECCAK-f 1
KECCAK-f 2
Stage 2
Buffer Out
RAM Out
(a) (b)
(c)
Figure 4. The experiment results of MRS and MSS across all modes in (a) the relationship between
throughput and different RAM In sizes, (b) data flow timing chart of multiple KECCAK-f units, and (c) the
relationship between throughput and the different numbers of KECCAK-f units
Table 1. The implementation results of the MRS and MSS architectures on DE10-Pro
Architecture Freq. Area Reg. TP (Gbps) Eff. (Mbps/ALM)
(MHz) (ALM) 224 256 384 512 224 256 384 512
MRS 280 8273 8374 35.55 33.60 27.69 19.23 4.30 4.06 3.35 2.32
MSS 380 9485 12832 43.12 41.20 36.27 25.11 4.55 4.34 3.82 2.65
Int J Reconfigurable & Embedded Syst, Vol. 14, No. 1, March 2025: 1–11
Int J Reconfigurable & Embedded Syst ISSN: 2089-4864 ❒ 9
Table 2. The comparison of KECCAK-f computation architectures between our proposals and FPGA-based
works on Virtex 7
Reference [? ], 2022 [? ], 2023 Our proposed architecture
Approach Dual Rnd Unrolling factor of 2 MRK MSK
Fmax (MHz) - 378.73 380.95 485.67
Area (Slice) 1521 1375 3203 2917
Register - - 4831 9669
#clock/hash 12 12 8 8
TP (Gbps)* 22.90 18.18 27.43 34.97
Eff. (Mbps/slice)* 15.11 13.22 8.56 11.99
∗
For SHA3-512 mode
Our MRK architecture requires 3203 slices and 4831 registers, operating at a maximum frequency of
380.95 MHz and achieving a throughput of 27.43 Gbps and an efficiency of 8.56 Mbps/slice. Conversely, the
MSK architecture, aimed at reducing the critical path of Rnd, utilizes more registers than MRK (9669>4831).
However, MSK outperforms MRK in terms of both throughput and efficiency, achieving 34.97 Gbps and 11.99
Mbps/slice, respectively.
Sravani and Durai [? ] proposed the dual Rnd architecture, which utilizes one Rnd consisting of five
steps (θ - ρ - π - χ - ι) and registers cascading another Rnd and register to halve the number of clock cycles
(#clock/hash = 12), achieving a throughput of 22.90 Gbps. However, the throughputs of our two architectures,
MRK and MSK, are 1.20 times (27.43 vs. 22.90) and 1.53 times (34.97 vs. 22.90) higher than that achieved by
the dual Rnd architecture. While our architectures prioritize high performance, their efficiency is slightly lower
compared to the dual architecture of Sravani and Durai [? ] with MRK being 0.56 times (8.56 vs. 15.11) and
MSK being 0.79 times (11.99 vs. 15.11). However, despite the lower efficiency, the throughput acceleration of
our MSK architecture (53%) surpasses the efficiency acceleration of their dual Rnd architecture (26%).
When comparing our proposals with that of Sideris et al. [? ], who implemented an unrolling factor of
2 to halve the number of clock cycles (#clock/hash = 12), we observe significant improvements in throughput
for both our MRK and MSK architectures. Specifically, our MRK architecture achieves a throughput 1.51
times higher (27.43 vs. 18.18), while our MSK architecture achieves a throughput 1.92 times higher (34.97 vs.
18.18) than the proposal of Sideris et al. [? ]. However, despite these substantial throughput improvements,
our efficiency is slightly lower, with MRK being 0.65 times lower (8.56 vs. 13.22) and MSK being 0.91 times
lower (11.99 vs. 13.22), respectively. Nonetheless, this decrease in efficiency is not considered significant
when compared to the notable throughput accelerations of 51% and 92% for MRK and MSK, respectively.
5. CONCLUSION
The demand for high-performance hash functions for modern applications has emerged, especially
for the latest hashing version, SHA-3. The improvement of SHA-3 throughput is proposed in this paper.
Specifically, full SHA-3 architecture is present from buffers and athrimetic core like KECCAK-f to integration
at the system level. The proposed architectures are designed on an FPGA platform, which is connected to a
PC via PCIe. PCIe boosts the data transfer rate, which is used popularly in modern applications. The issue of
data transfer in PCIe’s DMA is analyzed and resolved through the implementation of ping-pong memory and
the selection of appropriate memory sizes. Furthermore, the configuration BIO is presented to support multiple
SHA-3 modes and minimize the number of buffer instances. This feature benefits modern applications which
require various output lengths of the hash values. The proposed architectures, like MRS and MSS, achieve a
high throughput of up to 35.55 Gbps and 43.12 Gbps, respectively, thanks to the multiple KECCAK-f combined
with one of the re-scheduled and sub-pipelined architectures. MSS demonstrates greater efficiency compared
to MRS, for instance, with 4.55 Mbps/ALM >4.30 Mbps/ALM for the SHA3-224 mode. In addition, our
Implementing a very high-speed secure hash algorithm 3 accelerator based on PCI ... (Huu-Thuan Huynh)
10 ❒ ISSN: 2089-4864
MRK and MSK achieve 27.43 Gbps and 34.97 Gpbs for SHA3-512 mode when implemented on Virtex 7,
respectively.
REFERENCES
[1] L. Rota, M. Caselle, S. Chilingaryan, A. Kopmann, and M. Weber, “A PCie DMA architecture for multi-gigabyte per second data
transmission,” IEEE Transactions on Nuclear Science, vol. 62, no. 3, pp. 972–976, 2015, doi: 10.1109/TNS.2015.2426877.
[2] J. Liu, J. Wang, Y. Zhou, and F. Liu, “A cloud server oriented FPGA accelerator for lstm recurrent neural network,” IEEE Access,
vol. 7, pp. 122 408–122 418, 2019, doi: 10.1109/ACCESS.2019.2938234.
[3] H. Kavianipour, S. Muschter, and C. Bohm, “High performance FPGA-based DMA interface for pcie,” IEEE Transactions on
Nuclear Science, vol. 61, no. 2, pp. 745–749, 2014, doi: 10.1109/RTC.2012.6418352.
[4] J.-S. Ng, J. Chen, K.-S. Chong, J. S. Chang, and B.-H. Gwee, “A highly secure fpga-based dual-hiding asynchronous-logic aes
accelerator against side-channel attacks,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 30, no. 9, pp.
1144–1157, 2022, doi: 10.1109/TVLSI.2022.3175180.
[5] M. Zeghid, H. Y. Ahmed, A. Chehri, and A. Sghaier, “Speed/area-efficient ECC processor implementation over gf (2 m) on FPGA
via novel algorithm-architecture co-design,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 31, no. 8, pp.
1192–1203, 2023, doi: 10.1109/TVLSI.2023.3268999.
[6] S. Shin and T. Kwon, “A privacy-preserving authentication, authorization, and key agreement scheme for wireless sensor networks
in 5g-integrated internet of things,” IEEE access, vol. 8, pp. 67 555–67 571, 2020, doi: 10.1109/ACCESS.2020.2985719.
[7] S. Jiang, X. Zhu, and L. Wang, “An efficient anonymous batch authentication scheme based on hmac for vanets,” IEEE Transactions
on Intelligent Transportation Systems, vol. 17, no. 8, pp. 2193–2204, 2016, doi: 10.1109/TITS.2016.2517603.
[8] L. Zhou, C. Su, and K.-H. Yeh, “A lightweight cryptographic protocol with certificateless signature for the internet of things,” ACM
Transactions on Embedded Computing Systems (TECS), vol. 18, no. 3, pp. 1–10, 2019, doi: 10.1145/3301306.
[9] Federal Information Processing Standards Publication, “SHA-3 standard: permutation-based hash and extendable-output functions,”
Aug. 2015, doi: 10.6028/NIST.FIPS.202.
[10] M. Stevens, E. Bursztein, P. Karpman, A. Albertini, and Y. Markov, “The first collision for full sha-1,” in Advances in Cryptology–
CRYPTO 2017: 37th Annual International Cryptology Conference, Santa Barbara, CA, USA, August 20–24, 2017, Proceedings,
Springer, 2017, pp. 570–596, doi: 10.1007/978-3-319-63688-7 19.
[11] X. Zhang, Z. Zhou, and Y. Niu, “An image encryption method based on the feistel network and dynamic DNA encoding,” IEEE
Photonics Journal, vol. 10, no. 4, pp. 1–14, 2018, doi: 10.1109/JPHOT.2018.2859257.
[12] C. Zhu and K. Sun, “Cryptanalyzing and improving a novel color image encryption algorithm using rt-enhanced chaotic tent maps,”
IEEE Access, vol. 6, pp. 18 759–18 770, 2018, doi: 10.1109/ACCESS.2018.2817600.
[13] W.-K. Lee, R. C.-W. Phan, B.-M. Goi, L. Chen, X. Zhang, and N. N. Xiong, “Parallel and high speed hashing in GPU for
telemedicine applications,” IEEE Access, vol. 6, pp. 37 991–38 002, 2018, doi: 10.1109/ACCESS.2018.2849439.
[14] M. Sravani and S. A. Durai, “Bio-hash secured hardware e-health record system,” IEEE Transactions on Biomedical Circuits and
Systems, 2023, doi: 10.1109/TBCAS.2023.3263177.
[15] M. De Donno, K. Tange, and N. Dragoni, “Foundations and evolution of modern computing paradigms: Cloud, IoT, edge, and fog,”
IEEE Access, vol. 7, pp. 150 936–150 948, 2019, doi: 10.1109/ACCESS.2019.2947652.
[16] T.-Y. Wu, Z. Lee, M. S. Obaidat, S. Kumari, S. Kumar, and C.-M. Chen, “An authenticated key exchange protocol for multi-server
architecture in 5g networks,” IEEE Access, vol. 8, pp. 28 096–28 108, 2020, doi: 10.1109/ACCESS.2020.2969986.
[17] W.-K. Lee, K. Jang, G. Song, H. Kim, S. O. Hwang, and H. Seo, “Efficient implementation of lightweight hash functions on GPU
and quantum computers for iot applications,” IEEE Access, vol. 10, pp. 59 661–59 674, 2022, doi: 10.1109/ACCESS.2022.3179970.
[18] Z. Liu, L. Ren, Y. Feng, S. Wang, and J. Wei, “Data integrity audit scheme based on quad merkle tree and blockchain,” IEEE Access,
2023, doi: 10.1109/ACCESS.2023.3240066.
[19] S. Islam, M. J. Islam, M. Hossain, S. Noor, K.-S. Kwak, and S. R. Islam, “A survey on consensus algorithms in blockchain-based
applications: architecture, taxonomy, and operational issues,” IEEE Access, 2023, doi: 10.1109/ACCESS.2023.3267047.
[20] H. Cho, “Asic-resistance of multi-hash proof-of-work mechanisms for blockchain consensus protocols,” IEEE Access, vol. 6, pp.
66 210–66 222, 2018, doi: 10.1109/ACCESS.2018.2878895.
[21] H. Choi and S. C. Seo, “Fast implementation of sha-3 in GPU environment,” IEEE Access, vol. 9, pp. 144 574–144 586, 2021, doi:
10.1109/ACCESS.2021.3122466.
[22] H. Bensalem, Y. Blaquière, and Y. Savaria, “An efficient opencl-based implementation of a SHA-3 co-processor on an fpga-centric
platform,” IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 70, no. 3, pp. 1144–1148, 2022, doi: 10.1109/TC-
SII.2022.3223179.
[23] M. M. Sravani and S. A. Durai, “On efficiency enhancement of SHA-3 for FPGA-based multimodal biometric authentication,” IEEE
Transactions on Very Large Scale Integration (VLSI) Systems, vol. 30, no. 4, pp. 488–501, 2022, doi: 10.1109/TVLSI.2022.3148275.
[24] S. El Moumni, M. Fettach, and A. Tragha, “High throughput implementation of sha3 hash algorithm on field programmable gate
array (FPGA),” Microelectronics journal, vol. 93, p. 104615, 2019, doi: 10.1016/j.mejo.2019.104615.
[25] B. Li, Y. Yan, Y. Wei, and H. Han, “Scalable and parallel optimization of the number theoretic transform based on FPGA,” IEEE
Transactions on Very Large Scale Integration (VLSI) Systems, 2023, doi: 10.1109/TVLSI.2023.3312423.
[26] A. Sideris, T. Sanida, and M. Dasygenis, “Hardware acceleration design of the SHA-3 for high throughput and low area on FPGA,”
Journal of Cryptographic Engineering, pp. 1–13, 2023, doi: 10.1007/s13389-023-00334-0.
[27] H. E. Michail, L. Ioannou, and A. G. Voyiatzis, “Pipelined SHA-3 implementations on FPGA: architecture and performance
analysis,” in Proceedings of the Second Workshop on Cryptography and Security in Computing Systems, 2015, pp. 13–18, doi:
10.1145/2694805.2694808.
[28] G. S. Athanasiou, G.-P. Makkas, and G. Theodoridis, “High throughput pipelined FPGA implementation of the new SHA-3 cryp-
tographic hash algorithm,” in 2014 6th International Symposium on Communications, Control and Signal Processing (ISCCSP).
IEEE, 2014, pp. 538–541, doi: 10.1109/ISCCSP.2014.6877931.
Int J Reconfigurable & Embedded Syst, Vol. 14, No. 1, March 2025: 1–11
Int J Reconfigurable & Embedded Syst ISSN: 2089-4864 ❒ 11
[29] M. M. Wong, J. Haj-Yahya, S. Sau, and A. Chattopadhyay, “A new high throughput and area efficient SHA-3 implementation,” in
2018 IEEE International Symposium on Circuits and Systems (ISCAS), IEEE, 2018, pp. 1–5, doi: 10.1109/ISCAS.2018.8351649.
[30] Intel, L-Tile and H-Tile Avalon® Memory-Mapped Intel® FPGA IP for PCI Express* User Guide (version 23.4), 2024, 2024.
[Online]. Available: https://www.intel.com/content/www/us/en/docs/programmable/683667/23-4/introduction.html, (accessed Apr.
10).
[31] H. Mestiri and I. Barraj, “High-speed hardware architecture based on error detection for keccak,” Micromachines, vol. 14, no. 6, p.
1129, 2023, doi: 10.3390/mi14061129.
[32] A. Arshad, D.-e.-S. Kundi, and A. Aziz, “Compact implementation of SHA3-512 on FPGA,” in 2014 Conference on Information
Assurance and Cyber Security (CIACS), 2014, pp. 29–33, doi: 10.1109/CIACS.2014.6861327.
BIOGRAPHIES OF AUTHORS
Huu-Thuan Huynh received the B.S., M.S., and Ph.D. degrees in radio physics and
electronics from the University of Science, Ho Chi Minh City (HCMUS), in 1997, 2001, and 2010,
respectively. Since 2006, he has been with the Faculty of Electronics and Telecommunications (FE-
TEL), HCMUS. His current research interests are SoC FPGA-based real-time digital signal process-
ing. He can be contacted at email: [email protected].
Tuan-Kiet Tran received a bachelor of science (B.Sc.) degree and a master of science
(M.S.) degree in electronics telecommunications engineering and electronics engineering, awarded
by the University of Science, Ho Chi Minh City (HCMUS) in 2017 and 2019, respectively. He
currently serves as a faculty member at the Faculty of Electronics and Telecommunications (FETEL)
at HCMUS, Vietnam. His research interests include hardware accelerators for cryptography and
parallel processing in AI. He can be contacted at email: [email protected].
Tan-Phat Dang received a bachelor of science (B.Sc.) degree and a master of science
(M.S.) degree in electronics telecommunications engineering and electronics engineering, awarded
by the University of Science, Ho Chi Minh City (HCMUS) in 2018 and 2023, respectively. Presently,
he is an active member of the Faculty of Electronics and Telecommunications (FETEL) at HCMUS,
Vietnam. His primary focus lies in FPGA-based hardware accelerators for cryptography, digital
signal processing, and video compressing. He can be contacted at email: [email protected].
Implementing a very high-speed secure hash algorithm 3 accelerator based on PCI ... (Huu-Thuan Huynh)