0% found this document useful (0 votes)
4 views

chen2020

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

chen2020

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

A High-Throughput Hardware Implementation of

SHA-256 Algorithm
Yimeng Chen Shuguo Li
Institute of Microelectronics Institute of Microelectronics
Tsinghua University, Beijing, China Tsinghua University, Beijing, China
Email: [email protected] Email: [email protected]

Abstract—The SHA-256 algorithm is widely used in the field SHA-256. Section IV shows the implementation results of our
of security. In this paper, we propose a rescheduling method design on FPGA. Finally, Section V concludes this paper.
for the SHA-256 round computation. Based on the proposed
rescheduling, we propose a design for SHA-256, in which the II. SHA-256 A LGORITHM
critical path is reduced. Our design is implemented on the Xilinx The SHA-256 algorithm consists of two stages: preprocess-
Virtex-4 FPGA. It achieves the throughput of 1984 Mbps with the
ing and hash computation [1].
area of 979 slices. Compared with other designs on FPGA, our
design shows a better performance in terms of the throughput. A. Preprocessing
Index Terms—SHA-256, Round Computation, Rescheduling,
FPGA The first step of preprocessing is message padding, which
ensures the length of the padded message is a multiple of 512
I. I NTRODUCTION bits. Suppose that the message length is l bits before padding.
The bit “1” and k zero bits are appended to the end of the
Hash functions can compress a message into a string of
message. According to the rules in [1], the value k is the
fixed length, which is called the message digest. The SHA-2
smallest solution to the equation l + k + 1 ≡ 448 mod 512.
hash functions are issued by the National Institute of Standards
Then the padded message is expressed as N 512-bit blocks,
and Technology (NIST) in 2002. All algorithms of the SHA-2
denoted M (1) , M (2) , ... , M (N ) .
family are iterative [1]. In this paper, we focus on SHA-256,
one representative algorithm of the SHA-2 family. B. Hash Computation
The SHA-256 algorithm is widely used in the field of In the hash computation, message blocks are processed in
security, such as block chain and message authentication order. Each message block is processed as follows.
code (MAC). The high-throughput hardware implementation The block M (i) is expressed as sixteen 32-bit words,
(i) (i) (i)
of SHA-256 is in high demand. There are several techniques denoted M0 , M1 , ... , M15 . And it is expanded into sixty-
that are widely used in the published work: four 32-bit words as shown in (1) and (2). The details of the
• Unrolled architecture [4], [7], [8]. It is widely used in functions σ0 and σ1 are given in [1].
the implementations of cryptographic algorithms that are For 0 ≤ t ≤ 15,
(i)
iterative. This architecture improves the throughput but Wt = Mt (1)
consumes more area. For 16 ≤ t ≤ 63,
• Parallel counters [2]. The round computation of SHA-256
Wt = σ1 (Wt−2 ) + Wt−7
can be regarded as the addition of multiple operands. So (2)
the parallel counters can be used to reduce the area cost + σ0 (Wt−15 ) + Wt−16
and the critical path. The word Wt is used in the round computation, as shown
• Pipeline [2]–[6]. Although the pipeline structure can im- in Fig. 1. The round computation will be performed 64 times,
prove the throughput, paper [4] emphasizes that registers and the new intermediate hash values can be computed by:
cannot be added at will in the design for SHA-256, due (i) (i−1)
H0 = A64 + H0
to the feedback.
(i) (i−1)
• Delay balancing [2]. This technique is used to balance H1 = B64 + H1
(3)
the delays of different paths. ...
In this paper, we propose a rescheduling method for the (i)
H7 = H64 + H7
(i−1)
SHA-256 round computation. Based on our rescheduling, we
(i)
propose a design for SHA-256, in which some intermediate When the next block M (i+1) is processed, the values H0 −
(i)
values required in the next round can be precalculated in the H7 are used as the initial values of A − H. The functions
current round. Then the critical path is reduced. This paper σ0 , σ1 , Σ0 , Σ1 , M aj and Ch are given in [1].
is organized as follows. Section II describes the background After the final block has been processed, the 256-bit mes-
of the SHA-256 algorithm. Section III presents our design for sage digest is obtained.

978-1-7281-3320-1/20/$31.00 ©2020 IEEE

Authorized licensed use limited to: IEEE Xplore. Downloaded on October 19,2020 at 12:17:10 UTC from IEEE Xplore. Restrictions apply.
Σ0 B. Proposed Rescheduling
In this paper, we propose a rescheduling method for the
At A t+1
SHA-256 round computation. The values St and Nt are used
Bt Maj B t+1 to save the results of Σ1 (Et ) and Ch(Et , Ft , Gt ), as shown
in (8). The values Pt and Qt continue to be used.
Ct Ct+1 St = Σ1 (Et )
(8)
Nt = Ch(Et , Ft , Gt )
Dt Dt+1
Σ1 So the value Et can be computed by:
Et E t+1
Et = St−1 + Nt−1 + Qt−1 (9)
Ft Ch Ft+1
Because Ft and Gt can be given directly by Et−1 and Ft−1
respectively, the values St and Nt can be computed by:
Gt Gt+1
St = Σ1 (St−1 + Nt−1 + Qt−1 )
Ht Kt Wt Ht+1 (10)
Nt = Ch( (St−1 + Nt−1 + Qt−1 ), Et−1 , Ft−1 )
This means that the values St and Nt can also be precalculated
in the previous clock cycle. So the round computation in (6)
Fig. 1. SHA-256 round computation can be rewritten as:
At+1 = Σ0 (At ) + M aj(At , Bt , Ct ) + St + Nt + Pt
III. P ROPOSED D ESIGN F OR SHA-256 Et+1 = St + Nt + Qt
(11)
In each round of SHA-256, the computation of the variables St+1 = Σ1 (St + Nt + Qt )
A and E can be described as: Nt+1 = Ch( (St + Nt + Qt ), Et , Ft )
At+1 = Σ0 (At ) + M aj(At , Bt , Ct ) + Σ1 (Et ) Also, the first values of St and Nt can be computed by:
+ Ch(Et , Ft , Gt ) + Kt + Wt + Ht S0 = Σ1 (E0 )
(4) (12)
Et+1 = Σ1 (Et ) + Ch(Et , Ft , Gt ) + Kt + Wt N0 = Ch(E0 , F0 , G0 )
+ Ht + Dt
Further, the value At+1 can be computed by (13). The
The remaining variables can be obtained directly as: values M 1 and M 2 are the sum and carry vectors of a carry
save addition. This addition can be performed in parallel with
Bt+1 =At Ct+1 = Bt Dt+1 = Ct the operations Σ0 (At ) and M aj(At , Bt , Ct ). So the value
(5)
Ft+1 =Et Gt+1 = Ft Ht+1 = Gt At+1 can be regarded as the addition of 4 intermediate values.
A. Basic Rescheduling At+1 = Σ0 (At ) + M aj(At , Bt , Ct ) + M 1 + M 2
(13)
According to the work in [2], [3], [11], (4) can be rewritten M 1 + M 2 = St + Nt + Pt
as: According to (7), (11), (12) and (13), a hardware structure
At+1 = Σ0 (At ) + M aj(At , Bt , Ct ) + Σ1 (Et ) of SHA-256 is proposed, as shown in Fig. 2. The values M 1
and M 2 are generated by a CSA (carry save adder). And
+ Ch(Et , Ft , Gt ) + Pt (6)
the addition of 4 operands in (13) are implemented by a 4-2
Et+1 = Σ1 (Et ) + Ch(Et , Ft , Gt ) + Qt compressor and an adder. The 4-2 compressor is introduced
where Pt and Qt are computed by: in [12] and its critical path is equal to three XORs.
The round computation is divided into two pipeline stages
Pt = Kt + Wt + Ht = Kt + Wt + Gt−1 by the registers P and Q. Before the registers P and Q get
Qt = Kt + Wt + Ht + Dt (7) their first values (denoted P0 and Q0 respectively), the pipeline
= Kt + Wt + Gt−1 + Ct−1 is empty. So one clock cycle is needed to make the pipeline
full before the first round computation. According to (7), the
As shown in (2), the computation of Wt is independent of the values D0 and H0 are necessary for the computation of P0
variables A − H. And Kt (0 ≤ t ≤ 63) is the round constant. and Q0 . When the pipeline is full, the values Ct and Gt are
So Pt and Qt can be precalculated in the previous clock cycle. used to precalculate the values Pt and Qt . So Dt and Ht are
As a result, the round computation of SHA-256 is divided into only used in the first clock cycle of processing one block.
two stages. The values Pt and Qt can be precalculated in the In Fig. 2, there are 6 registers which are used to store the
first stage. The values At+1 and Et+1 are computed in the variables A, B, C, E, F and G. Because Dt and Ht are only
second stage. used at the beginning, the registers storing the variables D and

Authorized licensed use limited to: IEEE Xplore. Downloaded on October 19,2020 at 12:17:10 UTC from IEEE Xplore. Restrictions apply.
B0 C 0 D0 E0 F0 G0 H0
A0
At Bt Et Gt
A B C E F G
Ft
K t+1
Ct
Wt+1
Σ0 Maj Σ1 Ch
CSA
S0 N0

S N
4-2 Compressor
CSA
St Nt

CSA Qt
Σ1 S0 Q
E0 M1
F0 Ch N0 M2 CSA Pt
P
G0

Fig. 2. Structure of SHA-256 with rescheduling

H are removed in order to reduce the area cost. This design A0 B0 C 0 D0


works as follows: As
• In the first clock cycle, the values D0 and H0 are B C
0 Bt
required. The registers C and G are loaded with the
values D0 and H0 respectively. The values P0 and Q0 Ac
At Ct
St
are computed in this cycle. S
Σ0 Maj
• In the second clock cycle, the values of D and H are
not required. The registers A, B, C, E, F and G are
Nt
loaded with their first values: A0 , B0 , C0 , E0 , F0 and N
G0 respectively. Meanwhile, the registers S and N are
M1
loaded with their first values S0 and N0 respectively. As CSA CSA Pt
shown in (12), S0 and N0 can be computed by using E0 , M2 P
csa0
F0 and G0 . Also, the registers P and Q get their first
values P0 and Q0 . The values A0 − H0 are the standard Fig. 3. Optimization of the structure inside the gray region in Fig. 2
constants or the previous hash values.
• From the third clock cycle, the registers get their new
values from the round computation. Considering that C. Further Optimization
Dt+1 = Ct and Ht+1 = Gt , the final values of the
variables D and H can be obtained in advance, although The structure inside the gray region in Fig. 2 is further
there are no registers for D and H in Fig. 2. optimized. The remaining region in Fig. 2 keeps unchanged.
According to [9], the functions M aj and Ch can be The details of the optimization are as follows:
expressed as: • Firstly, an adder is used to merge the vectors M 1 and
M aj(a, b, c) = (a ∧ b) ⊕ (a ∧ c) ⊕ (b ∧ c) M 2. So the 4-2 compressor can be replaced by a CSA,
denoted csa0. But there will be two adders in the critical
= a ∧ (b ∨ c) ∨ (b ∧ c)
(14) path.
Ch(e, f, g) = (e ∧ f ) ⊕ (¬e ∧ g) • Then the delay balancing technique introduced in [2] can
= (e ∧ f ) ∨ (¬e ∧ g) be used. The sum and carry vectors that csa0 outputs are
Furthermore the delay of Σ0 or Σ1 is comparable to the delay stored in two registers As and Ac respectively, as shown
of a CSA. So the critical path is the computation of the in Fig. 3. So an adder is moved out of the critical path.
next value of A, as shown in the gray region in Fig. 2. The So the delay of the critical path is reduced to: delay(CSA) +
delay of the critical path is equal to: delay(CSA) + delay(4-2 delay(adder) + delay(CSA) + delay(multiplexer). And several
compressor) + delay(adder) + delay(multiplexer). steps of processing one block have changed as follows.

Authorized licensed use limited to: IEEE Xplore. Downloaded on October 19,2020 at 12:17:10 UTC from IEEE Xplore. Restrictions apply.
TABLE I
C OMPARISON OF D IFFERENT D ESIGNS

Design Device Clock frequency Clock cycles Throughput Area Efficiency


(MHz) per block (Mbps) (Slices) (Mbps/Slice)
[4] (2x-unrolled) Virtex-II 73.975 38 996.7 2032 0.491
[6] Virtex-4 170.75 65 1344.98 610 2.2
[10] Virtex-4 50.06 280 91.53 422 0.22
[13] (Case I) Virtex-4 238 321 379.6 382 0.99
[13] (Case II) Virtex-4 222 129 881.1 485 1.82
Our design Virtex-4 255.7 66 1984 979 2.02

When the register B is loaded with its first value B0 , the ACKNOWLEDGMENT
register As is loaded with A0 and Ac is loaded with zero. This work is supported by the National Natural Sci-
After the final round, the values of As and Ac are added to ence Foundation of China under Grant No.61974083 and
the old A0 . This addition is implemented by a CSA and an No.61674086. The corresponding author is Shuguo Li.
adder.
R EFERENCES
IV. I MPLEMENTATION R ESULTS AND C OMPARISON
[1] National Institute of Standards and Technology (NIST). Announcing
Our design for SHA-256 is implemented on Xilinx Virtex-4 the secure hash standard. Federal Information Processing Satndards
Publication 180-2 (FIPS PUB 180-2), August 2002.
XC4VLX100-12 FPGA, using the Xilinx ISE 14.7 tool. The [2] L. Dadda, M. Macchetti, and J. Owen, “The design of a high speed
whole implementation also contains a message expander. The ASIC unit for the hash function SHA-256 (384, 512),” in Proceedings
performance comparison of our design with some published Design, Automation and Test in Europe Conference and Exhibition, Feb
2004, pp. 70–75.
designs is shown in Table I. And the throughput and efficiency [3] L. Dadda, M. Macchetti,and J. Owen, “An ASIC design for a high speed
can be computed by: implementation of the hash function SHA-256 (384, 512),” in ACM
Great Lakes symposium on VLSI, 2004, pp. 421–425.
Block size × Clock f requency [4] R. P. McEvoy, F. M. Crowe, C. C. Murphy, and W. P. Marnane, “Opti-
T hroughput =
Clock cycles per block misation of the SHA-2 Family of Hash Functions on FPGAs,” in IEEE
(15) Computer Society Annual Symposium on Emerging VLSI Technologies
T hroughput
Ef f iciency = and Architectures (ISVLSI’06), March 2006, pp. 317–322.
N umber of Slices [5] R. Chaves, G. Kuzmanov, L. Sousa, and S. Vassiliadis, “Cost-Efficient
SHA Hardware Accelerators,” IEEE Transactions on Very Large Scale
The block size for SHA-256 is 512-bit. Our design requires Integration (VLSI) Systems, vol. 16, no. 8, pp. 999–1008, Aug 2008.
66 clock cycles to process a 512-bit block. One is for the [6] M. Padhi and R. Chaudhari, “An optimized pipelined architecture of
computation of P0 and Q0 . And one is for the final addition. SHA-256 hash function,” in 2017 7th International Symposium on
Embedded Computing and System Design (ISED), Dec 2017, pp. 1–4.
Other 64 clock cycles are for the round computation. [7] H. Michail, A. Kakarountas, A. Milidonis, and C. Goutis, “A Top-Down
The 2x-unrolled design in [4] combines pipeline with un- Design Methodology for Ultrahigh-Performance Hashing Cores,” IEEE
rolling techniques, which can perform two rounds in one clock Transactions on Dependable and Secure Computing, vol. 6, no. 4, pp.
255–268, Oct 2009.
cycle. The design in [6] is based on the pipeline architecture [8] H. E. Michail, G. S. Athanasiou, V. Kelefouras, G. Theodoridis, and
and carry skip adders are used to improve the performance C. E. Goutis, “On the Exploitation of a High-Throughput SHA-256
of the hardware core. The design in [10] is a compact FPGA FPGA Design for HMAC,” ACM Transactions on Reconfigurable Tech-
nology and Systems (TRETS), vol. 5, no. 1, pp. 1–28, 2012.
processor for SHA-256. The designs in [13] use architectural [9] F. Opritoiu, S. L. Jurj, and M. Vladutiu, “Technological solutions for
folding technique. Case I is folded by factor 5 and has the throughput improvement of a Secure Hash Algorithm-256 engine,” in
lowest area cost. Case II is folded by factor 2 and has a better 2017 IEEE 23rd International Symposium for Design and Technology
in Electronic Packaging (SIITME), Oct 2017, pp. 159–164.
balance between the area and throughput than Case I. [10] R. Garcı́a, I. Algredo-Badillo, M. Morales-Sandoval, C. Feregrino-
Among the designs in Table I, our design has the highest Uribe, and R. Cumplido, “A compact FPGA-based processor for the
throughput. The design in [6] has the higher efficiency than Secure Hash Algorithm SHA-256,” Computers & Electrical Engineer-
ing, vol. 40, no. 1, pp. 194–202, 2014.
ours. But the throughput of our design is 47% higher than that [11] I. Algredo-Badillo, C. Feregrino-Uribe, R. Cumplido, and M. Morales-
of the design in [6]. Sandoval, “FPGA-based implementation alternatives for the inner loop
of the Secure Hash Algorithm SHA-256,” Microprocessors and Mi-
V. C ONCLUSION crosystems, vol. 37, no. 6, pp. 750–757, 2013.
[12] K. Prasad and K. K. Parhi, “Low-power 4-2 and 5-2 compressors,”
Our rescheduling method is effective for reducing the in Conference Record of Thirty-Fifth Asilomar Conference on Signals,
critical path in the SHA-256 hardware implementation. The Systems and Computers, vol. 1, Nov 2001, pp. 129–133.
[13] M. M. Wong, V. Pudi, and A. Chattopadhyay, “Lightweight and High
implementation results show that our design based on the Performance SHA-256 using Architectural Folding and 4-2 Adder Com-
proposed rescheduling has a significant improvement in terms pressor,” in 2018 IFIP/IEEE International Conference on Very Large
of the throughput. Further, the structure in our design for SHA- Scale Integration (VLSI-SoC), Oct 2018, pp. 95–100.
256 can be extended to the implementations of all the SHA-2
family.

Authorized licensed use limited to: IEEE Xplore. Downloaded on October 19,2020 at 12:17:10 UTC from IEEE Xplore. Restrictions apply.

You might also like