Optimization of Advanced Encryption Standard (AES) Using Vivado High Level Synthesis (HLS)
Optimization of Advanced Encryption Standard (AES) Using Vivado High Level Synthesis (HLS)
1
Electrical and Computer Engineering
Boise State University, Boise, ID 83725
2
[email protected]
3
[email protected]
4
[email protected]
Abstract
Advanced Encryption Standard (AES) represents a fundamental building module of
many network security protocols to ensure data confidentiality in various applications
ranging from data servers to low-power hardware embedded systems. In order to opti-
mize such hardware implementations, High-Level Synthesis (HLS) provides flexibility in
designing and rapid optimization of dedicated hardware to meet the design constraints. In
this paper, we present the implementation of AES encryption processor on FPGA using
Xilinx Vivado HLS. The AES architecture was analyzed and designed by loop unrolling,
and inner-round and outer-round pipelining techniques to achieve a maximum throughput
of the AES algorithm up to 1290 Mbps (Mega bit per second) with very significant low
resources of 3.24% slices of the FPGA, achieving 3 Mbps per slice area.
keywords: Advanced Encryption Standard, AES, High Level Synthesis, HLS, Optimization,
High throughput, Low area resources, Zynq, FPGA.
1 Introduction
Advanced Encryption Standard (AES) is a standardized algorithm approved by the National
Institute of Standards and Technology (NIST) [16]. It has been adopted by numerous appli-
cations ranging from data servers to low-power hardware embedded systems to ensure data
secrecy and privacy. However, AES-based block cipher is computationally intensive and time
demanding for software implementation on general purpose processors, which leads to hardware
acceleration of the AES algorithm on application-specific integrated circuit (ASIC) or recon-
figurable hardware devices such as field programmable gate arrays (FPGAs). It is known that
current embedded systems may depend on dedicated hardware accelerators for data encryption
and decryption.
The AES encryption algorithm consists of several rounds of encryption and each round
is comprised of three main layers to apply data confusion through nonlinear transform and
data diffusion by mixing the data state. The algorithm for each round takes the state array
and, after applying a round encryption, returns an updated state array. The implementation
G. Lee and Y. Jin (eds.), CATA 2019 (EPiC Series in Computing, vol. 58), pp. 36–44
Optimization of AES using Vivado HLS Daoud, Hussein and Rafla
and optimization of such complex functions in Hardware Description Languages (HDL) is time
consuming and not easily optimized. In order to achieve an efficient design with less effort,
High-Level Synthesis (HLS) procedures are applied. HLS is an automated process that accepts
a system design created in a high level language, such as C or C++, and then generates a
Register Transfer Level (RTL) design describing the behavior of the system. HLS plays a vital
role in the design process by reducing the effort of HDL design and debugging, and providing
flexibility in the final hardware implementation to meet design constraints set by the developer.
In this paper, we present the implementation of AES using Vivado High Level Synthesis [19]
and evaluate the performance of the proposed architecture. The design is implemented on the
Xilinx Zynq-7000 SoC FPGA chip of the ZedBoard prototyping board [20]. The standard
AES-128 block cipher consists of a full 10 rounds of data permutation and mixing. Each
round was optimized and pipelined to achieve high throughput with minimum area cost. Our
proposed AES design was only implemented by look-up tables (LUTs) and flip flops (FFs)
without including any block RAM (BRAM) or DSP slices of the FPGA. Therefore, our design
may be appealing to low cost and high throughput applications. Additionally, our proposed
HLS implementation of the standard AES-128 is compared to previous implementations on
FPGAs. The rest of this paper is organized as follows: Background and the related works are
explored in Section 2, an overview of HLS and relevant techniques are represented in Section 3,
implementation of the AES block cipher algorithm and its optimization is described in Section
4 and compared to previous work as well. Finally, Section 5 concludes the work done and gives
suggestions for future work.
37
Optimization of AES using Vivado HLS Daoud, Hussein and Rafla
cation in GF(28 )
AddRoundKey: is an operation of bit-wise XORing the round key (sub-key) with the cur-
rent state. Each round key is derived from the previous sub-key. This requires the encryption
algorithm to schedule the key for each round.
The key schedule takes the original input key and derives a sub-key for each round. For
AES-128, the number of rounds is 10, and 11 sub-keys are derived, each of 128 bits. The sub-
key derivation is computed recursively and the first sub-key is the original input key. The AES
key schedule is word-oriented of word size = 32 bits. Figure 2 shows the AES key schedule for
128-bit key size, where the purpose of the function g() is not only to add non-linearity to the
key schedule but also to remove symmetry in the AES, and RC[i] is the round coefficient that
varies from round to round.
In the standard AES-128, the initial round is done by applying AddRoundKey, i.e., XORing
the key with the input data. Then, it is followed with 10 repeated rounds and each round
performs SubBytes, ShiftRows, MixColumns and AddRoundKey. The final round is slightly
different by dropping the MixColumns function.
38
Optimization of AES using Vivado HLS Daoud, Hussein and Rafla
based implementations. They focused on time performance and the encryption throughput.
Authors in [2] presented a hardware implementation of the AES algorithm developed for an
external data storage unit in a dependable application and optimization of the encryption
algorithm to meet the needs of the target application. In [7], authors focused in optimizing the
AES algorithm to suit small embedded applications or low power consumption devices. They
achieved a throughput of 121 Mbps at a maximum frequency of 153 MHz targeting small area
design and lower energy consumption per processed block.
However, HLS has been used to optimize implementations of the cryptography protocols
in hardware [8, 11]. There are few works that uses HLS to implement the AES algorithm on
FPGAs [1, 9, 12, 18].
In [1], authors investigated various optimizations of the C-based AES implementation into
hardware using C2R [3] methodology for co-processor synthesis. These implementations in-
cluded baseline hardware design, BRAM-based architecture, a pipelined scheme, and an op-
timized architecture for performance and area. In [12], the authors explored different hard-
ware implementations of the AES using HLS directives and memory partitioning optimizations.
In [18], authors explored four different implementation methods of the AES using Vivado HLS
39
Optimization of AES using Vivado HLS Daoud, Hussein and Rafla
Block Size
Tp = (1)
NC ∗ TCLK
, and the throughput to area ratio is calculated as:
Tp
K= (2)
A
where Block Size is the size of a block in bits, i.e., 128 bits, NC is the number of clock cycles
necessary to encrypt a single block, TCLK is the maximum delay path, and A, area, is the
number of slices from the Vivado utilization report.
40
Optimization of AES using Vivado HLS Daoud, Hussein and Rafla
tool allows us to choose the handshaking protocol to be implemented onto the I/O ports of the
designed block(s). I/O ports can be implemented as streaming data to/from a FIFO, or as
reading/writing data to/from a memory. There are other handshaking protocols which can be
implemented, as necessitated by the design. In our implementation, the design was synthesized
to accommodate receiving a stream of data. The output data is synthesized to allow commu-
nication with a dual-port memory.
SW-based Implementation: This scheme is the baseline AES algorithm designed for soft-
ware implementation, where the key extension is executed first before starting the encryption
process. The purpose of the key expansion module is to generate the 11 different extended keys
each is 128-bit size. In this version, all loops are rolled and no optimization is applied. This im-
plementation led to an encryption of one block in 2556 clock cycles using 154 slices of the FPGA.
// SW−based E n c r y p t i o n F u n c t i o n
v o i d AES Encrypt ( u n s i g n e d c h a r s t a t e [ 1 6 ] , u n s i g n e d c h a r msg [ 1 6 ] , u n s i g n e d c h a r
key [ 1 6 ] ) {
f o r ( i n t i = 0 ; i < 1 6 ; i ++)
s t a t e [ i ] = msg [ i ] ;
// Rounded key G e n e r a t i o n
u n s i g n e d c h a r expanded key [ 1 7 6 ] ;
k e y e x p a n s i o n ( key , expanded key ) ;
a d d r o u n d k e y ( s t a t e , key ) ;
f o r ( i n t j = 0 ; j < r o u n d c n t ; j ++) {
sub bytes ( state ) ;
shift rows ( state ) ;
mix columns ( s t a t e ) ;
a d d r o u n d k e y ( s t a t e , ( expanded key + ( 1 6 ∗ ( j + 1 ) ) ) ) ;
}
// F i n a l round
sub bytes ( state ) ;
shift rows ( state ) ;
a d d r o u n d k e y ( s t a t e , ( expanded key + 1 6 0 ) ) ;
}
41
Optimization of AES using Vivado HLS Daoud, Hussein and Rafla
// Throughput−based E n c r y p t i o n F u n c t i o n
v o i d AES Encrypt ( u n s i g n e d c h a r s t a t e [ 1 6 ] , u n s i g n e d c h a r msg [ 1 6 ] , u n s i g n e d c h a r
key [ 1 6 ] ) {
u n s i g n e d c h a r ExtendedKey [ 1 6 ] ;
#pragma HLS ARRAY PARTITION v a r i a b l e=ExtendedKey c o m p l e t e dim=1
f o r ( i n t i = 0 ; i < 1 6 ; i ++){
#pragma HLS UNROLL
s t a t e [ i ] = msg [ i ] ;
ExtendedKey [ i ] = key [ i ] ;
}
f o r ( i n t j = 0 ; j <= r o u n d c n t ; j ++) {
#pragma HLS PIPELINE
a d d r o u n d k e y ( s t a t e , ExtendedKey ) ;
// c r e a t e a r e g i o n t o s e t f a l s e dependence o f t h e e x t e n d e d key
{
key e x p a n s i o n ( ExtendedKey , j ) ;
#pragma HLS UNROLL
#pragma HLS DEPENDENCE v a r i a b l e=Extendedkey i n t e r f a l s e
}
i f ( j != r o u n d c n t )
mix columns ( s t a t e ) ;
else
a d d r o u n d k e y ( s t a t e , ExtendedKey ) ;
}
}
This solution is able to execute the encryption of one block within 19 clock cycles using 431
slices (3.24%) of the FPGA. This optimized solution is at the expense of the FPGA’s resources.
It can be noted that the throughput difference between the throughput-based implemenation
and the other two implementations are so vast. This indicates that by simply implementing
high-level code into a HLS tool and then optimizing in a way which is beneficial for design
constraints can speed up development time. Thus, by using HLS, rapid optimization can be
accomplished with results similar to dedicated HDL designs.
The proposed optimization was compared to previous HLS implementations in the liter-
ature. The area utilization, in slices, maximum achieved frequency, in Mhz, throughput, in
Mbps, and throughput to area ratio, in Mbps/slice, are shown in Table 1 for the Throughput-
based implementation and the previous works. Our proposed optimization in HLS achieved
higher throughput per area and less number of slices is required to implement the proposed
architecture.
42
Optimization of AES using Vivado HLS Daoud, Hussein and Rafla
The synthesized RTL design of the throughput-based optimization was exported to Vivado
and design implementation was completed for further analysis. The experiments were conducted
using Xilinx Zynq-7000 SoC, Zedboard, XC7Z020-1CLG484C [20] along with the Xilinx Vivado
Design suite and SDK 17.4. The FPGA fabric runs normally on 100 MHz (used in this context),
but can be configured up to 192 MHz, the maximum frequency of the encryption module.
5 Conclusion
In this work, we explored the standard AES encryption and its implementation into a Xilinx
ZedBoard with the Zynq-7000 SoPC. This work focused on the encryption aspect of AES-
128, but the decryption part could easily be implemented and tested as well. The AES was
initially coded in a high-level language and was then implemented with Xilinx Vivado High Level
Synthesis. The Xilinx HLS tool enabled us to quickly realize our design and make optimizations
which greatly increased throughput of the AES algorithm. HLS also offers the potential to allow
for hardware benchmarking in early design stages and for in-depth analysis of a design’s resource
usage versus high-level code placement.
The most successful optimization implemented in our design was the pipelining of the func-
tion’s for-loops besides unrolling loops and computing the extended key on the fly during the
encryption process, which reduced the initiation interval and allowed for concurrent execution
of operations within loops and functions.
The encryption throughput of the proposed AES in HLS observed to be 1.26 Gbps. This
rapid development and optimization of HLS-ready code shows that HLS can be used to increase
a designer’s productivity by applying directives such as pipelining, array shaping, and port map-
ping to their new and existing designs. A designer is thus able to see a moderate improvement
without the need to design RTL with traditional, and time consuming, HDL languages.
Some future work would include further optimization of the AES algorithm. Adaptation for
HLS can be achieved through writing optimized code; both with standard HDL and manual
implementation of a pipelined structure. The future work would also be implemented into
the same prototyping board for fair comparison. Additional future work may also include the
utilization of the AXI interface currently existing within the Xilinx Zynq-7000 series of FPGAs.
The on-board processor is able to use the FPGA for hardware acceleration, as opposed to
complete implementation of the AES algorithm in the FPGA.
As a future work, different modes of encryption for AES to encrypt successive blocks of
data including counter (CTR), cipher block chaining (CBC), cipher feedback (CFB), and out-
put feedback (OFB) can be implemented and optimized using HLS and compared to their
counterpart RTL implementations on FPGA.
References
[1] Sumit Ahuja, Swathi T Gurumani, Chad Spackman, and Sandeep K Shukla. Hardware coprocessor
synthesis from an ansi c specification. IEEE Design & Test of Computers, 26(4):58–67, 2009.
[2] Marko Mali—Franc Novak—Anton Biasizzo. Hardware implementation of aes algorithm. Journal
of Electrical Engineering, 56(9-10):265–269, 2005.
[3] C2R Compiler. C2r compiler, 2018. http://www.cebatech.com/.
[4] Andreas Dandalis, Viktor K Prasanna, and Jose DP Rolim. A comparative study of performance
of aes final candidates using fpgas. In International workshop on cryptographic hardware and
embedded systems, pages 125–140. Springer, 2000.
43
Optimization of AES using Vivado HLS Daoud, Hussein and Rafla
[5] Ashwini M Deshpande, Mangesh S Deshpande, and Devendra N Kayatanavar. Fpga implemen-
tation of aes encryption and decryption. In Control, Automation, Communication and Energy
Conservation, 2009. INCACEC 2009. 2009 International Conference on, pages 1–6. IEEE, 2009.
[6] Saar Drimer, Tim Güneysu, and Christof Paar. Dsps, brams and a pinch of logic: new recipes
for aes on fpgas. In Field-Programmable Custom Computing Machines, 2008. FCCM’08. 16th
International Symposium on, pages 99–108. IEEE, 2008.
[7] Panu Hamalainen, Timo Alho, Marko Hannikainen, and Timo D Hamalainen. Design and im-
plementation of low-area and low-power aes encryption hardware core. In Digital System Design:
Architectures, Methods and Tools, 2006. DSD 2006. 9th EUROMICRO Conference on, pages 577–
583. IEEE, 2006.
[8] HS Jacinto, Luka Daoud, and Nader Rafla. High level synthesis using vivado hls for optimizations
of sha-3. In IEEE 60th International Midwest Symposium on Circuits and Systems (MWSCAS),
pages 563–566. IEEE, 2017.
[9] Makoto Kotegawa, Keisuke Iwai, Hidema Tanaka, and Takakazu Kurokawa. Optimization of
hardware implementations with high-level synthesis of authenticated encryption. Bulletin of Net-
working, Computing, Systems, and Software, 5(1):26–33, 2016.
[10] L. Daoud, D. Zydek, and H. Selvaraj. A Survey of High Level Synthesis Languages, Tools, and
Compilers for Reconfigurable High Performance Computing. In Advances in Systems Science,
pages 483–492. Springer, 2014. , DOI: 10.1007/978-3-319-01857-7 47.
[11] Muhammad Latif, HS Jacinto, Luka Daoud, and Nader Rafla. Optimization of a quantum-secure
sponge-based hash message authentication protocol. In IEEE 61st International Midwest Sympo-
sium on Circuits and Systems (MWSCAS), pages 984–987. IEEE, Aug 2018.
[12] Rodrigo Schmitt Meurer, Tiago Rogerio Muck, and Antonio Augusto Frohlich. An implementation
of the aes cipher using hls. In Computing Systems Engineering (SBESC), 2013 III Brazilian
Symposium on, pages 113–118. IEEE, 2013.
[13] Sumio Morioka and Akashi Satoh. An optimized s-box circuit architecture for low power aes
design. In International Workshop on Cryptographic Hardware and Embedded Systems, pages
172–186. Springer, 2002.
[14] Sumio Morioka and Akashi Satoh. A 10-gbps full-aes crypto design with a twisted bdd s-box
architecture. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 12(7):686–691,
2004.
[15] Christof Paar and Jan Pelzl. Understanding cryptography: a textbook for students and practitioners.
Springer Science & Business Media, 2009.
[16] NIST FIPS Pub. 197: Advanced encryption standard (aes). Federal information processing stan-
dards publication, 197(441):0311, 2001.
[17] Ingrid Verbauwhede, Patrick Schaumont, and Henry Kuo. Design and performance testing of a
2.29-gb/s rijndael processor. IEEE Journal of Solid-State Circuits, 38(3):569–572, 2003.
[18] Masashi Watanabe, Keisuke Iwai, Hidema Tanaka, and Takakazu Kurokawa. High-speed imple-
mentation of encryption circuit using a high-level synthesis tool. Bulletin of Networking, Comput-
ing, Systems, and Software, 3(1):63–66, 2014.
[19] Xilinx Inc. Vivado Design Suite: High-Level Synthesis, July, 2018.
[20] Xilinx Inc. ZC702 Evaluation Board for the Zynq-7000 XC7Z020 User Guide, June, 2018.
44