0% found this document useful (0 votes)
4 views

Different Implementations of AES Cryptographic Algorithm

This paper discusses various implementations of the AES cryptographic algorithm, focusing on optimizing its performance through different methods such as fast implementation, AES-NI instruction sets, and GPU parallelization using CUDA. The authors present experimental results showing significant performance improvements, with AES-NI and CUDA implementations achieving up to 50 and 18 times faster execution compared to the standard AES algorithm, respectively. The paper outlines the structural features of AES and details the optimization techniques employed to enhance its efficiency in handling encryption and decryption processes.

Uploaded by

Phan Thắm
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Different Implementations of AES Cryptographic Algorithm

This paper discusses various implementations of the AES cryptographic algorithm, focusing on optimizing its performance through different methods such as fast implementation, AES-NI instruction sets, and GPU parallelization using CUDA. The authors present experimental results showing significant performance improvements, with AES-NI and CUDA implementations achieving up to 50 and 18 times faster execution compared to the standard AES algorithm, respectively. The paper outlines the structural features of AES and details the optimization techniques employed to enhance its efficiency in handling encryption and decryption processes.

Uploaded by

Phan Thắm
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

2015 IEEE 17th International Conference on High Performance Computing and Communications (HPCC), 2015 IEEE 7th

International Symposium on Cyberspace Safety and Security (CSS), and 2015 IEEE 12th International Conf on Embedded Software
and Systems (ICESS)

Different Implementations of AES Cryptographic Algorithm


Guang-liang Guo, Quan Qian*, Rui Zhang
School of Computer Engineering & Science
Shanghai University, Shanghai 200444, China
*
Corresponding email: [email protected]

Abstract—Currently, AES is regarded as the most popular implementation include AES standard algorithm[1], the AES
symmetric cryptographic algorithm. It is very significant to fast implementation algorithm, AES algorithm realized by
develop high performance AES to further broaden its AES-NI extended instruction set and AES algorithm realized
widespread applications. And in this paper, it is mainly about by CUDA.
the different optimized designs and implementations of AES The organizations of paper are as follows. Section II
algorithm. Firstly, it tests the fast implementation of AES introduces the structural feature of AES algorithm, and AES
algorithm and the performance has been improved by about 50 fast implementation algorithm will be discussed in Section
times when compared to the standard AES algorithm; III. AES-NI based AES and GPU parallelized AES will be
Secondly, using the Intel AES-NI extended instruction sets,
presented in Section IV and V respectively. Section VI gives
and the performance has been improved by about 50 times
the whole experimental results and finally summarizes the
compared with the fast implementation of AES algorithm;
Finally, using CUDA and GPU to execute the AES in parallel,
whole paper in Section VII.
and it can improve the performance by about 18 times II. STRUCTURAL FEATURES OF AES ALGORITHM
compared with the fast implementation of AES algorithm.
AES algorithm has three different kinds of layers. The
Keywords-High Performance AES; AES-NI; CUDA based operation of each layer will act on all the 128-bit
AES intermediate result data (operate as 16*16 Matrix, and the
element is 8-bit).
I. INTRODUCTION A. Layer of Add Round Key
In 2000, NIST announced the Rijndael algorithm from In this layer, the operation is to conduct XOR operation
Belgium has been selected as the Advanced Encryption on round key (round key is obtained from the extension of
Standard algorithm (AES). After that, AES algorithm has secret key operation) and state. This layer is to establish the
attracted attentions from various departments since it relationship between the key and the cipher-text more
provides high level of security and can be implemented complicated and to satisfy the confusion principle.
easily [1]. Moreover, as the widespread of big data related
applications, both the security and performance of AES B. Layer of Bytes Substitution (using S-box)
algorithm need to be improved as soon as possible. In this layer, each byte in the state will be substituted by
So far, many solutions have been proposed about the values obtained from substitution boxes. This is done to
optimization of AES algorithm. AES fast implementation achieve more security according to diffusion-confusion
algorithm proposed by Daemen et al, which takes 32-bit data Shannon's principles for cryptographic algorithms design.
as the basic data unit of AES algorithm operation, can
improve the execution performance greatly [2]. Also, there C. Diffusion layer
are many other hardware implementations for AES algorithm This layer is to provide diffusion for all the state. It
[4-7]. Feng et al realize AES algorithm by the methods like contains two sub-layers to ensure the high-degree diffusion
extended instruction set, which can be used on some after transformation for many rounds.
embedded devices, but the improvement of this algorithm is • Sub-layer of Rows Shifting: It is a kind of linear
insignificant [6]. Intel has put forward a solution to realize transformation. The byte in each row in the state
AES algorithm by using AES-NI extended instruction set [3], matrix will be rotated left. The number of left
which can greatly improves the performance of AES rotations is not the same in each row, and it can be
algorithm. Moreover, with the constant progress of the determined by the row number.
computer parallelization, many scholars have tried to realize
AES algorithm by parallelization technology [8,13], and the • Sub-layer of Columns Mixing: It is also a linear
performance has been improved through the parallel transformation. This layer mixes each column in the
processing of multi-processors. [8] has conducted parallel state matrix, and each transformation will make a
processing about AES by using CUDA, OpenMP and byte affect three other bytes in the same column.
OpenCL, and the results show that the performance obtained The algorithm flow of encoding and decoding of AES are
by CUDA is the best. shown in Figure 1. The plaintext and ciphertext are both the
The key point of this paper is to implement and test each 128-bit data and decryption algorithm is just the inverse
optimization of AES algorithm. Using a unified standard to process of the encryption.
test and compare the performance of different optimizations
of AES algorithm. The detailed algorithms to compare

978-1-4799-8937-9/15 $31.00 © 2015 IEEE 1848


DOI 10.1109/HPCC-CSS-ICESS.2015.215
Merging (1), (2) and (3), we can get:
ª d 0, j º ª02 03 01 01º ª SBox[a0, j + step0 ]º
«d » « » « SBox[a »
1, j + step1 ] »
« 1, j » = « 01 02 03 01»× « (4)
« d 2, j » « 01 01 02 03» « SBox[a2, j + step ]»
« » « » « 2 »

¬« d3, j ¼» ¬ 03 01 01 02 ¼ «¬ SBox[a3, j + step3 ]»¼


The Eq.(4) above can be transformed into:

ªd0, j º ª02º ª03º


«d » « » «02»
« 1, j » = « 01» × SBox[a « » × SBox[a
0, j + step0 ] ⊕ 1, j + step1 ] ⊕
«d2, j » « 01» « 01»
« » « » « »
¬« d3, j ¼» ¬03¼ ¬ 01¼
(5)
ª 01º ª 01º
«03» « 01»
« » × SBox[a ] ⊕ « » × SBox[a
2, j + step2 3, j + step3 ]
«02» «03»
« » « »
Figure 1. Algorithm flow of encryption and decryption of AES ¬ 01¼ ¬02¼
From Eq. (5), can get four transformation matrixes:
III. AES FAST IMPLEMENTATION ALGORITHM
ª02º ª03º
A. Operation of Merging Round Transformation « 01» «02»
The most complicated layer of AES algorithm is the Te0 [a] = « » × SBox[a]; Te1[a] = « » × SBox[a];
« 01» « 01»
Columns Mixing layer. Since it needs to conduct multiply « » « »
operation of polynomial on finite field GF(28), so it has high- ¬03¼ ¬ 01¼
(6)
degree of complexity[10-11]. While the shift, transformation ª 01º ª 01º
and XOR operation on state data at other layer have low- «03» « 01»
degree of complexity, so the optimization of AES algorithm Te2 [a] = « » × SBox[a]; Te3[a] = « » × SBox[a];
can be realized by reducing the complexity of columns «02» «03»
mixing layer. This is the main idea of AES fast « » « »
implementation algorithm. Now take the encryption ¬ 01¼ ¬02¼
algorithm of AES-128 as an example. The detailed Up to now, matrix d is to get encryption in a round
realization of the optimization algorithm is as follows: (except for transformation of add round key) can be
Assume the input data as matrix a, and get matrix b presented as:
through substituting bytes. The substitution of bytes can be
presented as: ªd0, j º
«d »
bi , j = SBox[ai , j ] (0 ≤ j < 4, 0 ≤ i < 4) « 1, j » = Te [a
0 0, j + step0 ] ⊕ Te1[a1, j + step1 ] ⊕
(1)
«d2, j » (7)
Rows shifting operation is to shift each row in matrix b, « »
«¬ d3, j »¼
which is:
Te2 [a2, j + step2 ] ⊕Te3[a3, j + step3 ] (0 ≤ j < 4)
ª c0, j º ªb0, j + step0 º
«c » «b » Through the optimization above, the transformation of
« 1, j » = « 1, j + step1 » (0 ≤ j < 4, step : # of shift ) (2) add round key and the operation above can be merged as
«c2, j » «b2, j + step » i

« » « 2 » Eq.(8).
«¬ c3, j »¼ «¬ b3, j + step3 »¼
ªd0, j º
Through column mix, state matrix c can be presented as: «d »
« 1, j » = Te [a
0 0, j + step0 ] ⊕ Te1[a1, j + step1 ] ⊕Te2 [a2, j + step2 ] ⊕
«d2, j » (8)
ª d 0, j º ª02 03 01 01º ª c0, j º « »
«d » « « »
« 1, j » = « 01 02 03 01»» « c1, j » ¬«d3, j ¼»
× (0 ≤ j < 4) (3) Te3[a3, j + step3 ] + roundKey[r] (0 ≤ j < 4, r : current round #)
« d 2, j » « 01 01 02 03» «c2, j »
« » « » « »
¬« d3, j ¼» ¬ 03 01 01 02 ¼ ¬« c3, j ¼» Through the transformations above, it can be concluded
that the four times transformation operations on data

1849
processing can be merged into 16 times of table look-ups and ª c0, j º ª a0, j º ª roundKeyround ,0, j º
16 times of XOR operations of Te(Te1 , Te2 , Te3 and Te4 ) table «c » «a » « »
within one round. This algorithm can reduce the algorithm « 1, j » = X × « 1, j » + X × « roundKeyround ,1, j » (12)
complexity. There are 256 data of 32 bits in each Te table, « c2, j » « a2, j » « roundKeyround ,2, j »
« » « » « »
so 4 Te tables will occupy about 4KB. The consumption of «¬ c3, j »¼ «¬ a3, j »¼ «¬ roundKeyround ,3, j »¼
space is very small.
ª 0e 0b 0d 09 º
B. Optimization of Decryption Algorithm « 09 0e 0b 0d »
Decryption algorithm is the inverse process of encryption In Eq.(12), X = « »
«0d 09 0e 0b »
algorithm. Although the methods of key expansion in « »
encryption and decryption algorithm of AES are the same, ¬ 0b 0d 09 0e ¼
the order of transformation in decryption is different from From Eq.(12), it shows that if we use the operation of
that of in decryption. Hence, if the order of operations within Reverse Mix Column to process the round key in advance,
the decryption algorithm is not changed, the optimization of the order of the layer of mix column and add round key can
encryption algorithm of AES cannot be used in decryption be exchanged, so as to make the same order of operations in
algorithm. Now let’s discuss the operations in decryption decryption and encryption. After this operation, we can use
algorithm. the optimization process of encryption to decryption.
Firstly, we need to replace the layer of Reverse Shift
Rows and the layer of Reverse Bytes Substitution. The IV. USING AES-NI EXTENDED INSTRUCTION SET
operation on the Reverse Shift Layer only affects the order of
the input bytes in state matrix, and will not change the A. Extended Instruction and AES-NI Order Structure
content of bytes. As the operation in the Reverse Bytes Generally, Instruction Set of CPU is mainly design to
Substitution layer, it only affects the content of the input enhance the processing capacity of CPU. AES-NI is an
bytes in state matrix and does not change the order of bytes. instruction set to realize some steps of AES by CPU
Therefore, the operations in these two layers have no ordinal hardware proposed by Intel, which can enhance the
relation, so the order of them can be exchanged. encryption capabilities of CPU.
Secondly, we need to exchange the order between Add AES-NI instruction has one or two input data (typically
Round Key layer and Reverse Columns Mixing layer in 128 bits), and the form of the new instructions is as:
decryption algorithm. The calculation details are as follows:
The process to get matrix b through adds round key in " Instruction xmm1 xmm2 / m128"
matrix a can be expressed as Eq.(9). Here, xmm1 and xmm 2 are nicknames of two random
xmm registers (data of xmm register are of 128 bits). The
ªb0, j º ª a0, j + roundKeyround ,0, j º results of this instruction will be stored in xmm1 registers,
«b » « a + roundKeyround ,1, j »» and m128 presents the data of 128 bits of the basic register.
« 1, j » = « 1, j 0 ≤ j < 4,
(9) The AES-NI instruction set contains many instructions,
«b2, j » « a2, j + roundKeyround ,2, j » 0 ≤ round < 11
« » « » such as AESENC, AESDEC et al., which can support the
¬«b3, j ¼» ¬« a3, j + roundKeyround ,3, j ¼» encryption and decryption of AES. This paper will mainly
introduce the AESENC instruction.
The process to change matrix b into matrix c through For AESENC, the main work is the encryption operation
reverse mix column can be presented as Eq.(10). [12]. The pseudo-codes for AESNEC execution is as Fig.2.
ªc0, j º ª 0e 0b 0d 09 º ªb0, j º
«c » « » « »
« 1, j » = « 09 0e 0b 0d » × « b1, j » (0 ≤ j < 4) (10)
«c2, j » «0d 09 0e 0b » «b2, j »
« » « » « »
«¬ c3, j »¼ ¬ 0b 0d 09 0e ¼ «¬b3, j »¼
Combing (9) and (10) can get Eq.(11).

ªc0, j º ª 0e 0b 0d 09 º ªa0, j + roundKeyround ,0, j º


«c » « « »
« 1, j » = « 09 0e 0b 0d »» « a1, j + roundKeyround ,1, j » Figure 2. Pseudo-codes for execution of AESENC
× (11)
«c2, j » «0d 09 0e 0b » «a2, j + roundKeyround ,2, j » Each instruction in AES-NI has been supported by the
« » « » « »
«¬c3, j »¼ ¬ 0b 0d 09 0e ¼ «¬ a3, j + roundKeyround ,3, j ¼» hardware of CPU, so AES-NI will obtain a great
improvement in performance.
From Eq.(11), it can obtain:

1850
V. GPU PARALLELIZED AES ALGORITHM data from the Host to device, invoking kernel program,
retrieving the final encrypted results, etc.
A. AES Parallelized Algorithm
B.2 Thread design
When operating under ECB (Electronic Code Book)
mode, the encryption and decryption of AES algorithm When CPU invokes the kernel function of CUDA, it will
conducts at different data block which does not affect each operate in grid with parallel thread. One kernel can create
other, so the parallel operation is suitable. Since the one grid, and each grid includes one or more blocks, while
processing capacity and bandwidth of GPU internal storage each block contains several threads. All threads in the block
can reach above 10 times of that of CPU, the high-efficient will execute the same kernel function in parallel mode. GPU
processing capacity of GPU can be used for parallel consists of a set of multiprocessors, and each multiprocessor
operation for AES algorithm. owns several set of stream processors, which are groups of
In this paper, the parallel AES algorithm divides the input blocks if reflected in CUDA model. In this paper, supposing
data into several data blocks with the same length, executes the total number of bytes need to be encrypted is S, and each
parallel encryption and decryption of each data block by the block contains 512 threads. So, the total number of threads in
optimized AES fast implementation algorithm in the GPU GPU is T = S /16 , and the number of block G is
kernel. The flow diagram of AES CUDA programming is as G = ª«T /512º» = ª«(S /16)/512º» . It should be noted that the data for
Fig.3. processing has been padded if needed, so S is a multiple of
16.
The marking variables of threads are blockIdx.x ,
b blockDim.x and threadIdx.x , which will be used to get the
data block for encryption and decryption of the current
thread. The pointer variable exp ress will be used to point to
the head of data to be encrypted. Using the AES fast
implementation algorithm to encrypt and decrypt each data
segment.
B.3 Memory space allocation
The memory space allocation is an important stage in
CUDA programming. This paper has chosen the AES fast
implementation algorithm to do encryption and decryption,
so the four look-up tables (Te0 − Te3 ) need to be established
before encryption, which totally requires about 4KB space.
The four tables will be shared by all threads, so they will be
frequently used for look-up during encryption and decryption,
and the access time of the operation will directly affect the
performance of encryption and decryption. In this paper, data
is stored in the constant memory, since it has high access rate
and the capacity is more than 4KB.
Figure 3. Flow of AES CUDA Programming Data stored in the host, if can be accessed by GPU, must
be copied to the memory of GPU. But the time for allocating
memory in GPU and data copy between the Host and GPU
B. AES CUDA Algorithm Implementation have great influence on performance. And Figure 4 shows
CUDA, the popular general programming model of GPU, the respective time consuming in different stages of the host,
can shield the internal details of GPU, and make where applying device space is t1, data copy from the host to
programmers focusing on the application itself. When using the device t2, kernel procedure invoking t3, and data copy
CUDA for algorithm implementation, the key point includes from the device to the host t4.
the host side design, thread design, memory space allocation,
and Kernel program design.
B.1 Host side design
According to the relevant explanation of programming
model of CUDA, the CUDA codes can be divided into two
parts. One part will operate on the Host (CPU), belonging to
common C code; the other part will operate on the Device
(GPU), belonging to parallel code, called Kernel. The codes
of the Host mainly are in serial execution. The Host
completes the key extensions, preprocessing of data for
encrypting or decrypting, apply device space, transmission of Figure 4. Time consuming comparison in different procedure of the host

1851
From Fig.4, it is obvious that the time consuming for as the AES fast optimization algorithm, we will not test
applying device space is at most when compares to other repeatedly here.
stages. Since the applied space in the device can be
repeatedly used, so the encryption and decryption of AES can
be executed by repeatedly use that kind of space in the
device.
B.4 Kernel program design
From the above discussion in section III, the GPU side
AES encryption kernel program is as Figure 5.

Figure 6. Performance comparison among different Key Expansions

From Fig. 6, it shows the performances of key expansion


with different algorithms are quite different. AES fast
implementation algorithm can achieve higher improvement
than AES standard algorithm for about 10 times. By using
AES-NI extended instruction set, because the implementation
of key expansion is totally completed by the hardware in
CPU, so it can significantly improve the performance for
5~6 times when compared with AES fast implementation
algorithm.
The main process of AES algorithm contains encryption
and decryption. Now let’s compare the performances of
encryption and decryption by different implementations. The
detailed results are shown in Figure 7.

Figure 5. The GPU side kernel function for AES encryption

VI. EXPERIMENTS AND ANALYSIS


In this part, performances of different AES
implementations will be discussed in detail. The
experimental environment is the Intel i7-4612 CPU
supporting the AES-NI extended instruction set, 4GB Figure 7. Comparison of performances among AES Fast, AES-NI and
memory, NVIDIA GT200 GPU with 240 processing cores. GPU optimization under AES-128
In this paper, a measure unit, cycles/bytes (clock period
consumed by operation on unit data) will be used below, From Figure 7, we can get that with the help of hardware,
where cycle is number of clock period, byte stands for 8 bits the AES algorithm executed by AES-NI achieves the best
data. We need to get the cycles of CPU spent during data performance, and the performance of encryption has
encryption and decryption, and then divide by the number of improved for about 44 times when compared with the AES
byte been processed. Then the data of this unit can be fast implementation algorithm, the decryption is improved
obtained, which can be generally representing the by 54 times. Meanwhile, the performance of AES algorithm
performance of the encryption or decryption. realized by CUDA is also very good, and it has been
First of all, compares the performances of key expansion improved by about 18 times when compared with the fast
with different algorithms. It mainly tests the key extension implementation of AES algorithm, similar as the
operation using AES-NI extended instruction set, AES fast performance realized by AES-NI extended instruction set.
implementation algorithm and the AES standard algorithm. Hence, parallel implementation of AES algorithm has a very
Since key extension operation realized by CUDA is the same a good prospect.

1852
VII. CONCLUSIONS [9] C. Paar, J. Pelzl and B. Preneel. Understanding Cryptography: A
Textbook for Students and Practitioners, Springer Science & Business
This paper mainly focuses on the performance of the Media, 2009, pp.83-112.
different AES implementations. From the experimental [10] Z. D. Chen and J. L. Zhang. “Inner Fusion Optimization for AES
results, some conclusions can be made as follows: AIgorithm”, Journal of Air Force Radar Academy( In Chinese),
Firstly, with the help of hardware, AES-NI gets the best Vol.48, No.3, pp.215-217, 2012.
performance, improving by 44 times when compared to AES [11] J. Fang. “Mix Column round transformation Optimization and
fast implementation algorithm. Besides, since AES-NI based Improvement in the AES Algoritm”, Control & Automation, Vol.25,
No.21, pp. 49-50.
algorithm has been fully supported by hardware, it can
[12] A. Slobodová. “Formal verification of hardware support for advanced
prevent side-channel attack effectively [14]. However, the encryption standard”, Proceedings of the 2008 International
AES-NI instruction set is not compatible with all CPUs, Conference on Formal Methods in Computer-Aided Design. IEEE
which limits its widespread applications. Press, Portland, Nov. 2008.
Secondly, AES algorithm realized on GPU by CUDA [13] S. Kai and H. Yan, “Implementation of AES Algorithm Based on
also has a very good performance, and it highly depends on GPU”, Electronic Technology( In Chinese), 2011, pp. 9-11.
the GPU computation capacity. For our current GPU based [14] X. Hui, Z. P. Jia, F. Zhang, X. Li, R. H. Chen and E. M. Sha. “The
AES, the performance has been improved by about 18 times Research and Application of a Specific Instruction Processor for
AES”, Journal of Computer Research and Development( In Chinese),
when compared to the performance of AES with fast vol. 48, No.8, 2011, pp. 1554-1562.
implementation algorithm. However, the parallel block
cipher AES, currently, is only the electronic code book (ECB)
mode and counter (CTR) mode.
Finally, AES fast implementation algorithm is an
optimized version of AES realized by software. Compared to
AES standard algorithm, it can improve the performance by
about 50 times, also shows a very good performance.

Acknowledgements: This work is partially supported


by Shanghai Municipal Natural Science Foundation
(13ZR1416100).
REFERENCES
[1] Standard N F, Announcing the advanced encryption standard (AES),
Federal Information Processing Standards Publication, 2001.
[2] Joan Daemen, Vincent Rijmen. The Design of Rijndael: AES - The
Advanced Encryption Standard.New York: Springer, 2002.
[3] Gueron S, Intel advanced encryption standard (AES) instructions set.
White Paper, Intel, 2010.
http://software.intel.com/sites/default/files/m/6/1/9/d/6/17973-aes-
instructions-set_wp.pdf, 2008.
[4] S.T. Lu, S. Wang, J. Han and X.Y.Zeng. “Method and
Implementation of SIMD Instruction Set Extension for AES
Algorithm”, computer Engineering(In Chinese), Vol.37, No.6,
pp.121-123, 2011.
[5] A. Dehbaoui, J. M. Dutertre, B. Robisson and A. Tria.
“Electromagnetic transient faults injection on a hardware and a
software implementations of AES”, 2012 Workshop on Fault
Diagnosis and Tolerance in Cryptography (FDTC). IEEE Press,
Leuven, Sep. 2012, pp. 7-15, doi:10.1109/FDTC.2012.15.
[6] B. Feng and D. Y. Qi. “Implementation of Extended Instruction Set
for AES Fast Algorithm”, Journal of South China University of
Technology(Natural Science Edition), Vol. 40, No.6, pp.97-102, 2012.
[7] R. X. Bai, H. Y. Liu and X. H. Zhang. “AES and its software
implementation ba sed on ARM920T”, Journal of
computerApplications( In Chinese), Vol.31, No.5, pp.1295-1299,
2011.
[8] C. L. Duta, G. Michiu, S. Stoica and L. Gheorghe. “Accelerating
Encryption Algorithms Using Parallelism”, 2013 19th International
Conference on Control Systems and Computer Science (CSCS). IEEE
Press, Bucharest, May 2013, pp.549-554, DOI:
10.1109/CSCS.2013.92.

1853

You might also like