Different Implementations of AES Cryptographic Algorithm
Different Implementations of AES Cryptographic Algorithm
International Symposium on Cyberspace Safety and Security (CSS), and 2015 IEEE 12th International Conf on Embedded Software
and Systems (ICESS)
Abstract—Currently, AES is regarded as the most popular implementation include AES standard algorithm[1], the AES
symmetric cryptographic algorithm. It is very significant to fast implementation algorithm, AES algorithm realized by
develop high performance AES to further broaden its AES-NI extended instruction set and AES algorithm realized
widespread applications. And in this paper, it is mainly about by CUDA.
the different optimized designs and implementations of AES The organizations of paper are as follows. Section II
algorithm. Firstly, it tests the fast implementation of AES introduces the structural feature of AES algorithm, and AES
algorithm and the performance has been improved by about 50 fast implementation algorithm will be discussed in Section
times when compared to the standard AES algorithm; III. AES-NI based AES and GPU parallelized AES will be
Secondly, using the Intel AES-NI extended instruction sets,
presented in Section IV and V respectively. Section VI gives
and the performance has been improved by about 50 times
the whole experimental results and finally summarizes the
compared with the fast implementation of AES algorithm;
Finally, using CUDA and GPU to execute the AES in parallel,
whole paper in Section VII.
and it can improve the performance by about 18 times II. STRUCTURAL FEATURES OF AES ALGORITHM
compared with the fast implementation of AES algorithm.
AES algorithm has three different kinds of layers. The
Keywords-High Performance AES; AES-NI; CUDA based operation of each layer will act on all the 128-bit
AES intermediate result data (operate as 16*16 Matrix, and the
element is 8-bit).
I. INTRODUCTION A. Layer of Add Round Key
In 2000, NIST announced the Rijndael algorithm from In this layer, the operation is to conduct XOR operation
Belgium has been selected as the Advanced Encryption on round key (round key is obtained from the extension of
Standard algorithm (AES). After that, AES algorithm has secret key operation) and state. This layer is to establish the
attracted attentions from various departments since it relationship between the key and the cipher-text more
provides high level of security and can be implemented complicated and to satisfy the confusion principle.
easily [1]. Moreover, as the widespread of big data related
applications, both the security and performance of AES B. Layer of Bytes Substitution (using S-box)
algorithm need to be improved as soon as possible. In this layer, each byte in the state will be substituted by
So far, many solutions have been proposed about the values obtained from substitution boxes. This is done to
optimization of AES algorithm. AES fast implementation achieve more security according to diffusion-confusion
algorithm proposed by Daemen et al, which takes 32-bit data Shannon's principles for cryptographic algorithms design.
as the basic data unit of AES algorithm operation, can
improve the execution performance greatly [2]. Also, there C. Diffusion layer
are many other hardware implementations for AES algorithm This layer is to provide diffusion for all the state. It
[4-7]. Feng et al realize AES algorithm by the methods like contains two sub-layers to ensure the high-degree diffusion
extended instruction set, which can be used on some after transformation for many rounds.
embedded devices, but the improvement of this algorithm is • Sub-layer of Rows Shifting: It is a kind of linear
insignificant [6]. Intel has put forward a solution to realize transformation. The byte in each row in the state
AES algorithm by using AES-NI extended instruction set [3], matrix will be rotated left. The number of left
which can greatly improves the performance of AES rotations is not the same in each row, and it can be
algorithm. Moreover, with the constant progress of the determined by the row number.
computer parallelization, many scholars have tried to realize
AES algorithm by parallelization technology [8,13], and the • Sub-layer of Columns Mixing: It is also a linear
performance has been improved through the parallel transformation. This layer mixes each column in the
processing of multi-processors. [8] has conducted parallel state matrix, and each transformation will make a
processing about AES by using CUDA, OpenMP and byte affect three other bytes in the same column.
OpenCL, and the results show that the performance obtained The algorithm flow of encoding and decoding of AES are
by CUDA is the best. shown in Figure 1. The plaintext and ciphertext are both the
The key point of this paper is to implement and test each 128-bit data and decryption algorithm is just the inverse
optimization of AES algorithm. Using a unified standard to process of the encryption.
test and compare the performance of different optimizations
of AES algorithm. The detailed algorithms to compare
« » « 2 » Eq.(8).
«¬ c3, j »¼ «¬ b3, j + step3 »¼
ªd0, j º
Through column mix, state matrix c can be presented as: «d »
« 1, j » = Te [a
0 0, j + step0 ] ⊕ Te1[a1, j + step1 ] ⊕Te2 [a2, j + step2 ] ⊕
«d2, j » (8)
ª d 0, j º ª02 03 01 01º ª c0, j º « »
«d » « « »
« 1, j » = « 01 02 03 01»» « c1, j » ¬«d3, j ¼»
× (0 ≤ j < 4) (3) Te3[a3, j + step3 ] + roundKey[r] (0 ≤ j < 4, r : current round #)
« d 2, j » « 01 01 02 03» «c2, j »
« » « » « »
¬« d3, j ¼» ¬ 03 01 01 02 ¼ ¬« c3, j ¼» Through the transformations above, it can be concluded
that the four times transformation operations on data
1849
processing can be merged into 16 times of table look-ups and ª c0, j º ª a0, j º ª roundKeyround ,0, j º
16 times of XOR operations of Te(Te1 , Te2 , Te3 and Te4 ) table «c » «a » « »
within one round. This algorithm can reduce the algorithm « 1, j » = X × « 1, j » + X × « roundKeyround ,1, j » (12)
complexity. There are 256 data of 32 bits in each Te table, « c2, j » « a2, j » « roundKeyround ,2, j »
« » « » « »
so 4 Te tables will occupy about 4KB. The consumption of «¬ c3, j »¼ «¬ a3, j »¼ «¬ roundKeyround ,3, j »¼
space is very small.
ª 0e 0b 0d 09 º
B. Optimization of Decryption Algorithm « 09 0e 0b 0d »
Decryption algorithm is the inverse process of encryption In Eq.(12), X = « »
«0d 09 0e 0b »
algorithm. Although the methods of key expansion in « »
encryption and decryption algorithm of AES are the same, ¬ 0b 0d 09 0e ¼
the order of transformation in decryption is different from From Eq.(12), it shows that if we use the operation of
that of in decryption. Hence, if the order of operations within Reverse Mix Column to process the round key in advance,
the decryption algorithm is not changed, the optimization of the order of the layer of mix column and add round key can
encryption algorithm of AES cannot be used in decryption be exchanged, so as to make the same order of operations in
algorithm. Now let’s discuss the operations in decryption decryption and encryption. After this operation, we can use
algorithm. the optimization process of encryption to decryption.
Firstly, we need to replace the layer of Reverse Shift
Rows and the layer of Reverse Bytes Substitution. The IV. USING AES-NI EXTENDED INSTRUCTION SET
operation on the Reverse Shift Layer only affects the order of
the input bytes in state matrix, and will not change the A. Extended Instruction and AES-NI Order Structure
content of bytes. As the operation in the Reverse Bytes Generally, Instruction Set of CPU is mainly design to
Substitution layer, it only affects the content of the input enhance the processing capacity of CPU. AES-NI is an
bytes in state matrix and does not change the order of bytes. instruction set to realize some steps of AES by CPU
Therefore, the operations in these two layers have no ordinal hardware proposed by Intel, which can enhance the
relation, so the order of them can be exchanged. encryption capabilities of CPU.
Secondly, we need to exchange the order between Add AES-NI instruction has one or two input data (typically
Round Key layer and Reverse Columns Mixing layer in 128 bits), and the form of the new instructions is as:
decryption algorithm. The calculation details are as follows:
The process to get matrix b through adds round key in " Instruction xmm1 xmm2 / m128"
matrix a can be expressed as Eq.(9). Here, xmm1 and xmm 2 are nicknames of two random
xmm registers (data of xmm register are of 128 bits). The
ªb0, j º ª a0, j + roundKeyround ,0, j º results of this instruction will be stored in xmm1 registers,
«b » « a + roundKeyround ,1, j »» and m128 presents the data of 128 bits of the basic register.
« 1, j » = « 1, j 0 ≤ j < 4,
(9) The AES-NI instruction set contains many instructions,
«b2, j » « a2, j + roundKeyround ,2, j » 0 ≤ round < 11
« » « » such as AESENC, AESDEC et al., which can support the
¬«b3, j ¼» ¬« a3, j + roundKeyround ,3, j ¼» encryption and decryption of AES. This paper will mainly
introduce the AESENC instruction.
The process to change matrix b into matrix c through For AESENC, the main work is the encryption operation
reverse mix column can be presented as Eq.(10). [12]. The pseudo-codes for AESNEC execution is as Fig.2.
ªc0, j º ª 0e 0b 0d 09 º ªb0, j º
«c » « » « »
« 1, j » = « 09 0e 0b 0d » × « b1, j » (0 ≤ j < 4) (10)
«c2, j » «0d 09 0e 0b » «b2, j »
« » « » « »
«¬ c3, j »¼ ¬ 0b 0d 09 0e ¼ «¬b3, j »¼
Combing (9) and (10) can get Eq.(11).
1850
V. GPU PARALLELIZED AES ALGORITHM data from the Host to device, invoking kernel program,
retrieving the final encrypted results, etc.
A. AES Parallelized Algorithm
B.2 Thread design
When operating under ECB (Electronic Code Book)
mode, the encryption and decryption of AES algorithm When CPU invokes the kernel function of CUDA, it will
conducts at different data block which does not affect each operate in grid with parallel thread. One kernel can create
other, so the parallel operation is suitable. Since the one grid, and each grid includes one or more blocks, while
processing capacity and bandwidth of GPU internal storage each block contains several threads. All threads in the block
can reach above 10 times of that of CPU, the high-efficient will execute the same kernel function in parallel mode. GPU
processing capacity of GPU can be used for parallel consists of a set of multiprocessors, and each multiprocessor
operation for AES algorithm. owns several set of stream processors, which are groups of
In this paper, the parallel AES algorithm divides the input blocks if reflected in CUDA model. In this paper, supposing
data into several data blocks with the same length, executes the total number of bytes need to be encrypted is S, and each
parallel encryption and decryption of each data block by the block contains 512 threads. So, the total number of threads in
optimized AES fast implementation algorithm in the GPU GPU is T = S /16 , and the number of block G is
kernel. The flow diagram of AES CUDA programming is as G = ª«T /512º» = ª«(S /16)/512º» . It should be noted that the data for
Fig.3. processing has been padded if needed, so S is a multiple of
16.
The marking variables of threads are blockIdx.x ,
b blockDim.x and threadIdx.x , which will be used to get the
data block for encryption and decryption of the current
thread. The pointer variable exp ress will be used to point to
the head of data to be encrypted. Using the AES fast
implementation algorithm to encrypt and decrypt each data
segment.
B.3 Memory space allocation
The memory space allocation is an important stage in
CUDA programming. This paper has chosen the AES fast
implementation algorithm to do encryption and decryption,
so the four look-up tables (Te0 − Te3 ) need to be established
before encryption, which totally requires about 4KB space.
The four tables will be shared by all threads, so they will be
frequently used for look-up during encryption and decryption,
and the access time of the operation will directly affect the
performance of encryption and decryption. In this paper, data
is stored in the constant memory, since it has high access rate
and the capacity is more than 4KB.
Figure 3. Flow of AES CUDA Programming Data stored in the host, if can be accessed by GPU, must
be copied to the memory of GPU. But the time for allocating
memory in GPU and data copy between the Host and GPU
B. AES CUDA Algorithm Implementation have great influence on performance. And Figure 4 shows
CUDA, the popular general programming model of GPU, the respective time consuming in different stages of the host,
can shield the internal details of GPU, and make where applying device space is t1, data copy from the host to
programmers focusing on the application itself. When using the device t2, kernel procedure invoking t3, and data copy
CUDA for algorithm implementation, the key point includes from the device to the host t4.
the host side design, thread design, memory space allocation,
and Kernel program design.
B.1 Host side design
According to the relevant explanation of programming
model of CUDA, the CUDA codes can be divided into two
parts. One part will operate on the Host (CPU), belonging to
common C code; the other part will operate on the Device
(GPU), belonging to parallel code, called Kernel. The codes
of the Host mainly are in serial execution. The Host
completes the key extensions, preprocessing of data for
encrypting or decrypting, apply device space, transmission of Figure 4. Time consuming comparison in different procedure of the host
1851
From Fig.4, it is obvious that the time consuming for as the AES fast optimization algorithm, we will not test
applying device space is at most when compares to other repeatedly here.
stages. Since the applied space in the device can be
repeatedly used, so the encryption and decryption of AES can
be executed by repeatedly use that kind of space in the
device.
B.4 Kernel program design
From the above discussion in section III, the GPU side
AES encryption kernel program is as Figure 5.
1852
VII. CONCLUSIONS [9] C. Paar, J. Pelzl and B. Preneel. Understanding Cryptography: A
Textbook for Students and Practitioners, Springer Science & Business
This paper mainly focuses on the performance of the Media, 2009, pp.83-112.
different AES implementations. From the experimental [10] Z. D. Chen and J. L. Zhang. “Inner Fusion Optimization for AES
results, some conclusions can be made as follows: AIgorithm”, Journal of Air Force Radar Academy( In Chinese),
Firstly, with the help of hardware, AES-NI gets the best Vol.48, No.3, pp.215-217, 2012.
performance, improving by 44 times when compared to AES [11] J. Fang. “Mix Column round transformation Optimization and
fast implementation algorithm. Besides, since AES-NI based Improvement in the AES Algoritm”, Control & Automation, Vol.25,
No.21, pp. 49-50.
algorithm has been fully supported by hardware, it can
[12] A. Slobodová. “Formal verification of hardware support for advanced
prevent side-channel attack effectively [14]. However, the encryption standard”, Proceedings of the 2008 International
AES-NI instruction set is not compatible with all CPUs, Conference on Formal Methods in Computer-Aided Design. IEEE
which limits its widespread applications. Press, Portland, Nov. 2008.
Secondly, AES algorithm realized on GPU by CUDA [13] S. Kai and H. Yan, “Implementation of AES Algorithm Based on
also has a very good performance, and it highly depends on GPU”, Electronic Technology( In Chinese), 2011, pp. 9-11.
the GPU computation capacity. For our current GPU based [14] X. Hui, Z. P. Jia, F. Zhang, X. Li, R. H. Chen and E. M. Sha. “The
AES, the performance has been improved by about 18 times Research and Application of a Specific Instruction Processor for
AES”, Journal of Computer Research and Development( In Chinese),
when compared to the performance of AES with fast vol. 48, No.8, 2011, pp. 1554-1562.
implementation algorithm. However, the parallel block
cipher AES, currently, is only the electronic code book (ECB)
mode and counter (CTR) mode.
Finally, AES fast implementation algorithm is an
optimized version of AES realized by software. Compared to
AES standard algorithm, it can improve the performance by
about 50 times, also shows a very good performance.
1853