Full Text 01
Full Text 01
Electrical Engineering
September 2014
Master Thesis in
Electrical Engineering
Contact Information
Author: Akash Kiran Neelap
Email: [email protected]
i
ACKNOWLEDGEMENT
I owe particular thanks to Stefan Peterson (BTH), for his invaluable guidance and
assistance. I also thank him for providing me necessary equipments to accomplish this thesis
work.
Special thanks to Md. Altaf Ahmed and C. Gautam Krishna, my friends for their
consistent support.
Finally, I would like to thank my parents for their kind love and support, words
cannot convey.
ii
CONTENTS
ABSTRACT .......................................................................................................................................................... I
ACKNOWLEDGEMENT .................................................................................................................................. II
CHAPTER 1 INTRODUCTION ..................................................................................................................... 1
1.1 OVERVIEW ................................................................................................................................................. 1
1.2 MOTIVATION ............................................................................................................................................ 2
1.3 AIM AND OBJECTIVES............................................................................................................................. 3
1.4 RESEARCH QUESTIONS .......................................................................................................................... 4
1.5 MAIN CONTRIBUTIONS .......................................................................................................................... 4
1.6 THESIS ORGANISATION.......................................................................................................................... 5
CHAPTER 2 BACKGROUND ....................................................................................................................... 6
2.1 ADVANCED ENCRYPTION STANDARD ............................................................................................... 6
2.2 PARALLEL COMPUTING ....................................................................................................................... 16
2.2.1 Why parallel computing? .................................................................................................................... 16
2.2.2 Single core Vs Multi core processor ................................................................................................... 17
2.3 GPU COMPUTING ................................................................................................................................... 19
2.3.1 Evolution of GPGPU .......................................................................................................................... 19
2.3.2 Architectural Difference in GPU and CPU ......................................................................................... 20
2.3.3 CUDA programming model................................................................................................................ 20
2.3.4 CUDA STREAMS .............................................................................................................................. 24
2.4 RELATED WORK ..................................................................................................................................... 25
CHAPTER 3 METHODOLOGY ................................................................................................................. 28
3.1 PRE-REQUISITE FOR EXPERIEMENT .................................................................................................. 28
3.2 EXPERIMENTATION SETUP ................................................................................................................. 29
3.3 EXPERIMENT ........................................................................................................................................... 30
3.3.1 Implementation on CPU...................................................................................................................... 30
3.3.2 Implementation on GPU. .................................................................................................................... 32
CHAPTER 4 RESULTS................................................................................................................................ 37
4.1 IMPLEMENTATION ON SINGLE CORE CPU ....................................................................................... 37
4.1.1 AES 128 on Single threaded C ........................................................................................................... 37
4.1.2 AES 192 on Single threaded C ........................................................................................................... 38
4.1.3 AES 256 on Single threaded C ........................................................................................................... 39
4.2 IMPLEMENTATION ON GPU USING CUDA ........................................................................................ 39
4.2.1 AES 128 on GPU ................................................................................................................................ 40
4.2.2 AES 192 on GPU ................................................................................................................................ 45
4.2.3 AES 256 on GPU ................................................................................................................................ 50
4.3 COMPARISON OF DIFFERENT GRANULARITY LEVELS ON GPU ................................................. 54
4.4 COMPARISON OF DIFFERENT GRID DIMENSIONS ON GPU ........................................................... 57
4.5 IMPLEMENTATION ON MULTI-CORE CPU ........................................................................................ 63
4.6 PEROFMANCE COMPARISION OF GPU AND GPU ............................................................................ 65
4.7 AES ALGORITHM USING CUDA STREAMS ........................................................................................ 66
4.7.1 AES 128 using CUDA STREAMS ..................................................................................................... 67
4.7.2 AES 192 using CUDA STREAMS ..................................................................................................... 68
4.7.3 AES 256 using CUDA STREAMS ..................................................................................................... 70
iii
CHAPTER 5 DISCUSSIONS ........................................................................................................................ 72
5.1 VALIDITY THREATS .............................................................................................................................. 72
5.1.1 Internal Validity .................................................................................................................................. 72
5.1.2 External Validity ................................................................................................................................. 73
5.2 DISCUSSIONS .......................................................................................................................................... 73
CHAPTER 6 CONCLUSION AND FUTURE WORK ............................................................................... 76
6.1 ANSWER TO RESEARCH QUESTIONS ................................................................................................. 76
6.2 FUTURE WORK ....................................................................................................................................... 77
BIBLIOGRAPHY ............................................................................................................................................... 78
APPENDIX A...................................................................................................................................................... 81
APPENDIX B .................................................................................................................................................... 128
iv
LIST OF FIGURES
v
LIST OF TABLES
Table 1: Execution time for single threaded C for AES-128 ................................................................................ 38
Table 2: Execution time for Single threaded C for AES-192 ............................................................................... 38
Table 3: Execution time for Single threaded C for AES-256 ............................................................................... 39
Table 4: Execution time for CUDA AES-128 with granularity 1 for 32000 bytes data ....................................... 40
Table 5: Execution time CUDA AES-128 with granularity 2 for 32000 bytes input ........................................... 41
Table 6: Execution time CUDA AES-128 with granularity 10 for 32000 bytes input ......................................... 41
Table 7: Execution time CUDA AES-128 with granularity 100 for 32000 bytes input ....................................... 42
Table 8: Execution time CUDA AES-128 with granularity 1 for 32000*5 bytes input ....................................... 42
Table 9: Execution time CUDA AES-128 with granularity 2 for 32000*5 bytes input ....................................... 43
Table 10: Execution time CUDA AES-128 with granularity 10 for 32000*5 bytes input ................................... 43
Table 11: Execution time CUDA AES-128 with granularity 100 for 32000*5 bytes input ................................. 44
Table 12: Execution time CUDA AES-192 with granularity 1 for 32000 bytes input ......................................... 45
Table 13: Execution time CUDA AES-192 with granularity 2 for 32000 bytes input ......................................... 46
Table 14: Execution time CUDA AES-192 with granularity 10 for 32000 bytes input ....................................... 46
Table 15: Execution time CUDA AES-192 with granularity 100 for 32000 bytes input ..................................... 47
Table 16: Execution time CUDA AES-192 with granularity 1 for 32000*5 bytes input ..................................... 47
Table 17: Execution time CUDA AES-192 with granularity 2 for 32000*5 bytes input ..................................... 48
Table 18: Execution time CUDA AES-192 with granularity 10 for 32000*5 bytes input ................................... 48
Table 19: Execution time CUDA AES-192 with granularity 100 for 32000*5 bytes input ................................. 49
Table 20: Execution time CUDA AES-256 with granularity 1 for 32000 bytes input ......................................... 50
Table 21: Execution time CUDA AES-256 with granularity 2 for 32000 bytes input ......................................... 50
Table 22: Execution time CUDA AES-256 with granularity 10 for 32000 bytes input ....................................... 51
Table 23: Execution time CUDA AES-256 with granularity 100 for 32000 bytes input ..................................... 51
Table 24: Execution time CUDA AES-256 with granularity 1 for 32000*5 bytes input ..................................... 52
Table 25: Execution time CUDA AES-256 with granularity 2 for 32000*5 bytes input ..................................... 53
Table 26: Execution time CUDA AES-256 with granularity 10 for 32000*5 bytes input ................................... 53
Table 27: Execution time CUDA AES-256 with granularity 100 for 32000*5 bytes input ................................. 54
Table 28: Execution time in multi-threaded C using Pthreads for AES-128 ........................................................ 64
Table 29: Execution time in multi-threaded C using Pthreads for AES-192 ........................................................ 64
Table 30: Execution time in multi-threaded C using Pthreads for AES-256 ........................................................ 64
Table 31: AES-128 Execution using CUDA STREAMS for 2 streams with 16000 bytes each ........................... 67
Table 32: AES-128 Execution using CUDA STREAMS for 2 streams with 32000 bytes each ........................... 67
Table 33: AES-128 Execution using CUDA STREAMS for 5 streams with 32000 bytes each ........................... 68
Table 34: AES-192 Execution using CUDA STREAMS for 2 streams with 16000 bytes each ........................... 68
Table 35: AES 192 Execution using CUDA STREAMS for 2 streams with 32000 bytes each ........................... 69
Table 36: AES-192 Execution using CUDA STREAMS for 5 streams with 32000 bytes each ........................... 69
Table 37: AES-256 Execution using CUDA STREAMS for 2 streams with 16000 bytes each ........................... 70
Table 38: AES-256 Execution using CUDA STREAMS for 2 streams with 32000 bytes each ........................... 70
Table 39: AES-256 Execution using CUDA STREAMS for 5 streams with 32000 bytes each ........................... 71
vi
LIST OF GRAPHS
Graph 1: Performance comparison of varied grid dimensions with granularity 1 for 32000 bytes data ............... 58
Graph 2: Performance comparison of varied grid dimensions with granularity 2 for 32000 bytes input ............. 59
Graph 3: Performance comparison of varied grid dimensions with granularity 10 for 32000 bytes data ............. 60
Graph 4: Performance comparison of varied grid dimensions with granularity 100 for 32000 bytes data ........... 60
Graph 5: Performance comparison of varied grid dimensions with granularity 1 for 32000*5 bytes data ........... 61
Graph 6: Performance comparison of varied grid dimensions with granularity 2 for 32000*5 bytes data ........... 62
Graph 7: Performance comparison of varied grid dimensions with granularity 10 for 32000*5 bytes data ......... 62
Graph 8: Performance comparison of varied grid dimensions with granularity 100 for 32000*5 bytes data ....... 63
Graph 9: Performance comparison of C, Pthreads and CUDA for 32000 bytes data ........................................... 65
Graph 10: Performance comparison of C, Pthreads and CUDA for 32000*5 bytes data ..................................... 66
vii
LIST OF ABBREVIATIONS
Acronyms Description
AES Advanced Encryption Standard
CPU Central Processing Unit
CUDA Compute Unified Device Architecture
GPU Graphics Processing Unit
GPGPU General Purpose Graphics Processing Unit
MIMD Multiple Instruction Multiple Data
NIST National Institute of Standards and Technology
SIMD Single Instruction Multiple Data
SIMT Single Instruction Multiple Thread
viii
Chapter 1 Introduction
CHAPTER 1 INTRODUCTION
This chapter describes briefly the motivation behind the research work, aim and
objectives, contributions of the research and organization of the paper.
1.1 OVERVIEW
Cryptography has been the hot topic of research in the field of information security
since ancient times. Beginning from the ancient Caesar’s cipher that used manual encryption
of data, followed by enigma machine used in the World War II, after the advancement in
mechanics, it has been the urge of researchers to find solutions for better performance in
terms of speed and security. With the invention of transistors, which gave rise to
microprocessors, began the era of digital automation. This new advancement together with
contemporary processors, gave rise to a new generation of encryption technology. Since then,
various encryption algorithms have been developed and utilized. Encryption finds its usage in
almost every field where data protection is concerned, including scientific institutions,
corporate offices, social networking, and more specifically in military and government affairs
in order to facilitate secret communication. The military and armed forces hold huge amount
of secret information which relates to the protection of the country and its people, making
itself one of the most important area to use the encryption technology.
The Advanced Encryption Standard (AES) is a specification for the encryption of
electronic data established by the U.S. National Institute of Standards and Technology
(NIST) in 2001[25]. This standard today is used worldwide in almost every field including
government organizations, military, educational institutions, ATM cards, computer password
security and electronic commerce. The present state of art technology generates exponential
amounts of confidential data which needs to be encrypted before being stored or transferred
to authorized destination. Encryption being heavily based on intense mathematical
computations is a time taking process [4]. Fast encryption has been an important subject of
research and discussion since many years. Various implementations techniques and
optimization methods have been incorporated to achieve better speedups. Yet the answer to
"Which method can give the best solution?" remains partially solved [19]. The advancement
in the computer industry directly affects the performance of these algorithms. This is a
positive hint for researchers, to focus on the improvement in the processor utility, in order to
achieve better processing speeds.
Microprocessors based on a single central processing unit (CPU) have given high
performance increase since decades. Various techniques have been followed to increase the
speed of the processor which includes increase in the clock speed. The culture of increasing
the clock speed has flattened out, due to issues such as excessive power consumption, heat
dissipation and current leakage. Also, excessive heat raises the need of expensive cooling
equipments, increasing the cost of overall system. The single core microprocessors served for
decades providing considerable functionality and user interfaces. As the law of science
demands, the users in turn long for more improvements and faster computation. The
computer industry moved from single core processors to multi-core processors in which more
than one processing units or processing cores are used in a single chip. The model of multi-
core processor has exerted a tremendous impact on the software community. According to the
Moore’s law, “the number of transistors in a dense integrated circuit doubles approximately
1
Chapter 1 Introduction
every two years” [37]. In other words, the number of processing core doubles with every new
generation of processors. However the increase in the number of cores in a processor does not
mark much impact on such systems, as a sequential program can run on only one of the
processor cores leaving other cores idle. The introduction of multi threaded programming
languages such as POSIX threads, however promises to help utilizing multiple cores of the
system effectively in such applications.
The addition of more cores into a single chip demands for higher chip area, which
leads to increased cost besides the power consumption and heat dissipation problems, which
continues to increase with increasing processing units. Hence, development in the hardware
cannot continue to be an effective solution to increase the processing speed of the system.
The application software here after can continue to feed on high performance achievements
only by the use of parallel programming, i.e., improvement in the software instead of addition
in the hardware. However, this theory holds good for specific algorithms and its level of
parallelizability. Algorithms are programmed in such a way that its parallel parts can be
executed simultaneously or concurrently on different processor cores. The practice of parallel
programming is by no means new. Although it has been in practise since quite before, the
application area to use this technique had been limited due to the need for expensive systems.
Nowadays all micro processors are parallel computers, and it is possible and also necessary to
parallelise many applications to accelerate performance. This again demands the need for
programmers to learn more about parallel computing, the technique of utilizing cores
efficiently to gain maximum performance.
After the introduction of General Purpose Graphical Processing Units (GPGPU),
many parallel applications have been shifted to graphic cards, or implemented with
coordination with CPU in a heterogeneous environment. The architecture of Graphical
Processing Unit (GPU), which has hundreds of independent processing cores, is best suitable
for applications that contain excessive parts that are independent of each other, and can be
implemented simultaneously. This however requires sufficient knowledge in many-core
programming, GPU architecture and memory design. A programmer also needs to understand
which part of the program can be best implemented on a GPU and on a CPU. With the
introduction of Compute Unified Device Architecture (CUDA) parallel computing platform
and programming model created by NVIDIA, various parallel programming languages such
as Direct Compute, OpenCL and CUDA have been developed using which, a programmer
need not master the depth of the hardware constrains of the system.
1.2 MOTIVATION
Many research works have been performed previously, in order to compare the
performance of GPU and CPU, for the implementation of various encryption algorithms
including AES. Where few researches claim that GPUs outperform CPUs, the
implementation technique used is either improper or incomplete. Also, the performance
improvement is quite small or almost negligible, to promise that GPUs can outperform CPUs
on every scenario. The devices used are older GPUs with Tesla and Fermi architecture, and
not much research has been done on the latest Kepler architecture GPUs which is used in this
research work. The GPUs are not efficiently exploited, to be able to deduce the factual
standards to be followed in order to get efficient performance, keeping the programmer
diffident while implementing a new research using a GPU. The GPUs have been working on
the principle of Single Instruction Multiple Data (SIMD), and further parallelising the
processing was not possible until the CUDA STREAMS have been recently introduced to
2
Chapter 1 Introduction
GPU computing. The usage of CUDA STREAMS towards general purpose computing is new
and not much research has been done towards its utility.
Where GPUs try to spread its mark on every general purpose application, the
developments in the multi-core trajectory continues to increase. As the number of cores in the
multi-core processors increase, the scope for faster execution on the CPUs considerably
increases opening the doors to new results and inferences. These improvements on multi-core
processors pose challenges to hold back general purpose applications to rely on CPUs for
better performance, instead of completely switching into GPU computing. Parallel computing
on multi-core processors is still an ongoing research and demands for more intricate
exploration.
This research aims to experiment the different versions of AES algorithm (AES-128,
AES-192 and AES-256) on the state of art CPU and GPU on different levels of optimizations
to identify the best implementation technique to achieve fastest execution. It focuses on
understanding how effectively multiple threads in a GPU can be utilized to achieve best
performance of an algorithm. It evaluates parallel implementation on the CPU using Pthreads,
and compares its performance with that of single threaded program. It also aims to
differentiate the performance of optimised version of programming on GPU using CUDA
STREAMS, in order to understand its effects on the performance of the algorithm. Finally
this research proposes protocols, which can be considered while implementing an algorithm
on the GPU by future researchers. This research being an active addition to ongoing research
on parallel computing, introduces valuable ideas and techniques, which has not been
efficiently stated earlier, and proves to be a definite contribution to the world of computing.
The aim of this research is to evaluate and compare the performance of CPU and GPU
on the three different versions of AES encryption algorithm, on different levels of
parallelism.
The objectives to achieve the aims are:
a) Develop Single threaded C program on the CPU for AES 128, AES 192 and AES 256
algorithms and record the execution time of each.
b) Develop CUDA program on the GPU for AES 128, AES 192 and AES 256
algorithms and record the execution time of each. Examine the performance of each
implementation on different levels of granularities to identify the fastest execution.
c) Compare the performance of the CPU and GPU based objective a) and b).
d) Develop Multi threaded C program on the CPU using POSIX thread for AES 128,
AES 192 and AES 256 algorithms and record the execution time of each.
e) Compare the performance of the single threaded C program and the multi threaded C
program on the CPU and analyse the result.
f) Compare the performance of the multi threaded C program on the CPU and the
CUDA program on the GPU and analyse the result.
g) Optimize the CUDA program using CUDA STREAMS on the GPU and analyse the
performance compared to the CUDA program.
3
Chapter 1 Introduction
RQ1:
a) Does the GPU outperform the single-core CPU in the implementation of AES algorithm?
b) Does the GPU outperform the multi-core CPU in the implementation of AES algorithm?
RQ2:
Does the use of CUDA STREAMS have a positive impact on the performance of AES
algorithm?
RQ3:
Which implementation gives the fastest AES execution?
1.5 MAIN CONTRIBUTIONS
GPUs holds its usage in various areas of the defence and armed forces which includes
image and video processing applications like image stabilization, cockpit and commander
display, video tracking, digital mapping and radar processing, and data protection
applications including encryption, compression etc. Identifying a better way to exploit
parallelism in GPU directly supports fast execution of these applications.
AES encryption algorithm is widely used by military and armed force due to its
dependable security and popularity. Improvement in the execution speed of AES can help
reducing the time to encrypt the continuously increasing huge amount of confidential data.
4
Chapter 1 Introduction
Chapter 2 begins with the description of the AES algorithm. It then reviews the
history of parallel computing beginning with single core processors followed by multi-core
and many-core/GPU computing followed by an introduction to CUDA programming model.
It finally reviews the previous related work done on this area of research. A good
understanding of these will help the reader to better comprehend this research work.
Chapter 4 presents the results obtained by the experiments conducted during the
research. It involves tables and graphs which is later used to analyse and compare the
outcome of the implementation.
Chapter 5 discusses the observations and inference through the research and obtained
results. It also presents the verification and validation of the implementation and result. It is
an important section of the document as it helps to answer the present research questions and
proves to be a reference for future research work.
Chapter 6 presents the conclusion which includes the answers to the research
questions and future work.
5
Chapter 2 Background
CHAPTER 2 BACKGROUND
This section firstly describes briefly the design of AES algorithm. Secondly it
introduces the concept of parallel computing followed by the description of GPU using the
CUDA programming model. It then presents the usage of CUDA STREAMS.
AES operates on an input data of 128-bits i.e., 16-bytes. These bytes can be represented
as finite field elements in the form of polynomial representation [25]:
Input/output:
The basic entities that the AES algorithm operates on are the input data (plain-text) and
the cipher-key. The 16-byte input data is arranged in a Ͷ ൈ Ͷ column-major order matrix
known as the input matrix and is denoted as ‘‘̴݅݊݉ܽ’’ݔ݅ݎݐas shown below.
The matrix is organized such that the columns are stored one after the other. The first
four bytes of the 128-bit input block occupy the first column in the Ͷ ൈ Ͷ matrix of bytes. The
next four bytes occupy the second column, and so on. The same method is used in order to
arrange any matrix here forth. This matrix is then copied into another two dimensional array
6
Chapter 2 Background
of bytes known as input state which is directly given as input to the AES algorithm. This two
dimensional array of bytes is known as the input state, state array or simply state and is
denoted as ‘‘̴݅݊’’݁ݐܽݐݏ. The ̴݅݊ ݁ݐܽݐݏundergoes various transformation rounds and
generates subsequent outputs after each operation which is also referred to as state. After the
completion of all the rounds, the final output of the algorithm known as the output state
denoted as ‘‘’’݁ݐܽݐݏ̴ݐݑis copied to a 16-byte Ͷ ൈ Ͷ column-major matrix known as the
output matrix denoted as ‘‘’’ݔ݅ݎݐ̴ܽ݉ݐݑ. The data in the output matrix is the required
cipher-text.
Rounds:
The AES algorithm has several transformation rounds to transform any information
into encoded format and vice versa. AES-128 has 10 rounds; whereas AES-192 and AES-256
has 12 and 14 rounds respectively. Each round consists of several processing steps. Except
for the last round all other rounds are identical. A set of reverse rounds are applied to
transform cipher-text back into the original plain-text using the same encryption key. The
decryption process is out of the scope of this dissertation and hence will not be considered.
To understand the working of AES encryption in depth let us consider the case of AES-128.
This section provides the detailed working of AES algorithm [39]. Consider an
example input matrix and cipher-key matrix as shown below. The input matrix is first copied
to the state array.
The state matrix is given directly as input to the encryption process, whereas the
cipher-key is processed through the key schedule before being used in encryption.
Key Expansion
The key expansion mechanism takes as input the cipher-key and generates a series of
round-keys that would be used during the encryption process. The round-keys are derived
from the cipher-key by using Rijndael's key schedule. If the total number of words in the
cipher-key is ‘‘݊݅ ’’ and the number of rounds in AES algorithm is ܰ, then the key expansion
7
Chapter 2 Background
݇Ͳ Ͷ ͺ ͳʹ
ͷ ͻ ͳ͵
൦ ͳ ൪
ʹ ͳͲ ͳͶ
͵ ͳͳ ͳͷ
The figure below depicts the arrangement of the cipher-key and the expansion of the
key into a key schedule consisting of "44" 4 byte words. Let the expanded key be denoted as
‘‘ ’’ݓsuch that ݓൌ ሺ Ͳݓǡ ͳݓǡ ʹݓǥ ݓͶ͵ ሻ.Each round in encryption consumes four words
from the key schedule generated.
I. Firstly, the four columns of the cipher-key are copied to the first four columns of the
key schedule as shown below.
8
Chapter 2 Background
II. The right-most column of the cipher-key circled in the table above denoted as ‘‘݅ݓെͳ ’’
acts as the first ‘‘Rot word’’. Note that here forth the right-most column of every round-
key would act as a rot word and contribute in generating the next round-key.
III. The Rot word is rotated upwards such that each byte of the column takes the place of the
byte above it, as shown below.
IV. Each byte of the obtained column is replaced with its substitute byte in the S-Box. This
method is called ‘‘substitute-byte’’ or ‘‘sub-byte’’. For example: The first byte of the
column "ad" should be substituted by the element at the ܽ ݄ݐrow and the ݀ ݄ݐcolumn of
the S-Box. Similarly the second byte of the column "4f" should be substituted by the
element of the Ͷ ݄ݐrow and ݂ ݄ݐcolumn of the S-Box, and so on. After the sub-byte
operation, the obtained column is shown below.
V. The obtained column is then added to the column which is two columns away from the
rot-word to the left, denoted as ‘‘݅ݓെͶ ’’. The output of this addition is then added to the
first column of the R-con matrix known as R-con(4).
Note that each column of the R-con matrix contributes to the generation of a new round-
key. The addition is obtained by the XOR operation and is shown below.
9
Chapter 2 Background
The output obtained from the above operation is the next column of the key schedule
matrix, and also the first column of the first round-key matrix.
VI. The second, third and the fourth column of the first round key matrix can be obtained by
simply performing XOR operation between columns to its left as shown below.
10
Chapter 2 Background
The above 6 steps needs to be repeated in order to generate all the 10 round keys,
therefore generating the 44 byte key schedule. Finally the key schedule that obtained is
shown below.
Each round-key of the above key schedule is used in each transformation round of the
encryption process described in the next section.
Encryption process
1. Initial round
2. Rounds
i. Substitute byes or Sub Bytes
ii. Shift Rows
iii. Mix Columns
iv. Add Round Key
11
Chapter 2 Background
Figure 10 shown above clearly depicts the encryption process. As can be observed in
the figure, the encryption process consists of 10 rounds, with the first 9 rounds consisting of
four different transformations. The last round does not involve one of the four
transformations namely the mixing of columns. (As mentioned earlier, the number of rounds
is 10, 12 and 14 for 128-bit, 192-bit and 256-bit long key size respectively). The description
12
Chapter 2 Background
of each step in the encryption process is produced in detail below with the considered
example matrix shown in figure 1.
Initial round:
In the initial round, the state matrix is processed through the Add Round Key
transformation function. In this Add Round Key transformation, the cipher-key is added to
the state. Note that here the cipher-key acts as the round-key. The cipher- key is added to the
state by combining each byte of the state with the corresponding byte of the cipher-key using
bitwise XOR as shown below.
Rounds:
It consists of 9 rounds in the case of AES-128. Each round consists of four different
transformation steps explained below.
In this step, each byte in the state matrix (obtained from the previous transformation step)
is replaced with its substitute byte using an 8-bit substitution box, the Rijndael’s S-Box. [See
section] The figure below depicts the substitute function. This provides non linearity to the
cipher.
13
Chapter 2 Background
The shift-rows function operates on the rows of the state (obtained from the previous
transformation step). It cyclically shifts the byte in each row by a certain offset. The first row
is left unchanged. Each byte of the second row is shifted once to the left. The third and fourth
rows are shifted by offsets of two and three respectively as shown in figure below.
In this function, the four bytes of each column of the state (obtained from the previous
transformation step) are transformed using an invertible linear transformation. It takes four
bytes as input and outputs four bytes. During the operation, each column of the state is
multiplied with a 128-bit key as shown in the figure 14. The multiplication operation is
defined as, multiplication by 1 indicates no change, multiplication by 2 indicates a left shift,
and multiplication by 3 indicates shift to the left and then performing XOR with the initial
un-shifted value. If the shifted value is larger than 0XFF then a conditional XOR with 0X1B
should be performed after shifting.
14
Chapter 2 Background
As mentioned previously in initial round, in Add Round Key step the round-key is
combined with the state (obtained from the previous transformation step) using XOR
operation. For each round, a sub-key/round-key is derived from the main cipher-key using
Rijndael's key schedule. The key schedule has already been explained in the previous
sections. The Round-keys generated through the key scheduling process shall be used in the
encryption process for Add Round Key transformation. For the first round i.e. after the initial
round, the Round-Key 1 is used. Similarly, for second round, the round- key 2 is used and so
on. The Add Round Key is defined as adding each byte of the state with the corresponding
byte of the sub-key/round-key using bitwise XOR.
15
Chapter 2 Background
The above four transformation functions are repeated 9 times to cover 9 rounds. The
output of the 9th round is processed through the final transformation round.
Final round:
The final round is similar to the previous rounds, except that this round skips the third
transformation step i.e. the Mix Column function. Therefore the transformation steps
involved in the final round are; The Sub-byte, the Shift-Row and Add Round Key function.
The output of the final round is the required cipher-text. The cipher-text generated for the
considered plaintext is shown in figure 16.
With a faster computer one can solve problems consisting of heavy computations
faster. In the case of interactive applications it can give better responsiveness. Second
important thing once can do is get better solutions in the same amount of time. It helps
increase the resolution of models and allows adding extra sophistication to the process.
Parallel computing is an attempt to speed up solution of a particular task through certain
techniques, one of the techniques being, dividing the particular task into sub tasks and then
executing them simultaneously.
Microprocessors based on single central processing unit (CPU), have driven rapid
performance improvement since years and still continue the trend. Many hardware and
software improvements have been made in order to get faster performance. These
improvements include increase in the clock speed, optimizing hardware by techniques like
instruction prefetching, reordering, pipelined functional units, branch predictions, hyper-
threading etc. A famous quote comes from Herb Shutter which says “the free lunch is over”.
What this means is that, the clock speed are no longer going to be increasing exponentially
like before when it used to double every 18 months to 2 years according to Moore’s Law
[37]. The reason for clock speeds being flattering out is, excessive power consumption which
leads to heat dissipation and current leakage. This leads to increase in the need for additional
cooling hardware which in turn increases the cost of sophistication. Additionally it requires
difficult design and verification. Improvement in the hardware demands for more silicon to
be devoted to control hardware. Due to many such complications computer industry could not
any more rely on improvements in hardware and software, instead it needed a change in the
16
Chapter 2 Background
approach of improving speed. One of such approach was to shift to multi core CPUs and this
gave rise of parallel computing.
A single core CPU implies a computing component with a single computing unit or
core. Figure 17 below shows the architecture of a single core CPU.
A multi core processor is a computer system with more than one core on a single chip.
The cores can be identical, referred to as homogeneous multi-core system or a heterogeneous
multi-core system with different types of core. A homogeneous multi-core processor contains
the replica of the CPU chip on the same silicon die. These cores run in parallel. Each core
contains several threads and within each core the threads are time-sliced. The operating
system comprehends each core as a separate processor and each thread in a processor as a
separate virtual processor. Figure 18 below shows dual-core processor architecture with a
private L1 cache and shared L2 cache.
17
Chapter 2 Background
There are various ways of exploiting parallelism namely Doman decomposition, Task
decomposition and Pipelining. In domain decomposition a large amount of data is divided
into chunks of small data which is processed by each thread independently. Each thread
executes the same instruction with its set of data. After all the data is executed the output
from all the threads together forms the final data. Another way of parallelising is task
decomposition in which a large sophisticated task is divided into sub tasks and each task is
executed independently by each thread. Two threads of a single core cannot work
simultaneously on the same functional unit. Pipelining is another way of parallelising a task,
where two threads actually executes two tasks concurrently overlapping the execution and
making it faster. This research work is a case of Domain decomposition, also known as Data
decomposition. In order to exploit parallelism efficiently a programmer needs to make few
considerations.
x Check if the application is suitable to be parallelised i.e., it consists of parts that are
independent of each other and can be executed simultaneously.
x Identification of which part of the algorithm or task can be parallelised.
x Distribution of tasks or data such that all the threads have their required part of data
x Avoid cache coherence problem.
x Synchronisation of execution.
With the introduction of POSIX Threads, usually referred to as Pthreads which is a
POSIX standard for threads it has become easy and flexible to efficiently parallelise an
algorithm on a multi-core CPU. The Pthreads is a standardized C language threads
programming interface specified by the IEEE POSIX 1003.1c standard [34].
18
Chapter 2 Background
Before one begins to talk about GPUs in high performance parallel computing it is
worthwhile to look back and see the evolution history of GPU. Even today GPU is considered
to have a marked job i.e., the entertainment industry.
Early GPUs were designed specifically for graphics applications. It settled out
somewhere around early 90s with a fixed pipeline/function. Those were the devices that had
specific silicon for specific operation like shading a triangle and so on. Then at one point it
was realized that there is much more flexibility if one can add programmability on these
devices. And eventually there were the vertex shaders and the pixel shaders which were
basically either pieces of programs that operated on 2D data structures or pieces of programs
that operated on 3D data structures. It was not until 2008, when it was realized that one can
use the same programming interface on the same silicon to do both 2D and 3D operations.
And that is basically where the Compute Unified Device Architecture (CUDA) was born.
Since then all the newer generation GPUs are CUDA capable and will remain CUDA capable
in the future. Size of the device is proportional to the amount of transistors that is located on
the device.
With the improvement in programmability, programmers started exploring GPUs for
scientific computing. At the beginning it actually had the same architecture that had the split
between the pixel and vertex shaders. But with the advent of CUDA there was just much
more activity in academia for doing GPU computing. And today it has become the main
stream in the world of scientific computing. These new trends of GPU that are used to
general purpose computing applications are usually referred to as General Purpose Graphical
Processing Unit (GPGPU).
19
Chapter 2 Background
CPU and GPU have fundamentally different design philosophies as illustrated in figure 20.
The CPU is optimized for minimal latency where one works towards being able to
quickly switch between different operations. GPUs are optimized for throughput so what one
can push as many operations through the device as possible. In order to get low latency on the
CPU there are lot of infrastructure on the chip such as large caches to make sure that one has
massive amount of data readily available. There is lot of control flow silicon. There are
actually few parts dedicated towards computing. On the other hand, in the GPU the balance
has shifted, wherein one needs tons of arithmetic and logic unit. It is specialized for compute
intensive, highly parallel computations and therefore is designed such that more transistors
are devoted to computing rather than memory and control [9]. The L2 cache can shrink
because the amount of time that a GPU takes to get data from the DRAM is not a major
concern as long as there is sufficient amount of work in the application. This sufficient
amount of data is required to hide the latency what is introduced while fetching data from the
global memory or while processing a complex operation. That means a GPU needs to use
massive number of threads in order to tolerate latencies.
The traditional GPU used OpenGL platform is not suitable for AES computation as it
is confined to floating point data and unavailability of bitwise logical operations unlike AES
that involves huge mathematical computations and uses bitwise logical operations in every
transformation round [9]. OpenCL can be a good platform to implement AES encryption.
However, CUDA is optimized for NVIDIA GPUs. This research uses the latest NVIDIA
GPU that supports CUDA platform, as it is flexible and efficient to be used for AES
encryption.
20
Chapter 2 Background
The GPU is a single instruction multiple thread (SIMT) computing device for CUDA.
A thread is the fundamental unit of CUDA programming model. Every thread executes the
same instruction with a specific data. The instruction is known as CUDA kernel or simply
kernel. The CUDA threads are hierarchically organized. Figure 21 presents the organization
of threads in a grid. A grid consists of a 3-Dimensional array of thread-blocks each intern
consisting of a 3-Dimensional array of threads. Every block within a grid holds a unique
block ID to be differentiated from other blocks. Similarly every thread within a block holds a
unique thread ID so as to be differentiated from other threads in the block. The grid has a
total of N*M threads, where N represents the number of blocks and M represent the number
of threads in each block. Interestingly, every thread and blocks holds together a unique ID
known as the thread-block ID to be differentiated from any other thread in the entire device,
i.e., by the ID of a thread it can be easily identified which thread of which block it belongs to.
21
Chapter 2 Background
A grid represents a single device (GPU) and the Host (CPU) can handle multiple such
grids. Figure 22 depicts the distribution of thread IDs and block IDs. All the threads in a grid
execute the same kernel function. They depend on their unique IDs to compute memory
addresses and make control decisions.
… … ... …
c[threadID] = c[threadID] = c[threadID] =
a[threadID]+b[threadID] a[threadID]+b[threadID] a[threadID]+b[threadID]
… … …
The thread organization is determined through the configuration provided during the
kernel launch.
The first dimension Ԣܾ݈ݔ݀ܫ݇ܿǤ ݔԢ indicates the dimension of the grid in terms of
number of blocks. The parameter Ԣܾ݈݉݅ܦ݇ܿǤ ݔԢ represents the dimension of each block in
terms of number of threads. For example thread ‘2’ of block ‘4’ has a Ԣܦܫ݀ܽ݁ݎ݄ݐԢ value
(Ͷ ൈ ܯ ʹ). Each thread executes the same code, but with different parts of data and may
take different paths.
22
Chapter 2 Background
Different combinations of the threads and blocks of a device can affect the execution
of a kernel. This research work examines the effect of different thread block combinations on
the algorithm.
In CUDA, the host and the device have separate memory spaces. GPUs are typically
hardware cards that come with its own memory. In order to execute a kernel in the device, the
programmer needs to allocate memory within the device to store the necessary data for
execution. Figure 24 below shows the CUDA device memory model. It includes registers,
shared memory, constant and texture memory and Global memory. The communication
between the device and host takes place through the global memory. When a data is
transferred by the host through the PCI bus, it is stored in the global memory and the GPU
can access this memory for data to be used during execution. In the same way the GPU after
execution deliver the data back to the CPU via the PCI bus. The global memory and the
constant memory are shared by all the blocks in the device. Every block has its own shared
memory which is shared by all the threads within that block. Each thread within a block has
its own private memory or local memory and registers.
23
Chapter 2 Background
int main ( ){
int *h_a, *h_b, *h_out; // Host copy of inputs
int *d_a, *d_b, *d_out; // Device copy of inputs
cudaMalloc ((void**)&d_a, N*sizeof(int)); //Memory allocation for inputs in device
cudaMalloc ((void**)&d_b, N*sizeof(int));
cudaMalloc ((void**)&d_out, N*sizeof(int));
cudaMemcpy(d_a, h_a, N*sizeof(int), cudaMemcpyHostToDevice); //Copy input to the device
cudaMemcpy(d_b, h_b, N*sizeof(int), cudaMemcpyHostToDevice);
encryption<<<blocks,threads>>>(d_state,d_key,d_cipher); //Execute kernel on the device
cudaMemcpy(out,d_out,N*sizeof(iht),cudaMemcpyDeviceToHost); //Copy the results to the host
cudaFree(d_a);cudaFree(d_b);cudaFree(d_out) //Free memory allocated on host
The instructions shown in the figure above is executed in a sequential manner. The
kernel execution has to wait until all the data is stored into the GPU memory, and after
execution it needs to send all the output data to the CPU before it can be displaced for the
user. The introduction of CUDA STREAMS has given a new possibility for CUDA
programmers to be able to execute the function in a smarter way.
24
Chapter 2 Background
K4 D2H 3
Performance
improvement
Execution Time
The exponential increase in data generation and the growth of several security hazards
have forced researchers to search for better and faster implementation of encryption
algorithms to avoid data vulnerability. To tackle the problem of security threats and
complexity various algorithms have been introduced and implemented. The next important
issue to deal with is to reduce the encryption time of the algorithms. Optimizing the algorithm
itself would do this, but to a limited extent. The introduction of multi-threading, multi-core
CPUs and GPUs have been a major breakthrough in decreasing the processing time of a
program by exploiting parallelism in the algorithms and processing them using multiple
processing units.
25
Chapter 2 Background
Early comparisons of CPU and GPU performance in encrypting data have been done
on various encryption algorithms including AES, DES, 3DES, RSA and many more. This
paper is mostly concerned with the AES implementations. The AES implementations
performed by previous researchers are quite motivating and intellectual. However when few
of them fail to exploit the parallelism on the Graphics device efficiently; few others have
ignored effective computation time including data transfer time due to which the results and
inferences are incomplete. There is still insufficient information to be able to accurately
predict the best utilization techniques of a Graphic Card for implementation of an algorithm.
This section presents few previous research works on this area. This research paper takes the
previous research work as a motivation. Considering the limitations in previous research
works as a research gap, this paper aims to exploit efficient parallelism on the GPU, and on
multi-core CPU to make a fair and reliable comparison. Also it aims to deduce
implementation techniques on multi-core CPU and GPU, in order to utilize them for future
implementations.
Parallel implementation of AES algorithms is not a new topic of interest. Previously,
works on implementing AES algorithm on multi-core CPU and GPU have been done. Shao et
al in research paper [4] suggests, AES algorithm implemented using a 16-core GPU @
550MHz using openssl reduced the computation time by 1.6 times, when compared to a
Pentium(R) Dual-Core E5200 @ 2.5GHz CPU . This paper however ignores the latency
caused due to transfer of data between the host and device and vice versa which is an
important factor of concern in GPU computing. Also the performance improvement is quite
small to claim that GPUs can efficiently outperform CPU for AES encryption.
As discussed in Le et al, the AES algorithms have been implemented using data
parallelism [9]. The paper boasts a boost of 7x speedup of AES parallel algorithm when
compared to its sequential version for data size beyond 200 Mbytes, where as the
performance boost drops down to 4 and 2 for data size below 10 Kbytes. For a significant
speeding up of the algorithm, which is the ratio of GPU to CPU execution time, the data input
should be large. The larger the input data the higher is the speedup value. However, the key
expansion block of the encryption function is done in CPU rather than on a GPU, so this
could be regarded as partial parallelism of the algorithm. This raises the question of how
complete parallelism performance would be, which this paper tries to solve.
Li et al in research paper [5] concentrates on the improving the throughput of the AES
algorithm. The paper compares similar implementations on various models of NVIDIA. The
author suggests that the efficiency of AES algorithm can be improved by an OpenCl
implementation of Electronic Codebook mode Encryption and Cipher Feedback mode
Decryption, however it entails a performance penalty when compared to CUDA
implementation. Also the key expansion is performed on the CPU before transferring the
round keys to the device similar to another research paper [1] by Luken et al.
In a report by Maksim [7], the performance of AES-128, AES-192 and is tested using
various numbers of threads per block. The number of threads allocated per block is 64, 128,
256 and 512. These were done on the three variants of AES-128, AES-192 and AES-256.
This resulted in highest throughput of 400+MBps when 64 threads per block were used. The
throughputs have been reported to be less than CPU for data sizes less than 1MB and
exponentially growing for data sizes exceeding 1MB. Higher throughputs and lesser
execution times are achieved when the input file size is larger. This research work takes this
paper as a motivation and aims to present the thread block combination effect more intensely.
26
Chapter 2 Background
27
Chapter 3 Methodology
CHAPTER 3 METHODOLOGY
This section is a crucial part of the research work as it helps to understand the cause
and effect for the problem in hand. This research work uses “the experimental method” also
known as the quasi experiment, in which a researcher observes the consequences while
actively influencing something. In other words, it is a systematic scientific approach of
research in which a researcher measures any change in other variables while controlling and
manipulating one or more variable.
The research work is carried out in two sections
1. Literature Review
A literature review allows one to gain and demonstrate skills in both information
seeking and critical appraisal. It also helps to generate a hypothetical analysis of the
outcome of research. The purpose of literature review in this research is fill the
knowledge gap on parallel programming techniques, GPU programming, efficient
algorithm build, AES encryption. Programmer needs to have clear understanding of
CPU and GPU architectural disputes, CUDA programming model, and Multi-core
processor architecture.
To carry out literature review:
x Latest research articles organized around and related directly to the thesis is
studied and analysed.
x Books and articles are studied in order to complete the knowledge gap in the
area.
2. Experiment
An experiment is a systematic procedure carried out to verify, refute and establish
the validity of a hypothesis. The experiment performed in this research work is a
controlled experiment which helps providing insight into cause-and-effect of
demonstrating the outcome. An experiment completes a research work fully. The
experiment details are described in this section.
x A CUDA-capable GPU
- NVIDIA Quadro K4000
- Architecture: Kepler
- 768 processing cores, 32 cores in each Streaming Processor.
x A supported version of Microsoft Windows
- Windows 7
x A CUDA-supported Microsoft Visual Studio
- Microsoft Visual Studio 2012
x NVIDIA CUDA toolkit
x CPU
- Intel® Xeon® CPU E5-1650
- No of Cores: 6
- No of threads: 12 (2 logical cores per physical)
28
Chapter 3 Methodology
Develop code in
Develop single-threaded CUDA program on GPU
C program on CPU (AES-128,192,256)
(AES-128,192,256)
AES-128 AES-128
AES-128
CPU CPU AES-128 AES-128 AES-128
CPU
GPU GPU GPU
RQ1: b
Comparative analysis of speed
RQ2
RQ3
29
Chapter 3 Methodology
3.3 EXPERIMENT
AES algorithm operates on an input size of 16 bytes using key size of 128 bits, 192
bits and 256 bits for AES-128, AES-192 and AES-256 version respectively. The algorithm is
developed on Intel® Xeon® CPU E5-1650 in such a way that it takes a large set of data as
input and encrypts it in chunks of 16 bytes each sequentially. The encryption key remains
constant for each data chunk. The key is first expanded to generate the round keys which are
then used in each round of the encryption process.
The experiment is performed with a data size of 32000 bytes and 32000*5 bytes. The data
used is a long array of random integers. In each case, execution time is recorded. A single
AES encryption process (k=1) involves execution time to:
1. Divide the state array (input array) into chunks of data.
2. Perform encryption on each chunk of data.
3. Store the results back into the output array.
The process is repeated for k=10, in order to increase readability of results. 20 readings of
execution time are taken in each case. Its Average, Standard Deviation and Confidence
Interval are calculated.
Implementation on single threaded CPU.
A single threaded CPU program implies that the program is executed by a single
processor in the CPU. The structure of the program is shown below in figure 29. The
instructions within the red box represent a process.
30
Chapter 3 Methodology
#include<..Header files…>
#define N MAX
int cipherkey[size];
int main()
{
int state[N];
for(…condition…
//data encrypted in chunks
{
chunk[j]=state[j];
}
encryption(chunk, cipherkey); //function call
for(Condition…){
state[j]=chunk[j]; //results copied back to state array
}
}
end = clock(); //time record stops
total_time = (end - start); //total time calculated
}
A multi threaded program for AES algorithm is developed using POSIX threads. The
CPU contains 6 physical cores, each containing 2 virtual cores. Effective utility of the CPU,
in order to achieve highest performance for a fair comparison with other implementations,
demands the use of all available cores efficiently. Hence 12 threads are utilized to process the
entire data set. The program is developed such that the data is ~ equally divided among the
threads. Each thread executes on its part of data independently and concurrently. The
structure of the program is shown below in figure 30. The instructions within the red box
represent a process.
31
Chapter 3 Methodology
#include<..Header files…>
#define state_ele MAX
#define num_threads 12
int main ()
{
pthread_t thread[num_threads];
pthread_attr_t attr;
start = clock(); //time record starts
for(k=0;k<process;k++){ //encryption performed 'k' no of times
'
keyexpansion (arg…); // key expansion
for(condition…)
{
rc = pthread_create(&thread[t], &attr, threadfunc, (void *)t); //thread create
}
}
Key points:
32
Chapter 3 Methodology
x This typically the case of data decomposition or data parallel programming which
uses the SIMD model.
AES-128, 192 and 256 are implemented on the Quadro K4000 NVIDIA GPU using
CUDA programming. A single AES encryption process (k=1) involves execution time to:
x Initialize the input date on the host (CPU).
x Copy the input data to the device.
x Execute the program with the copied input data in the device.
x Copy the output data to the host.
The program structure of the CUDA program is shown in figure 31 and figure 32. The
program is structured in such a way that each thread in the GPU takes a chunk of the data and
executes it parallel with other threads. The implementation is accomplished using two
different data sizes 32000 bytes and 32000*5 bytes. In case of 32000 bytes, for single
granularity 2000 threads are utilized so that each thread operates on data of size 16 bytes. In
the case of 32000*5 bytes, the host (CPU) passes the data in chunks of 32000 bytes to the
device (GPU). The GPU executes the 32000 bytes and returns the results before it gets the
next chunk from the host. The process is repeated for k=10, in order to increase readability of
results. 20 readings of execution time are taken in each case. Its Average, Standard Deviation
and Confidence Interval are calculated.
The execution time includes:
x Time taken to copy data from host to device
x Time taken to execute the kernel (key expansion and encryption) by the device.
x Time taken to copy the results from device to host.
The structure of the CUDA program is shown below.
33
Chapter 3 Methodology
//HOST CODE
#include<..Header files…>
#define N MAX
int cipherkey[size];
int main()
{
int i,k;
int state[N];
for(i=0;i<N;i++){ //Initialize input state in host
state[i]=i;
}
double total_time;
clock_t start, end;
start = clock(); //time record starts
for(k=0;k<process;k++){ //process repeated
34
Chapter 3 Methodology
//DEVICE CODE
__global__ void keyexpansion (int*key,int*roundkeys…..)
__global__ void encryption (int*state, int*key,int*cipher)
{ Mixcolumn (arg...);
Shiftrow (arg...);
Subbyte (arg...);
Addroundkey (arg...);
}
The AES 128, 192 and 256 algorithms are then implemented on the GPU using an
optimized CUDA version using CUDA STREAMS. The algorithm is implemented for data
size 32000 bytes, 32000*2 bytes and 32000* 5 bytes. This is implemented using the best
thread block distribution obtained from the previous results. For data size 32000 bytes and
32000*5 bytes, two streams are used with each stream processing 16000 bytes and 32000
bytes of data respectively. For data size of 32000*5 bytes, five streams are used with each
stream processing 32000 bytes of data. A single AES encryption process (k=1) involves
execution time to:
35
Chapter 3 Methodology
Interval are calculated. The structure of the CUDASTREAM program using 2 steams is
shown in figure 33.
//HOST CODE
#include<..Header files…>
#define N MAX
void main()
{
cudaStream_t stream0,stream1;
cudaStreamCreate()
start = clock(); //time record starts
for(k=0;k<process;k++) //process repeated
{
//copy data from host to device
cudaMemcpyAsync(d_a0,&h_a0, N *sizeof(int),cudaMemcpyHostToDevice,stream0);
cudaMemcpyAsync(d_a1,&h_a1, N *sizeof(int),cudaMemcpyHostToDevice,stream1);
keyexpansion<<<1,16>>>(d_key,…) //key expansion in the device
encryption<<<50,20,0,stream0>>>(d_a0,d_b); //kernel execution in the device
encryption<<<50,20,0,stream1>>>(d_a1,d_b);
//copy data from host to device
cudaMemcpyAsync(&h_a0,d_a0, N *sizeof(int),cudaMemcpyDeviceToHost,stream0);
cudaMemcpyAsync(&h_a1,d_a1, N *sizeof(int),cudaMemcpyDeviceToHost,stream1);
}
end = clock(); //time record stops
total_time = ((double)(end - start)) / CLK_TCK; //calculate total time
}
36
Chapter 3 Methodology
37
Chapter 4 Results
CHAPTER 4 RESULTS
With the advancement in the technology and the role of communication, huge amount
of data needs to be transferred while maintaining high level of privacy. Large amounts of data
are continuously encrypted every second. Speed of encryption has become a major concern in
the modern era of data protection. The degradation in the possibility of advancement in the
hardware of a system demands for re-programming the software to achieve higher
performance. Parallel computing is the preferred solution adopted by most developers to
design an algorithm that can be processed faster in parallel computers. Understanding the
different ways of parallelising a solution would help to choose the best solution in any
application. This research work examines the performance of encryption algorithm on
different levels of parallelism on a CPU and GPU. This section illustrates the acquired results
in this research.
4.1 IMPLEMENTATION ON SINGLE CORE CPU
AES 128 algorithm operates on an input state of 16 bytes or 128 bits and a cipher key
of size 16 bytes. It contains 10 transformation rounds as described in chapter 2. Each round
contains a set of transformation functions.
Case 1: Case 2:
Input data size= 32000 bytes Input data size=32000*5
Key size= 16 bytes. Key size =16 bytes
The table 1 below shows the average execution time, standard deviation and
confidence interval of the values acquired during a single process. The unit of measurements
is in “seconds”.
37
Chapter 4 Results
It can be seen that AES 128 takes ~ 0.032 seconds to execute a single process run for
a data size of 32000 bytes long. It takes ~ 0.15 seconds to execute the data of size 32000*5
bytes which is ~ 5 times higher than the former case.
AES 192 algorithm operates on an input state of 16 bytes or 128 bits and a cipher key
of size 24 bytes. It contains 12 transformation rounds as described in chapter 2. Each round
contains a set of transformation functions.
Case 1: Case 2:
Input data size= 32000 bytes Input data size=32000*5
Key size= 24 bytes. Key size =24 bytes
Table 2 shows the execution time of AES 192 encryption algorithm using Single
threaded C program on the CPU. It shows the average execution time, standard deviation and
confidence interval of the values acquired during a single process. The unit of measurements
is in “seconds”.
Single threaded C for AES 192
It can be seen that AES 192 takes ~ 0.04 seconds to execute a single process run for a
data size of 32000 bytes long. It takes ~ 0.2 seconds to execute the data of size 32000*5 bytes
which is ~ 5 times slower than the former case.
38
Chapter 4 Results
AES 256 algorithm operates on an input state of 16bytes or 128 bits and a cipher key
of size 32 bytes. It contains 14 transformation rounds as described in chapter 2. Each round
contains a set of transformation functions.
Case 1: Case 2:
Input data size= 32000 bytes Input data size=32000*5
Key size= 32 bytes. Key size =32 bytes
Table 3 shows the execution time of AES 256 encryption algorithm using Single
threaded C program on the CPU. It shows the average execution time, standard deviation and
confidence interval of the values acquired during the process. The unit of measurements is in
“seconds”.
Single threaded C for AES 256
It can be seen that AES 256 takes ~ 0.05 seconds to execute a single process run for a
data size of 32000 bytes long. It takes ~ 0.2 seconds to execute the data of size 32000*5 bytes
which is ~ 4 times slower than the former case.
39
Chapter 4 Results
Case 1:
Standard
Deviation 0.002394 0.002481 0.002415 0.002519 0.00215 0.001519 0.070208
Table 4: Execution time for CUDA AES-128 with granularity 1 for 32000 bytes data
Table 4 shows the execution time of AES 128 on GPU with granularity 1. For a data size
of 32000 bytes, 2000 threads are utilized where each thread handles 16 bytes of data. In other
words, every thread independently runs the AES algorithm on different parts of the data. It
can be observed that the fastest execution time obtained is ~0.0017 seconds which is about 18
times faster than the single threaded C program on the CPU.
¾ Granularity 2:
Total number of threads utilized = 1000.
Table 5 shows the execution time of AES 128 on GPU with granularity 2. For a data
size of 32000 bytes 1000 threads are utilized where each thread handles 32 bytes of data. The
32 bytes of data is internally divided into chunks and executed by the single thread
sequentially. In other words, every thread independently runs the AES algorithm on different
parts of the data. It can be observed that the fastest execution time obtained is ~0.0015
seconds which is about 22 times faster than the single threaded C program on the CPU.
40
Chapter 4 Results
Standard
Deviation 0.002513 0.002552 0.002351 0.002314 0.002552 0.001832 0.002565
Confidence 0.000899 0.001132 0.00047 0.000736 0.001632 0.003447 0.026376
Interval - - - - - - -
0.003101 0.003368 0.00253 0.002764 0.003868 0.005053 0.028624
Table 5: Execution time CUDA AES-128 with granularity 2 for 32000 bytes input
¾ Granularity 10:
Standard
Deviation 0.002351 0.002438 0.002552 0.002351 0.002351 0.001118 0.002221
Table 6: Execution time CUDA AES-128 with granularity 10 for 32000 bytes input
Table 6 shows the execution time of AES 128 on GPU with granularity 10. For a data size
of 32000 bytes 200 threads are utilized where each thread handles 160 bytes of data. The 160
bytes of data is internally divided into chunks and executed by the single thread sequentially.
In other words, every thread independently runs the AES algorithm on different parts of the
data. It can be observed that the fastest execution time obtained is ~0.0065 seconds which is
about 5 times faster than the single threaded C program on the CPU.
41
Chapter 4 Results
¾ Granularity 100:
Total number of threads utilized = 20.
Table 7: Execution time CUDA AES-128 with granularity 100 for 32000 bytes input
Table 7 shows the execution time of AES 128 on GPU with granularity 100. For a
data size of 32000 bytes 20 threads are utilized where each thread handles 1600 bytes of data.
The 1600 bytes of data is internally divided into chunks and executed by the single thread
sequentially. In other words, every thread independently runs the AES algorithm on different
parts of the data. It can be observed that the fastest execution time obtained is ~0.056 seconds
which is about twice higher than the execution time of single threaded C program on the
CPU.
Case 2:
Input data size = 32000*5 bytes
Key size= 16 bytes
In this case, the total data size is 32000*5 bytes. The data is divided into chunks of 32000
bytes each and processed on the Graphic card sequentially. Each process in turn divides the
32000 byte chunks into smaller chunks and processes it in parallel
¾ Granularity 1
Total number of threads utilized = 2000.
AES 128 on GPU using CUDA
Data Size = 32000*5
Granularity 1
Grid
Dimension <2, 1000> <20, 100> <40,50> <50,40> <100,20> <1000,2> <2000,1>
Average 0.01675 0.014 0.014 0.0155 0.0195 0.07325 0.1565
Standard
Deviation 0.002447 0.002052 0.002615 0.002236 0.002236 0.023411 0.002351
Table 8: Execution time CUDA AES-128 with granularity 1 for 32000*5 bytes input
42
Chapter 4 Results
Table 8 shows the execution time of AES 128 on GPU with granularity 1. The total data
size is 32000*5 bytes which is processed on the graphic card in chunks of size 32000 bytes
each. For a chunk of data of size 32000 bytes a total of 2000 threads are utilized where each
thread handles 16 bytes of data. In other words, every thread independently runs the AES
algorithm on different parts of the data chunk. It can be observed that the fastest execution
time obtained here is ~0.014 seconds which is about 11 times faster than the single threaded
C program on the CPU.
¾ Granularity 2:
Total number of threads utilized = 1000.
AES 128 on GPU using CUDA
Data Size = 32000*5
Granularity 2
Grid
Dimension <10,100> <20,50> <25, 40>> <40,25> <50, 20> <100,10> <1000,1>
Average 0.0125 0.013 0.01345 0.013 0.01175 0.021 0.14075
Standard
Deviation 0.002565 0.002513 0.002235 0.002991 0.002447 0.002052 0.001832
Confidence 0.011376 0.011898 0.012470 0.011689 0.010678 0.020101 0.139947
Interval - - - - - - -
0.013624 0.014101 0.014430 0.014311 0.012822 0.021899 0.141553
Table 9: Execution time CUDA AES-128 with granularity 2 for 32000*5 bytes input
Table 9 shows the execution time of AES 128 on GPU with granularity 2. The total data
size is 32000*5 bytes which is processed on the graphic card in chunks of size 32000 bytes
each. For a chunk of data of size 32000 bytes a total of 1000 threads are utilized where each
thread handles 32 bytes of data. This data is internally divided into yet smaller chunks and
executed by single thread sequentially. It can be observed that the fastest execution time
obtained is ~0.0117 seconds which is about 13 times faster than the single threaded C
program on the CPU.
¾ Granularity 10:
Total number of threads utilized = 200.
AES 128 on GPU using CUDA
Data Size = 32000*5
Granularity 10
Grid
Dimension <2,100> <8,25> <10,20> <20,10> <25,8> <100,2> <200,1>
Average 0.038 0.0375 0.036 0.03935 0.0415 0.078 0.14725
Standard
Deviation 0.002513 0.002565 0.002052 0.001565 0.002856203 0.002513 0.002552
Table 10: Execution time CUDA AES-128 with granularity 10 for 32000*5 bytes input
43
Chapter 4 Results
The total data size is 32000*5 bytes which is processed on the graphic card in chunks of
size 32000 bytes each. For a chunk of data of size 32000 bytes a total of 200 threads are
utilized where each thread handles 160 bytes of data. This data is internally divided into yet
smaller chunks and executed by single thread sequentially. It can be observed in table 10 that
the fastest execution time obtained is ~0.036 seconds which is about 4 times faster than the
single threaded C program on the CPU.
¾ Granularity 100:
Total number of threads utilized = 20.
Table 11 shows the execution time of AES 128 on GPU with granularity 100 for a
data size is 32000*5 bytes which is processed on the graphic card in chunks of size 32000
bytes each. For a chunk of data of size 32000 bytes a total of 20 threads are utilized where
each thread handles 1600 bytes of data. This data is internally divided into yet smaller chunks
and executed by single thread sequentially. It can be observed that the fastest execution time
obtained is ~0.285 seconds which is about twice higher than the execution time of single
threaded C program on the CPU.
Standard
Deviation 0 0.002991 5.69532E-17 0.002513 0.001118 0.00484
Table 11: Execution time CUDA AES-128 with granularity 100 for 32000*5 bytes input
44
Chapter 4 Results
Case 1:
Standard
Deviation 0.001832 0.002052 0.002350 0.001118 0.001118 0.002052 0.002052
Table 12: Execution time CUDA AES-192 with granularity 1 for 32000 bytes input
Table 12 shows the execution time of AES 192 on GPU with granularity 1. For a data size
of 32000 bytes 2000 threads are utilized where each thread handles 16 bytes of data. In other
words, every thread independently runs the AES algorithm on different parts of the data. It
can be observed that the fastest execution time obtained is ~0.003 seconds which is about 10
times faster than the single threaded C program on the CPU.
¾ Granularity 2:
Total number of threads utilized = 1000.
Table 13 shows the execution time of AES 192 on GPU with granularity 2. For a data
size of 32000 bytes 1000 threads are utilized where each thread handles 32 bytes of data. The
32 bytes of data is internally divided into chunks and executed by the single thread
sequentially. In other words, every thread independently runs the AES algorithm on different
parts of the data. It can be observed that the fastest execution time obtained is ~0.0017
seconds which is about 21 times faster than the single threaded C program on the CPU.
45
Chapter 4 Results
Standard
Deviation 0.002565 0.002513 0.002447 0.002052 0.002350 8.898E-19 0.001118
Table 13: Execution time CUDA AES-192 with granularity 2 for 32000 bytes input
¾ Granularity 10:
Total number of threads utilized = 200.
AES 192 on GPU using CUDA
Data Size = 32000
Granularity 10
Grid
Dimension <2, 100> <8, 25> <10, 20> <20, 10> <25, 8> <100, 2> <200,1>
Average 0.009750 0.009000 0.009500 0.010000 0.010250 0.020750 0.036500
Standard
Deviation 0.001118 0.002051 0.001538 1.779E-18 0.001118 0.002447 0.002351
Table 14: Execution time CUDA AES-192 with granularity 10 for 32000 bytes input
Table 14 shows the execution time of AES 128 on GPU with granularity 10. For a data
size of 32000 bytes 200 threads are utilized where each thread handles 160 bytes of data. The
160 bytes of data is internally divided into chunks and executed by the single thread
sequentially. In other words, every thread independently runs the AES algorithm on different
parts of the data. It can be observed that the fastest execution time obtained is ~0.009 seconds
which is about 4 times faster than the single threaded C program on the CPU.
46
Chapter 4 Results
¾ Granularity 100:
Total number of threads utilized = 20.
AES 192 on GPU using CUDA
Data Size = 32000
Granularity 100
Grid
Dimension <1, 20> <2, 10> <4, 5> <5, 4> <10, 2> <20, 1>
Average 0.071500 0.070250 0.069500 0.070000 0.070500 0.076250
Standard
Deviation 0.002351 0.001118 0.001539 2.8476E-17 0.001538 0.002221
Table 15 shows the execution time of AES 192 on GPU with granularity 100. For a
data size of 32000 bytes 20 threads are utilized where each thread handles 1600 bytes of data.
The 1600 bytes of data is internally divided into chunks and executed by the single thread
sequentially. In other words, every thread independently runs the AES algorithm on different
parts of the data. It can be observed that the fastest execution time obtained is ~0.069 seconds
which is about 2 times higher than the execution time of single threaded C program on the
CPU.
Case 2:
Input data size = 32000*5 bytes
Key size= 24 bytes
In this case, the total data size is 32000*5 bytes. The data is divided into chunks of 32000
bytes each and processed on the Graphic card sequentially. Each process in turn divides the
32000 byte chunks into smaller chunks and processes it in parallel
¾ Granularity 1
Total number of threads utilized = 2000.
AES 192 on GPU using CUDA
Data Size = 32000*5
Granularity 1
Grid
Dimension <2,1000> <20,100> <40,50> <50,40> <100,20> <1000,2> <2000,1>
Average 0.02075 0.0165 0.01725 0.01925 0.02475 0.10625 0.2035
Standard
Deviation 0.001832 0.002351 0.002552 0.001832 0.001118 0.002221 0.002351
Table 16: Execution time CUDA AES-192 with granularity 1 for 32000*5 bytes input
47
Chapter 4 Results
Table 16 shows the execution time of AES 192 on GPU with granularity 1. The total data
size is 32000*5 bytes which is processed on the graphic card in chunks of size 32000 bytes
each. For a chunk of data of size 32000 bytes a total of 2000 threads are utilized where each
thread handles 16 bytes of data. In other words, every thread independently runs the AES
algorithm on different parts of the data chunk. It can be observed that the fastest execution
time obtained here is ~0.0165 seconds which is about 11 times faster than the single threaded
C program on the CPU.
¾ Granularity 2:
Total number of threads utilized = 1000.
AES 192 on GPU using CUDA
Data Size = 32000*5
Granularity 2
Grid
Dimension <10,100> <20,50> <25, 40> <40, 25> <50, 20> <100, 10> <1000,1>
Average 0.014500 0.014250 0.014550 0.014570 0.015500 0.026500 0.175250
Standard
Deviation 0.001539 0.001832 0.001118 0.001539 0.001539 0.002351 0.001118
Table 17: Execution time CUDA AES-192 with granularity 2 for 32000*5 bytes input
Table 17 shows the execution time of AES 192 on GPU with granularity 2. The total data
size is 32000*5 bytes which is processed on the graphic card in chunks of size 32000 bytes
each. For a chunk of data of size 32000 bytes a total of 1000 threads are utilized where each
thread handles 32 bytes of data. This data is internally divided into yet smaller chunks and
executed by single thread sequentially. It can be observed that the fastest execution time
obtained is ~0.014 seconds which is about 12 times faster than the single threaded C program
on the CPU.
¾ Granularity 10:
Total number of threads utilized = 200.
AES 192 on GPU using CUDA
Data Size = 32000*5
Granularity 10
Grid
Dimension <2, 100> <8, 25> <10, 20> <20, 10> <25, 8> <100, 2> <200,1>
Average 0.044250 0.044250 0.044750 0.047500 0.049250 0.095250 0.181250
Standard
Deviation 0.002447 0.001832 0.002552 0.002565 0.001832 0.001970 0.002221
Confidence 0.043178 0.043447 0.043632 0.046376 0.048447 0.094387 0.180276
Interval - - - - - - -
0.045322 0.045053 0.045868 0.048624 0.050053 0.096113 0.182224
Table 18: Execution time CUDA AES-192 with granularity 10 for 32000*5 bytes input
48
Chapter 4 Results
Table 18 shows the execution time of AES 192 on GPU with granularity 10. The total
data size is 32000*5 bytes which is processed on the graphic card in chunks of size 32000
bytes each. For a chunk of data of size 32000 bytes a total of 200 threads are utilized where
each thread handles 160 bytes of data. This data is internally divided into yet smaller chunks
and executed by single thread sequentially. It can be observed that the fastest execution time
obtained is ~0.044 seconds which is about 4 times faster than the single threaded C program
on the CPU.
¾ Granularity 100:
Total number of threads utilized = 20.
AES 192 on GPU using CUDA
Data Size = 32000*5
Granularity 100
Grid
Dimension <1, 20> <2, 10> <4, 5> <5, 4> <10, 2> <20, 1>
Average 0.35575 0.3495 0.348 0.348 0.351 0.37525
Standard
Deviation 0.001832 0.001539 0.002513 0.002513 0.002052 0.00197
Table 19: Execution time CUDA AES-192 with granularity 100 for 32000*5 bytes input
Table 19 shows the execution time of AES 192 on GPU with granularity 100. The
total data size is 32000*5 bytes which is processed on the graphic card in chunks of size
32000 bytes each. For a chunk of data of size 32000 bytes a total of 20 threads are utilized
where each thread handles 1600 bytes of data. This data is internally divided into yet smaller
chunks and executed by single thread sequentially. It can be observed that the fastest
execution time obtained is ~0.348 seconds which is about twice higher than the execution
time of single threaded C program on the CPU.
49
Chapter 4 Results
Case 1:
Standard
Deviation 0.001118 0.001539 0.002052 0.001832 0.001832 0.002052 0.002552
Table 20: Execution time CUDA AES-256 with granularity 1 for 32000 bytes input
Table 20 shows the execution time of AES 256 on GPU with granularity 1. For a data size
of 32000 bytes 2000 threads are utilized where each thread handles 16 bytes of data. In other
words, every thread independently runs the AES algorithm on different parts of the data. It
can be observed that the fastest execution time obtained is ~0.004 seconds which is about 13
times faster than the single threaded C program on the CPU.
¾ Granularity 2:
Total number of threads utilized = 1000.
AES 256 on GPU using CUDA
Data Size = 32000
Granularity 2
Grid
Dimension <10,100> <20,50> <25, 40> <40, 25> <50, 20> <100, 10> <1000,1>
Average 0.00375 0.00375 0.0035 0.00475 0.004 0.0065 0.04025
Standard
Deviation 0.002221 0.002221 0.002351 0.001118 0.002052 0.002351 0.001118
Table 21: Execution time CUDA AES-256 with granularity 2 for 32000 bytes input
Table 21 shows the execution time of AES 256 on GPU with granularity 2. For a data size
of 32000 bytes 1000 threads are utilized where each thread handles 32 bytes of data. The 32
bytes of data is internally divided into chunks and executed by the single thread sequentially.
In other words, every thread independently runs the AES algorithm on different parts of the
50
Chapter 4 Results
data. It can be observed that the fastest execution time obtained is ~0.0035 seconds which is
about 12 times faster than the single threaded C program on the CPU.
¾ Granularity 10:
Total number of threads utilized = 200.
AES 256 on GPU using CUDA
Data Size = 32000
Granularity 10
Grid
Dimension <2, 100> <8, 25> <10, 20> <20, 10> <25, 8> <100, 2> <200,1>
Average 0.01175 0.0105 0.0113 0.01125 0.01175 0.02125 0.04175
Standard
Deviation 0.002447 0.001539 0.002203 0.002221 0.002447 0.002221 0.002447
Confidence 0.010678 0.009826 0.010335 0.010276 0.010678 0.020276 0.040678
Interval - - - - - - -
0.012822 0.011174 0.012265 0.012224 0.012822 0.022224 0.042822
Table 22: Execution time CUDA AES-256 with granularity 10 for 32000 bytes input
Table 22 shows the execution time of AES 256 on GPU with granularity 10. For a data
size of 32000 bytes 200 threads are utilized where each thread handles 160 bytes of data. The
160 bytes of data is internally divided into chunks and executed by the single thread
sequentially. In other words, every thread independently runs the AES algorithm on different
parts of the data. It can be observed that the fastest execution time obtained is ~0.0105
seconds which is about 4 times faster than the single threaded C program on the CPU.
¾ Granularity 100:
Total number of threads utilized = 20.
AES 256 on GPU using CUDA
Data Size = 32000
Granularity 100
Grid
Dimension <1, 20> <2, 10> <4, 5> <5, 4> <10, 2> <20, 1>
Average 0.08175 0.081250 0.0808 0.08175 0.083000 0.0875
Standard
Deviation 0.002447 0.002221 0.001824 0.002447 0.002513 0.002565
Table 23: Execution time CUDA AES-256 with granularity 100 for 32000 bytes input
Table 23 shows the execution time of AES 192 on GPU with granularity 100. For a
data size of 32000 bytes 20 threads are utilized where each thread handles 1600 bytes of data.
The 1600 bytes of data is internally divided into chunks and executed by the single thread
sequentially. In other words, every thread independently runs the AES algorithm on different
parts of the data. It can be observed that the fastest execution time obtained is ~0.08 seconds
51
Chapter 4 Results
which is approximately 2 times higher than the execution time of single threaded C program
on the CPU.
Case 2:
Input data size = 32000*5 bytes
Key size= 32 bytes
In this case, the total data size is 32000*5 bytes. The data is divided into chunks of 32000
bytes each and processed on the Graphic card sequentially. Each process in turn divides the
32000 byte chunks into smaller chunks and processes it in parallel
¾ Granularity 1
Total number of threads utilized = 2000.
AES 256 on GPU using CUDA
Data Size = 32000*5
Granularity 1
Grid
Dimension <2,1000> <20,100> <40,50> <50,40> <100,20> <1000,2> <2000,1>
Average 0.02325 0.019 0.01875 0.021 0.02775 0.12125 0.23275
Standard
Deviation 0.002936 0.002052 0.002221 0.002616 0.002552 0.007232 0.002552
Table 24: Execution time CUDA AES-256 with granularity 1 for 32000*5 bytes input
Table 24 shows the execution time of AES 256 on GPU with granularity 1. The total data
size is 32000*5 bytes which is processed on the graphic card in chunks of size 32000 bytes
each. For a chunk of data of size 32000 bytes a total of 2000 threads are utilized where each
thread handles 16 bytes of data. In other words, every thread independently runs the AES
algorithm on different parts of the data chunk. It can be observed that the fastest execution
time obtained here is ~0.018 seconds which is about 11 times faster than the single threaded
C program on the CPU.
52
Chapter 4 Results
¾ Granularity 2:
Total number of threads utilized = 1000.
AES 256 on GPU using CUDA
Data Size = 32000*5
Granularity 2
Grid
Dimension <10,100> <20,50> <25, 40> <40, 25> <50, 20> <100, 10> <1000,1>
Average 0.01775 0.01675 0.017 0.01775 0.017 0.02975 0.202
Standard
Deviation 0.002552 0.002447 0.002513 0.002552 0.002513 0.001118 0.002513
Table 25: Execution time CUDA AES-256 with granularity 2 for 32000*5 bytes input
Table 25 shows the execution time of AES 256 on GPU with granularity 2. The total data
size is 32000*5 bytes which is processed on the graphic card in chunks of size 32000 bytes
each. For a chunk of data of size 32000 bytes a total of 1000 threads are utilized where each
thread handles 32 bytes of data. This data is internally divided into yet smaller chunks and
executed by single thread sequentially. It can be observed that the fastest execution time
obtained is ~0.0167 seconds which is about 12 times faster than the single threaded C
program on the CPU.
¾ Granularity 10:
Total number of threads utilized = 200.
AES 256 on GPU using CUDA
Data Size = 32000*5
Granularity 10
Grid
Dimension <2, 100> <8, 25> <10, 20> <20, 10> <25, 8> <100, 2> <200,1>
Average 0.05025 0.05 0.05 0.05375 0.05725 0.10875 0.20725
Standard
Deviation 0.00197 7.12E-18 0.001622 0.002221 0.002552 0.002751 0.002552
Table 26: Execution time CUDA AES-256 with granularity 10 for 32000*5 bytes input
Table 26 shows the execution time of AES 256 on GPU with granularity 10. The total
data size is 32000*5 bytes which is processed on the graphic card in chunks of size 32000
bytes each. For a chunk of data of size 32000 bytes a total of 200 threads are utilized where
each thread handles 160 bytes of data. This data is internally divided into yet smaller chunks
and executed by single thread sequentially. It can be observed that the fastest execution time
obtained is ~0.05 seconds which is about 4 times faster than the single threaded C program
on the CPU.
53
Chapter 4 Results
¾ Granularity 100:
Total number of threads utilized = 20.
Standard
Deviation 0.001832 0.002221 0.002565 0.002565 0.002351 0.002447
Confidence 0.408447 0.402776 0.401376 0.406376 0.40547 0.432178
Interval - - - - - -
0.410053 0.404724 0.403624 0.408624 0.40753 0.434322
Table 27: Execution time CUDA AES-256 with granularity 100 for 32000*5 bytes input
Table 27 shows the execution time of AES 256 on GPU with granularity 100. The
total data size is 32000*5 bytes which is processed on the graphic card in chunks of size
32000 bytes each. For a chunk of data of size 32000 bytes a total of 20 threads are utilized
where each thread handles 1600 bytes of data. This data is internally divided into yet smaller
chunks and executed by single thread sequentially. It can be observed that the fastest
execution time obtained is ~0.402 seconds which is about 2 times higher than the execution
time of single threaded C program on the CPU.
This section presents the variation in execution time for different granularity levels for
AES-128, AES-192 and AES-256.
Granularity 2 (~ 0.0015 sec)
P
E
R Granularity 1 (~ 0.0017 sec)
F
O
R Granularity 10 (~ 0.006 sec)
M
A Granularity 100 (~ 0.056 sec)
N
C
E
Figure 34 shows the execution time in ascending order for different granularities in
AES-128 with a data size of 32000 bytes. It can be deduced from the figure that for
granularity 2 which utilizes 1000 threads for encryption process the algorithm takes minimum
execution time. Granularity 1 gives almost same results as granularity 1 with a small
54
Chapter 4 Results
variation of factor 1.16. AES-128 with granularity 10 is ~ 4 times slower than granularity 2
and granularity 1. Granularity 100 gives the worst performance in this case with an execution
time which is ~ 38 times slower than for granularity 2.
P
Granularity 2 (~ 0.0017 sec)
E
R Granularity 1 (~ 0.0035 sec)
F
O Granularity 10 (~ 0.009 sec)
R
M
Granularity 100 (~ 0.0695 sec)
A
N
C
E
Figure 35 shows the execution time in ascending order for different granularities in
AES 192 with a data size of 32000 bytes. It can be deduced from the figure that for
granularity 2 which takes minimum execution time of ~0.0017 sec. It is twice faster than
execution with granularity 1, ~ 5 times faster than granularity 10 and ~40 times faster than
granularity 100.
Figure 36 shows the execution time in ascending order for different granularities in
AES 256 with a data size of 32000 bytes. It can be deduced from the figure that for
granularity 2 which takes minimum execution. Execution time for granularity 1 is almost
similar to that of granularity 1 with a negligible variation. Granularity 2 is 3 times faster than
that in granularity 10 and ~23 times faster than granularity 100.
55
Chapter 4 Results
Figure 37 shows the execution time in ascending order for different granularities in
AES 128 with a data size of 32000*5 bytes. It can be deduced from the figure that for
granularity 2 which takes minimum execution time of ~0.0117 sec. It is approximately near
to the execution time for granularity 1 with negligible variation. Granularity 2 gives
execution time which is ~3 times faster than granularity 10 and ~24 times faster than
granularity 100.
P
Granularity 2 (~ 0.014 sec)
E
R
F Granularity 1 (~ 0.0165 sec)
O
R
Granularity 10 (~ 0.044 sec)
M
A
N Granularity 100 (~ 0.348 sec)
C
E
Figure 38 shows the execution time in ascending order for different granularities in
AES 192 with a data size of 32000*5 bytes. It can be deduced from the figure that for
granularity 2 which takes minimum execution time. It is approximately near to the execution
time for granularity 1 with negligible variation. Granularity 2 gives execution time which is
~3 times faster than granularity 10 and ~25 times faster than granularity 100.
56
Chapter 4 Results
Figure 39 shows the execution time in ascending order for different granularities in
AES 256 with a data size of 32000*5 bytes. It can be deduced from the figure that for
granularity 2 which takes minimum execution time of ~0.0167 sec. It is approximately near
to the execution time for granularity 1 with negligible variation. Granularity 2 gives
execution time which is ~3 times faster than granularity 10 and ~24 times faster than
granularity 100.
57
Chapter 4 Results
0.06
0.05
Execution time (sec)
0.04
AES 128
0.03
AES 192
AES 256
0.02
0.01
0
<2, 1000> <20, 100> <40, 50> <50, 40> <100, 20> <1000, 2> <2000, 1>
Graph 1: Performance comparison of varied grid dimensions with granularity 1 for 32000 bytes data
Graph 2 shows the variation in execution time of AES-128, AES-192 and AES-256
for a data of size 32000 bytes for granularity 2 where a total of 1000 threads are utilized. It
can be depicted from the graph that AES-128 gives a faster execution for grid dimensions
<25, 40> followed by grid dimensions <20, 50> and <10, 100>. Similar behaviour can be
observed for AES-192 and AES-256. The execution time is highest for grid dimension
<1000, 1> followed by <100, 10>.
58
Chapter 4 Results
0.045
0.04
0.035
Execution time (sec)
0.03
0.025
AES 128
0.01
0.005
0
<10,100> <20, 50> <25, 40> <40, 25> <50, 20> <100, 10> <1000, 1>
Graph 2: Performance comparison of varied grid dimensions with granularity 2 for 32000 bytes input
Graph 3 shows the variation in execution time of AES-128, AES-192 and AES-256
for a data of size 32000 bytes for granularity 10 where a total of 200 threads are utilized. It
can be depicted from the graph that AES-128, AES-192 and AES-256 gives the fastest
execution for grid dimension <8, 25> followed by grid dimensions <2, 100> and <10, 20>.
The slowest performance is obtained by grid dimension <200, 1> followed by <100, 2>.
59
Chapter 4 Results
0.045
0.04
0.035
Execution time (sec)
0.03
0.025
AES 128
0.01
0.005
0
<2, 100> <8, 25> <10, 20> <20, 10> <25, 8> <100, 2> <200, 1>
Graph 3: Performance comparison of varied grid dimensions with granularity 10 for 32000 bytes data
0.1
0.09
0.08
0.07
Execution time (sec)
0.06
AES 128
0.05
AES 192
0.04
AES 256
0.03
0.02
0.01
0
<1,20> <2,10> <4,5> <5,4> <10,2> <20,1>
Graph 4: Performance comparison of varied grid dimensions with granularity 100 for 32000 bytes
data
60
Chapter 4 Results
Graph 4 shows the variation in execution time of AES-128, AES-192 and AES-256
for a data of size 32000 bytes for granularity 100 where a total of 20 threads are utilized. It
can be depicted from the graph that the fastest execution is obtained for grid dimensions <4,
5> followed by <2, 10>. On the other hand, the highest execution time is taken by grid
dimension <20, 1>.
Graph 5 shows the variation in execution time of AES-128, AES-192 and AES-256
for a data of size 32000*5 bytes for granularity 1 where a total of 2000 threads are utilized. It
can be depicted from the graph that AES-128, AES-192 and AES-256 give faster execution
for grid dimension <40, 50> followed by <20, 100>. The highest execution time is taken by
grid dimension <2000, 1> followed by <1000, 2>.
0.25
0.2
Execution time (sec)
0.15
AES 128
AES 192
0.1 AES 256
0.05
0
<2,1000> <20,100> <40,50> <50,40> <100,20> <1000,2> <2000,1>
Graph 5: Performance comparison of varied grid dimensions with granularity 1 for 32000*5 bytes
data
Graph 6 shows the variation in execution time of AES-128, AES-192 and AES-256
for a data of size 32000*5 bytes for granularity 2 where a total of 1000 threads are utilized. It
can be depicted from the graph that fastest execution is given by grid dimensions <25, 40>,
<20, 50> and <10, 100>. The highest execution time is taken by grid dimension <1000, 1>
followed by <100, 10>.
61
Chapter 4 Results
0.25
0.2
Execution time (sec)
0.15
AES 128
AES 192
0.1 AES 256
0.05
0
<10,100> <20,50> <25,40> <40,25> <50,20> <100,10> <1000,1>
Graph 6: Performance comparison of varied grid dimensions with granularity 2 for 32000*5 bytes
data
0.25
0.2
Execution time (sec)
0.15
AES 128
AES 192
0.1
AES 256
0.05
0
<2,100> <8,25> <10,20> <20,10> <25,8> <100,2> <200,1>
Graph 7: Performance comparison of varied grid dimensions with granularity 10 for 32000*5 bytes
data
62
Chapter 4 Results
Graph 7 shows the variation in execution time of AES-128, AES-192 and AES-256
for a data of size 32000*5 bytes for granularity 10 where a total of 200 threads are utilized. It
is depicted by the graph that grid dimension <8, 25> and <10, 20> gives the fastest execution
followed by <2, 100>. The highest execution time is given by grid dimension <2000, 1>
followed by <100, 2>.
0.5
0.45
0.4
0.35
Execution time (sec)
0.3
AES 128
0.25
AES 192
0.2
AES 256
0.15
0.1
0.05
0
<1,20> <2,10> <4,5> <5,4> <10,2> <20,1>
Graph 8: Performance comparison of varied grid dimensions with granularity 100 for 32000*5 bytes
data
Graph 8 shows the variation in execution time of AES-128, AES-192 and AES-256
for a data of size 32000*5 bytes for granularity 100 where a total of 20 threads are utilized. It
can be depicted from the graph that the fastest execution is obtained by grid dimension <4, 5>
followed by grid dimension <2, 10>. The highest execution time is given by grid dimension
<20, 1> followed by <10, 2>.
Multi-core CPU has multiple processing cores on a single chip. Intel® Xeon® CPU
E5-1650 used in this research has 6 independent processing cores with 2 virtual cores each.
POSIX thread programming provides the benefit of efficiently utilizing the available cores on
a system towards solving a time consuming complex problem. In this research a total of 12
threads are utilized for implementing AES-128, AES-192 and AES-256 encryption
algorithms for data sizes 32000 bytes and 32000* bytes independently. Readings are taken
for a single process run (k=1) and for 10 runs (k=10). The purpose of executing the process
10 times is to acquire better readability of data making it easy for comparison. This section
depicts the results obtained from the implementation followed by its comparisons with
previous implementations. Table 28, 29 and 30 shows the average execution time, standard
63
Chapter 4 Results
deviation and confidence interval for multi threaded Pthreads programming for AES 128,
AES 192 and AES 256 respectively.
Multi threaded C using POSIX THREAD for AES 128
No of threads=12
No of threads=12
0.0095 0.0375
Average
0.00394 0.004136
Standard Deviation
0.007773 0.035687
Confidence Interval - -
0.011227 0.039313
0.0103 0.04175
Average
0.003278 0.006544
Standard Deviation
0.008863 0.038882
Confidence Interval - -
0.011737 0.044618
64
Chapter 4 Results
0.06
0.05
Execution time (sec)
0.04
Single threaded C
0.03
Multi threaded C
0.02 CUDA
0.01
0
AES 128 AES 192 AES 256
Graph 9: Performance comparison of C, Pthreads and CUDA for 32000 bytes data
Graph 10 shows the execution time variation in data size of 32000*5 bytes. It is
observed that for AES 128, execution time of Pthreads program is ~4 times faster than single
threaded C program, whereas CUDA on GPU is ~3 times faster than the Pthreads on CPU.
For AES 192 algorithm, Pthread program is ~5 times faster than C and is ~ 3 times slower
than CUDA. For AES 256 Pthread program gives execution ~ 5 times lower than the single
threaded C whereas CUDA is ~2.5 times faster than Pthread program.
65
Chapter 4 Results
0.25
0.2
Execution time (sec)
0.15
Single threaded C
Multi threaded C
0.1
CUDA
0.05
0
AES 128 AES 192 AES 256
Graph 10: Performance comparison of C, Pthreads and CUDA for 32000*5 bytes data
For a summarized view the execution time of AES algorithm in ascending order is shown
below.
*G1, G2, G10 and G100 denote Granularity 1, 2, 10 and 100 respectively.
66
Chapter 4 Results
run (k=1) and for 10 runs (k=10). The purpose of executing the process 10 times is to acquire
better readability of data making it easy for comparison. The tables below shows the average
execution time, standard deviation and confidence interval of the values acquired during the
process. Readings for k=10 is shown in APPENDIX A. The unit of measurements is in
“seconds”.
Case 1:
Data size: 32000 bytes
Number of streams used: 2
Data in each stream: 16000 bytes
k=1 k=10
Table 31: AES-128 Execution using CUDA STREAMS for 2 streams with 16000 bytes each
The CUDA STREAMS program for AES 128 for a data size of 32000 bytes where
each streams takes 16000 bytes as input gives an execution time ~0.002 sec which is ~ equal
to time taken by CUDA without streams. There is no acceleration in execution of the
algorithm using CUDA STREAMS.
Case 2:
Data size: 32000*2 bytes
Number of streams used: 2
Data in each stream: 32000 bytes
k=1 k=10
Table 32: AES-128 Execution using CUDA STREAMS for 2 streams with 32000 bytes each
For a data size of 32000*2 bytes using two streams, each stream taking 32000 bytes
as input the execution time is ~ 0.0027 seconds, which is ~ equal to the time taken to execute
32000 bytes input in the first case with a small variation of factor 1.3.
67
Chapter 4 Results
Case 3:
Data size: 32000*5 bytes
Number of streams used: 5
Data in each stream: 32000 bytes
k=1 k=10
Table 33: AES-128 Execution using CUDA STREAMS for 5 streams with 32000 bytes each
CUDA STREAMS for 5 streams, with each stream taking 32000 bytes as input, ( total
input size =32000*5) takes ~ 0.010 seconds which is ~ 1.11 times faster than execution using
CUDA with single stream.
Case 1:
k=1 k=10
0.002469 0.032248
Confidence Interval - -
0.004530 0.034752
Table 34: AES-192 Execution using CUDA STREAMS for 2 streams with 16000 bytes each
The CUDA STREAMS program for AES 192 for a data size of 32000 bytes where
each streams takes 16000 bytes as input gives an execution time which is ~ equal to the time
taken by CUDA program without using streams. There is no acceleration in the execution
speed using CUDA STREAMS.
68
Chapter 4 Results
Case 2:
Data size: 32000*2 bytes
Number of streams used: 2
Data in each stream: 32000 bytes
k=1 k=10
Table 35: AES 192 Execution using CUDA STREAMS for 2 streams with 32000 bytes each
For a data size of 32000*2 bytes using two streams, each stream taking 32000 bytes
as input the execution time is ~ 0.007 seconds, which is twice the time taken in case 1.
Case 3:
Data size: 32000*5 bytes
Number of streams used: 5
Data in each stream: 32000 bytes
k=1 k=10
Table 36: AES-192 Execution using CUDA STREAMS for 5 streams with 32000 bytes each
CUDA STREAMS for 5 streams, with each stream taking 32000 bytes as input, ( total
input size =32000*5) takes ~ 0.013 seconds which is ~ 1.05 times faster than CUDA without
streams.
69
Chapter 4 Results
Case 1:
k=1 k=10
Table 37: AES-256 Execution using CUDA STREAMS for 2 streams with 16000 bytes each
The CUDA STREAMS program for AES 256 for a data size of 32000 bytes where
each streams takes 16000 bytes as input gives an execution time which is ~ equal to the time
taken by CUDA without streams.
Case 2:
Data size: 32000*2 bytes
Number of streams used: 2
Data in each stream: 32000 bytes
k=1 k=10
Table 38: AES-256 Execution using CUDA STREAMS for 2 streams with 32000 bytes each
For a data size of 32000*2 bytes using two streams, each stream taking 32000 bytes
as input the execution time is ~ 0.008 seconds, which is ~2 times higher than execution time
in case 1.
70
Chapter 4 Results
Case 3:
Data size: 32000*5 bytes
Number of streams used: 5
Data in each stream: 32000 bytes
k=1 k=10
0.015276 0.148769
Confidence Interval - -
0.017224 0.151231
Table 39: AES-256 Execution using CUDA STREAMS for 5 streams with 32000 bytes each
CUDA STREAMS for 5 streams, with each stream taking 32000 bytes as input, ( total
input size =32000*5) takes ~ 0.015 seconds which is ~ 1.03 times of the time taken by
CUDA program without using streams.
The results show that CUDA STREAMS is inefficient to accelerate the execution of
AES algorithm compared to CUDA without using streams. However there is a small
increment in speed for higher data size per stream.
71
Chapter 5 Discussions
CHAPTER 5 DISCUSSIONS
There are two types of validity threats that need to be considered while conducting a
research namely, internal validity and external validity.
Internal Validity refers to the ability of the research paper to be able to establish relation
between cause and effect [38]. Deviation from this validity can affect the correctness of the
results and its inferences. In order to avoid this validity threat the following steps have been
taken:
x The algorithm design has been thoroughly understood before formulating the code.
x 20 readings have been taken in every case in order to maintain accuracy of the
results for a fair and accurate comparison. Standard deviation and confidence
interval has been calculated to assure consistency in the results. The readings are
taken upto 6 digits after decimal to assure accuracy.
x The tests have been performed for k=1 and k=10, where k denotes the number of
process runs. The purpose is to increase the readability for accurate and easy
comparison. The tests has given consistent results.
x Possible optimizations have been taken while formulating the code in order to
present a valid comparison. Global memory usage is done where ever applicable so
as to reduce latency of accessing memory. Each work item can access those data
directly from the global memory. These data includes, the SBOX, multiplication
tables and cipher key which is common for the entire data chunk.
x The compiler used has been set to full optimization to get fastest execution possible.
x The correctness of the code is tested in a step by step manner at every stage to
assure valid output.
x The consistency in the generated output is been tested for every implementation in
order to maintain correct cipher-text.
x Code optimisation level is maintained constant for all the implementations to make
sure that the amount of work done in a single process by a core remains constant in
all the devices for a fair comparison.
72
Chapter 5 Discussions
x The performance variation pattern has been critically observed for different
combinations for the three algorithms to assure consistency and accuracy.
External Validity deals with the generalizability of the results of the research work
outside the study [38]. The functions in the algorithm are formed referring to the defined
standard. The functions therefore are universal and acceptable, and would hold good for
execution in real time environment. Although more optimizations can be performed in order
to get better performances, this research work used the considerable optimizations. The
obtained performance is independent of the type of input, assuring the correctness of the
results obtained.
This research work uses data parallelism technique on the AES algorithm. Hence, it is
the data size that effects the change in the performance of devices, and not the algorithm
complexity. The inference acquired through this research is expected to hold good for any
application on which data parallelism is applied.
The execution time of the device depends upon the kind of GPU used. This research
work uses a NVIDIA Quadro K4000 which has Kepler architecture. The algorithm
performance may show variation in the comparison factor between the two devices, yet the
general observation holds good for all NVIDIA GPUs using CUDA
5.2 DISCUSSIONS
The performance of the algorithm on the devices is seen to be higher in two cases of
thread/block distribution. One is when there is a balanced distribution of blocks and threads
in each block. For example in case of granularity 1, 2, 10 and 100 on an average the grid
dimension <40, 50>, <25, 40>, <8, 25> and <4, 5> respectively gives relatively higher
73
Chapter 5 Discussions
performance compared to other combinations. Another case which gave ~ equal or slightly
lower performance is when the threads of each block are effectively utilized however
considering the balance in distribution. For example in case of granularity 1, 2, 10 and 100 on
a average the grid dimension <20, 100>, <10, 100>, <20, 10> and <2, 10> gives the next high
performance. It can also be deduced that for extreme grid dimensions the execution time is
higher than other combinations. It is the situation when threads in each block are not
efficiently utilized. For example in case of granularity 1, 2, 10 and 100, grid dimensions
<2000, 1>, <1000, 1>, <200, 1>, and <20, 1> respectively give the least performance. From
the above observation it is understood that in order to generate maximum results there needs
to be a trade of between the number of blocks and threads used. The grid needs to be
organized such that the maximum number of blocks with maximum number of threads in
each block can give a considerably higher performance. The ratio of number of threads to that
of blocks shown below, should be greater than 1, and as small and possible.
݊ݏ݀ܽ݁ݎ݄ݐ݂ݎܾ݁݉ݑΤ݊ݏ݈ܾ݂݇ܿݎܾ݁݉ݑ
Higher ratio tends to higher execution time, as can be seen in graph 1 to graph 8.
Effect of granularity:
From the deduced results it is observed that for all the three versions of AES
algorithms, granularity 2 provides the fastest execution giving the least execution time
compared to granularity 1, 10 and 100, followed by granularity 1 and granularity 10. Whereas
granularity 100 gives the least performance. From this observation it is inferred that to
achieve high GPU performance there needs to be proper trade off between the amount of task
done and utilization of threads in the device. In case of granularity 10 and 100, the amount of
task by each thread is high but the threads in the device are not efficiently utilized. The
execution time of granularity 100 is higher than the single threaded C program and for higher
data size the execution time of granularity 10, and granularity 100 moves higher giving
unacceptable performance reduction. The reason is that for the case of granularity 100, the
program on whole consists of heavy serial computation and hence performs better on the
CPU. The conclusion is that, “higher granularities serve better, provided the program
contains sufficient parallelism”.
The CPU used in this experiment is Intel….CPU, which has 6 physical cores and 12
virtual cores. Assuming that each core would take 2 cores to execute the algorithm, utilizing
12 threads would give the best performance by the CPU. For a data of size 32000 bytes, 12
threads has been utilized such that 10 threads handle 167 sets of input each and 2 threads
handles 165 threads each, making it a total of 2000 sets. For data size of 32000*5 bytes, 10
threads handle 833 sets of data chunk each and 2 threads handle 835 data chunks each. The
variation in data division is negligible and has no impact on the reliability of results.
CUDA STREAMS are effective when each stream has considerable data to process.
The effect of latency can be compensated when the work to be processed is considerably
high. In this research work it can be observed for data size of 32000 bytes, two streams with
16000 bytes each gives a performance with is lower than the performance of the CUDA
program with single stream. The reason is the lack of effective utilization of threads in each
74
Chapter 5 Discussions
stream. Instead, for each stream taking a data of 32000 bytes performs execution in the same
amount of time. CUDA STREAM program accelerated the performance giving a slightly
lower execution time in case of data size of 32000*5 bytes in which 5 streams are utilized
such that each stream handles 32000 bytes of data. An intermediate value has been chosen in
order to analysis the pattern of performance variation, an intermediate value has been chosen,
increasing data size to 64000 bytes. Two streams have been utilized such that each stream
handles 32000 bytes of data. This implementation shows small acceleration in performance.
It can be inferred that CUDA STREAMS are effective only for higher data sizes and effective
utilization of threads in each stream.
75
Chapter 6 Conclusions and Future Work
a) Does the GPU outperform the single core CPU in the implementation of AES
algorithm?
Answer:
The GPU outperforms the single core CPU in the implementation of AES algorithm. For
a data size of 32000 bytes, it accelerates the execution of AES 128, AES 192 and AES 256 by
a factor of 22, 21 and 12 respectively. For a data size of 32000*5 bytes in which data is
divided into chunks of 32000 bytes each to be processed by the GPU in sequence, it
accelerates the execution of AES 128, AES 192 and AES 256 by a factor of 13, 12 and 12
times respectively.
b) Does the GPU outperform the multi core CPU in the implementation of AES
algorithm?
Answer:
The GPU outperforms the multi core CPU in the implementation of AES algorithm.
For a data size of 32000 bytes, the GPU performs 6, 5 and 3 times faster than multi core CPU
for the implementation of AES 128, AES 192 and AES 256 respectively. For data of size
32000*5 bytes, where the data is sequentially processed in chunks of 32000 bytes each, the
GPU is 3, 3 and 2.5 times faster than CPU for the implementation of AES 128, AES 192 and
AES 256 respectively.
RQ2. Does the use of CUDA STREAMS have a positive impact on the performance of
the AES algorithm?
Answer:
The use of CUDA STREAMS does not have a positive impact on the performance of
AES algorithm for a data size of 32000 bytes. However it shows an increase in the
performance for data sizes of 32000*5 bytes. Hence, the CUDA STREAMS can have a
positive impact on the AES algorithm for higher data size with considerable data in each
stream.
76
Chapter 6 Conclusions and Future Work
This thesis work is limited, due to many constraints out of which time is a crucial player.
To expand the findings related to exploitation of parallelism using GPU, the following are
some of the avenues for future work.
x This research work can be performed on other cryptographic ciphers such as blowfish,
serpent, and Salsa20 to find out if the results hold good for those implementations.
x It would be of value to exploit task parallelism to accelerate the AES algorithm.
x Similar algorithm model can be implemented on different GPUs in order to compare
and deduce absolute results.
x It would be interesting to see how the use of a heterogeneous system i.e., the
combination of CPU and GPU towards the execution of AES algorithm can accelerate
its performance.
77
Bibliography
BIBLIOGRAPHY
[1] B. P. Luken, M. Ouyang, and A. H. Desoky, “AES and DES Encryption with GPU,” in ISCA
PDCCS, 2009, pp. 67–70.
[2] L. Swierczewski, “3DES ECB Optimized for Massively Parallel CUDA GPU Architecture,”
ArXiv13054376 Cs, May 2013.
[3] H.-P. Yeh, Y.-S. Chang, C.-F. Lin, and S.-M. Yuan, “Accelerating 3-DES Performance Using
GPU,” in International Conference on Cyber-Enabled Distributed Computing and Knowledge
Discovery (CyberC), 2011, pp. 250–256.
[4] F. Shao, Z. Chang, and Y. Zhang, “AES Encryption Algorithm Based on the High Performance
Computing of GPU,” in Second International Conference on Communication Software and
Networks, 2010. ICCSN ’10, 2010, pp. 588–590.
[5] X. Wang, X. Li, M. Zou, and J. Zhou, “AES finalists implementation for GPU and multi-core
CPU based on OpenCL,” in IEEE International Conference on Anti-Counterfeiting, Security and
Identification (ASID), 2011, pp. 38–42.
[6] H. Zhang, D. Zhang, and X. Bi, “Comparison and Analysis of GPGPU and Parallel Computing
on Multi-Core CPU.”
[7] M. Bobrov, “Cryptographic algorithm acceleration using CUDA enabled GPUs in typical system
configurations,” Theses, Aug. 2010.
[8] N.-P. Tran, M. Lee, and D. H. Choi, “Heterogeneous parallel computing for data encryption
application,” in 6th International Conference on Computer Sciences and Convergence
Information Technology (ICCIT), 2011, pp. 562–566.
[9] D. Le, J. Chang, X. Gou, A. Zhang, and C. Lu, “Parallel AES algorithm for fast Data Encryption
on GPU,” in 2nd International Conference on Computer Engineering and Technology (ICCET),
2010, vol. 6, pp. V6–1–6.
[10] Q. Li, C. Zhong, K. Zhao, X. Mei, and X. Chu, “Implementation and Analysis of AES
Encryption on GPU,” in 2012 IEEE 14th International Conference on High Performance
Computing and Communication 2012 IEEE 9th International Conference on Embedded Software
and Systems (HPCC-ICESS), 2012, pp. 843–848.
[11] A. M. Chiuta, “AES Encryption and Decryption Using Direct3D 10 API,” ArXiv12010398 Cs,
Jan. 2012.
[12] T. R. Daniel and S. Mircea, “AES on GPU using CUDA,” in European Conference for the
Applied Mathematics & Informatics. World Scientific and Engineering Academy and Society
Press, 2010.
[13] R. Inam, “An Introduction to GPGPU Programming-CUDA Architecture,” Mälardalen Univ.
Mälardalen Real-Time Res. Cent., 2011.
[14] C. Cullinan, C. Wyant, T. Frattesi, and X. Huang, “Computing performance benchmarks among
cpu, gpu, and fpga,” Internet Www. Wpi EduPubsE-Proj.-Proj.-030212-
123508unrestrictedBenchmarking Final, 2013.
[15] H. Jo, S.-T. Hong, J.-W. Chang, and D. H. Choi, “Data Encryption on GPU for High-
Performance Database Systems,” Procedia Comput. Sci., vol. 19, pp. 147–154, 2013.
[16] V. W. Lee, C. Kim, J. Chhugani, M. Deisher, D. Kim, A. D. Nguyen, N. Satish, M. Smelyanskiy,
S. Chennupaty, P. Hammarlund, R. Singhal, and P. Dubey, “Debunking the 100X GPU vs. CPU
Myth: An Evaluation of Throughput Computing on CPU and GPU,” in Proceedings of the 37th
Annual International Symposium on Computer Architecture, New York, NY, USA, 2010, pp.
451–460.
[17] O. Gervasi, D. Russo, and F. Vella, “The AES Implantation Based on OpenCL for Multi/many
Core Architecture,” in International Conference on Computational Science and Its Applications
(ICCSA), 2010, pp. 129–134.
[18] N. Nishikawa, K. Iwai, H. Tanaka, and T. Kurokawa, “Throughput and Power Efficiency
Evaluations of Block Ciphers on Kepler and GCN GPUs,” in First International Symposium on
Computing and Networking (CANDAR), 2013, pp. 366–372.
78
Bibliography
[19] C. Gregg and K. Hazelwood, “Where is the data? Why you cannot debate CPU vs. GPU
performance without the answer,” in IEEE International Symposium on Performance Analysis of
Systems and Software (ISPASS), 2011, pp. 134–144.
[20] M. Taher, “Accelerating scientific applications using GPU’s,” in Design and Test Workshop
(IDT), 2009 4th International, 2009, pp. 1–6.
[21] V.Venugopal and D. M. Shila, “High throughput implementations of cryptography algorithms on
GPU and FPGA,” in Instrumentation and Measurement Technology Conference (I2MTC), 2013
IEEE International, 2013, pp. 723–727.
[22] D. L. Cook, J. Ioannidis, A. D. Keromytis, and J. Luck, “CryptoGraphics: Secret key
cryptography using graphics cards,” in Topics in Cryptology–CT-RSA 2005, Springer, 2005, pp.
334–350.
[23] K. Iwai, T. Kurokawa, and N. Nisikawa, “AES Encryption Implementation on CUDA GPU and
Its Analysis,” in 2010 First International Conference on Networking and Computing (ICNC),
2010, pp. 209–214.
[24] P. Maistri, F. Masson, and R. Leveugle, “Implementation of the Advanced Encryption Standard
on GPUs with the NVIDIA CUDA framework,” in 2011 IEEE Symposium on Industrial
Electronics and Applications (ISIEA), 2011, pp. 213–217.
[25] “FIPS 197, Advanced Encryption Standard (AES) [Online]. Available:
http://csrc.nist.gov/publications/fips/fips197/fips-197.pdf
[26] V. Xouris, “Parallel hashing, compression and encryption with opencl under os x,” Master’s
thesis, School of Informatics-University of Edinburgh, 2010.
[27] “GPU Gems 3 - Chapter 36. AES Encryption and Decryption on the GPU.” [Online]. Available:
http://http.developer.nvidia.com/GPUGems3/gpugems3_ch36.html.
[28] C. McClanahan, “History and Evolution of GPU Architecture,” Pap. Surv. Httpmcclanahoochie
Comblogwpcontentuploads201103gpu-Hist-Pap. Pdf, 2010.
[29] A. Barnes, R. Fernando, K. Mettananda, and R. Ragel, “Improving the throughput of the AES
algorithm with multicore processors,” in 2012 7th IEEE International Conference on Industrial
and Information Systems (ICIIS), 2012, pp. 1–6.
[30] “On Fair Comparison between CPU and GPU.” [Online]. Available:
http://www.eecs.berkeley.edu/~sangjin/2013/02/12/CPU-GPU-comparison.html.
[31] J. Ortega, H. Trefftz, and C. Trefftz, “Parallelizing AES on multicores and GPUs,” in
Proceedings of the IEEE International Conference on Electro/Information Technology (EIT),
2011, pp. 15–17.
[32] C. So-In, S. Poolsanguan, C. Poonriboon, K. Rujirakul, and C. Phudphut, “Performance
Evaluation of Parallel AES Implementations over CUDA GPU Framework,” Int J Digit. Content
Technol. Its Appl., vol. 7, no. 5, pp. 501–511, 2013.
[33] “Slide 1 - StreamsAndConcurrencyWebinar".[Online] Available: http://on-
demand.gputechconf.com/gtc-express/2011/presentations/StreamsAndConcurrencyWebinar.pdf
[34]“POSIX Threads Programming.”[Online].Available:https://computing.llnl.gov/tutorials/pthreads/.
[35] J. Gómez-Luna, J. M. González-Linares, J. I. Benavides, and N. Guil, “Performance models for
CUDA streams on NVIDIA GeForce series,” Technical Report, University of Málaga, 2011.
http://www. ac. uma. es/ vip/publications/UMA-DAC-11-02. pdf.
[36] M. Domeika, “Development and Optimization Techniques for Multi-Core Processors,” Dr.
Dobb’s. [Online]. Available: http://www.drdobbs.com/development-and-optimization-
techniques/212600040.
[37] D. B. Kirk and W. W. Hwu, Programming Massively Parallel Processors, Second Edition: A
Hands-on Approach, 2 edition. Amsterdam; Boston; Waltham, Mass.: Morgan Kaufmann, 2012.
[38] Steven Taylor and Gordon J.G. Asmundson, "Ineternal and External Validity In Clinical
Research", [Online].Available: http://www.sagepub.com/upm-data/19352_Chapter_3.pdf.
[39] Aes algorithm flash: Rijndael (2004). [Online].Available: http://www.cs.bc.edu/~straubin/cs381-
05/blockciphers/rijndael_ingles2004.swf.
[40] “Huong Nguyen: Computer systems.[Online]. Available: http://cnx.org/contents/611fa6c7-a16d-
460a-a221-ae57ff2379be@1.
79
Bibliography
[41] “iXBT Labs Review - NVIDIA CUDA,” iXBT Labs. [Online]. Available:
http://ixbtlabs.com/articles3/video/cuda-1-p1.html.
[42] “Ron Maltiel: Semiconductor Experts, Witnesses, Consultants and Patent Litigation Support"
(March 2012). [Online]. Available: http://ixbtlabs.com/articles3/video/cuda-1-p1.html.
80
Appendix A
APPENDIX A
This section delivers the execution time for AES 128, AES 192 and AES 256
encryption algorithm, using single threaded C, multi-threaded C (Pthreads), and CUDA. 20
runs have been taken for every implementation, for a single process (k=1), and for 10
processes (k=10). As mentioned earlier, the process is run 10 times in order to achieve
readable data for easy comparison. The execution time for k=10 is approximately 10 times
that of k=1. This section also shows the average execution time, standard deviation and
confidence interval for every implementation on the CPU and the GPU for k=10, as a
continuation to Chapter 4.
TableA.1 to Table A.6 shows the execution time for AES implementation on single threaded
CPU.
TableA.7 to TableA.78 shows the execution time with different granularities and grid
dimension on GPU using CUDA.
TableA.79 to TableA.84 shows the execution time obtained on multi threaded CPU using
Pthreads.
TableA.85 to TableA.93 shows the execution time obtained on the GPU for different data
sizes using CUDA STREAMS.
81
Appendix A
k=10
Data Size= 32000 bytes Data Size= 32000*5 bytes
k=10
Data Size= 32000 bytes Data Size= 32000*5 bytes
82
Appendix A
Table A.4: Calculations for AES 192 using Single threaded C on single core CPU
Average
0.40955 1.90635
83
Appendix A
84
Appendix A
k=10
Grid
Dimension <20,100> <100,20> <50,40> <40,50> <1000,2> <2,1000> <2000,1>
Standard
Deviation 0.002351 0.002565 0.002221 0.002052 0.00197 0.002447 0.002294
85
Appendix A
k=10
Grid
Dimension <10,100> <100,10> <50,20> <20,50> <40,25> <25,40> <1000,1>
Standard
Deviation 0.002221 0.002751 0.002447 0.002552 0.002513 0.002447 0.043543
86
Appendix A
87
Appendix A
k=10
Grid
Dimension <20,10> <10,20> <8,25> <25,8> <2,100> <100,2> <200,1>
Standard
Deviation 0.001118 0.002552 0.002552 0.001118 0.002552 0.002552 0.002221
88
Appendix A
k=10
Grid
Dimension <2,10> <10,2> <5,4> <4,5> <20,1> <1,20>
Standard
Deviation 0.002294 0.002565 0.001622 0.002447 0.002426 0.002552
89
Appendix A
90
Appendix A
k=10
Grid
Dimension <20,100> <100,20> <50,40> <40,50> <1000,2> <2,1000> <2000,1>
Standard
Deviation 0.004128 0.00456 0.004472 0.004472 0.004375 0.003582 0.003244
91
Appendix A
k=10
Grid
Dimension <10,100> <100,10> <50,20> <20,50> <40,25> <25,40> <1000,1>
Standard
Deviation 0.004378 0.020344 0.005849 0.00394 0.004833 0.005501 0.006973
92
Appendix A
93
Appendix A
k=10
Grid
Dimension <20,10> <10,20> <8,25> <25,8> <2,100> <100,2> <200,1>
Standard
Deviation 0.004253 0.005495 0.004128 0.004007 0.003582 0.004064 0.002936
94
Appendix A
k=10
Grid
Dimension <2,10> <10,2> <5,4> <4,5> <20,1> <1,20>
Standard
Deviation 0.004692 0.002552 0.003354 0.002552 0.004253 0.003024
95
Appendix A
96
Appendix A
k=10
Grid
Dimension <20,100> <100,20> <50,40> <40,50> <1000,2> <2,1000> <2000,1>
Standard
Deviation 0.002221 0.002441 0.002447 0.002221 0.002221 0.002052 1.4238E-17
97
Appendix A
k=10
Grid
Dimension <10,100> <100,10> <50,20> <20,50> <40,25> <25,40> <1000,1>
Standard
Deviation 0.002936 0.001832 0.002552 0.002221 0.002052 0.002936 0.002350
98
Appendix A
99
Appendix A
k=10
Grid
Dimension <20,10> <10,20> <8,25> <25,8> <2,100> <100,2> <200,1>
Standard
Deviation 0.002447 0.002565 0.001970 0.002513 0.002513 0.002052 0.001538
100
Appendix A
k=10
Grid
Dimension <2,10> <10,2> <5,4> <4,5> <20,1> <1,20>
Standard
Deviation 0.009515 0.014980 0.002052 0.228757 0.002763 0.003770
Confidence
Interval 0.697830 0.700185 0.695101 0.500495 0.748289 0.711348
- - - - - -
0.070617 0.713315 0.696899 0.701005 0.750711 0.714652
Table A.42: Calculations for AES 192 using CUDA on GPU with granularity 100 (Data size 32000, k=10)
101
Appendix A
102
Appendix A
k=10
Grid
Dimension <20,100> <100,20> <50,40> <40,50> <1000,2> <2,1000> <2000,1>
Standard
Deviation 0.006708 0.005257 0.005712 0.004253 0.00377 0.00573 0.002221
103
Appendix A
k=10
Grid
Dimension <10,100> <100,10> <50,20> <20,50> <40,25> <25,40> <1000,1>
Standard
Deviation 0.005596 0.004702 0.005684 0.005955 0.006382 0.006198 0.004128
104
Appendix A
105
Appendix A
k=10
Grid
Dimension <20,10> <10,20> <8,25> <25,8> <2,100> <100,2> <200,1>
Standard
Deviation 0.004375 0.011571 0.004560 0.004552 0.003204 0.003663 0.002236
106
Appendix A
k=10
Grid
Dimension <2,10> <10,2> <5,4> <4,5> <20,1> <1,20>
Standard
Deviation 0.001118 0.002936 0.002856 0.011059 0.002936 0.002552
107
Appendix A
108
Appendix A
k=10
Grid
Dimension <20,100> <100,20> <50,40> <40,50> <1000,2> <2,1000> <2000,1>
Standard
Deviation 0.002236 0.002447 0.002763 0.002616 0.002552 0.002052 0.002552
109
Appendix A
k=10
Grid
Dimension <10,100> <100,10> <50,20> <20,50> <40,25> <25,40> <1000,1>
Standard
Deviation 0.002236 0.002351 0.001832 0.002751 0.002616 0.002552 0.003432
Confidence 0.02952
Interval -
0.05547 0.029947 0.030045 0.029854 0.031132 0.403246
0.031480 - - - - - -
0.057530 0.031553 0.032455 0.032146 0.033368 0.406254
Table A.60: Calculations for AES 256 using CUDA on GPU with granularity 2 (Data size 32000, k=10)
110
Appendix A
111
Appendix A
k=10
Grid
Dimension <20,10> <10,20> <8,25> <25,8> <2,100> <100,2> <200,1>
Standard
Deviation 0.002513 0.001832 0.002513 0.002447 0.002351 0.002236 0.001832
112
Appendix A
k=10
Grid
Dimension <2,10> <10,2> <5,4> <4,5> <20,1> <1,20>
Standard
Deviation 0.002751 0.001118 0.002552 0.002552 0.003024 0.002513
113
Appendix A
114
Appendix A
k=10
Grid
Dimension <20,100> <100,20> <50,40> <40,50> <1000,2> <2,1000> <2000,1>
Standard
Deviation 0.005104 0.005104 0.005356 0.006156 0.002447 0.005821 0.002552
115
Appendix A
k=10
Grid
Dimension <10,100> <100,10> <50,20> <20,50> <40,25> <25,40> <1000,1>
Standard
Deviation 0.004702 0.00634 0.005982 0.005155 0.005525 0.00535 0.009854
Table A.72: Calculations for AES 256 using CUDA on GPU with granularity 2 (Data size 32000*5, k=10)
116
Appendix A
117
Appendix A
k=10
Grid
Dimension <20,10> <10,20> <8,25> <25,8> <2,100> <100,2> <200,1>
Average
0.533 0.49725 0.49025 0.5545 0.50175 1.079 2.0695
Standard
Deviation 0.005477 0.005495 0.004993 0.003591 0.004375 0.004472 0.005104
118
Appendix A
k=10
Grid
Dimension <2,10> <10,2> <5,4> <4,5> <20,1> <1,20>
Standard
Deviation 0.022213 0.023508 0.025649 0.025649 0.024468 0.018317
Confidence
Interval 4.027765 4.054697 4.063759 4.013759 4.321777 4.084472
- - - - - -
4.047235 4.075303 4.086241 4.036241 4.343223 4.100528
Table A.78: Calculations for AES 256 using CUDA on GPU with granularity 100 (Data size 32000*5, k=10)
119
Appendix A
0.00573 0.009216
Standard Deviation
0.065239 0.267711
Confidence Interval - -
0.070261 0.275789
Table A.80: Calculations for AES 128 using POSIX thread on Multi core CPU
120
Appendix A
0.080475 0.32042
Confidence Interval - -
0.086525 0.33308
Table A.82: Calculations for AES 192 using POSIX thread on Multi core CPU
121
Appendix A
0.004375 0.08058
Standard Deviation
0.088832 0.33121
Confidence Interval
- -
0.092668 0.40184
Table A.84: Calculations for AES 256 using POSIX thread on Multi core CPU
122
Appendix A
Table A.86: Execution time of AES 128 using CUDA STREAMS on GPU (Data size 32000*2)
123
Appendix A
Table A.87: Execution time of AES 128 using CUDA STREAMS on GPU (Data size 32000*5)
124
Appendix A
Table A.89: Execution time of AES 192 using CUDA STREAMS on GPU (Data size 32000*2)
125
Appendix A
126
Appendix A
127
Appendix B
APPENDIX B
This section delivers the program to implement AES-128, AES-192 and AES-256 using C,
Pthreads, CUDA and CUDA STREAMS.
// Multiplication 2 table
128
Appendix B
// MULTIPLICATION 3 TABLE
129
Appendix B
}
a[3]=temp1;
}
int i;
int temp1[4],temp2[4],temp3[4],temp4[4];
i=0;
while(i<4){
temp1[i]=a[i];
temp2[i]=a[i+4];
temp3[i]=a[i+8];
temp4[i]=a[i+12];
i++;
}
shift(temp2);
shift(temp3);
shift(temp3);
shift(temp4);
shift(temp4);
shift(temp4);
i=0;
while(i<4) {
a[i]=temp1[i];
a[i+4]=temp2[i];
a[i+8]=temp3[i];
a[i+12]=temp4[i];
i++;
}
}
//Substitute Byte
void subbyte(int *a) {
int i;
int pt1,pt2;
for (i = 0; i < 16; i++){
pt1 = ((240 & a[i]) / 16);
pt2 = (15 & a[i]);
a[i] = sbox[pt1][pt2];
}
}
//Mix Column
void mixcolumn(int *a) {
int m1,n1,o1,p1,m2,n2,o2,p2;
m1=a[0]; n1=a[4];o1=a[8];p1=a[12];
m2=m1; n2=n1;o2=o1;p2=p1;
mult2(m1); mult2(o1);
130
Appendix B
mult3(m2); mult3(o2);
mult2(n1); mult2(p1);
mult3(n2); mult3(p2);
m1=a[1]; n1=a[5];o1=a[9];p1=a[13];
m2=m1; n2=n1;o2=o1;p2=p1;
mult2(m1); mult2(o1);
mult3(m2); mult3(o2);
mult2(n1); mult2(p1);
mult3(n2); mult3(p2);
m1=a[2]; n1=a[6];o1=a[10];p1=a[14];
m2=m1; n2=n1;o2=o1;p2=p1;
mult2(m1); mult2(o1);
mult3(m2); mult3(o2);
mult2(n1); mult2(p1);
mult3(n2); mult3(p2);
m1=a[3]; n1=a[7];o1=a[11];p1=a[15];
m2=m1; n2=n1;o2=o1;p2=p1;
mult2(m1); mult2(o1);
mult3(m2); mult3(o2);
mult2(n1); mult2(p1);
mult3(n2); mult3(p2);
131
Appendix B
tempa[0] = rk1[0];
tempb[0] = rk1[1];
tempc[0] = rk1[2];
tempd[0] = rk1[3];
tempa[1] = rk1[4];
tempb[1] = rk1[5];
tempc[1] = rk1[6];
tempd[1] = rk1[7];
tempa[2] = rk1[8];
tempb[2] = rk1[9];
tempc[2] = rk1[10];
tempd[2] = rk1[11];
tempa[3] = rk1[12];
tempb[3] = rk1[13];
tempc[3] = rk1[14];
tempd[3] = rk1[15];
shift(rot);
subbyte4(rot);
132
Appendix B
rk2[2] = rot[0];
rk2[6] = rot[1];
rk2[10] = rot[2];
rk2[14] = rot[3];
void keyexp(int *cipherkey, int *rk1, int *rk2, int *rk3, int *rk4, int *rk5, int *rk6, int *rk7, int *rk8,
int *rk9, int *rk10) {
int rcon1[4] = { 1, 0, 0, 0 };
int rcon2[4] = { 2, 0, 0, 0 };
int rcon3[4] = { 4, 0, 0, 0 };
int rcon4[4] = { 8, 0, 0, 0 };
int rcon5[4] = { 16, 0, 0, 0 };
int rcon6[4] = { 32, 0, 0, 0 };
int rcon7[4] = { 64, 0, 0, 0 };
int rcon8[4] = { 128, 0, 0, 0 };
int rcon9[4] = { 27, 0, 0, 0 };
int rcon10[4] = { 54, 0, 0, 0 };
subbyte(state); //Round 1
shiftrow(state);
mixcolumn(state);
addroundkey(state, rk1);
subbyte(state); //Round 2
133
Appendix B
shiftrow(state);
mixcolumn(state);
addroundkey(state, rk2);
subbyte(state); //Round 3
shiftrow(state);
mixcolumn(state);
addroundkey(state, rk3);
subbyte(state); //Round 4
shiftrow(state);
mixcolumn(state);
addroundkey(state, rk4);
subbyte(state); //Round 5
shiftrow(state);
mixcolumn(state);
addroundkey(state, rk5);
subbyte(state); //Round 6
shiftrow(state);
mixcolumn(state);
addroundkey(state, rk6);
subbyte(state); //Round 7
shiftrow(state);
mixcolumn(state);
addroundkey(state, rk7);
subbyte(state); //Round 8
shiftrow(state);
mixcolumn(state);
addroundkey(state, rk8);
subbyte(state); //Round 9
shiftrow(state);
mixcolumn(state);
addroundkey(state, rk9);
subbyte(state); //Round 10
shiftrow(state);
addroundkey(state, rk10);
}
134
Appendix B
int main()
{
int state[N],take[16];
int i, j, k;
double total_time;
clock_t start, end;
for(i=0;i<N ;i++){
state[i]=i;
}
start = clock(); //time record starts
keyexp(cipherkey, rk1, rk2, rk3, rk4, rk5, rk6, rk7, rk8, rk9, rk10); //key expansion
for(k=0; k<n ;k++) { //encryption performed 'n' no of times'
for(i=0;i<(N);i+=16) { //value passed in sets of 16 for each run
for(j=0;j<16;j++) {
take[j]=state[i+j];
}
encryption(take, cipherkey);
for(j=0;j<16;j++) {
state[i+j]=take[j];
}
}
} // k ends here
end = clock(); //time record stops
total_time = ((double)(end - start)) / CLK_TCK;
printf("\n Time taken to for aes128 execution in c is: %f\n", total_time);
return 0;
}
135
Appendix B
tempa[0] = rk1[0];
tempb[0] = rk1[1];
tempc[0] = rk1[2];
tempa[1] = rk1[6];
tempb[1] = rk1[7];
tempc[1] = rk1[8];
tempa[2] = rk1[12];
tempb[2] = rk1[13];
tempc[2] = rk1[14];
tempa[3] = rk1[18];
tempb[3] = rk1[19];
tempc[3] = rk1[20];
tempd[0] = rk1[3];
tempe[0] = rk1[4];
tempf[0] = rk1[5];
tempd[1] = rk1[9];
tempe[1] = rk1[10];
tempf[1] = rk1[11];
tempd[2] = rk1[15];
tempe[2] = rk1[16];
tempf[2] = rk1[17];
tempd[3] = rk1[21];
tempe[3] = rk1[22];
tempf[3] = rk1[23];
for (i = 0; i < 4; i++) {
rot[i] = tempf[i];
}
shift(rot);
subbyte4(rot);
for (i = 0; i < 4; i++) {
rot[i] = rot[i] ^ tempa[i] ^ rcon[i];
}
rk2[0] = rot[0];
rk2[6] = rot[1];
rk2[12] = rot[2];
rk2[18] = rot[3];
for (i = 0; i < 4; i++) {
rot[i] = rot[i] ^ tempb[i];
}
rk2[1] = rot[0];
rk2[7] = rot[1];
136
Appendix B
rk2[13] = rot[2];
rk2[19] = rot[3];
for (i = 0; i < 4; i++){
rot[i] = rot[i] ^ tempc[i];
}
rk2[2] = rot[0];
rk2[8] = rot[1];
rk2[14] = rot[2];
rk2[20] = rot[3];
for (i = 0; i < 4; i++) {
rot[i] = rot[i] ^ tempd[i];
}
rk2[3] = rot[0];
rk2[9] = rot[1];
rk2[15] = rot[2];
rk2[21] = rot[3];
for (i = 0; i < 4; i++) {
rot[i] = rot[i] ^ tempe[i];
}
rk2[4] = rot[0];
rk2[10] = rot[1];
rk2[16] = rot[2];
rk2[22] = rot[3];
for (i = 0; i < 4; i++) {
rot[i] = rot[i] ^ tempf[i];
}
rk2[5] = rot[0];
rk2[11] = rot[1];
rk2[17] = rot[2];
rk2[23] = rot[3];
}
void keyexp(int *key, int *inr, int *r1,int *r2,int *r3,int *r4,int *r5,int *r6,
int *r7,int *r8,int *r9,int *r10,int *r11,int *r12 ) {
int rk1[24],rk2[24],rk3[24],rk4[24],rk5[24],rk6[24],rk7[24],rk8[24];int i;
int rcon1[4] = { 1, 0, 0, 0 };
int rcon2[4] = { 2, 0, 0, 0 };
int rcon3[4] = { 4, 0, 0, 0 };
int rcon4[4] = { 8, 0, 0, 0 };
int rcon5[4] = { 16, 0, 0, 0 };
int rcon6[4] = { 32, 0, 0, 0 };
int rcon7[4] = { 64, 0, 0, 0 };
int rcon8[4] = { 128, 0, 0, 0 };
nextrk(key,rk1,rcon1);
nextrk(rk1,rk2,rcon2);
nextrk(rk2,rk3,rcon3);
nextrk(rk3,rk4,rcon4);
nextrk(rk4,rk5,rcon5);
nextrk(rk5,rk6,rcon6);
nextrk(rk6,rk7,rcon7);
137
Appendix B
nextrk(rk7,rk8,rcon8);
for(i=0;i<4;i++) {
inr[i]=key[i];
r3[i]=rk2[i];
r6[i]=rk4[i];
r9[i]=rk6[i];
r12[i]=rk8[i];
inr[i+4]=key[i+4+2];
r3[i+4]=rk2[i+4+2];
r6[i+4]=rk4[i+4+2];
r9[i+4]=rk6[i+4+2];
r12[i+4]=rk8[i+4+2];
inr[i+8]=key[i+8+4];
r3[i+8]=rk2[i+8+4];
r6[i+8]=rk4[i+8+4];
r9[i+8]=rk6[i+8+4];
r12[i+8]=rk8[i+8+4];
inr[i+12]=key[i+12+6];
r3[i+12]=rk2[i+12+6];
r6[i+12]=rk4[i+12+6];
r9[i+12]=rk6[i+12+6];
r12[i+12]=rk8[i+12+6];
r2[i]=rk1[i+2];
r5[i]=rk3[i+2];
r8[i]=rk5[i+2];
r11[i]=rk7[i+2];
r2[i+4]=rk1[i+4+4];
r5[i+4]=rk3[i+4+4];
r8[i+4]=rk5[i+4+4];
r11[i+4]=rk7[i+4+4];
r2[i+8]=rk1[i+8+6];
r5[i+8]=rk3[i+8+6];
r8[i+8]=rk5[i+8+6];
r11[i+8]=rk7[i+8+6];
r2[i+12]=rk1[i+12+8];
r5[i+12]=rk3[i+12+8];
r8[i+12]=rk5[i+12+8];
r11[i+12]=rk7[i+12+8];
138
Appendix B
for(i=0;i<2;i++) {
r1[i]=key[i+4];
r4[i]=rk2[i+4];
r7[i]=rk4[i+4];
r10[i]=rk6[i+4];
r1[i+4]=key[i+10];
r4[i+4]=rk2[i+10];
r7[i+4]=rk4[i+10];
r10[i+4]=rk6[i+10];
r1[i+8]=key[i+16];
r4[i+8]=rk2[i+16];
r7[i+8]=rk4[i+16];
r10[i+8]=rk6[i+16];
r1[i+12]=key[i+22];
r4[i+12]=rk2[i+22];
r7[i+12]=rk4[i+22];
r10[i+12]=rk6[i+22];
r1[i+2]=rk1[i];
r4[i+2]=rk3[i];
r7[i+2]=rk5[i];
r10[i+2]=rk7[i];
r1[i+6]=rk1[i+6];
r4[i+6]=rk3[i+6];
r7[i+6]=rk5[i+6];
r10[i+6]=rk7[i+6];
r1[i+10]=rk1[i+12];
r4[i+10]=rk3[i+12];
r7[i+10]=rk5[i+12];
r10[i+10]=rk7[i+12];
r1[i+14]=rk1[18];
r4[i+14]=rk3[18];
r7[i+14]=rk5[18];
r10[i+14]=rk7[18];
}
}
139
Appendix B
rk2[2] = rot[0];
rk2[10] = rot[1];
rk2[18] = rot[2];
rk2[26] = rot[3];
140
Appendix B
141
Appendix B
nextrk(rk4,rk5,rcon5);
nextrk(rk5,rk6,rcon6);
nextrk(rk6,rk7,rcon7);
for(i=0;i<4;i++) {
inr[i]=b[i];
inr[i+4]=b[i+8];
inr[i+8]=b[i+16];
inr[i+12]=b[i+24];
r2[i]=rk1[i];
r2[i+4]=rk1[i+8];
r2[i+8]=rk1[i+16];
r2[i+12]=rk1[i+24];
r4[i]=rk2[i];
r4[i+4]=rk2[i+8];
r4[i+8]=rk2[i+16];
r4[i+12]=rk2[i+24];
r6[i]=rk3[i];
r6[i+4]=rk3[i+8];
r6[i+8]=rk3[i+16];
r6[i+12]=rk3[i+24];
r8[i]=rk4[i];
r8[i+4]=rk4[i+8];
r8[i+8]=rk4[i+16];
r8[i+12]=rk4[i+24];
r10[i]=rk5[i];
r10[i+4]=rk5[i+8];
r10[i+8]=rk5[i+16];
r10[i+12]=rk5[i+24];
r12[i]=rk6[i];
r12[i+4]=rk6[i+8];
r12[i+8]=rk6[i+16];
r12[i+12]=rk6[i+24];
r14[i]=rk7[i];
r14[i+4]=rk7[i+8];
r14[i+8]=rk7[i+16];
r14[i+12]=rk7[i+24];
r3[i]=rk1[i+4];
r3[i+4]=rk1[i+12];
r3[i+8]=rk1[i+20];
r3[i+12]=rk1[i+28];
142
Appendix B
r5[i]=rk2[i+4];
r5[i+4]=rk2[i+12];
r5[i+8]=rk2[i+20];
r5[i+12]=rk2[i+28];
r7[i]=rk3[i+4];
r7[i+4]=rk3[i+12];
r7[i+8]=rk3[i+20];
r7[i+12]=rk3[i+28];
r1[i]=b[i+4];
r1[i+4]=b[i+12];
r1[i+8]=b[i+20];
r1[i+12]=b[i+28];
r9[i]=rk4[i+4];
r9[i+4]=rk4[i+12];
r9[i+8]=rk4[i+20];
r9[i+12]=rk4[i+28];
r11[i]=rk5[i+4];
r11[i+4]=rk5[i+12];
r11[i+8]=rk5[i+20];
r11[i+12]=rk5[i+28];
r13[i]=rk6[i+4];
r13[i+4]=rk6[i+12];
r13[i+8]=rk6[i+20];
r13[i+12]=rk6[i+28];
143
Appendix B
This section shows the CUDA program for AES-128 on GPU for granularity level 1, such
that each thread takes 16 bytes input data and 16 bytes cipher-key. The definition for the
functions is same as implemented on the CPU.
AES-192 and AES-256 are implemented similarly for key size 24 bytes and 32 bytes
respectively. The key expansion as mentioned in previous section is different for the three
versions. The three version of AES algorithm is implemented using granularity 1, granularity
2 (Each thread takes 32 bytes data), granularity 10 (Each thread takes 160 bytes data) and
granularity 100 (Each thread takes 1600 bytes data). For every granularity level, the
experiment is repeated with different grid dimensions as described in Chapter 3.
#define N 32000
int key[16];
int rk1[16]; int rk2[16]; int rk3[16]; int rk4[16]; int rk5[16]; int rk6[16];
int rk7[16]; int rk8[16]; int rk9[16]; int rk10[16];
subbyte(data); //Round 1
shiftrow(data);
mixcolumn(data);
addroundkey(data,rk1);
subbyte(data); //Round 2
shiftrow(data);
mixcolumn(data);
addroundkey(data,rk2);
144
Appendix B
subbyte(data); //Round 3
shiftrow(data);
mixcolumn(data);
addroundkey(data,rk3);
subbyte(data); //Round 4
shiftrow(data);
mixcolumn(data);
addroundkey(data,rk4);
subbyte(data); //Round 5
shiftrow(data);
mixcolumn(data);
addroundkey(data,rk5);
subbyte(data); //Round 6
shiftrow(data);
mixcolumn(data);
addroundkey(data,rk6);
subbyte(data); //Round 7
shiftrow(data);
mixcolumn(data);
addroundkey(data,rk7);
subbyte(data); //Round 8
shiftrow(data);
mixcolumn(data);
addroundkey(data,rk8);
subbyte(data); //Round 9
shiftrow(data);
mixcolumn(data);
addroundkey(data,rk9);
subbyte(data); //Round 10
shiftrow(data);
addroundkey(data,rk10);
145
Appendix B
int main() {
int i;
int a[N],k;
for(i=0;i<N;i++){
a[i]=i;
}
int *d_a,*d_b;
cudaMalloc((void**)&d_a,N*sizeof(int));
cudaMalloc((void**)&d_b,16*sizeof(int));
double total_time;
clock_t start, end;
start = clock();
for(k=0;k<n;k++){
cudaMemcpy(d_a,&a,N*sizeof(int),cudaMemcpyHostToDevice);
cudaMemcpy(d_b,&b,16*sizeof(int),cudaMemcpyHostToDevice);
encryption<<<B,T>>>(d_a,d_b);
cudaMemcpy(&a,d_a,N*sizeof(int),cudaMemcpyDeviceToHost);}
146
Appendix B
The function definitions remain same for this implementation as that in single threaded C
program for all AES version.
This section shows the main program of multi-threaded C using POSIX THREADS for data
size of 32000 bytes. The 2000 chunks of data is divided such that among the 12 virtual
threads utilized, 10 threads handle 167 chunks each and 2 threads utilize 165 chunks each.
For data of size 32000*5 bytes, the data is divided such that among the 12 virtual threads
utilized
#include <pthread.h>
#define state_ele 32000
#define num_threads 12
#define state_ele_s1 26720
#define state_ele_s2 5280
int state_main[state_ele],c[16];
int encryption(int);
void init() {
int i;
for(i=0;i<state_ele;i++)
{
state_main[i]=i;
}
return;
}
void *threadfunc(void *t) {
int tmp,startpt, endpt;
int thid, tmpstate[16];
int i,j,*pmain,k;
int *tp,*cp;
thid=(int)t;
if(thid<10)
{
tmp=state_ele_s1/(num_threads-2);
startpt=thid*tmp;
endpt=startpt+tmp;
}
else
{
tmp=(state_ele_s2/(num_threads-10));
startpt=state_ele_s1 + (thid % 10)*tmp;
endpt=startpt+tmp;
}
for(i=startpt;i<endpt;i=i+16)
{
for(j=0;j<16;j++)
tmpstate[j]=state_main[i+j];
tp=&tmpstate[0];
147
Appendix B
pmain=encryption(tp);
for(k=0;k<16;k++)
c[k]=*(pmain+k);
}
pthread_exit((void*) t);
}
148
Appendix B
The function definitions remain same as that of CUDA program without streams. This section
shows the main program for CUDA using CUDA STREAMS. The number of streams
utilized is 2 such that each stream processes 32000 bytes of data.
int main() {
int i;
int h_a0[32000],h_a1[32000],k;
cudaStream_t stream0,stream1;
cudaStreamCreate(&stream0);
cudaStreamCreate(&stream1);
int *d_a0,*d_b;
int *d_a1;
for(i=0;i<32000;i++){
h_a0[i]=i;
}
for(i=0;i<32000;i++){
h_a1[i]=i+16000;
}
cudaMalloc((void**)&d_a0,32000*sizeof(int));
cudaMalloc((void**)&d_a1,32000*sizeof(int));
cudaMalloc((void**)&d_b,16*sizeof(int));
double total_time;
clock_t start, end;
start = clock();
for(k=0;k<10;k++){
cudaMemcpy(d_b,&b,16*sizeof(int),cudaMemcpyHostToDevice);
cudaMemcpyAsync(d_a0,&h_a0,32000*sizeof(int),cudaMemcpyHostToDevice,stream0);
cudaMemcpyAsync(d_a1,&h_a1,32000*sizeof(int),cudaMemcpyHostToDevice,stream1);
encryption<<<B,T,0,stream0>>>(d_a0,d_b);
encryption<<<B,T,0,stream1>>>(d_a1,d_b);
cudaMemcpyAsync(&h_a0,d_a0,32000*sizeof(int),cudaMemcpyDeviceToHost,stream0);
cudaMemcpyAsync(&h_a1,d_a1,32000*sizeof(int),cudaMemcpyDeviceToHost,stream1);
}
end = clock(); //time count stops
total_time = ((double)(end - start)) / CLK_TCK; //calculate total time
printf("\n\nTime taken to for aes128 encryption in cudastreams is: %f\n", total_time);
}
149