ToSC2025_1_09
ToSC2025_1_09
Abstract. Key lengths in symmetric cryptography are determined with respect to the
brute force attacks with current technology. While nowadays at least 128-bit keys
are recommended, there are many standards and real-world applications that use
shorter keys. In order to estimate the actual threat imposed by using those short
keys, precise estimates for attacks are crucial.
In this work we provide optimized implementations of several widely used algorithms
on GPUs, leading to interesting insights on the cost of brute force attacks on several
real-word applications.
In particular, we optimize KASUMI (used in GPRS/GSM), SPECK (used in RFID
communication), and TEA3 (used in TETRA). Our best optimizations allow us to
try 235.72 , 236.72 , and 234.71 keys per second on a single RTX 4090 GPU. Those
results improve upon previous results significantly, e.g. our KASUMI implementation is
more than 15 times faster than the optimizations given in the CRYPTO’24 paper
[ACC+ 24] improving the main results of that paper by the same factor.
With these optimizations, in order to break GPRS/GSM, RFID, and TETRA com-
munications in a year, one needs around 11, 22 billion, and 1.36 million RTX 4090
GPUs, respectively.
For KASUMI, the time-memory trade-off attacks of [ACC+ 24] can be performed with
142 RTX 4090 GPUs instead of 2400 RTX 3090 GPUs or, when the same amount of
GPUs are used, their table creation time can be reduced to 20.6 days from 348 days,
crucial improvements for real world cryptanalytic tasks.
Keywords: KASUMI · A5/3 · SPECK · TEA3 · cryptanalysis · GPU
1 Introduction
Cryptographic algorithms with short keys are susceptible to generic attacks like exhaustive
key search and time-memory trade-off (TMTO) attacks in which an initial exhaustive key
search like attack is performed to create tables in order to perform future exhaustive key
searches in a short amount of time. For this reason, NIST recommends [FR24] at least
112-bit keys for symmetric key encryption algorithms and keys that will be used after 2030
should be at least 128-bit. In this respect, NIST removed SKIPJACK and 3DES from its
standards which provided 80-bit and 112-bit security, respectively.
Although NIST took measures to remove algorithms with short keys from its standards,
there are still many ISO/IEC standards with short keys. For instance, ISO/IEC 29192-3
lightweight stream cipher standard [ISO12] have two stream ciphers: Enocoro and Trivium.
Enocoro supports both 80 and 128-bit keys. However, Trivium works only with 80-bit keys.
Similarly, ISO/IEC 29192-2 [ISO19] lightweight block cipher standard PRESENT supports
both 80 and 128-bit keys. And finally, the NSA designed block cipher SPECK supports short
keys like 64, 72, and 96 bits. The 96-bit version became an ISO/IEC 29167-22 [ISO18]
RFID air interface standard in 2018. Note that, although SPECK also supports longer
keys, having shorter keys in a standard make them preferable for speed and low hardware
footprint.
Another example for a short key is the A5/1 stream cipher used in 2G GSM com-
munications. A5/1 stream cipher uses 64-bit key and it is very easy to eavesdrop GSM
communications via exhaustive search or TMTO attacks [BBK03]. Due to these attacks,
GSM Association adopted KASUMI block cipher as A5/3 for 3G and GPRS. KASUMI block
cipher supports 128-bit secret key, but for backward compatibility A5/3 and GPRS use
concatenation of the same 64-bit key twice as a 128-bit key, resulting in 64-bit security.
Recently, practical TMTO attacks for these two protocols were provided in [ACC+ 24]
which requires hundreds of GPUs to run for hundreds of days to create TMTO tables.
Sometimes algorithms that use short keys do not receive academic cryptanalysis due
to security by obscurity practices. For instance, some cryptographic algorithms like the
ones used in the European standard TETRA which is globally used by military, police,
emergency services, prisons, and government agencies were kept secret for decades contrary
to Kerckhoffs’s principle. However, it was recently shown in [MBW23] that it is possible
to reverse engineer the cryptographic algorithms used in this standard. It was shown that
80-bit secret keys are used in the four keystream generators used in TETRA and some of
these algorithms were deliberately weakened to provide 32-bit security.
Moreover, many academic papers still propose new algorithms that use short keys like
64 or 80 bits. Thus, it is important to be able to measure how easy or hard it is to perform
generic attacks on ciphers with short keys using general purpose computing devices.
Our Contribution
In this work, we optimized KASUMI, SPECK, and TEA3 ciphers on GPUs to perform ex-
haustive key search attacks. Our best optimizations allowed us to try 235.72 , 236.72 , and
234.71 keys per second on a single RTX 4090 GPU for KASUMI, SPECK-96-26, and TEA3,
respectively. With these optimizations, in order to break GPRS/GSM, RFID, and TETRA
communications in a year, one needs around 11, 22 billion, and 1.36 million RTX 4090
GPUs, respectively. Moreover, our KASUMI optimizations are 16.89x faster than the op-
timizations of recent CRYPTO’24 paper [ACC+ 24]. This result directly improves the
main result of that paper by basically the same factor. More precisely, the time-memory
trade-off attacks of [ACC+ 24] can be performed with 142 GPUs instead of 2400 GPUs or
when the same amount of GPUs are used, their table creation time can be reduced to 20.6
days from 348 days.
Methodologically, those speed-ups are the result of many different GPU specific con-
siderations. Although GPUs have thousands of cores, they are not as powerful as CPU
cores. Moreover, GPU architecture imposes many limitations, making it a challenge to
fully occupy the GPU in an implementation. In order to obtain the best performance
and fully occupy the GPU, in our KASUMI, SPECK, and TEA3 implementations we explored
straightforward, table-based, and bitsliced implementation techniques. GPUs have different
memory types namely registers, shared memory, and global memory. Registers are the
fastest and global memory is the slowest when reading and writing. In our optimizations
we tried to minimize the global memory usage, remove the shared memory bank conflicts
and tried to minimize number of used registers. GPU kernels have two inputs: Number
of blocks and number of threads in each block. To be able to do as much encryption as
possible in a thread, we aimed to keep the number of threads to be as large as possible. To
be able to run large number of threads in a kernel block, we tried to keep the used registers
Cihangir Tezcan and Gregor Leander 311
to minimum. The number of used registers and some other properties actually depends on
the used CUDA SDK and the target compute capability which actually represents a set
of software and hardware features. Thus, we used different CUDA SDKs and compiled
our codes for every possible compute capability that these SDKs allow to get the best
performance.
Although our KASUMI optimizations are 16.89x faster than the optimizations of recent
CRYPTO’24 paper [ACC+ 24], we do not know the main bottleneck causing this difference
because the source codes of that work are not publicly available. Moreover, we have not
observed any GPU optimizations of SPECK and TEA3 ciphers. Thus, to the best of our
knowledge, we are the first to provide GPU optimizations for these ciphers.
We made our CUDA codes for KASUMI, SPECK, and TEA3 publicly available in order
for the academic community to verify our results, better analyze these ciphers, verify
theoretically obtained results, discover new properties, and compare future optimizations.
2 Preliminaries
Graphics processing units (GPUs) use single instruction multiple thread (SIMT) paral-
lelization and provide especially superior speed compared to CPUs when the running
algorithm is parallelizable. Modern GPUs have thousands of cores and since an exhaustive
key search is embarrassingly parallelizable, each GPU core can try a different key without
the need for communicating with other cores. However, GPU cores are not as powerful as
CPU cores and due to architectural limitations, many optimizations need to be done to
fully occupy the GPU.
In order to obtain the best optimizations, one needs to know the specifics of the used
GPU. Some of the specifications of GPUs are provided by the manufacturers but some
differences between different GPUs for some properties like the delays caused by shared
memory bank conflicts can only be observed by performing experiments. And architectural
changes between different generations of GPUs can significantly affect the performance of
an implementation.
NVIDIA GPUs are categorized with respect to their compute capabilities (CC), which
actually represents a set of software and hardware features. It should be noted that a
CUDA device is backwards compatible. For example, an RTX 4090 GPU has a compute
capability of 8.9 but it can run any code that is compiled with a lower compute capability.
Codes compiled for different compute capabilities generally require different number of
registers per thread and sometimes compiling for a lower compute capability provides
better results. Currently the latest CUDA SDK has version 12.6 and the 12.x versions
support compute capability 5.0 and higher. And the previous CUDA SDK 11.x versions
support compute capability 3.5 and higher. In our benchmarks, we used many CUDA
SDKs and compiled our codes for every compute capability that the SDK and the tested
GPU supports to obtain the best results.
In this work, we used many different desktop and mobile GPUs with different archi-
tectures to show that our optimizations are valid for every GPU or architecture and not
targeted for a specific GPU. The specifications of the GPUs that we used in this work are
provided in Table 1.
Straightforward implementation, table-based implementation, and bitsliced implemen-
tation are the common strategies for implementing symmetric key encryption algorithms.
But depending on the cipher design and GPU limitations, one technique may be superior
to others. Thus, we tried these three strategies in our implementations.
Modern GPUs run many threads in blocks and they are grouped in warps that consist
of 32 threads. On GPUs, data can be stored in registers, shared memory or global memory.
Global memory is large but slow. Hence, we get the best optimizations when we can store
everything in the fast registers. However, on modern GPUs we have at most 64K 32-bit
312 GPU Assisted Brute Force Cryptanalysis of GPRS, GSM, RFID, and TETRA
Table 1: The specifications of the GPUs that we used in this work. Clock rates are listed
according to the maximum boost clock rates of the GPUs and they may differ depending on
the manufacturer and the model. Table is sorted with respect to GPU compute capabilities
(CC), which actually represents a set of software and hardware features.
registers for each block and a block can have at most 1024 threads. Generally if there are
no bottlenecks, we can best occupy the GPU when we use 1024 threads which means we
can spend at most 64 registers for each thread. Thus, generally it is not possible to use a
bitsliced implementation for a symmetric encryption algorithm and still use blocks of 1024
threads because the implementation would require more than 64 registers to store internal
values of the cipher.
Similarly, it is not possible to store large S-boxes in registers. And since global memory
is slow, keeping these kinds of tables in shared memory provides the best speed. However,
a warp can use 32 data lanes to reach the shared memory and if two threads try to use
the same data lane, this causes a shared memory bank conflict and one thread needs to
wait the other one. One way to avoid this problem is to store 32 copies of the S-box so
that every thread in a warp can use its own data lane. This approach is used in [Tez21]
for the table-based implementation of AES and the best performance of AES on GPUs was
obtained via this technique.
There are many academic papers on GPU optimizations of symmetric encryption
algorithms to obtain fast encryption. For instance, AES was optimized on GPUs many times
in the past using the best commercially available GPUs of the time (e.g. [NAI17], [AS20b],
[AFDM17], and [AS20a]). However, none of the source codes of those implementations
were made publicly available. And it is not possible to make a fair comparison when
two implementations are benchmarked on different GPUs. For this reason we made our
codes publicly available and we compare our results in Table 2 with the best block cipher
optimizations on GPUs which have publicly available source codes. It shows that our
KASUMI and SPECK implementations provide more key trials per second than AES, DES,
KLEIN, and PRESENT.
Table 2: Exhaustive key search attack performance on GPUs for various symmetric key
encryption optimizations.
Cipher Key Block Rounds RTX 2070 Super RTX 4090 Reference
PRESENT-80 80 64 31 229.73 keys/s 232.90 keys/s [Tez22]
DES/3DES 56/168 64 16 230.78 keys/s 233.94 keys/s [Tez22]
AES-128 128 128 10 232.43 keys/s 234.64 keys/s [Tez21]
TEA3 80 - - 232.54 keys/s 234.71 keys/s This paper
KLEIN-64 64 64 12 233.19 keys/s 235.40 keys/s [Tez24]
KASUMI 128 64 8 232.72 keys/s 235.72 keys/s This paper
SPECK-96-26 96 64 26 234.49 keys/s 236.72 keys/s This paper
Finally, we observed that all of the optimizations we performed in this work provided
similar speed-ups for every GPU listed in the Table 1. Thus, our optimizations are not
Cihangir Tezcan and Gregor Leander 313
Table 3: Performance normalization of GPUs with respect to their number of cores and
boost clock speeds. Note that these numbers might be higher for some models depending
on the manufacturer.
According to Table 3 we would expect RTX 4090 to be 9.22 ≈ 23.20 times faster than
RTX 2070 Super. Comparing with Table 2 we see that this estimate fits well in some
cases, while it is an overestimate of the difference in others. Since our normalization only
focuses on processing speed and overlooks delays that may be introduced by memory
read-and-write operations or architectural differences, it may be seen as a theoretical peak
performance difference between GPUs.
2.1 KASUMI
KASUMI was designed by ETSI SAGE [3GP] and it is a modified version of the block cipher
MISTY1 [Mat97]. It is a Feistel block cipher with 8 rounds and a block size of 64 bits.
The key schedule of KASUMI simply consists of rotations on 16-bit values and XOR with
constants. The round function of KASUMI contains FO and FL functions where FL function
contains AND, OR, and rotation operations and FO function is also a 3-round Feistel
structure which contains FI functions in each of its round. Moreover, FI functions is a
four round Feistel structure which uses two S-boxes of sizes 7 × 7 and 9 × 9 consecutively
in each round. KASUMI block cipher is pictured in Figure 1 where || symbol represents OR
operation.
Although KASUMI supports 128-bit keys, A5/3 and GEA-3 both use a session key Kc
of 64 bits as 128-bit Kc ||Kc KASUMI key for GSM and GPRS in order to have backward
compatibility. We refer to this version as KASUMI-64 in this paper.
2.2 SPECK
SPECK [BSS+ 13] is a family of add-rotate-xor (ARX) lightweight block ciphers designed
in 2013 by National Security Agency (NSA) of United States. SPECK supports seven key
sizes: 64, 72, 96, 128, 144, 192, and 256 bits. Block size and number of rounds depend on
314 GPU Assisted Brute Force Cryptanalysis of GPRS, GSM, RFID, and TETRA
L0 64 R0 32 16
32 32 16 16 9 7
KL1 KO1,KI1
KOi,1 S9
FL1 FO1
FIi1 KIi,1 zero-extend
KO2,KI2 KL2
S7
FO2 FL2
KOi,2 truncate
FL5 FO5
FO Function FI Function
KO6,KI6 KL6 32
16 16
KO2,KI2 KL2
FL Function
FO8 FL8
L8 R8
C
KASUMI
the key size and possible variations are provided in Table 4. One round of SPECK is shown
in Figure 2 and the key schedule of SPECK also uses this round function.
In Figure 2, for the block word size of 16 bits α = 7 and β = 2. For every other block
word size α = 8 and β = 3.
We represent k-bit keyed SPECK with r rounds as SPECK-k-r. In this work we optimized
SPECK-64-22, SPECK-72-22, SPECK-96-26, and SPECK-128-32 using the CUDA programming
language and our optimizations can easily be modified for other variants of SPECK. SPECK
became an ISO/IEC RFID air interface standard in 2018 (ISO/IEC 29167-22:2018). This
standard was reviewed in 2024 and confirmed. Note that the ISO standard [ISO18] contains
SPECK-96-26, SPECK-96-29, SPECK-128-32, and SPECK-256-34 versions and without a proper
warning of the security implications of using a short key, users may prefer short keys for
performance and low hardware footprint.
Cihangir Tezcan and Gregor Leander 315
xi yi
Table 4: Variations of SPECK block size
in bits, key size in bits, and number of
rounds. ⋙α
2.3 TEA3
key 0 1 2 3 4 5 6 7 8 9
SB3
state 0 1 2 3 4 5 6 7
F31 R3 F32
keystream
large tables in the shared memory might reduce the occupancy of the GPU or might
even not be possible due to the device limits.
3. Bitsliced: Programming languages and processors generally work on words that are
multiples of 8 bits and bit operations can become more costly compared to hardware.
Thus, hardware-oriented ciphers might use bit-level operations intensively so that
the straightforward implementation might become taxing in software. Instead of
working on bits of large words, we can store each bit in a different register so that
we can avoid operations that try to access and use bits of a word. This bitslicing
technique can be implemented in two different approaches on a GPU:
1. We stored the two S-boxes S7 and S9 of KASUMI in the shared memory. Keeping
the S-boxes in the shared memory causes bank memory conflicts when different
threads in a warp try to access different memory banks. When a shared memory
bank conflict occurs, one thread has to wait the other. The amount of delay caused
by this depends on the model of the GPU and the size of the data that is requested.
The shared memory bank conflicts for S7 can be avoided by using the technique
of [Tez21] where the S-box is duplicated 32 times in a special order so that every
thread in a warp uses its own data lane. Although the S-boxes are of sizes 7 and
9 bits, in software implementations we use one and two bytes, respectively. Thus,
duplicating the S-box S7 32 times requires 4 KBs. Although modern GPUs have
64KBs of shared memory, using 4KBs of shared memory reduced our occupancy of
the GPU. This reduced occupancy causes a slowdown that is larger than the speed
gain coming from removing the shared memory bank conflicts. Moreover, unlike
the 32-bit values stored in the shared memory in [Tez21], we store 8-bit values in
the shared memory for S7. This provides faster shared memory read speeds and
the delays in the shared memory bank conflicts are shorter compared to the case of
[Tez21].
Since we need to spend two bytes for S9, duplicating the S-box S9 32 times requires 32
KBs of shared memory. Although modern GPUs have this much shared memory, such
an implementation reduces the occupancy of the GPU and speed up coming from the
318 GPU Assisted Brute Force Cryptanalysis of GPRS, GSM, RFID, and TETRA
removed shared memory bank conflicts cannot compete with this slowdown. Thus,
we kept a single copy of S7 and S9 in the shared memory in our best optimizations.
When a GPU kernel is called with n blocks of t threads, we initialize each thread
to an integer threadIndex.x in [0, n × t − 1]. In the following code, the first 128
threads of each block copy the S-box S7 and the first 512 threads of each block copy
the S-box S9 from the global memory to the shared memory:
u i n t 3 2 _ t t h r e a d I n d e x = b l o c k I d x . x ∗ blockDim . x + t h r e a d I d x . x ;
i f ( threadIdx . x < 512) {
i f ( t h r e a d I d x . x < 1 2 8 ) S7S [ t h r e a d I d x . x ] = S7G [ t h r e a d I d x . x ] ;
S9S [ t h r e a d I d x . x ] = S9G [ t h r e a d I d x . x ] ;
}
__syncthreads ( ) ;
The __syncthreads(); command in the last line above synchronizes the threads in
the block to prevent threads to try to read before the values are written to the shared
memory.
2. The round constants of KASUMI can be kept in the shared memory, or in registers,
or they can be provided as they are inside the code. We observed that the best
performance is obtained when independent registers are used for the round constants.
In this case, we do not pass the constants to the GPU kernel. Instead, they are
assigned to registers at the beginning of the kernel as follows:
uint16_t c1 = 0 x0123 , c2 = 0 x4567 , c3 = 0x89AB , c4 = 0xCDEF, c5 = 0xFEDC,
c6 = 0xBA98 , c7 = 0 x7654 , c8 = 0 x3210 ;
3. Our implementation of KASUMI for TMTO table creation and performing exhaustive
key search are different because in an exhaustive search we rarely perform the
encryption of the last round of a Feistel cipher. Because for KASUMI the left 32-bit of
the seventh-round output is the right 32-bit of the ciphertext. Thus, the last round
of the encryption is only performed when the target right 32-bit of the ciphertext is
observed after the seventh round which happens with a probability of 2−32 . Thus,
the last round is performed only 232 times out of 264 key searches.
4. In modern GPUs, best occupancy is obtained when the blocks consist of 1024 threads
when there is no other significant bottleneck. However, this can be achieved when
the kernel uses less than or equal to 64 registers per thread because a block can have
at most 64K registers in modern GPUs.
When the attacked plaintext left and right halves are represented with plaintextl and
plaintextr, corresponding ciphertext is represented with ciphertextl and ciphertextr,
and the S-boxes are represented as the arrays S7d[ ] and S9d[ ], the GPU kernel
for our exhaustive key search attack is called with 1024 threads of 2048 blocks as
follows:
KASUMI64ExhaustiveConstantsRegister <<< 2 0 4 8 , 1024 >>>
( p l a i n t e x t l , p l a i n t e x t r , c i p h e r t e x t l , c i p h e r t e x t r , S7d , S9d ) ;
When compiled with many different versions of CUDA SDK and choice of compute
capabilities between 5.0 and 8.9, our TMTO codes always require less than 64
registers and our exhaustive search codes that conditionally performs the last round
Cihangir Tezcan and Gregor Leander 319
require always more than 64 registers. Although the codes are very similar, CUDA
compiler observes other optimizations for our exhaustive key search code when we
conditionally perform the last round encryption. But these optimizations increase
the register count. This prevents us to call the kernel blocks with 1024 threads and
when we use 512 threads in blocks, conditionally performing the last round provides
negligible speed up compared to the version where we perform all of the eight rounds
with 1024 threads.
Since the increase in the number of required registers were due to the compiler’s
optimizations, we forced the CUDA SDK to use at most 64 registers when compiling
our codes by using the command −maxrregcount = 64. This way, we achieved 10%
speed up compared to the 8-round encryption of the TMTO table creation codes.
When compiling our KASUMI implementations, not limiting the register count to 64
at the compile time would result in "too many resources requested for launch" error
at the run time.
Our KASUMI benchmark results are provided in Table 5 and it can be seen that they are
valid for many different GPUs and they are not optimized for a specific GPU architecture
or model.
Table 5: Number of KASUMI encryptions per second when performing TMTO table creation
and exhaustive key search on various GPUs.
We are not aware of any publicly available GPU optimizations of KASUMI so we were
not able to benchmark other optimizations on our GPUs and compare them with our
results. However, it is reported in [ACC+ 24] that it is possible to perform 243.32 KASUMI
encryption in 61 minutes on a single RTX 3090 GPU. This means that they achieve around
231.48 KASUMI encryptions per second. Note that we achieved a similar performance with a
GTX 970 GPU as it is shown in Table 5 and RTX 3090 is roughly 8.5 times faster than
GTX 970 on average due to their differences in core numbers and clock speeds as shown in
Table 3.
In [JRW11], three meet-in-the-middle key recovery attacks for full KASUMI-64 were
provided. Their initial attack requires a single known plaintext/ciphertext pair and 263
encryptions. Then the time complexity is reduced to 262.75 encryptions when the attacker
captures 1152 chosen plaintext/ciphertext pairs. And finally, the time complexity becomes
262.63 encryptions when the data complexity is 220 chosen plaintexts. With our GPU
optimizations, the attack that requires 262.63 encryptions can be performed in 4.47 years
on a single RTX 4090.
We made our CUDA optimizations of KASUMI for creating TMTO tables and perform-
ing exhaustive key search publicly available1 so that future optimizations can be easily
compared with our optimizations.
1 Our optimized KASUMI CUDA codes are publicly available at GitHub so that they can be used to verify
Table 6: Number of SPECK encryptions per second when performing exhaustive key search
on various key sizes on various GPUs.
In our implementations for various SPECK variants, we did the following optimizations
and observations:
1. We observed that compiling our SPECK optimizations with compute capability 5.2
provides around 20.06 better performance compared to a code compiled for compute
capability 8.9. For example, trying 243 keys for SPECK-96-26 using an RTX 4090
takes 80.63 seconds and 77.80 seconds when compiled for compute capabilities 8.9
and 5.2, respectively.
2. Current CUDA SDKs do not support compute capability less than 5.0 but using older
CUDA SDKs allow us to use deprecated compute capabilities. Thus, we used CUDA
SDK 11.1 and compiled our implementations for compute capability 3.5. However,
using compute capability 3.5 did not provide any observable speed up compared to
5.2.
3. Although SPECK-64-22 and SPECK-72-22 have smaller number of rounds compared to
SPECK-96-26, we obtained better speeds for SPECK-96-26. We observed that this is
because the key and the data are stored in 32-bit registers in SPECK-96-26 due to
its block size and on CUDA devices the rotation operation on 32-bit values is faster
than 16 or 24-bit values. Thus, 64 and 72-bit versions might be broken faster in the
future if new GPU instructions provide speed-ups for rotations on values smaller
than 32 bits. We defined the rotation operations as macros in a straightforward and
common way as follows:
#d e f i n e ROTL16( x , r ) ( x<<r ) | ( x>>(16− r ) )
#d e f i n e ROTR16( x , r ) ( x>>r ) | ( x<<(16− r ) )
#d e f i n e ROTL24( x , r ) ( x<<r ) | ( x>>(24− r ) )
#d e f i n e ROTR24( x , r ) ( x>>r ) | ( x<<(24− r ) )
#d e f i n e ROTL32( x , r ) ( x<<r ) | ( x>>(32− r ) )
#d e f i n e ROTR32( x , r ) ( x>>r ) | ( x<<(32− r ) )
Note that using 32-bit unsigned integers for storing 16 or 24-bit values requires
additional AN D operations after the rotation with 0xFFFF or 0xFFFFFF, respectively.
4. In software implementations, it is a common practice to use a for loop to per-
form r rounds of encryption inside the loop. Using the #pragma unroll macro
unrolls these loops and in the case of SPECK, this provided marginal speed-up in our
implementations.
Cihangir Tezcan and Gregor Leander 321
5. Our optimizations use small numbers of registers so that we can call the GPU kernels
with 1024 threads as follows to try 220+trial keys:
s p e c k 6 4 _ e x h a u s t i v e <<< 1 0 2 4 , 1024 >>> ( ct_d , pt_d , K_d, t r i a l ) ;
s p e c k 7 2 _ e x h a u s t i v e <<< 1 0 2 4 , 1024 >>> ( ct_d , pt_d , K_d, t r i a l ) ;
s p e c k 9 6 _ e x h a u s t i v e <<< 1 0 2 4 , 1024 >>> ( ct_d , pt_d , K_d, t r i a l ) ;
s p e c k 1 2 8 _ e x h a u s t i v e <<< 1 0 2 4 , 1024 >>> ( ct_d , pt_d , K_d, t r i a l ) ;
Since a year has around 224.91 seconds, one needs around 8 RTX 4090 GPUs to break
SPECK-64-22 in a year. In order to break SPECK-72-22 in a year, one needs around 1575
RTX 4090 GPUs. And to break SPECK-96-26 in a year, one needs around 22 billion RTX
4090 GPUs. Note that SPECK-96-26 is included in the ISO/IEC 29167-22 [ISO18] RFID air
interface standard. Although 22 billion GPUs are a lot, this number is going to reduce when
new generation of GPUs like NVIDIA’s 5000 series are announced and produced in 2025.
According to our estimates in Table 3, we expect one would need around 17.5 billion RTX
5090 GPUs to break SPECK-96-26 in a year. Those numbers are by far exceeding today’s
practical capabilities. However, they show that devices built today with SPECK-96-26 may
not be secure around 2050. Moreover, GPUs are general purpose computing devices and
our results also show that if built, dedicated devices can break SPECK-96-26 faster than
GPUs and would consume significantly less energy compared to GPUs.
We made our CUDA optimizations of SPECK for performing exhaustive key search pub-
licly available2 so that future optimizations can be easily compared with our optimizations.
unsigned m = 0x0000FFFF ;
#pragma u n r o l l
f o r ( i n t l = 1 6 ; l != 0 ; l = l >> 1 , m = m ^ (m << l ) ) {
#pragma u n r o l l
for ( k = 0 ; k < 32; k = ( k + l + 1) & ~ l ) {
t = ( bSboxOut [ k ] ^ ( bSboxOut [ k + l ] >> l ) ) & m;
bSboxOut [ k ] = bSboxOut [ k ] ^ t ;
bSboxOut [ k + l ] = bSboxOut [ k + l ] ^ ( t << l ) ;
}
}
Note that any improvement to the above bit-level matrix transpose operation would
increase the performance of our TEA3 implementation.
Note that limiting the maximum used register count to 64 (or to 128) as we did in our
KASUMI implementation causes a performance loss in this case because the kernel really
needs to keep more than 64 (or 128) registers. For instance, bitslicing the 80-bit state
register already requires 80 registers per thread, exceeding the 64 limit. If we force CUDA
SDK to use only 64 registers by using the command −maxrregcount = 64 as we did when
implementing KASUMI, the required extra registers spill and are kept in the global memory.
Reading and writing these values to and from the global memory or cache provides delays
that significantly slow down our implementation.
Although decreasing the thread count in a block to 512 or 256 decreases the GPU
occupancy and therefore the performance, the number of parallel executions in the bitsliced
implementation still provide better results. We obtained the best results when we used
32-bit registers and achieved 234.71 key trials per second on an RTX 4090. This is around
160 times faster than our straightforward implementation.
Our TEA3 exhaustive key search results are provided in Table 7 and it can be seen that
they are not optimized for a specific GPU architecture or model.
Table 7: Number of TEA3 encryptions per second when performing exhaustive key search
on various GPUs.
Our best optimizations show that 80-bit key search for TEA3 would require 1.36 million
RTX 4090 GPUs to break it in a year. And according to our estimates provided in Table
3, we expect 1.08 million RTX 5090 GPUs can break it in a year. Again, while clearly
not practical today, it is likely to be so in the future and should therefore not be used
in real life scenarios. The required time and the number of GPUs will be significantly
reduced with every new generation of GPUs. Moreover, our results showing that GPUs
can theoretically break TEA3 and the fact that this cipher is used by military, police, and
government agencies, one might invest building ASICs to break TEA3 in a short time.
Our optimizations can also be used for performing TMTO attacks on TEA3 but note
that table creation speed for TMTO can be at most 6 times slower than the speeds reported
in Table 7. This is because we perform an early abort technique when performing the
exhaustive key search attack. That is, if the first byte of the keystream does not provide
the desired output, then there is no need to produce more keystream bytes. Since we
try 32 different keys in our bitsliced implementation, the desired output is observed with
probability 32/256 even though all of the 32 keys are wrong. Thus, we generate the second
keystream byte with probability 1/8 but rarely produce the remaining keystream bytes.
But in TMTO table creation, we have to produce the whole keystream in every encryption
Cihangir Tezcan and Gregor Leander 323
and producing 10 bytes of keystream is enough for a TMTO attack. However, producing
the next keystream bytes are easier since TEA3 is clocked 51 times when producing the
first keystream byte and clocked 19 times for the rest of the keystream byte generation.
We made our CUDA optimizations of TEA3 for performing exhaustive key search publicly
available3 so that future optimizations can be easily compared with our optimizations.
Table 8: Precomputation and attack time for performing TMTO attack on KASUMI-64.
3 Our optimized TEA3 CUDA codes are publicly available at GitHub so that they can be used to verify
1. The passive TMTO attacks of [ACC+ 24] on GPRS assumes that the network is
misconfigured so they affect a subset of GPRS networks.
2. The TMTO attack of [ACC+ 24] on well-configured GPRS networks requires an active
attack in which the attacker has to inject messages.
3. The generic TMTO attack of [ACC+ 24] against GSM communications that builds a
TMTO on a specific IV requires a known plaintext to be encrypted with that IV.
This happens with 50% in a 1 hour and 44 minute GSM communication. Thus, the
success probability of the attack increases when the attacked communication lasts
longer. However, in many countries the communication is terminated after 1 hour
and this limits the success probability of the attack to 21%.
Our implementation of KASUMI for TMTO table creation and performing exhaustive
key search are different because in an exhaustive search we rarely perform the encryption
of the last round. This is because the left 32-bit of the seventh-round output is the right
32-bit of the ciphertext. Thus, the last round of the encryption is only performed when
the target right 32-bit of the ciphertext is observed after the seventh round which happens
with a probability of 2−32 . Thus, the last round is performed only 232 times out of 264 key
searches.
Since our optimizations allow 235.72 keys per second on an RTX 4090, it takes 10.35
years for a single RTX 4090 to break KASUMI-64. Or to break KASUMI-64 in a year, 11 RTX
4090 GPUs are enough. If we use 2400 GPUs as suggested in [ACC+ 24] but this time for
exhaustive key search attack instead of generating TMTO tables, it would take less than
38 hours to find the key in the worst case. Thus, instead of creating TMTO tables in 348
days as in [ACC+ 24] or 20.6 days by our optimized code, same amount of GPUs can be
used for 1.5 days to capture the key via brute force attack. Moreover, switching to the
exhaustive key search from TMTO attack increases the success probability from 21% to
100% and the requirement for the communication to be 1 hour is no longer needed. It can
be as short as a few seconds.
Note that the GPU core numbers and therefore their performance increases in every
generation of GPUs while their price and sometimes energy consumption remains the
same. Thus, our exhaustive key search attacks are going to be more practical in the future.
However, such technological improvements are not going to be beneficial for the one-time
TMTO creation that is performed now.
Moreover, the three scenarios of [ACC+ 24] performs the precomputation on GPUs and
record the results on SSDs and the attack is performed on 128-core servers. Thus, these
scenarios require two, five, and ten 128-core servers and 100, 125, and 200 TB of SSDs,
respectively. According to [ACC+ 24], these servers and SSDs would cost around 85 000,
206 250, and 410 000 USD, respectively. In our exhaustive search we do not need any CPU
power or SSDs to record any tables. Thus, one can avoid these extra costs if they move
from TMTO to exhaustive key search.
To summarize, using our optimized code for brute force cryptanalysis of a GPS
communication instead of a TMTO attack of [ACC+ 24] has the following advantages when
the scenario three with a one-hour call is considered:
Cihangir Tezcan and Gregor Leander 325
The only disadvantage of our exhaustive key search attack against the TMTO attack
of [ACC+ 24] is that it now requires 38 hours to break a communication, instead of 14
minutes. Note that 38 hours is the worst-case scenario and on average our brute force
attack should take around 19 hours. Moreover, this duration will be shortened with every
new generation of GPUs.
Instead of the exhaustive search, one can also perform the meet-in-the-middle attacks
of [JRW11] on GPUs using our optimized codes. When we have 1 known plaintext, 1152
chosen plaintexts, or 220 chosen plaintexts, the attacks of [JRW11] require 263.03 , 262.75 ,
and 262.63 encryptions, respectively. Thus, if we use our CUDA codes to perform these
attacks using 2400 RTX 4090 GPUs, the 38 hours required for the exhaustive search
reduces to 19, 16, and 14.7 hours, respectively.
6 Conclusion
Symmetric key encryption algorithms that use short keys appear in standards and many
real-world applications making them susceptible to exhaustive key search attacks. In this
work we provided the best-known GPU optimizations of the KASUMI, SPECK, and TEA3
ciphers to show that they can be broken by brute force attacks.
GPRS and GSM uses the 64-bit key version of KASUMI and we showed that it can
be broken in a year just by using 11 RTX 4090 GPUs. And the attack reduces to 38
hours when 2400 GPUs are used. Our KASUMI implementation is more than 15 times
faster than the optimizations given in the CRYPTO’24 paper [ACC+ 24] improving the
main results of that paper by the same factor. Our optimizations can also be used to
perform the meet-in-the-middle attacks of [JRW11] which have better time complexities
than exhaustive search. With 2400 GPUs, the best attack in [JRW11] can be performed
in 14.7 hours with our codes.
SPECK block cipher supports short keys like 64, 72, and 96 bits. Although 64 and 72-bit
versions do not appear in the ISO/IEC standard for RFID air interface, 96-bit version
does. Our best optimizations show that this version could be broken in a year when 22
billion RTX 4090 GPUs were used. Thus, currently 96-bit SPECK cannot be practically
broken via brute force attacks.
European standard TETRA for trunked radio uses proprietary keystream generators
which were kept secret for decades. Recently they were reverse engineered and it was shown
that the best of these keystream generators uses 80-bit secret key. Our best optimizations
show that it can be broken in a year by using 1.36 million RTX 4090 GPUs. This is an
important threat in real-world because TETRA is used by military, police, emergency
services, prisons, and government agencies.
All of these attacks will become more practical with every new generation of GPUs
or if dedicated hardware is designed for breaking them. Thus, we strongly recommend
avoiding keys shorter than 128 bits.
326 GPU Assisted Brute Force Cryptanalysis of GPRS, GSM, RFID, and TETRA
Acknowledgments
This work was supported by The Scientific and Technological Research Council of Türkiye
(TÜBITAK) and German Academic Exchange Service (DAAD) Bilateral Research Coop-
eration Project (TÜBİTAK 2531 Project) under the grant number 123N546 and titled
"Cryptanalysis of Symmetric Key Encryption Algorithms: Theory vs. Practice". The
authors thank TÜBİTAK and DAAD for their support.
This work was also supported by the ERC project 101097056 (SYMTRUST) and
the enCRYPTON project. The later has received funding from the European Union’s
Horizon Europe Research and innovation programme under grant agreement No: 101079319.
Funded by the European Union. Views and opinions expressed are however those of the
author(s) only and do not necessarily reflect those of the European Union, European
Commission or European Research Executive Agency. Neither the European Union nor
the granting authority can be held responsible for them.
References
[3GP] 3rd generation partnership project, technical specification group services and
system aspects, 3g security, specification of the 3gpp confidentiality and in-
tegrity algorithms; document 2: Kasumi specification, v.3.1.1 (2001).
[ACC+ 24] Gildas Avoine, Xavier Carpent, Tristan Claverie, Christophe Devine, and Diane
Leblanc-Albarel. Time-memory trade-offs sound the death knell for GPRS and
GSM. In Leonid Reyzin and Douglas Stebila, editors, Advances in Cryptology
- CRYPTO 2024 - 44th Annual International Cryptology Conference, Santa
Barbara, CA, USA, August 18-22, 2024, Proceedings, Part IV, volume 14923
of Lecture Notes in Computer Science, pages 206–240. Springer, 2024.
[AS20a] Sang Woo An and Seog Chung Seo. Study on optimizing block ciphers (aes,
cham) on graphic processing units. In 2020 IEEE International Conference on
Consumer Electronics - Asia (ICCE-Asia), pages 1–4, 2020.
[AS20b] SangWoo An and Seog Chung Seo. Highly efficient implementation of block
ciphers on graphic processing units for massively large data. Applied Sciences,
10(11), 2020.
[BBK03] Elad Barkan, Eli Biham, and Nathan Keller. Instant ciphertext-only crypt-
analysis of GSM encrypted communication. In Dan Boneh, editor, Advances in
Cryptology - CRYPTO 2003, 23rd Annual International Cryptology Conference,
Santa Barbara, California, USA, August 17-21, 2003, Proceedings, volume
2729 of Lecture Notes in Computer Science, pages 600–616. Springer, 2003.
[BSS+ 13] Ray Beaulieu, Douglas Shors, Jason Smith, Stefan Treatman-Clark, Bryan
Weeks, and Louis Wingers. The SIMON and SPECK families of lightweight
block ciphers. Cryptology ePrint Archive, Paper 2013/404, 2013.
[Mat97] Mitsuru Matsui. New block encryption algorithm MISTY. In Eli Biham, editor,
Fast Software Encryption, 4th International Workshop, FSE ’97, Haifa, Israel,
January 20-22, 1997, Proceedings, volume 1267 of Lecture Notes in Computer
Science, pages 54–68. Springer, 1997.
[MBW23] Carlo Meijer, Wouter Bokslag, and Jos Wetzels. All cops are broadcasting:
TETRA under scrutiny. In 32nd USENIX Security Symposium (USENIX Secu-
rity 23), pages 7463–7479, Anaheim, CA, August 2023. USENIX Association.
[NAI17] Naoki Nishikawa, Hideharu Amano, and Keisuke Iwai. Implementation of
bitsliced AES encryption on cuda-enabled GPU. In Zheng Yan, Refik Molva,
Wojciech Mazurczyk, and Raimo Kantola, editors, Network and System Security
- 11th International Conference, NSS 2017, Helsinki, Finland, August 21-23,
2017, Proceedings, volume 10394 of Lecture Notes in Computer Science, pages
273–287. Springer, 2017.
[Tez21] Cihangir Tezcan. Optimization of advanced encryption standard on graphics
processing units. IEEE Access, 9:67315–67326, 2021.
[Tez22] Cihangir Tezcan. Key lengths revisited: Gpu-based brute force cryptanalysis
of DES, 3DES, and PRESENT. J. Syst. Archit., 124:102402, 2022.
[Tez24] Cihangir Tezcan. Gpu-based brute force cryptanalysis of KLEIN. In Gabriele
Lenzini, Paolo Mori, and Steven Furnell, editors, Proceedings of the 10th Inter-
national Conference on Information Systems Security and Privacy, ICISSP
2024, Rome, Italy, February 26-28, 2024, pages 884–889. SCITEPRESS, 2024.