A New Neural Distinguisher Considering Features Derived From Multiple Ciphertext Pairs
A New Neural Distinguisher Considering Features Derived From Multiple Ciphertext Pairs
This is an Open Access article distributed under the terms of the Creative Commons Attribution License
(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution,
and reproduction in any medium, provided the original work is properly cited.
Advance Access publication on 11 March 2022 https://doi.org/10.1093/comjnl/bxac019
1. INTRODUCTION
To improve the performance of N D, researchers have
In CRYPTO’19, Gohr improved attacks on round reduced explored N D from different directions. The most popular
Speck32/64 using deep learning [1], which created a precedent direction is adopting different neural networks. In [3], Jain
for neural-aided cryptanalysis. The neural distinguisher (N D) et al. proposed a multi-layer perceptron network (MLP)
proposed by Gohr plays a core role in [1]. Its target is to distin- to build N Ds against PRESENT reduced to 3, 4 rounds.
guish real ciphertext pairs (C0 , C1 ) corresponding to plaintext In [4], Yadav et al. also built an MLP-based 3-round N D
pairs with a specific difference from random ciphertext pairs. against Speck32/64. In [5], Bellini et al. compared MLP-
N D takes a ciphertext pair (C0 , C1 ) as input, and gives the based and convolutional neural network-based distinguishers
classification result. with classic distinguishers. In [6], Pareek et al. proposed
The performance of N D is important for neural-aided crypt- fully-connected network-based distinguisher against the key
analysis. For Gohr’s key recovery attack [1], the most important scheduling algorithm of PRESENT. Another popular direction
step is identifying the right plaintext structure that passes the is changing the input of N D. In [7], Baksi et al. used the
differential placed before N D. To attack 11-round Speck32/64, ciphertext difference C0 ⊕ C1 as the input. In [2], Chen et al.
Gohr adopted a 6-round N D and 7-round N D for identifying suggested that the ND can be built by flexibly taking some bits
the right plaintext structure. The identification result is given by of a ciphertext pair as input. In [8], Hou et al. investigate the
the 6-round N D instead of 7-round N D. Compared with the influence of input difference pattern on the accuracy of N Ds
7-round N D, the 6-round N D achieves higher distinguishing against round reduced Simon32/64.
accuracy. This implies that a stronger ND is more helpful for These above N Ds can be viewed as the same type since
Gohr’s attacks. Recently, Chen et al. [2] proposed a generic only features hidden in a single ciphertext pair are exploited.
neural-aided statistical attack (NASA) for cryptanalysis. The Thus, another natural way is taking more ciphertexts as the
data complexity of NASA is strongly related to the distinguish- input. In [9], Benamira et al. initially tested this idea as follows.
ing accuracy of N D. First, a group of B ciphertexts is constructed from the same
key. Second, take a group of B ciphertexts as the input of N D. the data complexity of the attack [2] can be reduced by
Finally, based on a large B, the accuracy of N Ds against 5- using the new N D.
round and 6-round Speck32/64 is increased to 100%, which is
a huge improvement. 1.2. Outlines
Previous findings especially the work in [9] inspired us to
This paper is organized as follows:
think about the deeper motivation of taking more ciphertexts
as input. We believe that the deeper motivation stands for a
• Section 2 presents preliminaries, including some impor-
generic method for improving N D. The N D processing a
tant notations and five related ciphers.
group of B ciphertexts has two important characteristics: (1)
• In Section 3, the N D proposed by Gohr and two key
the input contains more ciphertexts, (2) all the ciphertexts in
recovery attacks are briefly reviewed.
a group share the same key. Since Ankele and Kölbl [10],
• Section 4 presents the new N D including the motiva-
2.3. Computing resources (c) Compute the rank score Vkg of kg as:
m
In this paper, the available computing resources are: an Intel(R) Zi
Core(TM) i5-7500 CPU @ 3.40GHz, a graphics card (NVIDIA Vkg = log2 (3)
1 − Zi
GeForce GTX 1060 6GB). i=1
t = μ0 − z1−β0 × σ0 (9) where Y is the label of ciphertext groups, and (Cj,1 , Cj,2 )
is the ciphertext pair corresponding to the plaintext pair
where (Pj,1 , Pj,2 ), j ∈ [1, k].
According to the introduced motivation, the requirement is
μ0 = N × (p0 p1 + (1 − p0 ) p3 ) that ciphertext pairs in a group are randomly sampled from
the same distribution. To minimize influencing factors, we ask
that a ciphertext group is constructed from k random keys if
the cipher needs a key. This ensures that k ciphertext pairs
σ0 = N × p0 × p1 (1 − p1 ) + N(1 − p0 )p3 (1 − p3 ) do not have any same properties except for the same plaintext
difference constraint.
Our new N D can be described as
If c2 = 0.5, the distinguishing accuracy of the ND is (p1 +
1 − p3 ) × 0.5. Thus, the data complexity of NASA is strongly Pr (Y = 1 |X1 , · · · , Xk )
related to the N D. We refer readers to [2] for more details of = F2 (f (X1 ) , · · ·, f (Xk ) , ϕ (f (X1 ) , · · · , f (Xk ))) (11)
NASA. Xi = Ci,1 , Ci,2 , i ∈ [1, k]
where f (Xi ) represents the basic features extracted from the built over a 2D convolutional layer. The 2D filters with a size
ciphertext pair Xi , ϕ (·) is the derived features, and F2 (·) is the of Ks × Ks can learn derived features from k ciphertext pairs.
new posterior probability estimation function. In this article, we use one residual block for building our new
The motivation also puts forward some design guidelines for N Ds.
the neural network to be used. Since we hope more features
ϕ(f (X1 ), · · · , f (Xk )) are extracted from the distribution of basic Training pipeline
4.3.2.
features f (Xi ), i ∈ [1, k], N D should learn basic features from New N Ds are obtained by following three processes:
each ciphertext pair firstly. From the perspective of neural
networks, this requirement can be satisfied by placing 1D 1. Data generation: Consider a plaintext difference ΔP
convolutional layers before 2D convolutional layers. and a cipher E. Randomly generate k plaintext pairs with
ΔP. If E needs a key, randomly generate k keys. Collect
the k ciphertext pairs with E and k keys. Regard these
4.3. Residual network k ciphertext pairs as a ciphertext group with a size of k,
and the label is Y = 1. We denote a ciphertext group with
4.3.1. Network architecture
Y = 1 as a positive sample. If the plaintext differences
The network architecture adopted by Gohr [1] is also applied
of k plaintext pairs are random, the label of the resulting
in this article. According to the requirement of the motivation,
ciphertext group is Y = 0. And we denote it as a negative
except for the first 1D convolutional layer, the remaining 1D N
sample. A training set is composed of 2k positive samples
convolutional layers are replaced by 2D convolutional layers. N M
Figure 3 shows the neural network architecture. The input and 2k negative samples. A testing set is composed of 2k
M
consisting of k ciphertext pairs is arranged in a k × w × 2L w
positive samples and 2k negative samples. We need to
array. L represents the block size of the target cipher and w is generate a training set and a testing set.
the size of a basic unit. For example, L is 32 and w is 16 for 2. Training: Train the neural network (Fig. 3) on the train-
Speck32/64. ing dataset.
The network architecture contains two core modules. The 3. Testing: Test the distinguishing accuracy of the trained
first one (Module 1) is a bit slice layer that contains convolution neural network on the testing dataset. If the test accuracy
kernels with a size of 1 × 1. This layer can learn basic features exceeds 0.5, return the neural network as a valid N D. Or
from each input ciphertext pair that is arranged in a 1 × w × 2L w
choose a different α and start from the data generation
array. The second one (Module 2) is a residual block that is process again.
In the training phase, the neural network is trained for TABLE 1. Parameters for constructing our new N D.
Es epochs with a batch size of Bs . The cyclic learning rate
Nf d1 d2 Ks Bs
scheme in [1] is adopted. Optimization is performed against
the following loss function: 32 64 64 3 1000
N λ Lr Es N M
k
2 10−5 0.002 → 0.0001 10 107 106
loss = Zi,p − Yi + λ × W (12)
i=1
where Zi,p is the output of the N D, Yi is the true label, W is the such ciphertext pairs are false negative samples. These k sam-
parameters of the neural network and λ is the penalty factor. ples are combined into a ciphertext group and fed into N Dk .
The Adam algorithm [22] with default parameters in Keras [23] Generate a large number of such ciphertext groups and feed
TABLE 2. Distinguishing accuracy of N D s against Speck32/64. TABLE 3. Distinguishing accuracy of N Ds against Speck32/64
under the fair setting.
r N Dk=1 N D k=2 N D k=4 N Dk=8 N Dk=16
r / na N Dk=1 N Dk=2 , m = n
k N Dk=4 , m = n
k
5 0.926 0.9739 0.9914 0.9991 0.9999
6 0.784 0.8667 0.9358 0.9528 0.9786 6/8 0.9573 0.9767 0.9823
7 0.607 0.6396 0.6847 0.7009 0.6493 7/8 0.7333 0.7421 0.7506
7 / 16 0.7859 0.8020 0.8090
7 / 32 0.8352 0.8682 0.8751
7 / 64 0.8757 0.9282 0.9387
6.1. Experiments on Speck32/64
TABLE 4. Distinguishing accuracy of N Ds over two kinds of testing
TABLE 5. The comparison of neural network parameters as well as TABLE 7. Distinguishing accuracy of N Ds against Chaskey.
the accuracy of N Dk , k ∈ {1, 2}.
r N Dk=1 N Dk=2 N Dk=4 N Dk=8 N Dk=16
r N Dk=1 , 1 residual blocks
3 0.8608 0.8958 0.9583 0.9887 0.9986
Parameter Accuracy 4 0.6161 0.6589 0.6981 0.7603 0.7712
5 44 321 0.926
6 44 321 0.784 TABLE 8. Pass ratios of FPT and FNT of N Ds against Chaskey.
r N Dk=1 , 10 residual blocks
r False negative test
Parameter Accuracy
5 102 497 0.929 N Dk=2 N Dk=4 N Dk=8 N Dk=16
TABLE 10. Pass ratios of FNT and FPT of N Ds against TABLE 14. Pass ratios of FNT and FPT of N Ds against SHA3-256.
Present64/80.
r False negative test
r False negative test
N Dk=2 N Dk=4 N Dk=8 N Dk=16
N Dk=2 N D k=4 N Dk=8 N Dk=16
3 0.2249 0.2347 0.3336 0.2711
6 0.0277 0.0097 0.0258 0.0751
7 0.1796 0.0587 0.1214 0.1488 r False positive test
r False positive test N Dk=2 N Dk=4 N Dk=8 N Dk=16
N Dk=2 N D k=4 N Dk=8 N Dk=16 3 0.1045 0.0961 0.0171 0.0088
6 0.0147 0.0046 0.0068 0.0183
TABLE 12. Pass ratios of FNT and FPT of N Ds against DES. 7.1. Data reuse strategy for reducing data complexity
There is a potential problem when we directly apply our new
R False negative test
N D to key recovery attacks.
N Dk=2 N Dk=4 N D k=8 N Dk=16 Assuming Gohr’s distinguisher and our new N Dk have the
same performance, and a certain attack requires M random
5 0.0046 0.0034 0.0132 0.0131
inputs. If we directly reshape M × k ciphertext pairs into M
6 0.0802 0.2348 0.2526 0.3207
ciphertext groups, the data complexity of our N Dk is k times
R False positive test as much as the data complexity of Gohr’s distinguisher.
Given M ciphertext
pairs Xi = (Ci,0 , Ci,1 ), i ∈ [1, M], there
N Dk=2 N Dk=4 N D k=8 N Dk=16 are a total of Mk options for composing a ciphertext group,
which is much larger than Mk . Thus, we can randomly select M
5 0.0594 0.0627 0.0566 0.0518
6 0.0462 0.0598 0.0921 0.0809 ciphertext groups from Mk options. Such a strategy can help
reduce data complexity. In fact, it is equivalent to attach more
importance to derived features from k ciphertexts.
TABLE 13. Distinguishing accuracy of N Ds against SHA3-256. However, the subsequent key recovery attacks using this
naive strategy do not obtain good results. The main reason
r N Dk=1 N D k=2 N D k=4 N Dk=8 N Dk=16 is that the sampling randomness of M ciphertext groups is
destroyed. Two new concepts are proposed for overcoming this
3 0.7228 0.8149 0.9241 0.971 0.9904
problem.
Maximum reuse frequency: During the generation of M
ciphertext groups, a ciphertext pair is likely to be reused several
6.5. Experiments on SHA3-256 times. We denote the reuse frequency of the ith ciphertext pair
as RFi , i ∈ [1, M]. Maximum reuse frequency (MRF) is defined
SHA3-256 is a hash function. When one message block is fed as the maximum value of RFi :
into reduced SHA3-256, we collect the first 32 bytes of the
output process after r-rounds permutation is applied to this MRF = max RFi , i ∈ [1, M] (17)
message block. Given a message difference α = 1, we build
N Ds against SHA3-256 reduced up to 3 rounds.
The number of ciphertext pairs is N = 2 × 106 . The batch Sample similarity degree: For any two ciphertext groups
size is 500, and the penalty factor is 10−5 . The accuracies are Gi , Gj , the similarity of these two ciphertext groups is defined
presented in Table 13. The pass ratios of the FPT and FNT of as the number of the same ciphertext pairs. As for M cipher-
N Ds are presented in Table 14. text groups, sample similarity degree (SSD) is defined as the
maximum of any two ciphertext groups’ similarity: Thus, we can generate 2m k ciphertext pairs using m
neutral bits. The probability that these k ciphertext pairs satisfy
SSD = max Gi Gj , i, j ∈ [1, M] the difference transition ΔP → α simultaneously is still p0 .
Gi = {Xi1 , · · · , Xik } Then, N ciphertext groups with a size of k can be generated as
(18)
Gj = Xj1 , · · · , Xjk
1. Randomly generate N plaintext pairs with ΔP.
i1, · · · , ik, j1, · · · , jk ∈ [1, M]
2. Generate N plaintext structures using m neutral bits.
3. Randomly pick k plaintext pairs from a structure and
MRF can ensure that the contribution of each ciphertext pair collect the ciphertext pairs.
is similar. SSD can increase the distribution uniformity of M
The total data complexity is N × k.
ciphertext groups as much as possible. Based on the above two
It is worth noticing that the data reuse strategy is still appli-
concepts, we propose the following Data Reuse Strategy (see
TABLE 15. Data complexity comparisons when p0 = 2−6 , d = TABLE 17. The value of p1 , p2 , p3 related to the 5-round N Ds when
2, β0 = 0.005, β1 = 2−16 , c2 = 0.5. The prepended differential is a 3- c2 = 0.5, d = 1, r = 5. p0 = 2−6 .
round differential that is extended from 0x211/0xa04 → p0 0x40/0x0
Distinguisher p1 p2 p3 log2 N
without loss of transition probability.
To filter sk10 [8 ∼ 0], the student distinguisher with k = 2 In Section 3.2, we have reviewed how Gohr’s attack recovers
requires 217.785 plaintext pairs. In the second stage, we select the subkey skr+1 with an r-round N D. This method needs a
N = 215.821 plaintext pairs from 217.785 plaintext pairs. When rank score threshold. In steps 3a and 3(b)ii, we need a threshold
we perform NASA with our 6-round N Dk=2 100 times, the c3 , c4 respectively. In this paper, let c3 = 18 and c4 = 150.
results are Experiment results. Run 1000 experiments each time, and
repeat 5 times. These experiments based on Gohr’s distinguish-
1. the true subkey sk10 survives in 90 trails. ers N Dk=1 were also performed using the same ciphertexts.
2. the average numbers of surviving subkey guesses in two Table 19 summarizes the success rates.
stages are 11.82, 25.07, respectively.
3. In all the 100 trails, the number of surviving subkey
7.3.2. Posterior probability analysis
guesses is lower than 137.31.
We have proved that our N D applies to Gohr’s attack. More-
Figure 4 shows the runtime comparison of the 200 experi- over, the attack based on our N Ds shows a minor advantage in
ments. The practical experiments further prove that our new terms of the success rate. This minor advantage is interesting
N Ds can be applied to NASA. Besides, with smaller data com- since the success rate of Gohr’s attack is not directly deter-
plexity, the NASA based on our N D achieves a competitive mined by the distinguishing accuracy. To better understand
result. the influence of accuracy improvement on Gohr’s attack, we
1 2 3 4 5
perform a deeper analysis from the perspective of the key rank The first phenomenon makes that a large key rank score
score. threshold (eg. c3 = 18, c4 = 150) is applicable. The second
Consider an (r + 1)-round cipher E. We first build a r- phenomenon makes the gap between the rank score of the true
round N D based on a difference α. Then we collect numerous key and that of wrong keys increase. By setting a high key rank
ciphertext pairs corresponding to plaintext pairs with a differ- score threshold, wrong keys are less likely to obtain a key rank
ence α. We decrypt these ciphertext pairs with a subkey guess score higher than the threshold. Thus, a higher success rate is
kg and feed the partially decrypted ciphertext pairs into the more likely to be obtained by replacing Gohr’s NDs with our
N D. NDs.
Let tk denote the true subkey of the (r + 1)-round. Besides,
the Hamming distance between tk and kg is d. We focus on the 8. OPEN PROBLEMS
expectation of the following conditional posterior probability
Our work in this paper raises some open problems:
Z = Pr (Y = 1 |X, d ) = F (X) (21)
• What features derived from multiple ciphertext pairs are
learned by our distinguishers?
where X is the input of the N D, and F is N D. If the N D • The influence of features derived from multiple cipher-
is Gohr’s distinguisher, X is a decrypted ciphertext pair. If text pairs is rather complex. More exactly, except for its
the N D is our distinguisher N Dk , X is a ciphertext group positive influence, we find that these features also have a
consisting of k decrypted ciphertext pairs. negative influence. For example, when we compare the
Taking N Dk=2 against Speck32/64 reduced to 6, 7 rounds distinguishing accuracy of N Ds under a fair setting (see
as examples, we estimate the expectations of the above condi- Section 6.1.1), if we give the prediction label based on
tional posterior probability. As a comparison, we also estimate the following metric:
the expectations based on N Dk=1 . The final estimation results
are shown in Figures 5 and 6. Z1 Zm
v = log + · · · + log , (22)
There are two important phenomena. First, compared with 1 − Z1 1 − Zm
Gohr’s distinguishers N Dk=1 , our distinguishers N Dk=2 bring
higher expectations Pr(Y = 1|X, d = 0). Second, the value where Zi , i ∈ [1, m] is the output of N Ds, our dis-
of Pr(Y = 1|X, d = 0) − Pr(Y = 1|X, d = i), i ∈ [1, 16] tinguishers have tiny or no advantage in terms of the
increases. distinguishing accuracy. Table 20 shows our experiment
results based on the above metric. Thus, an important Advances in Cryptology - CRYPTO 2019 - 39th Annual Interna-
problem is how to make full use of these features and tional Cryptology Conference, Santa Barbara, CA, USA, August
bring more significant positive influence? 18-22, 2019, Proceedings, Part IILecture Notes in Computer
Science (Vol. 11693), pp. 150–179. Springer.
These problems are out of scope of this paper. We will [2] Chen, Y. and Yu, H. (2020) Neural aided statistical attack for
cryptanalysis. IACR Cryptol. ePrint Arch., 2020, 1620.
explore in future research.
[3] Jain, A., Kohli, V. and Mishra, G. (2020) Deep learning based dif-
ferential distinguisher for lightweight cipher PRESENT. IACR
Cryptol. ePrint Arch., 2020, 846.
9. CONCLUSIONS
[4] Yadav, T. and Kumar, M. (2020) Differential-ml distinguisher:
In this paper, we focus on the N D, which is the core module Machine learning based generic extension for differential crypt-
in neural aided cryptanalysis. By considering multiple cipher- analysis. IACR Cryptol. ePrint Arch., 2020, 913.
2007, ProceedingsLecture Notes in Computer Science (Vol. Pattern Recognition, CVPR 2019, Long Beach, CA, USA,
4727), pp. 450–466. Springer. June 16-20, 2019 Computer Vision Foundation / IEEE,
[14] Coppersmith, D., Holloway, C.L., Matyas, S.M. and Zunic, N. pp. 9799–9809.
(1997) The data encryption standard. Inf. Secur. Tech. Rep., 2, [21] Chen, Y., Yu, L., Ota, K. and Dong, M. (2019) Hierarchical
22–24. posture representation for robust action recognition. IEEE Trans.
[15] Huang, S., Wang, X., Xu, G., Wang, M. and Zhao, J. (2017) Comput. Soc. Syst., 6, 1115–1125.
Conditional cube attack on reduced-round keccak sponge func- [22] Kingma, D.P. and Ba, J. (2015) Adam: A method for stochas-
tion. In Coron, J., Nielsen, J.B. (eds) Advances in Cryptology - tic optimization. In Bengio, Y., LeCun, Y. (eds) 3rd Interna-
EUROCRYPT 2017 - 36th Annual International Conference on tional Conference on Learning Representations, ICLR 2015, San
the Theory and Applications of Cryptographic Techniques, Paris, Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.
France, April 30 - May 4, 2017, Proceedings, Part IILecture [23] Chollet, F. et al. (2015) Keras. https://github.com/fchollet/keras.
Notes in Computer Science (Vol. 10211), pp. 259–288. [24] Abed, F., List, E., Lucks, S. and Wenzel, J. (2014)