NeurIPS 2022 Salsa Attacking Lattice Cryptography With Transformers Paper Conference
NeurIPS 2022 Salsa Attacking Lattice Cryptography With Transformers Paper Conference
Transformers
Abstract
1 Introduction
The looming threat of quantum computers has upended the field of cryptography. Public-key
cryptographic systems have at their heart a difficult-to-solve math problem that guarantees their
security. The security of most current systems (e.g. [58, 28, 50]) relies on problems such as integer
factorization, or the discrete logarithm problem in an abelian group. Unfortunately, these problems
are vulnerable to polynomial time quantum attacks on large-scale quantum computers due to Shor’s
Algorithm [62]. Therefore, the race is on to find new post-quantum cryptosystems (PQC) built upon
alternative hard math problems.
Several leading candidates in the final round of the 5-year NIST PQC competition are lattice-based
cryptosystems, based on the hardness of the Shortest Vector Problem (SVP) [2], which involves
finding short vectors in high dimensional lattices. Many cryptosystems have been proposed based
on hard problems which reduce to some version of the SVP, and known attacks are largely based on
lattice-basis reduction algorithms which aim to find short vectors via algebraic techniques. The LLL
algorithm [43] was the original template for lattice reduction, and although it runs in polynomial
time (in the dimension of the lattice), it returns an exponentially bad approximation to the shortest
vector. It is an active area of research [22, 48, 5] to fully understand the behavior and running time of
a wide range of lattice-basis reduction algorithms, but the best known classical attacks on the PQC
candidates run in time exponential in the dimension of the lattice.
In this paper, we focus on one of the most widely used lattice-based hardness assumptions: Learning
With Errors (LWE) [56]. Given a dimension n, an integer modulus q, and a secret vector s ∈ Znq ,
the learning with errors problem is to find the secret given noisy inner products with random vectors.
LWE-based encryption schemes encrypt a message by blinding it with a noisy inner product. Given a
random vector a ∈ Znq , the noisy inner product is b := a · s + e mod q, where e is an “error” vector
sampled from a narrow Gaussian distribution (so its entries are small, thus the reference to noise).
∗
Co-first authors, work done while at Meta AI
†
Co-last authors
2.2 LWE
The Learning With Errors (LWE) problem, introduced in [56], is parameterized by a dimension n, the
number of samples m, a modulus q and an error distribution χ (e.g., the discrete Gaussian distribution)
over Zq = {0, 1, . . . , q − 1}. Regev showed that LWE is at least as hard as quantumly solving certain
hard lattice problems. Later [53, 46, 14], showed LWE to be classically as hard as standard worst-case
lattice problems, therefore establishing a solid foundation for building cryptographic schemes on it.
LWE and RLWE. The LWE distribution As,χ consists of pairs (A, b) ∈ Zm×n q × Znq , where A
m×n n
is a uniformly random matrix in Zq , b = As + e mod q, where s ∈ Zq is the secret vector
sampled uniformly at random and e ∈ Zm q is the error vector sampled from the error distribution χ.
We call the pair (A, b) an LWE sample, yielding n LWE instances: one row of A together with the
corresponding entry in b is one LWE instance. There is also a ring version of LWE, known as the
Ring Learning with Errors (RLWE) problem (described further in Appendix A.1).
Search-LWE and Decision-LWE. We now state the LWE hard problems. The search-LWE problem
is to find the secret vector s given (A, b) from As,χ . The decision-LWE problem is to distinguish
As,χ from the uniform distribution {(A, b) ∈ Zm×n q × Znq : A and b are chosen uniformly at
random)}. [56] provided a reduction from search-LWE to decision-LWE . We give a detailed proof
of this reduction in Appendix A.2 for the case when the secret vector s is binary (i.e. entries are 0
and 1). In Section 4.3, our Distinguisher Secret Recovery method is built on this reduction proof.
(Sparse) Binary secrets. In LWE based schemes, the secret key vector s can be sampled from
various distributions. For efficiency reasons, binary distributions (sampling in {0, 1}n ) and ternary
distributions (sampling in {−1, 0, 1}n ) are commonly used, especially in homomorphic encryption
[4]. In fact, many implementations use a sparse secret with Hamming weight h (the number of 1’s
2
Loss Accuracy (%)
14 log q 100 Table 1: Size of the training sets required for learn-
15 ing modular inversion. Base-2 logarithm of the num-
12 16 80
10 17 ber of examples needed to reach 95% accuracy, for
18
8 19 60 different values of ⌈log2 (q)⌉ and bases. ’-’ means
6
20 95% accuracy not attained after 90 million examples.
21 40
4
Base
20 ⌈log2 (q)⌉
2 2 3 5 7 24 27 30 81 128
0 0 15 23 21 23 22 20 23 22 20 20
0 50 100 150 200 250 0 50 100 150 200 250 16 24 22 22 22 22 22 22 22 21
Epochs Epochs 17 - 23 25 22 23 24 22 22 22
18 - 23 25 23 23 24 25 22 22
Figure 2: Learning modular multiplication for var- 19 - 23 - 25 25 24 - 25 24
ious moduli q. Test loss and accuracy for q with 20 - - - - 24 25 24 24 25
21 - 24 - 25 - - - - 25
⌈log2 (q)⌉ from 15 to 21. 300,000 training examples 22 - - - - - 25 - - 25
per epoch. One layer transformers with 512 dimen- 23 - - - - - - - - -
sions, 8 attention heads, integers encoded in base 81.
in the binary secret). For instance, HEAAN uses n = 215 , q = 2628 , ternary secret and Hamming
weight 64 [23]. For more on the use of sparse binary secrets in LWE, see [6, 25].
3
The speed of learning and the training set size needed to reach high accuracy depend on the problem
difficulty, i.e. the value of q. Table 1 presents the ⌈log2 ⌉ of the number of examples needed to reach
95% accuracy for different values of ⌈log2 (q)⌉ and base B. Since transformers learn from scratch,
without prior knowledge of numbers and moduli, this procedure is not data-efficient. The number of
examples needed to learn modular multiplication is between 10q and 50q. Yet, these experiments
prove that transformers can solve the modular inversion problem in prime fields.
Table 1 illustrates an interesting point: learning difficulty depends on the base used to represent
integers. For instance, base 2 and 5 allow the model to learn up to ⌈log2 (q)⌉ = 17 and 18, whereas
base 3 and 7 can reach ⌈log2 (q)⌉ = 21. Larger bases, especially powers of small primes, enable
faster learning. The relation between representation base and learning difficulty is difficult to explain
from a number theoretic standpoint. Additional experiments are in Appendix B.
Multidimensional integer secrets. In the n-dimensional case, the model must learn the modular dot
product between vectors a and s in Zn . The proves to be a much harder problem. For n = 2, with
the same settings, small values of q (251, 367 and 967) can be learned with over 90% accuracy, and
q = 1471 with 30%. In larger dimension, all models fail to learn. Increasing model depth to 2 or 4
layers, or dimension to 1024 or 2048 and attention heads to 12 and 16, improves data efficiency (less
training samples are needed), but does not scale to larger values of q or n > 2.
Multidimensional binary secrets. Binary secrets make n-dimensional problems easier to learn. For
n = 4, our models solve problems with ⌈log2 (q)⌉ ≤ 29 with more than 99.5% accuracy. For n = 6
and 8, we solve cases ⌈log2 (q)⌉ ≤ 22 with more than 85% accuracy. But we did not achieve high
accuracy for larger values of n. So in the next section, we introduce techniques for recovering secrets
from a partially trained transformer. We then show that these additional techniques allow recovery of
sparse binary secrets for LWE instances with 30 ≤ n ≤ 128 (so far).
4
Direct Secret Recovery. The first technique, based on the LWE search problem, is analogous to a
chosen plaintext attack.For each i ∈ Nn , a guess of the i-th coordinate of s is made by feeding model
M the special value ai = Kei (all coordinates
0 except the i-th), with K a large integer. If Algorithm 1 Direct Secret Recovery
si = 0, and the model M correctly approxi- 1: Input: M, K, n
mates bi = ai · s + e from ai , then we expect 2: Output: secret s
b̃i := M(ai ) to be a small integer; likewise if 3: p = 0n
si = 1 we expect a large integer. This technique 4: for i = 1, . . . , n do
is formalized in Algorithm 1. The binarize 5: a = 0n ; ai = K
function in line 7 is explained in Appendix C. In 6: pi = M(a)
SALSA, we run direct recovery with 10 differ- 7: s = binarize(p)
ent K values in order to yield 10 s guesses. 8: Return: s
Distinguisher Secret Recovery. The second algorithm for secret recovery is based on the decision-
LWE problem. It uses the output of M to determine if LWE data (a, b) can be distinguished from
randomly generated pairs (ar , br ). The algorithm for distinguisher-based secret recovery is shown in
Algorithm 2. At a high level, the algorithm works as follows. Suppose we have t LWE instances (a, b)
and t random instances (ar , br ). For each secret coordinate si , we transform the a into a′i = ai + c,
with c ∈ Zq random integers. We then use model M to compute M(a′ ) and M(ar ). If the model
has learned s and the ith bit of s is 0, then M(a′ ) should be significantly closer to b than M(ar ) is
to br . Iterating on i allows us to recover the secret bit by bit. SALSA runs the distinguisher recovery
algorithm when model accτ =0.1 is above 30%. This is the theoretical limit for this approach to work.
4.4 Secret Verification. Algorithm 2 Distinguisher Secret Recovery
At the end of the recovery step, we have 10 or Input: M, n, q, accτ , τ
1:
11 guesses s̃ (depending on whether the distin- Output: secret s
2:
guisher recovery algorithm was run). To verify s = 0n
3:
them, we compute the residuals r = a · s̃ − b advantage, bound = accτ − 2 · τ, τ · q
4:
mod q for a set of LWE samples (a, b). If t = min{50, advantage
5: 2
2}
s is correctly guessed, we have s̃ = s, so
ALWE , BLWE = LW ESamples(t, n, q)
6:
r = a · s − b = e mod q will be distributed
for i = 1, . . . n do
7:
as the error e with small standard deviation σ.
Aunif ∼ U{0, q − 1}n×t
8:
If s̃ ̸= s, r will be (approximately) uniformly
Bunif ∼ U{0, q − 1}t
9:
distributed over Zq (because a · s̃ and b are uni-
c ∼ U{0, q − 1}t
10:
formly distributed over Zp q ), and will have stan-
A′LWE = ALWE
11:
dard deviation σ(r) ≈ q/ (12). Therefore, we A′LWE [:, i] = (ALWE [:, i] + c) mod q
12:
can verify if s̃ is correct by calculating the stan- ′
B LWE = M(ALWE )
13: ^
dard deviation of the residuals: if it is close to
σ, the standard deviation of error, the secret was B unif = M(Aunif )
14: ^
recovered. In this paper, σ = 3 and q = 251, so dl = |B LWE − BLWE |
15: ^
the standard deviation of r will be around 3 if
s̃ = s, and 72.5 if not. 16: du = |B ^ unif − Bunif |
17: cLW E = #{j | dlj < bound, j ∈ Nt }
5 SALSA Evaluation 18: cunif = #{j | duj < bound, j ∈ Nt }
19: if (cLW E −cunif ) ≤ advantage·t/2 then
In this section, we present our experiments with
20: si = 1
SALSA. We generate datasets for LWE prob-
21: Return: s
lems of different sizes, defined by the dimension
and the density of ones in the binary secret. We use gated universal transformers, with two layers
in the encoder and decoder. Default dimensions and attention heads in the encoder and decoder are
1024/512 and 16/4, but we vary them as we scale the problems. Models are trained on two NVIDIA
Volta 32GB GPUs on an internal cluster.
5.1 Data generation
We generate LWE data for SALSA training/evaluation is randomly given the following parameters:
dimension n, secret density d, modulus q, encoding base B, binary secret s, and error distribution χ.
For all experiments, we use q = 251 and B = 81 (see §3.1), fix the error distribution χ as a discrete
Gaussian with µ = 0, σ = 3 [4], and generate a random s.
We vary the problem size n (the LWE dimension) and the density d (the proportion of ones in the
secret) to test attack success and to observe how it scales. For problem size, we experiment with
5
Test Loss Table 2: Full secret recovery. Highest density values
5.50 at which the secret was recovered for each n, q = 251.
5.25 For n < 100, the model has 1024/512 embedding,
Loss 5.00 dimension n 16/4 attention heads. For n ≥ 100, the model has
30 90 1536/512 embedding, 32/4 attention heads. For com-
4.75 50 110
4.50 70 128 parison, we include the log2 number of possible se-
crets at each n/d level.
Test Accuracy
50
Dim. Density log2 log2 Runtime
Accuracy (%)
n = 30 to n = 128. For density, we experiment with 0.002 ≤ d ≤ 0.15. For a given n, we select d
so that the Hamming weight of the binary secret (h = dn), is larger than 2. Appendix D.2 contains
an ablation study of data parameters. We generate data using the RLWE variant of LWE, described in
Appendix A. For RLWE problems, each a is a line of a circulant matrix generated from an initial
vector ∈ Znq . RLWE problems exhibit more structure than LWE due to the use of the circulant matrix,
which may help our models learn.
5.2 Results
Table 2 presents problem sizes n and densities d for which secrets can be fully recovered, together
with the time and the logarithm of number of training samples needed. SALSA can recover binary
secrets with Hamming weight 3 for dimensions up to 128. Hamming weight 4 secrets can be
recovered for n < 70. For context, we include a column “log2 secrets” listing the number of possible
secrets for the corresponding Hamming weight. Exhaustive search for the secret may run faster than
our current experiments for low Hamming weight. However, an exhaustive search requires knowing
the exact hamming weight in advance, or iterating through all possible Hamming weights. Although
we have only succeeded unconditionally with low Hamming weight so far, SALSA does not a priori
know the secret Hamming weight h. It remains to be seen how SALSA will scale to larger n and h.
Interestingly, the number of samples required for SALSA to succeed for a fixed Hamming weight
is relatively flat as the dimension n increases. However the time needed to recover the secret for a
fixed Hamming weight increases with n, partly because the length of the input sequence fed into the
model is proportional to n even though the number of samples needed remains stable as n grows.
This is an important result, because all the data used for training the model must be collected (e.g.
via eavesdropping), making sample size an important metric.
For a given n, scaling to higher densities requires more time and data, and could not be achieved with
the architecture we use for n > 50. As n grows, larger models are needed: our standard architecture,
with 1024/512 dimensions and 16/4 attention heads (encoder/decoder) was sufficient for n ≤ 90.
For n > 90, we needed 1536/512 dimensions and 32/4 attention heads.
Figure 3 illustrates model behavior during training. After an initial burn-in period, the loss curve
(top graph) plateaus until the model begins learning the secret. Once loss starts decreasing, model
accuracy with 0.1q tolerance (bottom graph) increases sharply. Full secret recovery (vertical lines
in the bottom graph) happens shortly after, often within one or two epochs. Direct secret recovery
accounts for 55% of recoveries, while the distinguisher only accounts for 18% of recoveries (see
Appendix C.3). 27% of the time, both methods succeed simultaneously.
One key conclusion from these experiments is that the secret recovery algorithms enable secret
recovery long before the transformer has been trained to high accuracy (even before training loss
settles at a low level). Frequently, the model only needs to begin to learn for the attack to succeed.
6
Table 3: Architecture Experiments We test the effect of model layers, loops, gating, and encoder dimension
and report the log2 samples required for secret recovery (n = 50, Hamming weight 3).
Regular vs. UTs Ungated vs. Gated UT Loops Encoder Dim. Decoder Dim.
(1024/512, 16/4, 8/8) (1024/512, 16/4, 8/8) (1024/512, 16/4, X/X) (X/512, 16/4, 2/8) (1024/X, 16/4, 2/8)
Regular UT Ungated Gated 2/8 4/4 8/2 512 2048 3040 256 768 1024 1536
26.3 22.5 26.5 22.6 23.5 26.1 23.2 23.3 20.1 19.7 22.5 21.8 23.9 24.3
SALSA’s base model architecture is a Universal Transformer (UT) with a copy-gate mechanism.
Table 3 demonstrates the importance of these choices. For problem dimension n = 50, replacing the
UT by a regular transformer with 8 encoder/decoder layers, or removing the copy-gate mechanism
increases the data requirement by a factor of 14. Reducing the number of iterations in the shared
layers from 8 to 4 has a similar effect. Reducing the number of iterations in either the encoder or
decoder (i.e. from 8/8 to 8/2 or 2/8) may further speed up training. Asymmetric transformers (e.g.
large encoder and small decoder) have proved efficient for other math problems, e.g. [37], [17], and
asymmetry helps SALSA as well. Table 3 demonstrates that increasing the encoder dimension from
1024 to 3040, while keeping the decoder dimension at 512, results in a 7-fold reduction in sample
size. Additional architecture experiments are presented in Appendix D.1.
To attack real-world LWE problems, SALSA must handle larger dimension n and density d. Our
experiments with architecture suggest that increasing model size, and especially encoder dimension,
is the key factor to scaling n. Empirical observations indicate that scaling d is a much harder problem.
We hypothesize that this is due to the subset sum modular addition at the core of LWE with binary
secrets. For a secret with Hamming weight h, the base operation a · s + e mod q is a sum of
h integers, followed by a modulus. For small values of h, the modulus operation is not always
necessary, as the sum might not exceed q. As density increases, so does the number of times the sum
“wraps around” the modulus, perhaps making larger Hamming weights more difficult to learn. To
test this hypothesis, we limited the range of the coordinates in a, so that ai < r, with r = αq and
0.3 < α < 0.7. For n = 50, we recovered secrets with density up to 0.3, compared to 0.08 with the
full range of coordinates (see Table 4). Density larger than 0.3 is no longer considered a sparse secret.
7
Table 4: Secret recovery when max a value is bounded. Results shown are fraction of the secret recovered
by SALSA for n = 50 with varying d when a values are ≤ p · Q. Green means that s was fully recovered.
Yellow means all of the 1 bits were recovered, but not all 0 bits. Red means SALSA failed. For completeness,
we also note the log2 number of possible secrets at each d level.
0.1
0.05
0
5 10 15 20 25
# of times samples are reused
Figure 4: Reusing LWE samples yields a significant decrease in the number of samples needed for secret
recovery. Shown here is the ratio of samples required for secret recovery with reuse to the samples required for
secret recovery without reuse, both expected (top curve) and observed (bottom curve, better than expected).
Other LWE-based schemes use secret dimensions that seem achievable given our current results. For
example, in the LWE-based public key encryption scheme Crystal-Kyber [9], the secret dimension is
k ×256 for k = {2, 3, 4}, an approachable range for SALSA based on initial results. The LWE-based
signature scheme Crystal-Dilithium has similar n sizes [29]. However, these schemes don’t use
sparse binary secrets, and adapting SALSA nonbinary secrets is an avenue for future work.
Sample Efficiency. A key requirement of real-world LWE attacks is sample efficiency. In practice,
an attacker will only have access to a small set of LWE instances (a, b) for a given secret s. For
instance, in Crystal-Kyber, there are only (k + 1)n LWE instances available with k = 2, 3 or 4 and
n = 256. The experiments in [20, 11] use less than 500 LWE instances. The TU Darmstadt challenge
provides n2 LWE instances to attackers.
The log2 samples column of Table 2 lists the number of LWE instances needed for model training.
This number is much larger than what is likely available in practice, so it is important to reduce sample
requirements. Classical algebraic attacks on LWE require LWE instances to be linearly independent,
but SALSA does not have this limitation. Thus, we can reduce SALSA’s sample use in several ways.
First, we can reuse samples during training. Figure 4 confirms that this allows secret recovery with
fewer samples. Second, we can use integer linear combinations of given LWE samples to make new
samples which have the same secret but a larger error σ. Appendix E contains the formula for the
number of new samples we can generate with this method (up to 242 new samples from 100 samples).
Comparison to Baselines. Most existing attacks on LWE such as uSVP and dual attack use an
algebraic approach that involves building a lattice from LWE instances such that this lattice contains
an exceptionally short vector which encodes the secret vector information. Attacking LWE then
involves finding the short vector via lattice reduction algorithms like BKZ [22]. For LWE with sparse
binary secrets, the main focus of this paper, various techniques can be adapted to make algebraic
attacks more efficient. [20, 11] and [23] provide helpful overviews of algebraic attacks on sparse
binary secrets. More information about attacks on LWE is in Appendix A.3.
8
Compared to existing attacks, SALSA’s most notable feature is its novelty. We do not claim that to
have better runtime, neither do we claim the ability to attack real-world LWE problems (yet). Rather,
we introduce a new attack and demonstrate with non-toy successes that transformers can be used
to attack LWE. Given our goal, no serious SALSA speedup attempts have been made so far, but a
few simple improvements could reduce runtime. First, the slowest step in SALSA is model training,
which can be greatly accelerated by distributing it across many GPUs. Second, our transformers are
trained from scratch, so pre-training them on such basic tasks as modular arithmetic could save time
and data. Finally, the amount of training needed before the secret is recovered depends in large part
on the secret guessing algorithms. New algorithms might allow SALSA to recover secrets faster.
Since SALSA does not involve finding the shortest vector in a lattice, it has an advantage over the
algebraic attacks – with all LWE parameters fixed and in the range of SALSA, SALSA can attack the
LWE problem for a smaller modulus q compared to the algebraic attacks. This is because the target
vector is relatively large in the lattice when q is smaller and is harder to find. For instance, in [20],
their Table 2 shows that when the block size is 45, for n = 90, their attack does not work for q less
than 10 bits, but we can handle q as small as 8 bits (Table 20).
7 Related Work
Use of ML for cryptanalysis. The fields of cryptanalysis and machine learning are closely re-
lated [57]. Both seek to approximate an unknown function F using data, although the context
and techniques for doing so vary significantly between the fields. Because of the similarity
between the domains, numerous proposals have tried to leverage ML for cryptanalysis. ML-
based attacks have been proposed against a number of cryptographic schemes, including block
ciphers [3, 63, 38, 10, 30, 12, 21], hash functions [31], and substitution ciphers [1, 64, 8]. Although
our work is the first to use recurrent neural networks for lattice cryptanalysis, prior work has used
them for other cryptographic tasks. For example, [32] showed that LSTMs can learn the decryption
function for polyalphabetic ciphers like Enigma. Follow-up works used variants of LSTMs, including
transformers, to successfully attack other substitution ciphers [1, 64, 8].
Use of transformers for mathematics. The use of language models to solve problems of mathematics
has received much attention in recent years. A first line of research explores math problems set up in
natural language. [59] investigated their relative difficulty, using LSTM [34] and transformers, while
[33] showed large transformers could achieve high accuracy on elementary/high school problems. A
second line explores various applications of transformers on formalized symbolic problems. [42]
showed that symbolic math problem could be solved to state-of-the-art accuracy with transformers.
[68] discussed their limits when generalizing out of their training distribution. Transformers have
been applied to dynamical systems [18], transport graphs [19], theorem proving [54], SAT solving
[61], and symbolic regression [13, 26]. A third line of research focuses on arithmetic/numerical
computations and has had slower progress. [52] and [51] discussed the difficulty of performing
arithmetic operations with language models. Bespoke network architectures have been proposed
for arithmetic operations [35, 66], and transformers were used for addition and similar operations
[55]. [17] showed that transformers can learn numerical computations, such as linear algebra, and
introduced the shallow models with shared layers used in this paper.
Learnability of LWE. Shalev-Shwartz, Shamir and Shammah [60] showed that some problems of
modular arithmetic cannot be solved by gradient descent. Because it uses a transformer to learn the
secret, by predicting b from a, SALSA’s scalability hinges on the solvability of LWE by gradient
descent. Here, we provide our perspective on this question.
The LWE problem amounts to (discrete) linear regression on an n-dimensional torus with radius q.
When q is infinite, LWE is pure linear regression, and gradient methods will succeed. For finite q,
gradients are informative unless the model’s prediction lies on the opposite side of the torus from
the true solution (e.g. q = 100, true value is 0, prediction is 50). At this point, the projection of the
gradient along the coordinate axis is uninformative, since either direction on the torus points towards
the true solution. For large q, this situation is uncommon, so gradient methods should work. For very
small q, this will happen more often, and gradient methods will be perturbed. In the degenerate case
q = 2, featured in [60], gradient methods will always fail.
For the value q = 251 used in SALSA, gradient methods can recover secrets (i.e. solve the modular
linear regression problem) for n up to 128 so far. Except for the case of very small q, it is unlikely
9
that LWE belongs to the “failing gradient” class of problems described in [60]. Any learnability
problems should disappear altogether as q increases.
Interestingly, this intuition about the size of q also appears in the classical lattice reduction approach
to solving LWE. Laine and Lauter [41] use LLL to demonstrate concrete polynomial-time attacks
against lattice problems for large q. They also explore the boundary of where these attacks start to
fail for smaller q. The intuition is that when q is large, LLL directly finds a vector which is “small
enough” compared to q to break the system. When q is small, LLL does not find a short enough
vector. This is further explored in [20], which gives concrete running times for generalizations of
LLL and quantifies the size of q where these attacks start to fail.
8 Conclusion
Discussion. In this paper, we demonstrate that transformers can be trained to perform modular
arithmetic. Building on this capability, we design SALSA, a method for attacking the LWE problem
with binary secrets, a hardness assumption at the foundation of many lattice-based cryptosystems.
We show that SALSA can break LWE problems of medium dimension (up to n = 128), which is
in the same range as problems in the Darmstadt challenge [15], although we restrict to the easier
case of sparse binary secrets. This is the first paper to use transformers to solve hard problems in
lattice-based cryptography. Future work will attempt to scale up SALSA to attack higher dimensional
lattices with more general secret distributions.
The key to scaling up to larger lattice dimensions seems to be to increase the model size, especially
the dimensions, the number of attention heads, and possibly the depth. Large architectures should
scale to higher dimensional lattices such as n = 256 which is used in practice. Density, on the other
hand, is constrained by the performance of transformers on modular arithmetic. Better representations
of finite fields could improve transformer performance on these tasks. Finally, our secret guessing
algorithms enable SALSA to recover secrets from low-accuracy transformers, therefore reducing the
data and time needed for the attack. Extending these algorithms to take advantage of partial learning
should result in better performance.
Ethics and Broader Impact. The primary value of this work is in alerting the cryptographic and
ML communities to the risk of ML-based attacks on PQC. Even if current attacks do not succeed,
we believe that providing early warning of potential threats is critical. However, we emphasize that
SALSA represents a proof of concept that cannot be used against real-world implementations (i.e.
the PQC schemes which NIST standardized on July 5, 2022). Additional scaling work would be
necessary before these techniques would be relevant to attacking real-world cryptosystems.
10
References
[1] Ezat Ahmadzadeh, Hyunil Kim, Ongee Jeong, and Inkyu Moon. A Novel Dynamic Attack on
Classical Ciphers Using an Attention-Based LSTM Encoder-Decoder Model. IEEE Access,
2021.
[2] Miklós Ajtai. Generating hard instances of lattice problems. In Proc. of the 28th annual ACM
symposium on Theory of Computing, 1996.
[3] Mohammed M Alani. Neuro-cryptanalysis of DES and triple-DES. In Proc. of NeurIPS, 2012.
[4] Martin Albrecht, Melissa Chase, Hao Chen, Jintai Ding, Shafi Goldwasser, Sergey Gorbunov,
Shai Halevi, Jeffrey Hoffstein, Kim Laine, Kristin Lauter, et al. Homomorphic encryption
standard. In Protecting Privacy through Homomorphic Encryption, pages 31–62. Springer,
2021.
[5] Martin Albrecht, Carlos Cid, Jean-Charles Faugère, Robert Fitzpatrick, and Ludovic Perret. On
the complexity of the BKW algorithm on LWE. Designs, Codes and Cryptography, 2015.
[6] Martin R. Albrecht. On dual lattice attacks against small-secret lwe and parameter choices in
helib and seal. In EUROCRYPT, 2017.
[7] Martin R. Albrecht, Jean-Charles Faugère, Robert Fitzpatrick, and Ludovic Perret. Lazy
Modulus Switching for the BKW Algorithm on LWE. In Public-Key Cryptography, 2014.
[8] Nada Aldarrab and Jonathan May. Can sequence-to-sequence models crack substitution ciphers?
arXiv preprint arXiv:2012.15229, 2020.
[9] Roberto Avanzi, Joppe Bos, Léo Ducas, Eike Kiltz, Tancrède Lepoint, Vadim Lyubashevsky,
John M. Schanck, Peter Schwabe, Gregor Seiler, and Damien Stehlé. CRYSTALS-Kyber
(version 3.02) – Submission to round 3 of the NIST post-quantum project. 2021.
[10] Seunggeun Baek and Kwangjo Kim. Recent advances of neural attacks against block ciphers.
In Proc. of SCIS, 2020.
[11] Shi Bai and Steven D. Galbraith. Lattice Decoding Attacks on Binary LWE. In Information
Security and Privacy, 2014.
[12] Adrien Benamira, David Gerault, Thomas Peyrin, and Quan Quan Tan. A deeper look at
machine learning-based cryptanalysis. In Proc. of Annual International Conference on the
Theory and Applications of Cryptographic Techniques, 2021.
[13] Luca Biggio, Tommaso Bendinelli, Alexander Neitz, Aurelien Lucchi, and Giambattista Paras-
candolo. Neural symbolic regression that scales. arXiv preprint arXiv:2106.06427, 2021.
[14] Zvika Brakerski, Adeline Langlois, Chris Peikert, Oded Regev, and Damien Stehlé. Classical
Hardness of Learning with Errors. In Proc. of the Forty-Fifth Annual ACM Symposium on
Theory of Computing, 2013.
[15] Johannes Buchmann, Niklas Büscher, Florian Göpfert, Stefan Katzenbeisser, Juliane Krämer,
Daniele Micciancio, Sander Siim, Christine van Vredendaal, and Michael Walter. Creating
Cryptographic Challenges Using Multi-Party Computation: The LWE Challenge. In Proc. of
the 3rd ACM International Workshop on ASIA Public-Key Cryptography, 2016.
[16] Johannes A. Buchmann, Florian Göpfert, Rachel Player, and Thomas Wunderer. On the
Hardness of LWE with Binary Error: Revisiting the Hybrid Lattice-Reduction and Meet-in-the-
Middle Attack. In Proc. of AFRICACRYPT, 2016.
[17] François Charton. Linear algebra with transformers. arXiv preprint arXiv:2112.01898, 2021.
[18] François Charton, Amaury Hayat, and Guillaume Lample. Learning advanced mathematical
computations from examples. arXiv preprint arXiv:2006.06462, 2020.
[19] François Charton, Amaury Hayat, Sean T. McQuade, Nathaniel J. Merrill, and Benedetto
Piccoli. A deep language model to predict metabolic network equilibria. arXiv preprint
arXiv:2112.03588, 2021.
11
[20] Hao Chen, Lynn Chua, Kristin E. Lauter, and Yongsoo Song. On the Concrete Security of LWE
with Small Secret. IACR Cryptol. ePrint Arch., 2020:539, 2020.
[21] Yi Chen and Hongbo Yu. Bridging Machine Learning and Cryptanalysis via EDLCT.
Cryptology ePrint Archive, 2021.
[22] Yuanmi Chen and Phong Q. Nguyen. BKZ 2.0: Better Lattice Security Estimates. In Proc. of
ASIACRYPT 2011, 2011.
[23] Jung Hee Cheon, Minki Hhan, Seungwan Hong, and Yongha Son. A Hybrid of Dual and
Meet-in-the-Middle Attack on Sparse and Ternary Secret LWE. IEEE Access, 2019.
[24] Róbert Csordás, Kazuki Irie, and Jürgen Schmidhuber. The Neural Data Router: Adap-
tive Control Flow in Transformers Improves Systematic Generalization. arXiv preprint
arXiv:2110.07732, 2021.
[25] Benjamin R. Curtis and Rachel Player. On the feasibility and impact of standardising sparse-
secret LWE parameter sets for homomorphic encryption. In Proc. of the 7th ACM Workshop
on Encrypted Computing & Applied Homomorphic Cryptography, 2019.
[26] Stéphane d’Ascoli, Pierre-Alexandre Kamienny, Guillaume Lample, and François Charton.
Deep symbolic regression for recurrent sequences. arXiv preprint arXiv:2201.04600, 2022.
[27] Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Uni-
versal transformers. arXiv preprint arXiv:1807.03819, 2018.
[28] Whitfield Diffie and Martin Hellman. New directions in cryptography. IEEE transactions on
Information Theory, 1976.
[29] Léo Ducas, Eike Kiltz, Tancrède Lepoint, Vadim Lyubashevsky, Peter Schwabe, Gregor Seiler,
and Damien Stehlé. CRYSTALS-Dilithium – Algorithm Specifications and Supporting Docu-
mentation (Version 3.1). 2021.
[30] Aron Gohr. Improving attacks on round-reduced speck32/64 using deep learning. In Proc. of
Annual International Cryptology Conference, 2019.
[31] Sergij V Goncharov. Using fuzzy bits and neural networks to partially invert few rounds of
some cryptographic hash functions. arXiv preprint arXiv:1901.02438, 2019.
[32] Sam Greydanus. Learning the enigma with recurrent neural networks. arXiv preprint
arXiv:1708.07576, 2017.
[33] Kaden Griffith and Jugal Kalita. Solving Arithmetic Word Problems with Transformers and
Preprocessing of Problem Text. arXiv preprint arXiv:2106.00893, 2021.
[34] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computation,
1997.
[35] Łukasz Kaiser and Ilya Sutskever. Neural GPUs learn algorithms. arXiv preprint
arXiv:1511.08228, 2015.
[36] Ravi Kannan. Minkowski’s Convex Body Theorem and Integer Programming. Mathematics of
Operations Research, 12(3):415–440, 1987.
[37] Jungo Kasai, Nikolaos Pappas, Hao Peng, James Cross, and Noah A. Smith. Deep Encoder,
Shallow Decoder: Reevaluating the Speed-Quality Tradeoff in Machine Translation. arXiv
preprint arXiv:2006.10369, 2020.
[38] Hayato Kimura, Keita Emura, Takanori Isobe, Ryoma Ito, Kazuto Ogawa, and Toshihiro
Ohigashi. Output Prediction Attacks on SPN Block Ciphers using Deep Learning. IACR
Cryptol. ePrint Arch., 2021.
[39] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980, 2014.
12
[40] Philipp Koehn. Pharaoh: a beam search decoder for phrase-based statistical machine translation
models. In Conference of the Association for Machine Translation in the Americas. Springer,
2004.
[41] Kim Laine and Kristin Lauter. Key recovery for lwe in polynomial time. Cryptology ePrint
Archive, 2015.
[42] Guillaume Lample and François Charton. Deep learning for symbolic mathematics. arXiv
preprint arXiv:1912.01412, 2019.
[43] H.W. jr. Lenstra, A.K. Lenstra, and L. Lovász. Factoring polynomials with rational coefficients.
Mathematische Annalen, 261:515–534, 1982.
[44] Richard Lindner and Chris Peikert. Better Key Sizes (and Attacks) for LWE-Based Encryption.
In Topics in Cryptology – CT-RSA 2011, 2011.
[45] Mingjie Liu and Phong Q. Nguyen. Solving BDD by Enumeration: An Update. In Ed Dawson,
editor, Topics in Cryptology – CT-RSA 2013, 2013.
[46] Vadim Lyubashevsky and Daniele Micciancio. On Bounded Distance Decoding, Unique Shortest
Vectors, and the Minimum Distance Problem. In Advances in Cryptology, 2009.
[47] Daniele Micciancio. On the Hardness of Learning With Errors with Binary Secrets. Theory of
Computing, 2018.
[48] Daniele Micciancio and Oded Regev. Lattice-based cryptography. Post-Quantum Cryptography,
pages 147–191, 2009.
[49] Daniele Micciancio and Panagiotis Voulgaris. Faster exponential time algorithms for the shortest
vector problem. Proc. of the 2010 Annual ACM-SIAM Symposium on Discrete Algorithms
(SODA), 2010.
[50] Victor S Miller. Use of elliptic curves in cryptography. In Conference on the Theory and
Application of Cryptographic Techniques. Springer, 1985.
[51] Rodrigo Nogueira, Zhiying Jiang, and Jimmy Lin. Investigating the limitations of transformers
with simple arithmetic tasks. arXiv preprint arXiv:2102.13019, 2021.
[52] Theodoros Palamas. Investigating the ability of neural networks to learn simple modular
arithmetic. 2017.
[53] Chris Peikert. Public-Key Cryptosystems from the Worst-Case Shortest Vector Problem: Ex-
tended Abstract. In Proc. of the Forty-First Annual ACM Symposium on Theory of Computing,
2009.
[54] Stanislas Polu and Ilya Sutskever. Generative language modeling for automated theorem proving.
arXiv preprint arXiv:2009.03393, 2020.
[55] Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking:
Generalization Beyond Overfitting on Small Algorithmic Datasets. arXiv preprint
arXiv:2201.02177, 2022.
[56] Oded Regev. On lattices, learning with errors, random linear codes, and cryptography. In Proc.
of the Thirty-Seventh Annual ACM Symposium on Theory of Computing, 2005.
[57] Ronald L Rivest. Cryptography and machine learning. In International Conference on the
Theory and Application of Cryptology, pages 427–439. Springer, 1991.
[58] Ronald L Rivest, Adi Shamir, and Leonard Adleman. A method for obtaining digital signatures
and public-key cryptosystems. Communications of the ACM, 1978.
[59] David Saxton, Edward Grefenstette, Felix Hill, and Pushmeet Kohli. Analysing Mathematical
Reasoning Abilities of Neural Models. arXiv preprint arXiv:1904.01557, 2019.
13
[60] Shai Shalev-Shwartz, Ohad Shamir, and Shaked Shammah. Failures of gradient-based deep
learning. In Proc. of ICML, 2017.
[61] Feng Shi, Chonghan Lee, Mohammad Khairul Bashar, Nikhil Shukla, Song-Chun Zhu, and
Vijaykrishnan Narayanan. Transformer-based Machine Learning for Fast SAT Solvers and
Logic Synthesis. arXiv preprint arXiv:2107.07116, 2021.
[62] Peter W Shor. Algorithms for quantum computation: discrete logarithms and factoring. In
Proceedings 35th Annual Symposium on Foundations of Computer Science, 1994.
[63] Jaewoo So. Deep learning-based cryptanalysis of lightweight block ciphers. Security and
Communication Networks, 2020.
[64] Shivin Srivastava and Ashutosh Bhatia. On the Learning Capabilities of Recurrent Neural
Networks: A Cryptographic Perspective. In Proc. of ICBK, 2018.
[65] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural
networks. In Proc. of NeurIPS, 2014.
[66] Andrew Trask, Felix Hill, Scott Reed, Jack Rae, Chris Dyer, and Phil Blunsom. Neural
arithmetic logic units. arXiv preprint arXiv:1808.00508, 2018.
[67] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Proc. of NeurIPs, 2017.
[68] Sean Welleck, Peter West, Jize Cao, and Yejin Choi. Symbolic Brittleness in Sequence Models:
on Systematic Generalization in Symbolic Mathematics, 2021.
14