0% found this document useful (0 votes)
62 views

Towards Practical Attacks On Argon2i and Balloon Hashing: Jo El Alwen IST Austria Jeremiah Blocki Purdue University

This document discusses attacks on Argon2i and Balloon Hashing password hashing algorithms. It extends previous theoretical attacks to the newer Argon2i-B proposal and introduces heuristics to improve the attacks' concrete memory efficiency. The authors analyze the attacks' memory consumption on Argon2i-A, Argon2i-B and Balloon Hashing instances for practical parameters. They find that the attacks can succeed using less than 10GB of memory, calling into question the security of these algorithms for realistic settings.

Uploaded by

Abishanka Saha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views

Towards Practical Attacks On Argon2i and Balloon Hashing: Jo El Alwen IST Austria Jeremiah Blocki Purdue University

This document discusses attacks on Argon2i and Balloon Hashing password hashing algorithms. It extends previous theoretical attacks to the newer Argon2i-B proposal and introduces heuristics to improve the attacks' concrete memory efficiency. The authors analyze the attacks' memory consumption on Argon2i-A, Argon2i-B and Balloon Hashing instances for practical parameters. They find that the attacks can succeed using less than 10GB of memory, calling into question the security of these algorithms for realistic settings.

Uploaded by

Abishanka Saha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Towards Practical Attacks on Argon2i and Balloon Hashing

Joël Alwen Jeremiah Blocki


IST Austria Purdue University

Abstract—The algorithm Argon2i-B of Biryukov, Dinu and • For the Alwen-Blocki attack to fail against prac-
Khovratovich is currently being considered by the IRTF tical memory parameters, Argon2i-B must be in-
(Internet Research Task Force) as a new de-facto standard stantiated with more than 10 passes on memory.
for password hashing. An older version (Argon2i-A) of the The current IRTF proposal calls even just 6 passes
same algorithm was chosen as the winner of the recent as the recommended “paranoid” setting.
Password Hashing Competition. An important competitor • More generally, the parameter selection process
to Argon2i-B is the recently introduced Balloon Hashing in the proposal is flawed in that it tends to-
(BH) algorithm of Corrigan-Gibs, Boneh and Schechter. wards producing parameters for which the attack
A key security desiderata for any such algorithm is is successful (even under realistic constraints on
that evaluating it (even using a custom device) requires parallelism).
a large amount of memory amortized across multiple • The technique of Corrigan-Gibs for improving
instances. Alwen and Blocki (CRYPTO 2016) introduced security can also be overcome by the Alwen-Blocki
a class of theoretical attacks against Argon2i-A and BH. attack under realistic hardware constraints.
While these attacks yield large asymptotic reductions in • On a positive note, both the asymptotic and con-
the amount of memory, it was not, a priori, clear if crete security of Argon2i-B seem to improve on
(1) they can be extended to the newer Argon2i-B, (2) that of Argon2i-A.
the attacks are effective on any algorithm for practical
parameter ranges (e.g., 1GB of memory) and (3) if they
can be effectively instantiated against any algorithm under 1. Introduction
realistic hardware constrains.
In this work we answer all three of these questions The goal of key-stretching is to protect low-entropy
to the affirmative for all three algorithms. It is also the secrets (e.g., passwords) against brute-force attacks. A
first work to analyze the security of Argon2i-B. In more
good key-stretching algorithm should satisfy the prop-
detail, we extend the theoretical attacks of Alwen and
erties that (1) an honest party can compute a single in-
stance of the algorithm on standard hardware for a mod-
Blocki (CRYPTO 2016) to the recent Argon2i-B proposal
erate cost, (2) the amortized cost of computing the al-
demonstrating severe asymptotic deficiencies in its secu-
gorithm on multiple instances on customized hardware
rity. Next we introduce several novel heuristics for im-
is not (significantly) reduced. The first property ensures
proving the attack’s concrete memory efficiency even when
that it is possible for honest parties (who already know
on-chip memory bandwidth is bounded. We then simulate
the secret) to execute the algorithm, while the later
our attacks on randomly sampled Argon2i-A, Argon2i- property ensures that it is infeasible for an adversary
B and BH instances and measure the resulting memory to execute a brute-force attack with millions/billions of
consumption for various practical parameter ranges and different guesses for the user’s secret. Key-stretching
for a variety of upperbounds on the amount of parallelism techniques like hash iteration (e.g., bcrypt) fail to
available to the attacker. Finally we describe, implement achieve the later property as the cost of evaluating hash
and test a new heuristic for applying the Alwen-Blocki functions like SHA256 can be dramatically reduced
attack to functions employing a technique developed by by building Application Specific Integrated Circuits
Corrigan-Gibs et al. for improving concrete security of (ASICs).
memory-hard functions. Memory hard functions (MHFs), first explicitly in-
We analyze the collected data and show the effects troduced by Percival [Per09], are a promising key-
various parameters have on the memory consumption of stretching tool for achieving property two. In particular,
the attack. In particular, we can draw several interesting MHFs are motivated by the observation that the cost
conclusions about the level of security provided by these of storing/retrieving items from memory is relatively
functions. constant across different computer architectures. Data-
Independent Memory Hard Functions (iMHFs) are an the AT (Area x Time) complexity of implementing the
important variant of MHFs due to their greater resis- execution in hardware [Per09], [AS15], [BK15] which
tance to side-channel attacks1 making them the recom- in turn provides an estimate for the cost of constructing
mended type of MHF for password hashing. Indeed, the hardware [Tho79]. In this work, we have instead
most of the entrants to recent Password Hashing Com- measured the energy complexity [AB16] of an execu-
petition [PHC] which had the stated aim of finding a tion which approximates the energy consumed by the
new password hashing algorithm, claimed some form attack. As discussed in [AB16] this approximates the
of memory-hardness. running cost of hardware implementing the execution.2
Finally, to accommodate a variety of devices and Never-the-less, for the class of attacks considered in this
applications an MHF is equipped with a “memory work, energy complexity also tightly approximates the
parameter” (and sometimes also a “timing parameter”) (amortized) AT complexity of the attack. [AB16].
which fix the amount of memory (and computation,
respectively) used by the honest evaluation algorithm. 1.1.1. Argon2 History. The Argon2 specifica-
Ideally, the necessary amortized memory (and compu- tion [BDK16] has already undergone 4 reversions since
tation) of an adversary should also scale linearly in its first publication with, at times, very non-trivial
these parameters. Thus they can be thought of as natural changes to “Argon2”. Unfortunately any of these
security parameter(s) for the MHF. versions are regularly referred to simply as “Argon2”
In this work we focus an three of the most promi- with out further specification. In particular, between
nent iMHFs from the literature: Argon2i-A (Version 1 in [BDK16]) and Argon2i-B
(Version 1.3 in [BDK16]) the edge distribution (which
1) The winner of the PHC, which we call Argon2- describes how intermediary values relate to each) has
A [BDK15]. been altered twice. However the edge distribution is
2) A significantly updated version which we refer probably the single most important feature of any
to as Argon2i-B [BDKJ16] which is currently MHF in determining its memory-hardness, i.e. its
being considered by the Cryptography Form security. That is altering the edge distribution of an
Research Group (CFRG) of the IRTF as their MHF can make the difference between optimal security
first proposal of an MHF for wider use in and being completely broken. Thus, we have made
Internet protocols. an effort to distinguish at least between the original
3) An prominent competitor algorithm to Argon2 edge structures in the PHC Argon2i-A and the edge
called Balloon Hashing (BH) [CGBS16a]. structure used in the IRTF proposal Argon2i-B.
We remark that an important part of IRTF proposal is To the best of our knowledge there are three known
a recommendation for how to choose the parameters attacks against Argon2-A and none on Argon2-B. While
when instantiating Argon2. In particular, it describes the first attack was against Argon2-A [BK15] the two
how to select a memory parameter σ and timing pa- more recent ones took aim at Argon2i-A [CGBS16a],
rameter τ with the effect that the honest algorithm for [AB16].3
evaluating Argon2i-B builds a table of size σ in memory
and iterates over it τ times. 1.1.2. Balloon Hashing History. One of the most im-
portant alternate proposal for an iMHF besides Argon2i
1.1. Attacking an MHF is the Balloon Hashing (BH) algorithm of [CGBS16a].
To date the only known attack on BH is [AB16].4
In the context of an MHF an “attack” is an evalu- Another interesting contribution of the same authors
ation algorithm with lower (possibly amortized) com- appeared in an earlier version of that work [CGBS16b]
plexity then the honest evaluation algorithm. The qual-
ity of such an attack is the ratio between the honest 2. While the construction cost of the hardware is a one-off cost
which can be amortized across all evaluations ever performed on the
algorithms complexity on a single instance and the at- device, the running cost is a recurring charge per instance and so
tack’s (amortized) complexity. We refer to an evaluation seems at least as important from the point of view of an attacker
algorithm with quality greater than 1 as an attack. evaluating the effectiveness of the attack.
Historically, the “complexity” of an attack has re- 3. Due to the frequent naming collision between versions it not
ferred to the product of the total runtime and the largest possible to unambiguously determine the precise version considered
in these attacks. However all results seem to have appeared before the
amount of memory used at any given point during the newest edge structure, used in Argon2-B, was published [BDK16] and
execution as this often believed to be a good estimate of definitely before the IRTF proposal was made [BDKJ16]. Thus the
history presented here reflects the best guess of the authors based on
1. Standard MHFs (e.g., Argon2d [BDK16], scrypt [Per09]) are the dates of revisions and when the different attacks were published.
potentially vulnerable to security and privacy compromises due to 4. Roughly, BH is a special case of the “random graph” family of
cache-timing attacks [Ber], [FLW13]. iMHFs considered in [AB16] where indegree δ = 3.

2
where they introduced a new technique for construct- concrete instance of Argon2i it is likely that
ing iMHFs which we refer to as the XOR-and-high- the asymptotic optimal settings do not actually
indegree method. They showed how it could be used result in the lowest complexity of the resulting
to overcome the new attack on Argon2i-A described in evaluation algorithm.
the same work. The technique was also instantiated in 3) The authors did not attempt to investigate any
the BH-Double-Buffer (BH-DB) iMHF of [CGBS16b]. heuristics for improving concrete complexity.
Indeed, many such heuristics are far easier to
1.2. How Practical is AB16?
implement (and test) then to analyze in theory
While there seems to be less debate about the making them bad candidates for that work but
effectiveness and practical consequences of [BK15], potentially still a concern in practice.
[CGBS16a] the same can not be said for [AB16]. On 4) No work was done to examine the behavior of
the one hand the authors of [AB16] proved that the the attack in models of bounded (e.g. practical)
asymptotic complexity of their attack is far lower then parallelism or memory bandwidth.
that of the honest evaluation algorithm as a function
of the memory and timing parameters (not just for In fact, it is perhaps somewhat surprising (not to men-
Argon2i-A but also for the other candidate iMHFs tion disconcerting) that despite all of these omissions
including a precursor to BH, BH-DB and others). In and pessimistic assumptions such undesirable asymp-
other words, at least in theory, the quality of their attack totics where still displayed against both Argon2i-A and
grows (very quickly) in the natural security parameters BH.
of Argon2i-A (and BH) indicating sever problems with Ultimately we are left with a rather incomplete
those constructions from a theoretical perspective. understanding of the practicality of [AB16], especially
However, as observed by the Argon2 with respect to Argon2i-B and the CFRG proposal (not
authors [BDK16], due to the constants in the to mention the other iMHFs considered in [AB16]).
asymptotic expressions shown in [AB16], when 1.3. Our Contribution
plugging in practical parameters, using already τ ≥ 4
passes on memory suffice to thwart the attack, even In this work we attempt to make progress on this
for large practical values of σ .5 Moreover it has front. The results in this work can be summarized as
been observed that [Kho16], taken at face value, follows:
implementing the attack in hardware would require
• We analyze the asymptotic complexity of the
several times the amount of memory bandwidth
Alwen-Blocki attack [AB16] when applied to
currently possible with modern technology. Thus it
Argon2i-B demonstrating its strong asymptotic
may seem, as claimed by the Argon2 authors [BDK16]
effectiveness.
and others [Aum16], that the [AB16] attack does not
• We introduce two definitive improvements to
present a threat to the real world security of Argon2-B
the attack of [AB16], which apply to Argon2i-
(or BH). Indeed, in contrast to the attacks of [BK15],
A, Argon2i-B and BH. The first improvement
[CGBS16a] the [AB16] attack is omitted from the
reduces the size of the depth-reducing set S in
security analysis in the IRTF proposal [BDKJ16].
the attack. Here, the depth-reducing set is a set
However, despite these observations, there may yet
S of nodes such that by removing these nodes
be reasons for concern that [AB16] could behave far
from G, the a directed-acyclic graph (DAG) rep-
better in practice.
resenting data-dependencies between the mem-
1) Due to the rigors imposed by proving state- ory blocks produced during the evaluation of the
ments, in [AB16] several pessimistic assump- iMHF, the depth of the resulting DAG is small.
tions (detailed in Section 6) were made poten- The second improvement reduces the number
tially resulting in far worse cost estimates for of ‘expensive’ steps necessary to execute the
their algorithm then might be exhibited in any attack.
actual instantiation. • We give new details about the resources (e.g.,
2) The algorithm of [AB16] is equipped with a bandwidth, #cores) necessary to implement the
variety of variables and parameters. Due to attack of [AB16] in a custom device. This helps
the focus on asymptotics no effort was made determine for which parameter spaces the attack
to optimize them. Instead asymptotically opti- is feasible using modern fabrication technology.
mal values were used. However for any given • We implement a simulation of the [AB16] attack
5. The less passes on memory the better as this slows down the
in C for the case of Argon2i-A, Argon2i-B
evaluation of an iMHF with out increasing the memory required to and BH together with several new heuristics for
evaluate it. decreasing its concrete complexity.

3
• We implement two methods optimizing the pa- the specification [BDK16]). For any node j ≤ σ and
rameters and internal variables of [AB16] so as c ≥ 1 we have that
to obtain minimal complexity for given target √
σ and τ . The first method makes use of arbi- Pr [parents(j) ∈ [j − j/c, j − 1]] ∝ 1/ c
trary parallelism while the second is aimed at while for j > σ and c ≥ 1 we have that
a model with an upper-bound on the available √
parallelism. Pr [parents(j) ∈ [j − σ/c, j − 1]] ∝ 1/ c.
• We measure the resulting complexity of the Given a fixed graph and input we can now compute
simulation for a variety of practical parameter the corresponding iMHF. First the input (password, salt,
ranges and bounds on parallelism and we de- etc) is hashed once to produce the label of node 1.
scribe and analyze these results. In particular Each subsequent label is computed by applying the
we highlight some several concerns with the compression function to the labels of the parents of the
parameter choices in the CFRG proposal. We node. The final output of the iMHF is then obtained
also highlight several new questions posed by from the label of node n.
our results. For further details on each of these algorithms we
refer the interested reader to the original specifications.
2. Preliminaries 2.2. The AB16 Algorithm
We begin with a brief review of Argon2i-A, We describe an evaluation algorithm for an
Argon2i-B, BH and the evaluation algorithm of [AB16]. iMHF using the language of graph pebbling [HP70],
All three iMHFs can be viewed as a modes of [DKW11], [AB16]. Placing a pebble on a node denotes
operation over a compression function. We use the the act of computing the “label” of v . That is the
language of “graph labeling” to describe the functions intermediary value represented by the node v which is
(as in [DKW11], [AS15], [AB16] for example). That computed by applying the compression function to the
is for a given memory and time parameters σ and τ intermediary values of all nodes with outgoing edges
respectively each iMHF is given by a directed acyclic leading to v . Keeping a pebble on node v at iteration
graph Gσ,τ = (V, E) on n = σ ∗ τ nodes. We number i denotes storing the label of v in memory at step i.
the nodes according to V = {1, 2, . . . , n} = [n]. Each Conversely, removing a pebble from a node denotes
node represents an intermediary value in the computa- freeing the corresponding memory location. Clearly a
tion of the iMHF. For a given input we refer to such a pebble can only be placed at an iteration i if all parent
value as the “label” of the node. By selecting a salt nodes of v already have a pebble on them at the end of
value we effectively fix such a graph Gσ,τ from a iteration i − 1.6 Since we are considering evaluations in
particular distribution characterizing the iMHF. a parallel model of computation we allow for multiple
pebbles to be placed simultaneously as long as each
2.1. The iMHFs node receiving a pebble already has all its parents
pebbled at the beginning of that iteration. In a model
All three algorithms support an extra parallelism with an upperbound of U ∈ N on parallelism we only
integer parameter p > 0 in order to better support honest permit up to U pebbles to be placed simultaneously in
users with multi-core machines. We now describe the any given iteration.
case with p = 1. For a discussion for the more general Using this language, an evaluation algorithm for
case we refer to section Section 6.8. Argon2i is given by a pebbling of Gσ,τ , that is a
In all three cases, all nodes in V = {1, . . . , n} are sequence P = (P0 , P1 , P2 , . . . , Pz ) of subsets of V
initially connected by a path {(i, i + 1) : i ∈ [n − (denoting which nodes contain a pebble at the end of
1]} running through the entire graph. For Argon2i-A, each iteration) such that every:
each node v ∈ V \ {1} receives an additional incoming 1) P0 = ∅
edge from a uniform random and independently chosen 2) ∀v ∈ Pi \ Pi−1 it holds that parents(v) ∈ Pi−1
predecessor u←[max{v − σ, 1}, v − 1]. Similarly, for 3) n ∈ Pz .
BH we add 2 such uniform and independently chosen
random edges. To determine the quality of an evaluation algorithm we
In the case of Argon2i-B the distribution of the must establish a complexity measure. For this we use
random edges is somewhat more complicated. However, 6. Indeed, an intermediary value can only be computed if all the
for the purpose of this work, it suffices that the follow- necessary inputs to the compression function are already stored in
ing property holds (which can be easily verified from memory.

4
the energy (pebbling) complexiy (EC) [AS15] which is only a single pebble is placed due to the light phase,
parametrized by the core-memory energy ratio R. This namely on node i ∈ Tj . As an invariant, the algorithm
is the ratio between the cost of evaluating one call to guarantees that at the beginning of iteration i all of node
the compression function and storing one label for an i’s parents contain a pebble. To reduce memory costs
equivalent amount of time.7 For a given ratio R, the EC during a light phase the algorithm discards unnecessar-
is defined to be ily pebbles on nodes that are not in the set S and are
X not parents of any target node Tj .
ecR (P ) := |Pi | + R ∗ |Pi \ Pi−1 |. Of course before we can begin the j + 1st light
i∈[z] phase we need to first recover the pebbles on the parents
Intuitively it captures the total amount of energy con- Rj+1 = parents(Tj+1 ) ∩ [(j + 1)g − 1] of any nodes in
sumed (to store memory and evaluate the compression the target set Tj+1 .9 To accomplish this the algorithm
function) during the execution. As shown in [AS15] makes use of j th balloon phases. The key intuition is that
EC scales linearly in the number of instances being because we never discard pebbles on the set S we can
evaluated. Moreover, in the case of the [AB16] attack, recover all of Rj+1 in at most d steps as follows. The j th
is also turns out to be very close to the amortized balloon phase runs concurrently to the final d steps of
AT complexity of the algorithm. In particular, when the j th light phase. At each iteration, the balloon phase
computing several instances in parallel, it is easy to greedily pebbles every needed node it (legally) can. A
implement [AB16] such that essentially all memory node v is “needed at iteration i” if it is not currently
cells and all compression function circuits are almost pebbled but it must first be pebble in order to place a
always in use.8 pebble on some v ∈ Rj+1 . In other words a node v
is needed if satisfying the invariant for the upcoming
2.2.1. Quality of an Algorithm. In order to evaluate light phase requires first pebbling v .10 At the end of
the quality of an evaluation algorithm (and in particular the final iteration of the j th balloon phase all pebbles are
to determine if it is an attack) we compare to the EC of removed which are not either in S ∪Rj+1 . In particular,
the honest (reference) algorithm (e.g. the one proposed once a node v ∈ S is pebbled it remains pebbled for
in [BDK16]). Fortunately, for all three algorithms we the duration of the execution. Moreover the algorithm
consider, the honest algorithm is quite simple. It com- is always done in time n.
putes one label at a time storing them in a table of σ To see why the invariant holds (and thus why the
values. It iterates over the table τ times updating the pebbling strategy is legal and so corresponds to an
values as it passes overX them. Thus, in each case the evaluation algorithm) we refer the interested reader
final EC is σ(τ − 1) + i + R ∗ τ σ. to [AB16]. However the core idea is that the j th balloon
i∈σ phases, running for d steps with S ∩ [jg − d]} initially
pebbled, always has enough time to eventually pebble
2.2.2. The AB16 Strategy. The evaluation algorithm every node between [jg] using this greedy strategy as
of [AB16] is parametrized by a node set S ⊂ V and it must pebble no path longer than d edges. Thus all
integer g ≥ d where d = depth(Gσ,τ −S) is the number parent nodes needed by the next light phase (besides
of edges in the longest (directed) path in the graph the ones it pebbles itself) will be pebbled at the end of
obtained by removing S (and incident edges) from the j th balloon phase. For the formal definition of the
Gσ,τ . The algorithm consists of two main subroutines; algorithm we refer the reader to Algorithm 1 and Algo-
the light phase and balloon phase. Each light phase rithm 2 in Appendix A while further details (including
lasts for g steps and upon completion a new light phase its correctness and complexity) can be found in [AB16].
is immediately started. During the final d steps of each
light phase the algorithm also executes a balloon phase 2.3. Outline of the Results
in parallel.
We describe the hardware constraints of implement-
Intuitively the purpose of the j th light phase is to
ing the attack in Section 3. The analysis of the asymp-
pebble target nodes Tj = {(j − 1)g + 1, . . . , jg} ⊆ V
totic quality of the [AB16] attack applied to Argon2i-B
one iteration at a time. That is in any given iteration i
can be found in Section 4.
7. In our implementations we used a ratio of R = 3000 which To explore the practicality of this strategy we im-
is given in [BK15] as the Argon2 author’s estimate for the case of plemented it in C. That is, for a given σ and τ , the
the compression function used in their design. Regardless, the results
in this work are not particularly sensetive the precise value of R as 9. The parents of Tj+1 which a greater than (j + 1)g − 1 do not
the calls to the compression function represent only a comparatively need to be recovered since they will anyway be pebbled during the
small proportion of the total cost. j + 1st light phase.
8. For a more formal treatment of these notions and algorithms we 10. More formally node v is needed if there exists an unpebbled
recommend looking at [AS15], [AB16]. directed path from v ∈ Rj+1 .

5
code first samples a fresh Argon2i graph (and builds balloon phase. In particular, [AB16] divides the nodes
lists of all parents and children of nodes in the graph). in the DAG into layers, and each layer is divided then
Then it constructs a depth-reducing set S ⊂ [n] for G into segments of consecutive nodes. Within each layer
and selects appropriate integer g . The precise method each segment is re-computed in parallel (the depth-
for this is described bellow and depends on whether reducing set S eliminates in layer edges so that it is
an additional parallelism bound U is given as input. possible to pebble each segment in parallel). Thus, to
Next the code simulates an execution of the [AB16] implement this attack on chip one would need one core
algorithm keeping track of the energy complexity of (e.g., a Blake2b core) for each segment in a layer. In
the execution.11 the theoretical analysis of [AB16] the graph was divided
We also implemented several heuristics new n1/4 layers each 3/4
√ of size n . Each layer in turn was
to [AB16] for improving the complexity of the exe- divided into n segments of n1/4 consecutive nodes.
cutions with the following intuitive goals: Thus, we would need 211 cores to attack τ = 4-pass
1) Choose the S in a smarter way (Section 5.1). Argon2i-A with n = 222 nodes (1KB × n/τ = 1GB
2) Reduce the cost of a Balloon Phase by main- of memory).
taining a tighter upper-bound on the number of The underlying compression function for both ver-
steps needed to complete each balloon phase sions of Argon2i are based on the Blake2b hash function
(Section 5.2). (though they do differ somewhat from each other). A
3) Reduce the cost of a Light Phase by combin- Blake2b implementation on chip is estimated [BDK16]
ing the [CGBS16a] attack with [AB16] (Sec- to take about 0.1mm2 of space and DRAM takes about
tion 5.3). 550mm2 per GB . Thus, 5, 500 Blake2b cores would
approximately take the same amount of space on chip
We analyze the results from the executions in Sec- as 1GB of DRAM. Thus, we would have space to fit
tion 6. In particular the multi-lane variants (where 211 < 5, 500 Blake2b cores needed for the [AB16] at-
p > 1) of Argon2i are discussed in Section 6.8. Finally tack τ = 4-pass Argon2i with memory 1GB . However,
we implemented and tested the XOR-and-high-indegree as parallelism increases so does the required on-chip
counter measure of [CGBS16b] to determine its effec- memory bandwidth (we need to send each core the
tiveness at preventing the [AB16] (Appendix 3). appropriate values to be hashed during each cycle).
In all of the Argon2i-B instances we evaluated the
3. Parallelism optimal attack parameters never required more than
1, 323 cores (even without explicitly controlling for
In this section we give an example showing how to
parallelism). Thus, space for Blake2b cores does not
translate concrete parameters for the [AB16] attack into
appear to be a bottle-neck for our attacks. However,
requirements on the hardware used to implement it. We
as parallelism increases so does the required memory
consider various aspects such as chip size and memory
bandwidth. Thus, parallelism would be bounded by the
bandwidth. As a reference point, we also give some
maximum memory bandwidth of our chip.
specifications of consumer grade hardware currently
As we show it is possible to modify the [AB16]
available on the market. We show that the requirements
attack to control for the amount of parallelism when
imposed by the [AB16] attack (even for the case of
constructing the depth-reducing set S .
unbounded parallelism) is either already feasible or else
will be so in the near future. At the very least we see 3.2. Memory Bandwidth
that upperbounds on parallelism used in some of our
more strict experiments, which never-the-less resulted In general we will use bw to denote the bandwidth
in attacks on practical parameters of the iMHFs, can required to keep a single Blake2b core active. Thus, if
quite readily be realized with modern semiconductor we need p cores to instantiate the balloon phase then
technology. the chip must have total bandwidth at least p × bw or
else memory bandwidth will become a limiting bottle-
3.1. Chip Area and Memory Size neck. Implemented on a 4 core machine with 8 threads
The attacks of [AB16] require parallel computation Argon2i-B uses memory bandwidth 5.8GB/s [BDK16,
of the underlying compression function H during the Table 4]. This, suggests that we would need memory
bandwidth bw ≈ (5.8GB/s)/4 = 1.45GB/s for each
11. We remark that the cost of sampling the graph, set S and g as Blake2b core to keep pace. While this number may vary
well as building the parent and children lists is not included in the across different architectures, we will use 1.5GB/s as
final complexity. This reflects the fact that such a computation need
only be performed a single time for a given salt value and the result
a reference point in our discussion.
can be reused for each subsequent input (e.g. password guess) making At the time that Argon2i-A was developed the
the amortized contribution of those steps tend quickly towards 0. maximum bandwidth achieved by modern GPUs

6
was 400GB/s. Currently, the AMD Radeon R9 Now any path p in G − S contains at most N 1/5 nodes
Fury graphics cards have a memory bandwidth of from each layer Li . Thus, depth(G − S) ≤ N 3/5 . 
512GB/s [Wal15]. However, recently Samsung has be-
gan production of its High Bandwidth Memory 2 5. The Implementation
(HBM2) chips which will allow for memory bandwidths
of well over 1TB/s [Wal16]. Thus, it would be possible
1T B/s In this section we describe our implementation
to support parallelism up to p = 666 ≈ 1.5GB/s on a of [AB16] detailing the various optimization techniques
current GPU and even p = 1000 in the near future. and heuristics we implemented.
In both versions of Argon2i the balloon (and light)
phase memory read patterns are pseudo-random but 5.1. Improved Depth-Reducing Construction
deterministic (i.e. predictable). This potentially allows for Argon2i Graph
for significantly leverage prefetching techniques. Fur-
thermore, the memory write pattern is deterministic and At its core the attacks of [AB16] on Argon2i
has very good locality.12 rely on a constructing a small set S of nodes such
that removing S from Gσ,τ results in a graph with
4. Theoretical Analysis of Argon2i-B only short (directed) paths. A bit more technically, the
Alwen and Blocki [AB16] presented an attack on number depth (Gσ,τ − S), of edges traversed by the
Argon2i-A, but their paper does not analze the newest longest path in the remaining graph is small. Before
version Argon2i-B — the version from the IRTF pro- describing our improvements we review the construc-
posal [BDKJ16]. In this section we show how to extend tion of [AB16]. To construct√ S [AB16] divides the
the attacks of [AB16] to Argon2i-B. More specifically, N = στ nodes into√ d = N 1/4 layers L1 , . . . , L√d
Theorem 4.1 shows that we can get attack quality each containing N/ d = N 3/4 consecutive nodes. √
Θ N 0.2 by tuning our attack parameters appropriately. They further divided each √ layer Li into N/d = N
Theorem 4.1. Let Argon2i-B parameters τ = O(1) and segments Li1 , . . . , LiN/d of d consecutive nodes each.
p = O(1) be given and let N = τ σ then there S is now constructed in two steps. First add √ the last
 A on Argon2i-B with E-quality(A) =
is an attack node in every segment Lij to S for all i ≤ d and
Θ N 0.2 . j ≤ N/d. Then, for each layer Li and for each node
P ROOF. (sketch) For simplicity we assume τ = 1 and v ∈ Li for which parents(v) ∈ Li we add v to the set S .
p = 1 though the ideas in our analysis easily extend to That is we add v if and only if both of v ’s parents are in
any constant values τ and p. By [AB16] it suffices to the same layer. By removing nodes in S from the graph
show how to construct a set S of size |S| = θ N 4/5 we ensure that for each layer Li each of the segments
such that depth(G − S) = N 3/5 . Then we can simply Li1 , . . . , LiN/d in that layer are disconnected from each
run the algorithm GenPeb(G, S, g, d) with parameters other √ . Thus, any path in Gσ,τ stays in layer Li for at
g = N 4/5 and d = N 3/5 . GenPeb has complexity most d − √ 1 step. We have depth(Gσ,τ − S) ≤ d as
there are d layers. [AB16] √ analyzed Argon2i-A and
2

|S|N + gN + dN  /g = O N 1.8 [AB16] so the attack
2
 showed that when d = n the set S will have size
has quality: O NN1.8 = O N 0.2 .

|S| = O(n3/4 ln n).
To construct the set S we partition the nodes Our first optimization is based on the observation
1, ..., N into equal sized layers L1 , ..., LN 2/5 each con- that in Step 2 we do not always need to add v to the
taining N 3/5 consecutive nodes. For each node j ≤ N set S even if both of v ’s parent are in the same layer as
we add j to the set S if either j ≡ 0 mod N 1/5 or if v . Instead we only need to add v if these parent edges
both of j ’s parents are in the same layer as j . We have fail to make progress within a segment. Suppose that
Pr[parents(j) ∈ Li ] ∝ √1i thus v is the a’th node in segment Lij and that v ’s parent u
N 2/5 is the b’th node segment Lij 0 (j 0 < j ). If b < a then
we say that the edge (u, v) makes progress in layer Li .
X X
4/5
E [|S|] ≤ N + Pr[parents(j) ∈ Li ]
i=1 j∈Li
Otherwise if b ≥ a, we say that the edge (u, v) does not
 2/5
 make progress (note that edges (v − 1, v) always make
N
X 1   progress within a segment so we only need to worry
= N 4/5 + N 3/5 O  √  = O N 4/5 . about the pseudorandomly chosen parents). This simple
i=1
i
optimization allows us to reduce the size of the set S by
12. In particular balloon phases spend most of their time simply
a factor of (almost) 2 because (approximately) half of
walking along segments in the graph. Only comparatively rarely do the edges (u, v) will make progress! Furthermore, this
they traverse an edge leading to a new layer. optimization does not increase the depth of Gσ,τ − S .

7
Each edge in p either makes progress within a layer Li finish pebbling the last of v ’s children. While this attack
or moves to a higher √ layer. As before the path p can only reduces memory consumption by a constant factor
only make progress d − 1 times within each layer Li . (in contrast to the asymptotic reductions in [AB16]), the
Our second optimization is a heuristic one. We use constant factors were large enough that Argon2i was
two different parameters gap and #layers before we updated in response. In Argon2i-B each new block that
divide our N nodes into layers L1 , . . . , L#layers of size is that is being stored in memory is first XORed with
N/#layers and we divide each layer into segments the existing block in memory that is being replaced. In
Li1 , . . . , LiN/(#layers(gap+1)) of size gap + 1 each. We the language of graph theory this means that each node
can follow the same construction to find S such that v > σ has an additional parent v − σ , where σ is the
depth(Gσ,τ − S) ≤ gap√× #layers. [AB16] fixed size of the memory window. This modification ensures
#layers = (gap + 1) = d to maximize asymptotic that in the attack of [CGBS16a] we cannot discard a
performance, but in practice we achieve better attack pebble early (e.g., for each node v we will not finish
quality by allowing the two parameters to differ. pebbling v ’s children until the exact moment that v is
outside the memory window).
5.1.1. Controlling Parallelism. In the GenPeb However, if we are running the attack of [AB16]
attack of [AB16] each of we the segments the current memory window will be ‘interrupted’ by a
Li1 , . . . , LN/(#layers(gap+1) is re-pebble in parallel. balloon phase. We can potentially discard the pebble on
Thus, parallelism p = N/(#layers(gap + 1)) is v well before we have pebbled the last of v ’s children
sufficient to execute our attack. If we have an upper if we know that we have an opportunity to recover v
bound U on parallelism then we can select the before it is needed again. In particular, if we know that
parameters #layers and gap subject to the condition we can recover a pebble on node v during the next
that (gap + 1)#layers ≥ N/p. balloon phase we can discard the pebble on node v as
soon as it is no longer needed for the current light phase
5.2. Dynamic Reduction of Balloon Phase Costs or as soon as we finish pebbling the last of v ’s children
that we need to pebble in the current light phase. More
Recall that the balloon phase is used in [AB16] formally, if the light phase starts at time t and ends at
to recover pebbles that were discarded during the last time t + g − 1 then we can discard a pebble from node
light phase. Balloon phases, unlike light phases, can v∈ / S during round t0 < t + g − 1 if ∀u ∈ [t0 , t + g − 1]
be memory intensive. Therefore, to minimize cumu- we have v ∈ / parents(u).
lative memory usage it is imperative to minimize the
running time of the balloon phase. In [AB16] each 5.4. Attack Implementation
balloon phase is (pessimistically) assumed to run for
exactly d steps, where d ≥ depth(Gσ,τ − S). We use We developed C code to simulate the [AB16]
a simple observation to reduce the cost incurred during attack on randomly generated Argon2i-A, Argon2i-
balloon phases. The observation is that most balloon B and BH instances. Our code is available at
phases never need to run for the full d steps to recover an anonymous GitHub repository https://github.com/
the necessary pebbles for the next light round. If we ArgonAttack/AttackSimulation.git. Our implementation
begin a balloon phase on round i, where node i is in includes the additional optimizations described in this
layer Lj with j = d i×#layers
N e, then we only need to section. Specifically, our code base includes the follow-
recover pebbles on nodes in layers L1 , . . . , Lj during ing procedures:
the balloon phase. If we remove nodes in the set S and 1) GenerateRandom( )DAG. Procedures to sam-
SN/#layers
layers L>j = k=i+1 Lk from the graph then we ple a random Argon2i-A, Argon2i-B or BH
have depth Gσ,τ − S − L>j ≤ j × gap. Thus, the DAG Gσ,τ given memory parameter σ and the
balloon phase will only need to run for j × gap steps. parameter τ specifying the number of passes
On average a balloon phase will only needs to run for over memory.
about d/2 steps. 2) SelectSetS. A procedure to select the depth
reducing set S for an input DAG Gσ,τ given
5.3. Incorporating Memory-Reducing Attack input parameters gap and #layers.
3) Attack. Procedures to simulate the optimized
Our second observation is that the attacks of [AB16] attack. Specifically, Attack simulates
[AB16] can be combined with the opportunistic one iteration at a time and measure cumulative
memory-reducing attacks of [CGBS16a]. The attack of energy costs (memory usage + calls to com-
[CGBS16a] was based on the simple observation that pression function). Attack takes as input the
you could discard the pebble on node v as soon as we DAG Gσ,τ , the depth reducing set S along with

8
the associated parameters gap and #layers that the procedure additionally takes as input a
and a parameter g which specifies the length parallelism parameter. The parameters gap and
of each light phase. The procedure returns an #layers are chosen subject to the constraint
(upper bound13 ) on the cost of the optimized that p ≈ N/(#layers(gap + 1)) so that
[AB16] attack. the attack uses parallelism p. The iteratively
4) SearchForg. A procedure to search for the refining grid search is carried out in one
best g value. The procedure takes as input a dimension15 .
DAG G, the depth reducing set S along with
the associated parameters gap and #layers. 5.4.1. Advantages of Simulation Over Theoretical
The procedure then uses an iteratively refin- Analysis. By simulating the [AB16] attack we do not
ing grid search heuristic to search for the have to rely on pessimistic assumptions to upper bound
optimal parameter g to use in the attack. attack costs. For example, in the theoretical analysis of
In more detail, we start with a large range [AB16] they assume that the pebbling algorithm has
[gM in, gM ax] of potential g values containing to pay to keep pebbles on every node in the depth-
the point g = n3/4 (the value of g used in reducing set S during every pebbling round (total cost:
the theoretical attacks of [AB16])14 . In the n|S|). While this assumption may be necessary for a
first iteration we repeatedly run Attack to theoretical analysis (the set S is only defined once we
measure attack costs when we instantiate g specify a specific instance G), it overestimates costs
with each value in the set {gM in, gM in + paid during many pebbling rounds (especially during
gStep, . . . , gM in + 8 · gStep, gM ax} where early pebbling rounds when we will have very few
gStep = gM ax−gM in
. Suppose that the value pebbles on the set S ).
9
g = gM in + i · gStep yielded the lowest attack
cost. Then in the next iteration we would set 6. Analysis & Implications
gM in = max{gM in, gM in + (i − 1)gStep}
and we would set gM ax = min{gM ax, (i + In the simulation of [AB16], after fixing iMHF
1)gStep}. We repeat this process 6 times in parameters τ and n = στ we first generated a random
total and return the best value of g that we instances of the Argon2i-A (resp. Argon2i-B, BH) DAG
found. Gσ,τ . We then temporarily set g = n3/4 (the value
5) SearchForSParameters. A procedure to of g used in the theoretical analysis from [AB16])
search for the best pair of parameters gap and used the procedure SearchForSParameters (2-
and #layers which control the construction dimensional iteratively refined grid search) to find good
of the set S . The procedure takes as input values for the attack parameters #layers and gap. Once
the DAG G and a parameter g . The pro- we have #layers and gap we then ran SearchForg
cedure SearchForSParameters is similar to (1-dimensional iteratively refined grid search) to find a
SearchForg except that the iteratively refining good value for the attack parameter g . Once we have all
grid search is carried out in two dimensions. of the attack parameters g, gap, #layers we sampled 9
For each pair of parameters (gap, #layers) additional random Argon2i-A DAGs (resp. Argon2i-B,
we must run SelectSetS(gap, #layers) to BH), ran SelectSetS to generated a depth reducing
construct the depth-reducing set S before we set S for each DAG and measured using our simulation
can call Attack to simulate the attack. algorithm Attack. In addition to recording attack costs
6) SearchForSParametersWithParallelism. for each iMHF instance we also recorded the optimal
Similar to SearchForSParameters except attack parameters, required parallelism and the size of
the depth reducing sets constructed.
13. While the real-world attack would be carried out on a massively 6.1. Memory Consumption and Runtime
parallel machine the simulations were carried out on a single-threaded
process. To make the Attack simulation as efficient as possible
we allowed the Attack procedure to overestimate the cost of the Recall that memory parameter σ denotes the table
attack in order to improve efficiency. In particular, the procedure size used by the honest algorithm and τ denotes the
could occasionally
 double count the number of pebbles on (up to) number of passes over that table. Then n = στ is
depth Gσ,τ −S parent nodes during a balloon phase. These double (roughly) the number of calls the honest algorithm
counted costs comprise a very small fraction of the total energy cost.
Thus, we may overestimate the cost of the attack, but only very makes to the compression function and so is a rea-
slightly. In any case double counting these few pebbles can only sonable approximation of the runtime of the honest
cause us to underestimate attack quality. algorithm.
14. We also require that gM in ≥ gap × #layers so that we
ensure that the depth of the graph does not exceed the length of a 15. Once we fix one of the parameters (e.g., #layers) the other
balloon phase. parameter (e.g., gap) is fully specified.

9
For the Argon2i-B and BH iMHFs we simu- 6.3. Argon2i-A vs. Argon2i-B
lated our attack for each of pair of parameters
n ∈ {217 , 218 , 219 , 220 , 221 , 222 , 223 , 224 } and τ ∈ Figure 2 compares attack quality against both ver-
{1, 3, 4, 6, 10, 20}. We denote the label size (in bytes) sions of Argon2i. On a positive note the results show
by B . For example the Black2b based compression that the updated version of Argon2i is indeed somewhat
function of Argon2 has an output of size B = 1024 less susceptible to the attacks of [AB16]. On a negative
while BH we use B = 512 as this is the output of all note the attacks on Argon2i-A, the version from the
considered compression functions in that work (with the password hashing competition, are quite strong. For
exception of SHA3 where a B = 1344 is used). Note example, even at τ = 4 passes through memory (which
that the actual memory usage (by the honest algorithm) increases running time by a factor of 4) we get an attack
is then M = Bσ = Bn/τ . Thus for Argon2i-B, when with quality > 4. Meaning that the adversary can reduce
τ = 1, then n = 224 nodes corresponds to M = 16GB his energy costs by a factor of 4.
of memory usage, but when τ = 10, n = 224 nodes
corresponds to M = 1.6GB of memory usage. We 6.4. Balloon Hashing
also simulated our attack on τ -pass Argon2i-A for
τ ∈ {1, 3, 4} to compare Argon2i-A and Argon2i-B. Figure 3 shows attack quality against the BH iMHF
scheme of [CGBS16a] as memory usage M varies. As
mentioned above we use label size of B = 512B to
6.2. Argon2i-B generate our plots. Thus, when τ = 1, setting n = 224
corresponds to M = 8GB of memory usage in com-
Figure 1 shows attack quality against Argon2i-B for parison to M = 16GB for Argon2i-B with the same
different memory parameters σ .16 . The plots demon- parameters. The attacks of [AB16] perform particularly
strates that the attacks are effective even for “pes- well against BH. In particular, the smaller blocksize of
simistic” parameter settings (e.g., at τ = 6 passes over BH is a disadvantage (in comparison with Argon2i-B)
1GB of memory the attack already reduces costs by a as it means that we need to select a higher value of σ to
factor of 2). To prevent any attack at 1GB of memory achieve the same memory usage and the attack quality
we would need to select τ > 10. While attack quality of [AB16] increase rapidly with n. This makes BH less
decreases with τ , we stress that it is undesirable to resistant to the attacks than Argon2i-B.
pick larger values of τ in practice because it increases
running time by a factor of τ . Human users are not 6.5. Attack Parameters
known for their patience during authentication. If we Table 1 shows how the parameters of our attack
select a larger value of τ then we must select σ small (e.g., g, #layers, gap parallelism p) vary with memory
enough so that we can make τ passes over memory usage M and τ . A few interesting trends emerge. First,
before the user notices or complains (e.g., after 1–2 the maximum level of parallelism needed for any of the
seconds). Argon2i-B instances that we tried was 1, 324 (τ = 1-
The IRTF proposal [BDKJ16] suggests the opposite pass Argon2i-B with M = 16GB ) and for Argon2i-A
approach for selecting these parameters: First figure out parallelism never exceeded 1, 496 (τ = 1-pass Argon2i-
the maximum memory usage M that each instance of A with M = 16GB ). In Section 6.6 we explore
Argon2i-B can afford (setting σ = M/B ) as well as how attack quality is affected if you explicitly upper
the maximum allowable running time t that each call bound parallelism. Second, the amount of parallelism p
can afford. Then select the maximum τ such that we needed tends to increase with σ , but decreases as we
can complete τ passes through M memory in time increase the number of passes through memory τ . For
t. We stress that this procedure could easily result in example, τ = 4-pass Argon2i-B at M = 1GB only
the selection of the parameter τ = 1 pass through uses parallelism 400. Interestingly, the size |S| of the
memory (e.g., in settings where lots of memory is depth reducing set did not seem to vary much as τ
available and/or users are less patient). In this case increases (holding n = στ constant)17 . Interestingly,
attack quality is approximately 5 at 1GB of memory
and approximately 9.3 at 16GB (the latter data point is 17. |S| does seem increase slightly with τ , but this trend is hidden
because Table 1 only reports the two most significant digits of |S|.
outside the visible range of Figure 1). Furthermore, we observed the opposite trend (|S| decreases slightly
with τ ) for the BH iMHF. It is also worth noting that for each iMHF
16. Each data point measures the average attack quality over 10 instance (τ, σ) that we tested we sampled 10 distinct graphs Gσ,τ
random samples. We do not include error bars because attack cost and we had to sample different sets S for each graph. We found
was showed minimal variation across all 10 samples in all iMHF that the size of the |S| never varied greatly accross these 10 different
instance that we tried. For example, the energy cost of our attack on random instances. In fact, if we only consider the two most significant
all 10 graphs matched on first 2-3 significant digits. digits then the size of |S| was always the same.

10
5 5
τ = 1 τ = 1
τ = 3 τ = 3
4 4
τ = 4 τ = 4
Attack Quality

Attack Quality
τ = 6 τ = 6
3 τ = 10 3 τ = 10
τ = 20 τ = 20

2 2

1 1

0
12 14 16 18 20 22 12 14 16 18 20
Memory Parameter: log2 (σ) (M = σKB ) Memory Parameter log2 (σ) (M = 2σKB )

Figure 1: Argon2i-B Attack Quality Figure 3: Balloon Hash Attack Quality

10
Argon2i-A
5 Argon2i-A
Argon2i-B
Argon2i-B
8 τ = 1

Attack Quality
τ = 1 4
Attack Quality

τ = 4
τ = 3
τ = 7
6 τ = 4
τ = 10
3
4
2
2
1
0 0 200 400 600 800 1,000
16 18 20 22 24
Memory Parameter: log2 (σ) (M = σKB ) Parallelism

Figure 2: Argon2i-A vs. B. Figure 4: Argon2i Attacks with Bounded Parallelism


(M = 1GB)

the procedure SearchForSParameters tends to select


the parameters #layers and gap so that #layers is first involves Argon2i-A and Argon2i-B and the sec-
several times larger than gap (in the theoretical analysis ond involves Argon2i-A. In the first experiment we
of [AB16] these parameters were equated). As expected fix memory M = 1GB (σ = 220 ) and select p ∈
the attack parameters gap, L and g all seem to increase {25, 50, 100, 200, 500, 750, 1000} and τ ∈ {1, 4, 7, 10}
with n = στ , the number of nodes in the DAG Gσ,τ . and generate a random Argon2i-B (resp. Argon2i-A) in-
The Bandwidth-On-Chip column estimates the amount
of memory bandwidth on chip necessary to support p
cores (under the assumption that we need 1.5GB/s per 5 τ = 1

core). In the worst case (M = 1GB and τ = 1 memory τ = 4


τ = 7
passes) we need on-chip bandwidth of about 2 TB/s, a
Attack Quality

4 τ = 10
value that may be plausibly achieved in the near future
(there is currently a chip that achieves memory band- 3
width of 1 TB/s). In most other instances the required
bandwith is significantly reduced (e.g., 0.6T B/s for 4- 2
pass Argon2i-B at 1GB ).
1
6.6. Controlling Parallelism
0 200 400 600 800 1,000
We now explore how attack quality is affected Parallelism
when parallelism is limited (e.g., due to limitations
on on-chip bandwidth). We run two experiments. The Figure 5: Argon2i-A (n = 224 , M = n/τ KB)

11
iMHF N τ M g #layers gap p |S| Bandwidth-On-Chip (T B/s)
Argon2i-B 24 1 16GB 381, 376 264 47 1, 324 1.0e6 1.986 TB/s
Argon2i-B 22 4 1GB 93, 237 228 45 400 3.1e5 0.6 TB/s
Argon2i-B 24 4 4GB 220, 808 388 47 901 1.0e6 1.352 TB/s
Argon2i-B 24 6 2.7GB 223, 702 512 47 683 1.0e6 1.025 TB/s
Argon2i-B 24 10 1.6GB 218, 137 512 78 415 1.0e6 0.6225 TB/s
Argon2i-A 20 4 256M B 65, 555 113 36 251 4.7e4 0.376 TB/s
Argon2i-A 22 4 1GB 155, 394 156 47 561 1.4e5 0.8415 TB/s
Argon2i-A 24 4 4GB 357, 096 233 75 948 3.8e5 1.422 TB/s
BH 24 1 8GB 357, 096 170 65 1, 496 4.2e5 2.244 TB/s
BH 24 4 2GB 215, 223 233 75 948 3.8e5 1.422 TB/s
BH 24 10 0.8GB 192, 915 388 78 548 3.7e5 0.822 TB/s

TABLE 1: Best Attack Parameters Found (Selected Argon2i-A,B and BHLin Instances).

stance Gσ,τ . Once again fixing g = n3/4 we use the pro- 6.8. Multiple Lanes
cedure SearchForSParametersWithParallelism to
find good attack parameters gap and #layers, subject Thus, far in our analysis of Argon2i-A and Argon2i-
to the condition that (gap + 1)#layers = n/p so that B we have focused on the single-threaded version of the
the attack uses parallelism exactly p. We then use the iMHFs. In this section we discuss to possible ways that
procedure SearchForg to find a good parameter g . an iMHF could be extended to support parallelism: a
Finally, we generate 10 instances of graphs Gσ,τ , run trivial extension and the more-detailed approach taken
SelectSetS to generate the depth-reducing set S for by Argon2i. Surprisingly, we find that the trivial exten-
each instance and run Attack to find the cost of each sion offers better resistance to the [AB16] attacks.
attack. Figure 4 shows the results of this experiment.
6.8.1. Trivial Extension. Given a single-threaded
The second experiment is similar except that we iMHF the easiest way to support parallelism p > 1
use Argon2i-A and we fix runtime n = 224 instead of would have been evaluate p independent instances of
memory. We select p ∈ {25, 50, 100, 200, 500, 1000} the iMHF in parallel and then hash the final block from
and τ ∈ {1, 4, 7, 10} and generate a random Argon2i-B each iMHF instance. More specifically given parame-
instance Gσ=n/τ,τ . We use the same procedures ters τ, σ and p each individual iMHF instance would
SearchForSParametersWithParallelism and have memory parameter σ/p and make τ passes over
SearchForg to find our attack parameters memory. This solution does give the adversary an easy
g, gap, #layers before simulating our attack. Figure 5 time-memory trade-off. Namely, an adversary with σ/p
shows the results of this second experiment. memory could still evaluate the iMHF, but it would
also increase his running time by a factor of p so this
6.7. Discussion attack does not reduce overall energy costs or AT com-
plexity. Because energy complexity scales linearly with
the number of instances being computed attack quality
In Figure 5 attack quality (almost) monotonically
against the multi-threaded iMHF with parameters τ, σ
increases with parallelism (excluding the plot τ = 10-
and p will be equal to the attack quality on the single
pass which peaks at p = 500). Notice that attack quality
threaded variant of the iMHF with parameters τ, σ/p.
increases rapidly with p when p is small, but this rate of
Thus, increasing parallelism p will increase resistance
increase slows dramatically as p increases. In Figure 4
to the [AB16] attacks because their attack quality grows
attack quality for Argon2i-B peaks at around p = 200
with σ .
and often starts to decrease as parallelism increases af-
ter this point. It may seem surprising that attack quality 6.8.2. Argon2i Approach. In an attempt to avoid this
decreases with parallelism, but remember the proce- time-memory trade-off the Argon2i designers took a
dure SearchForSParametersWithParallelism en- different approach. They divide memory into p lanes
sures that we construct the set S in such a way that each with space for σ/p node labels and each lane
we need parallelism exactly p. Thus, the plots are sug- is further divided into 4 slices. Each thread will be
gesting that the optimal attack will not use parallelism responsible for filling in one lane, but the value of a
p > 200 (whether or not we control for parallelism). node in one lane is allowed to depend on the values of
In this case the attack could be implemented on a chip nodes in other lanes. In particular, to pick the parent of
with only 300GB/s of memory bandwidth (using the a node v we first select a lane uniformly at random, and
same assumption that we need 1.5GB/s of bandwidth then we pick a random parent from that lane according
per core). to a non-uniform distribution that is specified in the

12
IRTF proposal [BDKJ16] (the exact specification of the relevant for practical parameter ranges (e.g., ≤ 16GB
distribution is not important here). To prevent blocking of memory). We use simulations to show that, with
they further require that the i’th node in a lane cannot our optimizations, the Alwen-Blocki attack [AB16] is
be dependent on parent node from another slice if the already relevant for practical parameter ranges. In fact,
parent node is in the same ‘slice’ of memory, where even for ‘pessimistic’ parameter settings (τ = 6) the
each memory slice contains σ/(4p) nodes in each lane. attack can reduce costs by a factor of 2 when using
Loosely, this means that if i’s parent is the j ’th node just 1GB of memory. We also showed that when on-
then i − j must be somewhat large (about σ/(4p)). chip memory bandwidth limits parallelism we can ad-
just attack parameters accordingly without significantly
6.8.3. Analysis and Discussion. We observe that the decreasing attack quality. However, our results show
approach taken in the design of Argon2i can actually de- that Argon2i-B offers better resistance to the attack than
creases resistance to the [AB16] attack when compared Argon2i-A and than the balloon hashing algorithm. Ul-
with the trivial approach to supporting parallelism. Re- timately, our understanding of Argon2i-A and Argon2i-
call that during the construction of the depth-reducing B is incomplete. There are many other potential heuris-
set S we do not need to add a node v to the set S if tics that should be explored to (potentially) improve the
its parent is in the same layer. However, the parent of attack. Could our attack on Argon2i-B be improved in
node v will come from a different lane with probability the future? Can we lower bound the energy complexity
(p − 1)/p and whenever we select v ’s parent from a necessary to evaluate Argon2i-B?
different lane we are almost guaranteed that v ’s parent
will not be in the same layer because the node must References
occur in an earlier memory slice.18
As a concrete example we set τ = 4, σ = 217 [AB16] Joël Alwen and Jeremiah Blocki. Efficiently computing
and p = 4 (n = 219 = τ σ ) and generated random data-independent memory-hard functions. In Advances
Argon2i-B DAGs. We achieve attack quality > 1.18 in Cryptology CRYPTO’16. Springer, 2016.
even with minimal effort to optimize the parameters [AS15] Joël Alwen and Vladimir Serbinenko. High Parallel
Complexity Graphs and Memory-Hard Functions. In
used in the attack 19 . By comparison, if we had followed Proceedings of the Eleventh Annual ACM Symposium
the trivial approach [AB16] gives no attack (even with on Theory of Computing, STOC ’15, 2015. http://eprint.
our optimizations). Specifically, we would have had 4 iacr.org/2014/238.
independent 4-pass Argon2i-B instances with memory [Aum16] JP Aummason. What’s up argon2? BSidesLV 2016,
parameter σ = 217 . The best attack quality we found in 2016. Slides Available at https://speakerdeck.com/veorq/
the previous section on 4-pass Argon2i-B with σ = 217 whats-up-argon2.
was 0.89 < 1. We conjecture that attack quality on [BDK15] Alex Biryukov, Daniel Dinu, and Dmitry Khovratovich.
Fast and tradeoff-resilient memory-hard functions for
multi-lane versions of Argon2i-A and Argon2i-B can cryptocurrencies and password hashing. Cryptology
be further improved with additional effort to optimize ePrint Archive, Report 2015/430, 2015. http://eprint.
the parameters used in the attack and the heuristics used iacr.org/2015/430.
to construct the depth-reducing set S . We leave this an [BDK16] Alex Biryukov, Daniel Dinu, and Dmitry Khovratovich.
interesting challenge for future research. In this paper Argon2 password hash. Version 1.3, 2016. https://www.
cryptolux.org/images/0/0d/Argon2.pdf.
we have chosen to focus on the single-lane variations
[BDKJ16] Alex Biryukov, Daniel Dinu, Dmitry Khovratovich, and
of Argon2i-A and Argon2i-B since, combined with the Simon Josefsson. The memory-hard Argon2 password
trivial parallelism approach, they lead to iMHFs that are hash and proof-of-work function. Internet-Draft draft-
more resistant to the Alwen-Blocki attack. irtf-cfrg-argon2-00, Internet Engineering Task Force,
March 2016.
[Ber] Daniel J. Bernstein. Cache-Timing Attacks on AES.
7. Conclusions [BK15] Alex Biryukov and Dmitry Khovratovich. Tradeoff
cryptanalysis of memory-hard functions. Cryptology
We proved that the Alwen-Blocki attack [AB16] ePrint Archive, Report 2015/227, 2015. http://eprint.
on Argon2i-A can be extended to Argon2i-B, and we iacr.org/.
provided several novel techniques to improve this at- [CGBS16a] Henry Corrigan-Gibbs, Dan Boneh, and Stuart
tack. It was previously believed that the Alwen-Blocki Schechter. Balloon hashing: Provably space-hard
hash functions with data-independent access patterns.
attack [AB16], while it does yield large asymptotic Cryptology ePrint Archive, Report 2016/027, Version:
reductions in energy cost as σ grows large, was not 20160601:225540, 2016. http://eprint.iacr.org/.
[CGBS16b] Henry Corrigan-Gibbs, Dan Boneh, and Stuart
18. Typically, we will have n/#layers < σ/(4p) (the size of an Schechter. Balloon hashing: Provably space-hard
individual layer). hash functions with data-independent access patterns.
19. We found the attack parameters gap = 32, #layers = 256 Cryptology ePrint Archive, Report 2016/027, Version:
and g = 13, 970 by hand. 20160114:175127, 2016. http://eprint.iacr.org/.

13
[DKW11] Stefan Dziembowski, Tomasz Kazana, and Daniel Algorithm 1: GenPeb (G, S , g , d)
Wichs. One-time computable self-erasing functions. In
Yuval Ishai, editor, TCC, volume 6597 of Lecture Notes Arguments : G = (V, E), S ⊆ V ,
in Computer Science, pages 125–143. Springer, 2011. g ∈ [depth(G − S), |V |],
d ≥ depth(G − S)
[FLW13] Christian Forler, Stefan Lucks, and Jakob Wenzel.
Catena: A memory-consuming password scrambler. Local Variables: n = |V |
IACR Cryptology ePrint Archive, 2013:525, 2013. 1 for i = 1 to n do
[HP70] Carl E. Hewitt and Michael S. Paterson. Record of the 2 Pebble node i.
project mac conference on concurrent systems and par- 3 l ← bi/gc ∗ g + d + 1
allel computation. chapter Comparative Schematology, 4 if i mod g ∈ [d] then // Balloon
pages 119–127. ACM, New York, NY, USA, 1970.
Phase
[Kho16] Dmitry Khovratovich. Re: [Cfrg] Balloon-Hashing or 5 d0 ← d − (i mod g) + 1
Argon2i. CFRG Mailinglist, June 2016. https://www. 6 N ← need(l, l + g, d0 )
ietf.org/mail-archive/web/cfrg/current/msg08282.html.
7 Pebble every v ∈ N which has all
[Per09] C. Percival. Stronger key derivation via sequential parents pebbled.
memory-hard functions. In BSDCan 2009, 2009. 8 Remove pebble from any v 6∈ K where
[PHC] Password hashing competition. https: K ← S ∪ keep(i, i + g) ∪ {n}.
//password-hashing.net/. 9 else // Light Phase
10 K ← S ∪ parents(i, i + g) ∪ {n}
[Tho79] Clark D. Thompson. Area-time complexity for VLSI.
In Michael J. Fischer, Richard A. DeMillo, Nancy A.
11 Remove pebbles from all v 6∈ K .
Lynch, Walter A. Burkhard, and Alfred V. Aho, editors, 12 end
Proceedings of the 11h Annual ACM Symposium on 13 end
Theory of Computing, April 30 - May 2, 1979, Atlanta,
Georgia, USA, pages 81–88. ACM, 1979.
0
[Wal15] Mark Walton. The r9 fury is amds Algorithm 2: Function: need(x, y, d )
best card in years, but just who is it Arguments: x, y ≥ x, d0 ≥ 0
for? http://arstechnica.com/gadgets/2015/07/
the-r9-fury-is-amds-best-card-in-years-but-just-who-is-it-for/ Constants : Pebbling round i, g , gap.
(Retrieved 8/5/2016), 2015. 1 j ← (i mod g) // Current Layer is
Lbj/gapc
[Wal16] Mark Walton. Graphics cards with 1024gb/s n o
n
bandwidth? samsung begins hbm2 produc- 2 Return Lbj/gapc ∩ i · gap + j i ≤ gap
tion. http://arstechnica.com/gadgets/2016/01/
graphics-cards-with-1024gbs-bandwidth-samsung-begins-hbm2-production/
(Retrieved 8/5/2016), 2016.
On the XOR-and-High-Indegree Trick
Appendix
At a high level the current Balloon Hashing
DAG [CGBS16a] has similar structure to the Argon2i
The [AB16] Algorithm iMHFs (the underlying compression functions are dif-
ferent). Both DAGs have indeg = 3. However, the orig-
inal version of the Balloon Hashing algorithm (BHLin)
The following figures are taken almost verbatim
[CGBS16b] had indeg = 21. Besides v − 1 a node v
from [AB16] (after appropriate notational changes) and
had 20 other parents chosen uniformly at random so
gives the high level pseudo-code for the attack described
that the label of node v depends on up to 21 different
in that work and implemented in this one.
labels. Of course, applying an iterative Merkle-Damgard
Here need and keep are defined in Algorithm 2
and Algorithm 3 respectively. The notation Li denotes
the ith layer of nodes as in Section 5.1 and these layers Algorithm 3: Function: keep(x, y)
are further divided into segments of gap+1 consecutive
nodes. Intuitively, the function need specifies that dur- Arguments: x, y ≥ x
ing a balloon phase we will fill in each of these gaps in Constants : Pebbling round i, g , gap, #layers,
parallel one node at a time moving on to layer Li+1 as n, σ .
soon as we completely finish repebbling layer Li . The 1 j ← (i mod g)
function keep essentially allows us to discard pebbles 2 ` ← b(j/gapc // Current Layer
during a balloon phase as soon as they pass outside the 3 Return L≥`−d σ#layers e
n
current memory window (e.g. every node in Li+1 is at
least σ nodes ahead of v then we can discard a node v
as soon as we finish re-pebbling layer Li ).
14
construction to hash these 21 labels would result in store the XOR of these labels21 . While this optimization
a dramatic slowdown20 BH-DoubleBuffer [CGBS16b] doesn’t change asymptotic performance it significantly
avoided Merkle-Damgard by applying a cheap linear improves attack quality for practical parameter ranges.
operator (e.g., XOR) to the 21 labels before we apply
the underlying compression function. 2. Discussion
It seems like this trick could potentially make at-
tacks of Alwen and Blocki [AB16] much less efficient For the purposes of comparison, the green line in
in practice. In particular, the attack keeps pebbles on all Figure 2 shows attack quality against Argon2ib. The
of the parents of the next g nodes that we want to pebble plot shows that attack quality against 3-pass iXOR and
before the end of the light phase. However, there are up 6-pass Argon2ib are roughly equivalent. The 3-pass
to 20g parents so the memory costs could be quite high iXOR DAG has 3 · 2m nodes while the 6-pass Argon2ib
in practice. The increased indegree will also increase nodes has 6 · 2m nodes. Thus, the XOR trick from
the probability that a node v needs to be included in [CGBS16b] potentially has some benefit. If it takes
the depth-reducing set (v needs to be included if any of the same amount of time to label nodes in iXOR and
the edges from its parents fail to make progress in its Argon2ib then 3-pass iXOR would be preferable to 6-
layers). Does this XOR trick increase resistance to the pass Argon2i since it runs twice as fast and consumes
attacks of Alwen and Blocki [AB16]? the same memory. In practice, this may not always be
the case. To compute each new label in iXOR we need
To address this question we introduce a new hypo-
to load 20 new 1KB blocks from memory and XOR
thetical iMHF called iXOR. iXOR is Argon2i-A with
them before applying the compression function. For
the modification that we pick 20 random parents for
Argon2ib we only need to load 1 new 1KB block from
each node v in addition to v − 1 and XOR the labels
memory before applying the compression function. If
together before applying the underlying compression
memory bandwidth is not a bottleneck then 3-pass
function. iXOR is similar in spirit to the old balloon
iXOR may be preferable to 6-pass Argon2i. Otherwise,
hashing algorithm BH-DoubleBuffer [CGBS16b]. We
6-pass Argon2i may actually run faster than 3-pass
evaluate our attack on iXOR so that we can isolate the
iXOR.
effect of the XOR trick on attack quality. Our goal is
in this section is to evaluate the effect of the XOR trick 5
on attack quality as we evaluate attack quality on the XOR Compression
current balloon hashing algorithm paper. 4 No Compression

Figure 2 plots attack quality vs. memory against τ = 1


Attack Quality

τ = 3
the iXOR construction when instantiated with the same 3 Argon2ib (τ = 6)
compression function used in Argon2. The dotted lines
plot attack quality when we implement the attacks of 2
Alwen and Blocki [AB16] along with the other opti-
mizations described earlier in the paper. These plots 1
seem to indicate demonstrate that the XOR compres-
sion trick dramatically increases attack quality in prac-
14 16 18 20 22
tice. However, we introduce an additional optimization
called XOR compression which dramatically improves Memory Paremeter: log2 (σ) (M = σKB )
attack quality. Figure 6: iXOR Attack Quality (indeg = 21)
1. XOR Compression

We can dramatically improve attack quality by using


a trick we call XOR compression. The basic observation
is that instead of storing the labels for up to 20g parents
we observe that we can compress these labels at the
end of the balloon phase. For each node v that we
want to pebble in the next light phase we do not need
to store the labels of all of v ’s parents we can just 21. It is possible that we won’t have all of these parent labels
available when the balloon phase finishes (i.e., because some of v ’s
parents will be pebbled for the first time in the next light phase).
20. The newest version of the Balloon Hashing algorithm does However, this is not a problem as we can simply store the XOR of
apply Merkle-Damgard, but because the newest Balloon Hashing all known parent labels and XOR this block with the remaining parent
algorithm has indeg = 3 slowdown is less of an issue. label(s) as they become available in during the light phase.

15

You might also like