0% found this document useful (0 votes)
29 views

OPT-Tree - Speculative Decoding With Adaptive Draft Tree Structure

The document presents OPT-Tree, an adaptive and scalable draft tree structure designed to enhance the efficiency of autoregressive language models during speculative decoding. By employing a 'draft and then verify' mechanism, OPT-Tree allows for the generation of multiple tokens in a single decoding step, achieving up to 3.2 times faster decoding compared to traditional methods. Experimental results demonstrate that OPT-Tree outperforms existing draft structures, optimizing the acceptance length based on varying inputs.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

OPT-Tree - Speculative Decoding With Adaptive Draft Tree Structure

The document presents OPT-Tree, an adaptive and scalable draft tree structure designed to enhance the efficiency of autoregressive language models during speculative decoding. By employing a 'draft and then verify' mechanism, OPT-Tree allows for the generation of multiple tokens in a single decoding step, achieving up to 3.2 times faster decoding compared to traditional methods. Experimental results demonstrate that OPT-Tree outperforms existing draft structures, optimizing the acceptance length based on varying inputs.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

OPT-Tree: Speculative Decoding with Adaptive Draft Tree Structure

Jikai Wang1 , Yi Su1∗, Juntao Li1†,


Qingrong Xia2 , Zi Ye2 , Xinyu Duan2 , Zhefeng Wang2 , Min Zhang1
1
Soochow University, China 2 Huawei Cloud, China
[email protected], [email protected],
[email protected],
{xiaqingrong,yezi3,duanxinyu,wangzhefeng}@huawei.com,
[email protected]

Abstract Autoregressive models (Black et al., 2022;


Zhang et al., 2022; Touvron et al., 2023) usually
Autoregressive language models demon- generate one token in one decoding step, leading
strate excellent performance in various sce-
to limited decoding efficiency. In recent work,
narios. However, the inference efficiency
speculative decoding (Stern et al., 2018; He et al.,
arXiv:2406.17276v3 [cs.CL] 6 Dec 2024

is limited by its one-step-one-word gener-


ation mode, which has become a pressing 2023; Yang et al., 2023; Fu et al., 2024; Cai et al.,
problem recently as the models become in- 2024; Li et al., 2024) has shown great potential
creasingly larger. Speculative decoding em- for lossless accelerated decoding. It applies a
ploys a “draft and then verify” mechanism “draft and then verify” mechanism to maintain the
to allow multiple tokens to be generated in original output distribution of the target model to
one step, realizing lossless acceleration. Ex- be accelerated. Drafting is performed by a less-
isting methods mainly adopt fixed heuris-
tic draft structures, which do not adapt to
overhead drafting model. The generated draft is
different situations to maximize the accep- verified in parallel by the target model to gener-
tance length during verification. To allevi- ate multiple tokens in one decoding step, bringing
ate this dilemma, we proposed OPT-Tree, promising acceleration.
an algorithm to construct adaptive and scal- Existing work like EAGLE (Li et al., 2024) has
able draft trees, which can be applied to
proposed methods for training small but effective
any autoregressive draft model. It searches
the optimal tree structure that maximizes the
draft models. Previous work mainly adopts drafts
mathematical expectation of the acceptance with structures of sequences or fixed trees. How-
length in each decoding step. Experimen- ever, we argue that neither of them is the op-
tal results reveal that OPT-Tree outperforms timal draft structure under a limited node bud-
the existing draft structures and achieves a get. Sequence-structured drafts (Stern et al., 2018;
speed-up ratio of up to 3.2 compared with Leviathan et al., 2023; Xia et al., 2023; Yang et al.,
autoregressive decoding. If the draft model 2023; Zhang et al., 2023; Fu et al., 2024) con-
is powerful enough and the node budget is
tain redundant nodes. For example, “A-B-C-D-
sufficient, it can generate more than ten to-
kens in a single step. Our code is available at E” and “A-B-C-F-G” have the same prefix “A-B-
https://github.com/Jikai0Wang/OPT-Tree. C”, which is calculated twice during verification.
Therefore, there are only 7 valid tokens among
the 10 nodes of these two sequences. Drafts with
1 Introduction tree structure (He et al., 2023; Cai et al., 2024; Li
Large language models (LLMs) (Black et al., et al., 2024; Jeon et al., 2024; Chen et al., 2024)
2022; Touvron et al., 2023; Jiang et al., 2024; solved this problem. The same token can appear
Zheng et al., 2024) have achieved remarkable per- only once in the same tree layer. A corresponding
formance in various NLP scenarios. As models tree attention mask is designed for parallel verifi-
increase in size and complexity, the computational cation. The specific structure of the tree is usually
demands for inference rise significantly. Con- heuristic and remains constant. However, given
sequently, accelerating decoding is becoming in- a node budget, the best structure that maximizes
creasingly important to save computing resources the acceptance length during verification would
and reduce response time. change according to different inputs in each de-

Equal contribution.
coding step.

Corresponding author. This paper proposes an adaptive and scalable
Figure 1: Draft structures used in speculative decoding. The blue node is the last token of the current input. The
green nodes are tokens generated by the draft model. Nodes in the same layer share the same position index.
OPT-Tree varies in each decoding step to achieve a larger acceptance length.

tree structure called OPT-Tree. It can be applied


to any autoregressive draft model. As is shown in
Figure 1, the tree structure adaptively changes in
each decoding step to maximize the mathematical
expectation of the acceptance length. We apply a
greedy algorithm to construct an OPT-Tree in each
step. Details are elaborated in Section 3. We con-
duct comprehensive experiments in Section 4 to
evaluate the effectiveness of OPT-Tree. Experi-
mental results demonstrate that OPT-Tree outper-
forms the baselines and can be up to 3.2 times
faster than vanilla autoregressive decoding. The Figure 2: Subfigure (a) shows a draft tree, and subfig-
mathematical expectation of the acceptance length ure (b) is its corresponding tree attention mask. The
is generally positively correlated with the actual value of the blank position is zero.
acceptance length in practice. Moreover, OPT-
Tree performs well when the tree size scales up.
Using LLaMA-2-7B as the draft model, LLaMA- the sampling method. For greedy sampling, the
2-70B can generate 10 tokens in a single decod- ground truth is the sequence of tokens with the
ing step with OPT-Tree when the number of nodes highest probability for each position output by M .
is over 500, which indicates its great potential for As is shown in Figure 2, the model can process
adapting to more powerful computation resources the tree structure input in parallel by construct-
and more effective draft models in the future. ing the corresponding tree attention mask. For all
branches in the tree that contain the root node, the
2 Preliminaries
longest branch with the same prefix as the ground
In this section, we provide the necessary defini- truth is accepted. Therefore, multiple tokens can
tions to establish a clear and precise foundation be generated in one decoding step while ensuring
for the concepts discussed in this paper. that the generated sequences are consistent with
Inference. After inputting x = (x1 , x2 , ..., xl ), the original ones. Due to the parallel computing
where l is the current sequence length, the target mechanism, with given computing resources, the
model M and the drafting model Md return the time cost of verification can be considered con-
next word distribution p(y l+1 |x1 , x2 , ..., xl ) and stant when the length of the verification sequence
pd (ŷ l+1 |x1 , x2 , ..., xl ) respectively, where y l+1 is within a certain range. As is shown in Figure
and ŷ l+1 are the sampled next words. 3, on a 4090 GPU with a model containing 7B pa-
Speculative Decoding. In speculative decoding rameters, the time required to verify a sequence of
with tree-structured draft, Md first infers d steps length 10 is similar to that for a sequence of length
to generate a draft tree T of depth d and then 200. Consequently, when discussing the construc-
M verifies the draft. The verification depends on tion of a draft tree in Section 3, we consider the
Figure 4: An example of a draft tree containing p̂ in
each node. pd and p̂ of the root are regarded as 1. The
value of E(A) is 3.07. See Appendix B for specific
calculation process.

output distribution close to the target model distri-


Figure 3: The relationship between input length and the bution, it is intuitive that the token with a larger
wall time for inference for models of different sizes on output probability in draft model is more likely
various GPUs. to be accepted. See Appendix A for verification
of this property. Based on this, we can use pd as
node budget of the draft tree as a given condition. the probability that it will be accepted to approx-
imately calculate the mathematical expectation of
3 OPT-Tree
the acceptance length. We use E(A) to denote the
This section introduces OPT-Tree, an algorithm approximation of the expected average acceptance
for constructing our defined optimal draft tree length, which can be calculated by:
structure for any input sequence in speculative de- X Y
coding with autoregressive draft models. E(A) = pd (ŷ).
i i i
(2)
Considering a certain step in speculative decod- (ŷj ,p̂j )∈T ŷ∈P(ŷj )
ing whose input is x, the draft model Md generates
a draft tree based on x and the given tree structure where P(ŷji ) is the set of all parent nodes of ŷji
T . Draft tree T is defined as follows: (including itself). Note that the root node is also
considered when calculating E(A).
T = (V, E) The process of solving Topt is to find a subtree
l+d ni T of n nodes with the largest E(A) from a com-
[ [ (1)
(ŷji , pd (ŷji )) ,

V= plete n-ary tree. If there are no other conditions,
i=l+1 j=1 the time complexity of solving this tree is typi-
where V and E is the set of all nodes and edges. cally Ω(n2 ), which is unacceptable. However, by
ni represents the number of sampled tokens in the leveraging certain properties of the draft tree, we
ith layer of T . pd (ŷji ) is the output probability of can design a more efficient algorithm to solve this
token ŷji in Md . For each node in T , if it has k chil- problem.
dren, they are k tokens greedily sampled according For simplicity, we define the probability of the
to pd from its subsequent token distribution. Then prefix as p̂:
the target model inputs the draft tree and the corre- Y
sponding tree attention mask and returns the next p̂ij = pd (ŷ). (3)
tokens of each token in T . The root node is the last ŷ∈P(ŷji )

token of the current prompt, which is bound to be


p̂ij of the root node is regarded as 1. Then we can
accepted. We get the longest accepted candidate
simplify the calculation of E(A) as the summation
with length A by comparing the next tokens and
of the probability of the prefix of all nodes accord-
the draft tree. For the case where all leaf nodes are
ing to the drafting model:
rejected, the acceptance length is 1.
Given M , Md and n, for input x, an optimal tree
X
E(A) = p̂ij . (4)
structure Topt should maximize the mathematical (ŷji ,p̂ij )∈T
expectation of the acceptance length. Note that
Topt changes as the input changes. Since the op- Figure 4 shows a simple example of calculat-
timization goal of the draft model is to make its ing p̂ and E(A). E(A) should positively correlate
with the acceptance length. We discuss their cor- Algorithm 2 Speculative Decoding with Adaptive
relation in Section 4.2. Draft Tree Structure
Input: Input sequence x = (x1 , x2 , ..., xl ), tar-
Algorithm 1 Construct an OPT-Tree Topt get model M , draft model Md , number of nodes
Input: Input sequence x = (x1 , x2 , ..., xl ), n, threshold δ.
draft model Md , number of nodes n, threshold Output: New input sequence x′ = (x1 , x2 , ...,
δ. xl+A )
Output: A draft tree Topt . Topt ← Construct the draft tree with n nodes
Initialize a tree T with root node xl mask ← Compute the corresponding tree at-
E←0 tention mask
Output distribution Pd (T ) ← Md (T ) P ← M (Topt , mask)
T ← topk(Pd (T ), n) (y l+1 , y l+2 , ..., y l+A ) ← V erif y(Topt , P )
while Depth of tree D(T ) < n and // Find the longest accepted candidate. If a se-
Esub (T, n) − E > δ do quence of length A−1 successfully hits, its next
// Drafting step word will also be accepted. So, the total accep-
E ← Esub (T, n) tance length is A.
Output distribution Pd (T ) ← Md (T ) x′ ← Concat(x, (y l+1 , y l+2 , ..., y l+A ))
T ← topk(Pd (T ), n)
end while
Topt ← Select the n nodes with the largest p̂ According to Theorem 3.2, we can get the de-
from T sired Topt in theory by stopping drafting when
E(T ) no longer increases. However, the draft
We use Esub (T, n) to represent the maximum model brings additional overhead to the practice.
value of E(A) for all subtrees of T that contain For autoregressive draft models, the drafting over-
the root node and have n nodes. Note that the root head is proportional to the depth of the draft tree.
node is not considered when calculating node trees Taking this into consideration, we introduce a
and mathematical expectations. threshold δ when setting the conditions for termi-
Then, we propose Algorithm 1 to construct Topt nating drafting. The value of δ should be con-
during the drafting phase for each decoding step. trolled between µ and 1, where µ is the time taken
We initialize T with a root node. At each drafting for one drafting step divided by the time taken for
step, we greedily sample n tokens with the largest one decoding step.
p̂ in the next token distributions of nodes in the last A complete decoding step of M is detailed in
layer of T to construct the next layer. T has d ∗ n Algorithm 2. In practice, both M and Md utilize
nodes at this time. Finally, we select the n nodes key and value cache (Pope et al., 2023) to calculate
in T with the largest p̂. It is easy to prove that attention. This approach ensures that the actual in-
these n nodes are a subtree of T , which contains put length of each drafting step is n, effectively
the root node: preventing computational bottlenecks during the
inference process of the draft model, even when
Proof. (1) If these nodes can not form a tree with operating under larger tree size budgets.
the root, there is at least one node vi whose parent Applying our proposed algorithm to solve Topt
node vj is not among these nodes. (2) Since vj is incurs an acceptable time cost. Detailed time costs
the parent node of vi , p̂ of vj is larger than p̂ of for each operation are presented in Section 4.4.
vi . Therefore, vj is also selected. (1) and (2) are
contradictory, so these nodes must be able to form 4 Experiments
a subtree of T containing the root node.
4.1 Main Results
We summarize this as Theorem 3.1.
Setup. We adopt LLaMA-2-7B, LLaMA-2-13B,
Theorem 3.1. Given the depth of the tree, the top LLaMA-2-70B (Touvron et al., 2023) and Vicuna-
n nodes of the complete n-ary draft tree with the 33B (Zheng et al., 2024) as target models to ver-
largest p̂ form the Topt . ify the effectiveness of OPT-Tree. We use a
Theorem 3.2. As the drafting step increases, single GeForce RTX 4090 GPU for LLaMA-2-
Esub (T, n) is monotonic non-decreasing. 7B, a single L20 GPU for LLaMA-2-13B and 4
M Md Tree MAL Tokens/s Speedup M Md Tree MAL Tokens/s Speedup

None - 1.00 51.89 1.00 None - 1.00 26.79 1.00


Binary 2.12 68.58 1.32 Binary 2.05 40.24 1.50
L-68M EAGLE 2.47 77.06 1.49 L-68M EAGLE 2.42 46.82 1.75
OPT-Tree 2.58 87.57 1.69 OPT-Tree 2.58 48.10 1.80
LLaMA-2 Binary 3.95 46.10 0.89 LLaMA-2 Binary 3.95 37.37 1.39
-7B L-1B EAGLE 4.23 47.74 0.92 -13B L-1B EAGLE 4.25 40.12 1.50
OPT-Tree 4.88 52.48 1.01 OPT-Tree 5.20 43.40 1.62
Binary 3.40 107.91 2.08 Binary 3.54 66.24 2.47
EAGLE EAGLE 3.73 130.50 2.51 EAGLE EAGLE 3.80 73.97 2.76
OPT-Tree 4.36 132.75 2.56 OPT-Tree 4.35 76.61 2.86

None - 1.00 6.29 1.00 None - 1.00 11.25 1.00


Binary 4.84 11.05 1.76 Binary 4.41 12.49 1.11
L-7B EAGLE 4.97 11.35 1.80 V-7B EAGLE 4.64 12.99 1.15
LLaMA-2 Vicuna
OPT-Tree 7.74 11.65 1.85 OPT-Tree 6.51 13.74 1.22
-70B -33B
Binary 3.39 17.02 2.71 Binary 2.35 21.13 1.88
EAGLE EAGLE 3.67 18.81 2.99 EAGLE EAGLE 2.69 24.92 2.21
OPT-Tree 4.06 19.21 3.05 OPT-Tree 3.06 25.17 2.24

Table 1: Experimental results on MT-Bench. Md being None represents vanilla autoregressive decoding. "L" and
"V" in Md column represent "LLaMA-2" and "Vicuna". "MAL" indicates "Mean Acceptance Length". The best
results are shown in bold.

A100-PCIE-40GB GPUs for LLaMA-2-70B and draft model can hardly speed up decoding when
Vicuna-33B. We choose one or two smaller mod- the target model is LLaMA-2-7B because the dif-
els in the same version as the draft model for each ference in inference time between the two models
target model. Moreover, we adopt a corresponding is too small. EAGLE draft models achieve strong
EAGLE draft model for each target model. EA- performance with fewer parameters, thus provid-
GLE (Li et al., 2024) is an effective speculation ing better acceleration than the small models in the
decoding method that trains additional autoregres- same series with the target models. OPT-Tree out-
sive heads as draft models. It uses a well-designed performs other tree structures in terms of mean ac-
heuristic draft tree structure with 25 nodes. In our ceptance length in each group of experiments, es-
experiments, we regard it as the EAGLE draft tree. pecially when the performance of the draft model
EAGLE is certified by Xia et al. (2024) as the is close to the target model (e.g., LLaMA-2-70B
fastest speculative method in their experiments. combined with L-7B and Vicuna-33B combined
For each target and draft model group, we per- with Vicuna-7B), indicating its high upper limit.
form speculative decoding with greedy sampling Since OPT-Trees are usually deeper than binary
and compare OPT-Tree with the Binary tree and trees and EAGLE trees, they incur more overhead
EAGLE tree. The temperature is set to zero. when drafting. Therefore, from the perspective of
We compare the average acceptance length and tokens per second, the improvement is not as sig-
number of tokens generated per second decoding nificant as that from the mean acceptance length.
with different tree structures. The speedup ratio Tokens per second are affected by different hard-
is calculated according to generation speed. The ware resources and random errors. In addition,
number of nodes needs to be controlled within a some method-independent techniques can also be
certain range to avoid excessive time consumption used to reduce computation time. For example,
in the verification phase. It is treated as a hyper- the unchanged part of the attention mask in the
parameter chosen in [25, 50, 60] to maximize the drafting phase can be initialized only once and
speedup ratio according to different target mod- called multiple times, thus saving the time of mul-
els and GPU resources except for the EAGLE tiple initializations. In order to make a fairer com-
tree. We conduct evaluation on MT-Bench (Zheng parison in our experiments, we avoid these tricks
et al., 2024) and GSM8K (Cobbe et al., 2021). to be consistent with EAGLE’s practice. Over-
Results. Experimental results are shown in Table all, OPT-Tree outperforms the baselines. It can
1 and Table 2. Note that using LLaMA-2-1B as the be up to about 3.2 times faster than vanilla au-
M Md Tree MAL Tokens/s Speedup M Md Tree MAL Tokens/s Speedup

None - 1.00 52.76 1.00 None - 1.00 27.10 1.00


Binary 2.20 73.49 1.39 Binary 2.21 45.18 1.67
L-68M EAGLE 2.63 85.62 1.62 L-68M EAGLE 2.60 52.83 1.95
OPT-Tree 2.78 96.43 1.83 OPT-Tree 2.81 53.54 1.98
LLaMA-2 Binary 3.55 40.69 0.77 LLaMA-2 Binary 3.76 36.54 1.35
-7B L-1B EAGLE 3.87 44.42 0.84 -13B L-1B EAGLE 4.10 37.29 1.38
OPT-Tree 4.46 50.83 0.96 OPT-Tree 5.10 42.97 1.59
Binary 3.52 118.15 2.24 Binary 3.80 73.30 2.70
EAGLE EAGLE 3.83 137.41 2.60 EAGLE EAGLE 4.06 80.47 2.97
OPT-Tree 4.68 140.55 2.66 OPT-Tree 5.03 80.94 2.99

None - 1.00 6.38 1.00 None - 1.00 10.74 1.00


Binary 4.85 11.20 1.76 Binary 4.95 13.15 1.22
L-7B EAGLE 4.98 11.51 1.80 V-7B EAGLE 4.81 13.38 1.25
LLaMA-2 Vicuna
OPT-Tree 7.62 12.10 1.90 OPT-Tree 6.35 13.98 1.30
-70B -33B
Binary 3.62 18.63 2.92 Binary 2.82 25.20 2.35
EAGLE EAGLE 3.91 20.42 3.20 EAGLE EAGLE 3.15 28.37 2.64
OPT-Tree 4.55 20.50 3.21 OPT-Tree 3.47 28.76 2.68

Table 2: Experimental results on GSM8K. Md being None represents vanilla autoregressive decoding. "L" and
"V" in Md column represent "LLaMA-2" and "Vicuna". "MAL" indicates "Mean Acceptance Length". The best
results are shown in bold.

value of E(A) is rounded. The darker areas in


the four images are basically distributed along the
main diagonal line. When E(A) of the tree is
larger, it also tends to get a more considerable
acceptance length after verification. A stronger
draft model shifts the distribution to the lower
right corner. These phenomena corroborate our
theoretical analysis. In addition, in the LLaMA-
2-70B+LLaMA-2-7B group, high values of E(A)
and A (e.g., E(A) = 14, A = 15) are generally
found, which demonstrates the potential of OPT-
Tree to adapt to stronger draft models and larger
draft tree sizes.

4.3 Scaling the Draft Tree Size


Figure 5: Correlation between E(A) and A. The hori-
zontal axis represents E(A), and the vertical axis rep- We conduct experiments to explore the changes
resents A. Each square shows the number of times the in mean acceptance length with larger tree sizes.
corresponding situation occurs. The darker the color, We compare OPT-Tree with Sequoia (Chen et al.,
the more times it indicates. 2024) using LLaMA-2-7B and LLaMA-2-70B as
target models. Sequoia is a scalable draft tree that
toregressive decoding. The similar performance uses dynamic programming to solve for the tree
on both datasets verifies the robustness of the pro- structure. It requires the target and draft models to
posed method. be used in advance to infer some samples to de-
termine the best structure. The tree structure is
4.2 Correlation between E(A) and A fixed when doing speculative decoding. We use
The theory of OPT-Tree is based on the premise 200 samples in C4 (Raffel et al., 2020) to con-
that E(A) is positively correlated with actual A. struct the Sequoia trees. Temperature is set to 0
We record the values of E(A) and A of OPT- in the experiments.
Tree in about 8000 decoding steps for 4 groups The results are shown in Figure 6. OPT-
of M and Md . Figure 5 shows the results. The Tree outperforms Sequoia under various tree sizes.
Figure 6: Mean acceptance length under different tree
sizes under two sets of experiments.

Figure 7: The average time cost of different operations


For LLaMA-2-7B+LLaMA-2-68M, the mean ac- in each decoding step. Tree operations include initial-
ceptance length increases when the number of izing the tree, updating the tree, calculating the optimal
tree, and calculating the corresponding tree attention
nodes is smaller than 130 for both OPT-Tree and
mask. The drafting time cost is the entire time to cre-
Sequoia. When the number of nodes exceeds ate a draft tree minus the time for tree operations. Its
140, the mean acceptance length increases slowly. main component is the time of multiple draft model in-
For LLaMA-2-70B+LLaMA-2-7B, the growth of ferences. The verification time cost is mainly in the
mean acceptance length with Sequoia stabilizes inference of the target model.
when the number of nodes exceeds 150. In con-
trast, OPT-Tree can consistently improve the mean
erations is independent of the size of the draft
acceptance length even with more than 500 nodes.
and target models, these costs will constitute a
Since LLaMA-2-7B is a strong draft model for
smaller proportion when the models are larger. In
LLaMA-2-70B, the mean acceptance length can
the model setting with the least number of pa-
achieve 10 with an OPT-Tree of 500 nodes. A tree
rameters, the time overhead of tree-related oper-
with 500 nodes costs a large amount of compu-
ations accounts for 6.3%. However, given the im-
tation time for LLaMA-2-70B with A100-PCIE-
provements in draft quality achieved by OPT-Tree,
40GB GPUs, thus being unable to speed up de-
these costs are justified. When the model becomes
coding in our practice. However, this cost may
sufficiently large (e.g., LLaMA-2-70B+LLaMA-
be acceptable if more powerful computational re-
2-7B), the time overhead associated with tree op-
sources are equipped in the future.
erations is negligible.
4.4 Time Cost Analysis 4.5 Impact of the Threshold
In this section, we discuss the overhead of apply- Considering the overhead of the draft model is
ing OPT-Tree. We conduct experiments with 4 proportional to the depth of the tree, the tree that
groups of models on A100 GPUs. The number maximizes the acceptance length does not nec-
of nodes is set to 50. The temperature is set to 0. essarily have the highest speed-up ratio. There-
Figure 7 displays the time cost of the operations fore, we experiment to study the mean acceptance
in speculative decoding with OPT-Tree. Each pie length and tokens/s under different thresholds.
in the figure represents the average wall time of Figure 8 shows the experimental results on
one speculative decoding step, composed of tree- LLaMA-2-7B. The mean acceptance length drops
related operations time, drafting time, and veri- as the threshold grows when using LLaMA-2-68M
fication time. Since the cost of tree-related op- as the draft model. However, there is a slight
M Md MAL Tokens/s Speedup

L-68M 2.72 88.90 1.71


L-1B 5.25 49.76 0.96
LLaMA-2-7B
†EAGLE 3.37 101.63 1.96
EAGLE 4.07 125.79 2.42
L-68M 2.26 43.45 1.62
L-1B 4.23 37.84 1.41
LLaMA-2-13B
Figure 8: The two figures on the left and right are the †EAGLE 3.45 63.13 2.01
mean acceptance length and tokens/s under different EAGLE 4.13 69.27 2.21
thresholds on MT-Bench. The target model is LLaMA- L-7B 7.17 11.87 1.89
2-7B. The blue and orange dashed lines in the right fig- LLaMA-2-70B †EAGLE 3.51 15.93 2.53
ure represent the values of µ with LLaMA-2-68M and EAGLE 4.09 18.92 3.01
EAGLE as the draft model, respectively.
V-7B 4.91 13.48 1.20
Vicuna-33B †EAGLE 2.70 19.91 1.77
EAGLE 2.89 25.31 2.25

Table 3: Performance of OPT-Tree on MT-Bench with


the temperature set to 1. "L" and "V" in Md column
represents "LLaMA-2" and "Vicuna". "MAL" indi-
cates "Mean Acceptance Length". "†" means using the
EAGLE tree.

Figure 9: The two figures on the left and right are the
mean acceptance length and tokens/s with OPT-Tree ble 3 displays the experimental results. The mean
with different temperatures on MT-Bench. The target acceptance length and the speedup ratio of specu-
model is LLaMA-2-7B. lative decoding with OPT-Tree are slightly lower
when the temperature is set to 1 than when the
fluctuation for the EAGLE draft model. This is temperature is set to 0. Since the draft tree greedily
because E(A) and A are not completely equiva- samples tokens with higher probability, the posi-
lent. We calculate µ for each group of models, tive correlation between E(A) and A will be weak-
which is the time of one drafting step divided by ened in the decoding of random sampling. There-
the time of one decoding step. A threshold that is fore, it is typical for the acceleration of speculative
too large will reduce the tree’s depth, thus reduc- decoding to drop when the temperature is greater
ing the value of A. On the other hand, a threshold than 0. The improvement of OPT-Tree for EAGLE
that is too small may make the tree too deep and draft model is more significant in this setting. Fig-
increase the cost of drafting. When the depth of ure 9 shows specific changes in mean acceptance
the tree increases by one but the increment of the length and tokens/s with different temperature val-
E(A) does not exceed µ, it is not worth increasing ues. Both metrics drop as the temperature rises in
the depth. Therefore, we set a threshold between general. But even when the temperature is set to 1,
µ and 1 in practice. LLaMA-2-68M and EAGLE opt-tree can still provide high speedup compared
achieve the highest acceleration when δ = 0.2 and to vanilla autoregressive decoding.
δ = 0.8, respectively.
4.7 Case Study
4.6 Performance on Non-greedy Settings We show an example of speculative decoding with
In the decoding setting of non-greedy sampling an OPT-Tree of 50 nodes on LLaMA-2-70B with
(random sampling), we only modify the accept- LLaMA-2-7B as the draft model in Figure 10. The
able tokens during the verification phase. We threshold is 0.7, and the temperature is 0. The
evaluate OPT-Tree on these non-greedy settings, mean acceptance length is 9.34, and the genera-
where the temperature exceeds 0. tion speed is 12.07 tokens per second. Most words
We perform speculative decoding with OPT- (blue text) are generated by the draft model and
Tree on the MT-Bench dataset for all groups of then verified by the target model. Each couple of
models in 4.1 with the temperature set to 1. Ta- red words and the continuous blue text in front of
Figure 10: An example of speculative decoding with OPT-Tree on LLaMA-2-70B. Text on a blue background is
the input prompt. Blue text represents drafts generated by LLaMA-2-7B and accepted by LLaMA-2-70B. Red text
represents the next token for each accepted draft, which is generated by LLaMA-2-70B during the verification.

it is generated in a single decoding step of the tar- draft. Yang et al. (2023) adopt an early-exiting
get model. The appearance of red words is either mechanism for drafting. Similarly, Zhang et al.
because the depth of the draft tree is limited or be- (2023) performs adaptive layer skipping in the
cause none of the candidates for this position hits drafting phase. Lookahead Decoding (Fu et al.,
the target. Prepositions (e.g., in, for and with), 2024) designed an algorithm for parallel draft-
conjunctions (e.g., and and or), articles (e.g., a ing and verification. MEDUSA (Cai et al., 2024)
and the), punctuation and other words which have trains multiple decoding heads to obtain candi-
no apparent practical meanings in the drafts are dates for multiple steps from original features in
prone to be rejected during the verification phase. parallel. Considering that different sampling re-
In addition, the beginning of new sentences in sults at each step in drafting will affect the dis-
drafts tends to be rejected because it has no solid tribution of subsequent output, EAGLE (Li et al.,
sequential association with the previous word. 2024) designed an autoregressive head, which in-
troduced the embedding of each word in the draft-
5 Related Work ing stage.
Speculative decoding (Stern et al., 2018; Xia et al., The verification method has evolved from
2023; Leviathan et al., 2023; Chen et al., 2023a) sequence-structured verification to tree-structured
accelerates autoregressive decoding by drafting verification. Early work (Stern et al., 2018;
and then verifying while ensuring consistent out- Leviathan et al., 2023; Xia et al., 2023; Yang et al.,
put. Drafting methods are mainly divided into 2023; Zhang et al., 2023; Fu et al., 2024) veri-
independent drafting and self-drafting. Inde- fies drafts in the form of one or several sequences.
pendent drafting leverages an external low-cost However, as the number of verification tokens in-
model. SpecDec (Xia et al., 2023) trains a non- creases, there are a large number of prefix du-
autoregressive model for drafting while others plications between sequences, resulting in redun-
(Leviathan et al., 2023; Chen et al., 2023a; Spec- dant calculations. To alleviate this problem, re-
tor and Re, 2023; Chen et al., 2023b, 2024) di- cent work (He et al., 2023; Cai et al., 2024; Li
rectly utilize a smaller version of the target model. et al., 2024; Jeon et al., 2024) uses tree-structured
In addition, REST (He et al., 2023) proposed a heuristic drafts and designs the corresponding at-
retrieval-based drafting method. Self-drafting uses tention masks for parallel verification. Chen et al.
the original information of the target model to (2024) proposed Sequoia, an algorithm for build-
ing draft trees, which performs well as the tree size ation framework with multiple decoding heads.
scales up. arXiv preprint arXiv:2401.10774v3.

6 Conclusion Charlie Chen, Sebastian Borgeaud, Geoffrey Irv-


ing, Jean-Baptiste Lespiau, Laurent Sifre, and
In this paper, we propose a novel and effec- John Jumper. 2023a. Accelerating large lan-
tive method called OPT-Tree to construct adap- guage model decoding with speculative sam-
tive draft tree structures for speculative decod- pling. arXiv preprint arXiv:2302.01318v1.
ing, which is applicable to any autoregressive draft
model. OPT-Tree maximizes the mathematical ex- Zhuoming Chen, Avner May, Ruslan
pectation of the acceptance length under any lim- Svirschevski, Yuhsun Huang, Max Ryabinin,
ited draft tree size. Experimental results across ten Zhihao Jia, and Beidi Chen. 2024. Se-
sets of target and draft models using two datasets quoia: Scalable, robust, and hardware-
demonstrate that OPT-Tree outperforms existing aware speculative decoding. arXiv preprint
draft structures, achieving lossless acceleration of arXiv:2402.12374v2.
up to 3.2 times compared to vanilla autoregressive
Ziyi Chen, Xiaocong Yang, Jiacheng Lin, Chenkai
decoding. Furthermore, when paired with a robust
Sun, Jie Huang, and Kevin Chen-Chuan Chang.
draft model, OPT-Tree consistently increases the
2023b. Cascade speculative drafting for
mean acceptance length even with over 500 nodes,
even faster LLM inference. arXiv preprint
showcasing its potential in scenarios with ample
arXiv:2312.11462v4.
computational resources.
Karl Cobbe, Vineet Kosaraju, Mohammad Bavar-
Acknowledgments ian, Mark Chen, Heewoo Jun, Lukasz Kaiser,
We want to thank all the anonymous reviewers Matthias Plappert, Jerry Tworek, Jacob Hilton,
and Action Editor Ivan Titov for their valuable Reiichiro Nakano, Christopher Hesse, and
comments. This work was supported by the Na- John Schulman. 2021. Training verifiers to
tional Science Foundation of China (NSFC No. solve math word problems. arXiv preprint
62206194 and 62276077), the Natural Science arXiv:2110.14168v2.
Foundation of Jiangsu Province, China (Grant Yichao Fu, Peter Bailis, Ion Stoica, and Hao
No. BK20220488), Young Elite Scientists Spon- Zhang. 2024. Break the sequential dependency
sorship Program by CAST (2023QNRC001), and of LLM inference using lookahead decoding.
Tecorigin. arXiv preprint arXiv:2402.02057v1.

Zhenyu He, Zexuan Zhong, Tianle Cai, Jason D


References Lee, and Di He. 2023. Rest: Retrieval-
based speculative decoding. arXiv preprint
Sidney Black, Stella Biderman, Eric Hallahan,
arXiv:2311.08252v2.
Quentin Anthony, Leo Gao, Laurence Gold-
ing, Horace He, Connor Leahy, Kyle Mc- Wonseok Jeon, Mukul Gagrani, Raghavv Goel,
Donell, Jason Phang, Michael Pieler, Usvsn Sai Junyoung Park, Mingu Lee, and Christopher
Prashanth, Shivanshu Purohit, Laria Reynolds, Lott. 2024. Recursive speculative decoding:
Jonathan Tow, Ben Wang, and Samuel Wein- Accelerating LLM inference via sampling with-
bach. 2022. GPT-NeoX-20B: An open-source out replacement. In ICLR 2024 Workshop on
autoregressive language model. In Proceedings Large Language Model (LLM) Agents.
of BigScience Episode #5 – Workshop on Chal-
lenges & Perspectives in Creating Large Lan- Albert Q. Jiang, Alexandre Sablayrolles, An-
guage Models, pages 95–136, virtual+Dublin. toine Roux, Arthur Mensch, Blanche Savary,
Association for Computational Linguistics. Chris Bamford, Devendra Singh Chaplot, Diego
de las Casas, Emma Bou Hanna, Florian Bres-
Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu sand, Gianna Lengyel, Guillaume Bour, Guil-
Peng, Jason D Lee, Deming Chen, and Tri Dao. laume Lample, Lélio Renard Lavaud, Lu-
2024. Medusa: Simple LLM inference acceler- cile Saulnier, Marie-Anne Lachaux, Pierre
Stock, Sandeep Subramanian, Sophia Yang, Xavier Martinet, Todor Mihaylov, Pushkar
Szymon Antoniak, Teven Le Scao, Théophile Mishra, Igor Molybog, Yixin Nie, Andrew
Gervet, Thibaut Lavril, Thomas Wang, Tim- Poulton, Jeremy Reizenstein, Rashi Rungta,
othée Lacroix, and William El Sayed. 2024. Kalyan Saladi, Alan Schelten, Ruan Silva,
Mixtral of experts. Eric Michael Smith, Ranjan Subramanian, Xi-
aoqing Ellen Tan, Binh Tang, Ross Taylor,
Yaniv Leviathan, Matan Kalman, and Yossi Ma- Adina Williams, Jian Xiang Kuan, Puxin Xu,
tias. 2023. Fast inference from transformers Zheng Yan, Iliyan Zarov, Yuchen Zhang, An-
via speculative decoding. In International Con- gela Fan, Melanie Kambadur, Sharan Narang,
ference on Machine Learning, pages 19274– Aurelien Rodriguez, Robert Stojnic, Sergey
19286. PMLR. Edunov, and Thomas Scialom. 2023. Llama
Yuhui Li, Fangyun Wei, Chao Zhang, and 2: Open foundation and fine-tuned chat models.
Hongyang Zhang. 2024. EAGLE: Specula- arXiv preprint arXiv:2307.09288v2.
tive sampling requires rethinking feature uncer- Heming Xia, Tao Ge, Peiyi Wang, Si-Qing Chen,
tainty. arXiv preprint arXiv:2401.15077v2. Furu Wei, and Zhifang Sui. 2023. Speculative
decoding: Exploiting speculative execution for
Reiner Pope, Sholto Douglas, Aakanksha Chowd-
accelerating seq2seq generation. In Findings of
hery, Jacob Devlin, James Bradbury, Jonathan
the Association for Computational Linguistics:
Heek, Kefan Xiao, Shivani Agrawal, and Jeff
EMNLP 2023, pages 3909–3925.
Dean. 2023. Efficiently scaling transformer in-
ference. Proceedings of Machine Learning and Heming Xia, Zhe Yang, Qingxiu Dong, Peiyi
Systems, 5. Wang, Yongqi Li, Tao Ge, Tianyu Liu, Wen-
jie Li, and Zhifang Sui. 2024. Unlocking ef-
Colin Raffel, Noam Shazeer, Adam Roberts,
ficiency in large language model inference: A
Katherine Lee, Sharan Narang, Michael
comprehensive survey of speculative decoding.
Matena, Yanqi Zhou, Wei Li, and Peter J Liu.
arXiv preprint arXiv:2401.07851v3.
2020. Exploring the limits of transfer learning
with a unified text-to-text transformer. Journal Seongjun Yang, Gibbeum Lee, Jaewoong Cho,
of machine learning research, 21(140):1–67. Dimitris Papailiopoulos, and Kangwook Lee.
2023. Predictive pipelined decoding: A
Benjamin Spector and Chris Re. 2023. Accelerat-
compute-latency trade-off for exact LLM de-
ing LLM inference with staged speculative de-
coding. arXiv preprint arXiv:2307.05908v2.
coding. arXiv preprint arXiv:2308.04623v1.
Jun Zhang, Jue Wang, Huan Li, Lidan Shou,
Mitchell Stern, Noam Shazeer, and Jakob Uszkor-
Ke Chen, Gang Chen, and Sharad Mehrotra.
eit. 2018. Blockwise parallel decoding for deep
2023. Draft & verify: Lossless large language
autoregressive models. Advances in Neural In-
model acceleration via self-speculative decod-
formation Processing Systems, 31.
ing. arXiv preprint arXiv:2309.08168v2.
Hugo Touvron, Louis Martin, Kevin Stone, Pe- Susan Zhang, Stephen Roller, Naman Goyal,
ter Albert, Amjad Almahairi, Yasmine Babaei, Mikel Artetxe, Moya Chen, Shuohui Chen,
Nikolay Bashlykov, Soumya Batra, Prajjwal Christopher Dewan, Mona Diab, Xian Li,
Bhargava, Shruti Bhosale, Dan Bikel, Lukas Xi Victoria Lin, Todor Mihaylov, Myle Ott,
Blecher, Cristian Canton Ferrer, Moya Chen, Sam Shleifer, Kurt Shuster, Daniel Simig,
Guillem Cucurull, David Esiobu, Jude Fernan- Punit Singh Koura, Anjali Sridhar, Tianlu
des, Jeremy Fu, Wenyin Fu, Brian Fuller, Cyn- Wang, and Luke Zettlemoyer. 2022. Opt: Open
thia Gao, Vedanuj Goswami, Naman Goyal, pre-trained transformer language models. arXiv
Anthony Hartshorn, Saghar Hosseini, Rui preprint arXiv:2205.01068v4.
Hou, Hakan Inan, Marcin Kardas, Viktor
Kerkez, Madian Khabsa, Isabel Kloumann, Lianmin Zheng, Wei-Lin Chiang, Ying Sheng,
Artem Korenev, Punit Singh Koura, Marie- Siyuan Zhuang, Zhanghao Wu, Yonghao
Anne Lachaux, Thibaut Lavril, Jenya Lee, Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric
Diana Liskovich, Yinghai Lu, Yuning Mao, Xing, Hao Zhang, Joseph E Gonzalez, and Ion
Stoica. 2024. Judging LLM-as-a-judge with
MT-Bench and Chatbot Arena. Advances in
Neural Information Processing Systems, 36.
A The Relationship between the Output Probability of the Draft Model and the
Probability of Acceptance

Figure 11: Acceptance frequency with different range of draft model output probability.

We counted the frequency of token acceptance under different draft model output probabilities when
using LLaMA-2-7B and the EAGLE draft model for speculative decoding. Figure 11 shows the experi-
mental results. The acceptance frequency is positively correlated with the output probability.

B Calculation Process in Figure 4


To calculate E(A) by pd (left tree):

E(A) = 4 × (0.5 × 0.8 × 0.5 + 0.2 × 0.8 × 0.5 + 0.5 × 0.6 × 0.4) —— Layer 3
+ 3 × [0.8 × 0.5 × (1 − 0.5 − 0.2) + 0.5 × 0.1
+ 0.6 × 0.4 × (1 − 0.5) + 0.4 × 0.2] —— Layer 2
+ 2 × [0.5 × (1 − 0.8 − 0.1) + 0.4 × (1 − 0.6 − 0.2)] —— Layer 1
+ 1 × (1 − 0.5 − 0.4) —— Root
= 3.07

To calculate E(A) by p̂ (right tree):

E(A) = 1 + 0.5 + 0.4 + 0.4 + 0.05 + 0.24 + 0.08 + 0.2 + 0.08 + 0.12
= 3.07

You might also like