OPT-Tree - Speculative Decoding With Adaptive Draft Tree Structure
OPT-Tree - Speculative Decoding With Adaptive Draft Tree Structure
Table 1: Experimental results on MT-Bench. Md being None represents vanilla autoregressive decoding. "L" and
"V" in Md column represent "LLaMA-2" and "Vicuna". "MAL" indicates "Mean Acceptance Length". The best
results are shown in bold.
A100-PCIE-40GB GPUs for LLaMA-2-70B and draft model can hardly speed up decoding when
Vicuna-33B. We choose one or two smaller mod- the target model is LLaMA-2-7B because the dif-
els in the same version as the draft model for each ference in inference time between the two models
target model. Moreover, we adopt a corresponding is too small. EAGLE draft models achieve strong
EAGLE draft model for each target model. EA- performance with fewer parameters, thus provid-
GLE (Li et al., 2024) is an effective speculation ing better acceleration than the small models in the
decoding method that trains additional autoregres- same series with the target models. OPT-Tree out-
sive heads as draft models. It uses a well-designed performs other tree structures in terms of mean ac-
heuristic draft tree structure with 25 nodes. In our ceptance length in each group of experiments, es-
experiments, we regard it as the EAGLE draft tree. pecially when the performance of the draft model
EAGLE is certified by Xia et al. (2024) as the is close to the target model (e.g., LLaMA-2-70B
fastest speculative method in their experiments. combined with L-7B and Vicuna-33B combined
For each target and draft model group, we per- with Vicuna-7B), indicating its high upper limit.
form speculative decoding with greedy sampling Since OPT-Trees are usually deeper than binary
and compare OPT-Tree with the Binary tree and trees and EAGLE trees, they incur more overhead
EAGLE tree. The temperature is set to zero. when drafting. Therefore, from the perspective of
We compare the average acceptance length and tokens per second, the improvement is not as sig-
number of tokens generated per second decoding nificant as that from the mean acceptance length.
with different tree structures. The speedup ratio Tokens per second are affected by different hard-
is calculated according to generation speed. The ware resources and random errors. In addition,
number of nodes needs to be controlled within a some method-independent techniques can also be
certain range to avoid excessive time consumption used to reduce computation time. For example,
in the verification phase. It is treated as a hyper- the unchanged part of the attention mask in the
parameter chosen in [25, 50, 60] to maximize the drafting phase can be initialized only once and
speedup ratio according to different target mod- called multiple times, thus saving the time of mul-
els and GPU resources except for the EAGLE tiple initializations. In order to make a fairer com-
tree. We conduct evaluation on MT-Bench (Zheng parison in our experiments, we avoid these tricks
et al., 2024) and GSM8K (Cobbe et al., 2021). to be consistent with EAGLE’s practice. Over-
Results. Experimental results are shown in Table all, OPT-Tree outperforms the baselines. It can
1 and Table 2. Note that using LLaMA-2-1B as the be up to about 3.2 times faster than vanilla au-
M Md Tree MAL Tokens/s Speedup M Md Tree MAL Tokens/s Speedup
Table 2: Experimental results on GSM8K. Md being None represents vanilla autoregressive decoding. "L" and
"V" in Md column represent "LLaMA-2" and "Vicuna". "MAL" indicates "Mean Acceptance Length". The best
results are shown in bold.
Figure 9: The two figures on the left and right are the
mean acceptance length and tokens/s with OPT-Tree ble 3 displays the experimental results. The mean
with different temperatures on MT-Bench. The target acceptance length and the speedup ratio of specu-
model is LLaMA-2-7B. lative decoding with OPT-Tree are slightly lower
when the temperature is set to 1 than when the
fluctuation for the EAGLE draft model. This is temperature is set to 0. Since the draft tree greedily
because E(A) and A are not completely equiva- samples tokens with higher probability, the posi-
lent. We calculate µ for each group of models, tive correlation between E(A) and A will be weak-
which is the time of one drafting step divided by ened in the decoding of random sampling. There-
the time of one decoding step. A threshold that is fore, it is typical for the acceleration of speculative
too large will reduce the tree’s depth, thus reduc- decoding to drop when the temperature is greater
ing the value of A. On the other hand, a threshold than 0. The improvement of OPT-Tree for EAGLE
that is too small may make the tree too deep and draft model is more significant in this setting. Fig-
increase the cost of drafting. When the depth of ure 9 shows specific changes in mean acceptance
the tree increases by one but the increment of the length and tokens/s with different temperature val-
E(A) does not exceed µ, it is not worth increasing ues. Both metrics drop as the temperature rises in
the depth. Therefore, we set a threshold between general. But even when the temperature is set to 1,
µ and 1 in practice. LLaMA-2-68M and EAGLE opt-tree can still provide high speedup compared
achieve the highest acceleration when δ = 0.2 and to vanilla autoregressive decoding.
δ = 0.8, respectively.
4.7 Case Study
4.6 Performance on Non-greedy Settings We show an example of speculative decoding with
In the decoding setting of non-greedy sampling an OPT-Tree of 50 nodes on LLaMA-2-70B with
(random sampling), we only modify the accept- LLaMA-2-7B as the draft model in Figure 10. The
able tokens during the verification phase. We threshold is 0.7, and the temperature is 0. The
evaluate OPT-Tree on these non-greedy settings, mean acceptance length is 9.34, and the genera-
where the temperature exceeds 0. tion speed is 12.07 tokens per second. Most words
We perform speculative decoding with OPT- (blue text) are generated by the draft model and
Tree on the MT-Bench dataset for all groups of then verified by the target model. Each couple of
models in 4.1 with the temperature set to 1. Ta- red words and the continuous blue text in front of
Figure 10: An example of speculative decoding with OPT-Tree on LLaMA-2-70B. Text on a blue background is
the input prompt. Blue text represents drafts generated by LLaMA-2-7B and accepted by LLaMA-2-70B. Red text
represents the next token for each accepted draft, which is generated by LLaMA-2-70B during the verification.
it is generated in a single decoding step of the tar- draft. Yang et al. (2023) adopt an early-exiting
get model. The appearance of red words is either mechanism for drafting. Similarly, Zhang et al.
because the depth of the draft tree is limited or be- (2023) performs adaptive layer skipping in the
cause none of the candidates for this position hits drafting phase. Lookahead Decoding (Fu et al.,
the target. Prepositions (e.g., in, for and with), 2024) designed an algorithm for parallel draft-
conjunctions (e.g., and and or), articles (e.g., a ing and verification. MEDUSA (Cai et al., 2024)
and the), punctuation and other words which have trains multiple decoding heads to obtain candi-
no apparent practical meanings in the drafts are dates for multiple steps from original features in
prone to be rejected during the verification phase. parallel. Considering that different sampling re-
In addition, the beginning of new sentences in sults at each step in drafting will affect the dis-
drafts tends to be rejected because it has no solid tribution of subsequent output, EAGLE (Li et al.,
sequential association with the previous word. 2024) designed an autoregressive head, which in-
troduced the embedding of each word in the draft-
5 Related Work ing stage.
Speculative decoding (Stern et al., 2018; Xia et al., The verification method has evolved from
2023; Leviathan et al., 2023; Chen et al., 2023a) sequence-structured verification to tree-structured
accelerates autoregressive decoding by drafting verification. Early work (Stern et al., 2018;
and then verifying while ensuring consistent out- Leviathan et al., 2023; Xia et al., 2023; Yang et al.,
put. Drafting methods are mainly divided into 2023; Zhang et al., 2023; Fu et al., 2024) veri-
independent drafting and self-drafting. Inde- fies drafts in the form of one or several sequences.
pendent drafting leverages an external low-cost However, as the number of verification tokens in-
model. SpecDec (Xia et al., 2023) trains a non- creases, there are a large number of prefix du-
autoregressive model for drafting while others plications between sequences, resulting in redun-
(Leviathan et al., 2023; Chen et al., 2023a; Spec- dant calculations. To alleviate this problem, re-
tor and Re, 2023; Chen et al., 2023b, 2024) di- cent work (He et al., 2023; Cai et al., 2024; Li
rectly utilize a smaller version of the target model. et al., 2024; Jeon et al., 2024) uses tree-structured
In addition, REST (He et al., 2023) proposed a heuristic drafts and designs the corresponding at-
retrieval-based drafting method. Self-drafting uses tention masks for parallel verification. Chen et al.
the original information of the target model to (2024) proposed Sequoia, an algorithm for build-
ing draft trees, which performs well as the tree size ation framework with multiple decoding heads.
scales up. arXiv preprint arXiv:2401.10774v3.
Figure 11: Acceptance frequency with different range of draft model output probability.
We counted the frequency of token acceptance under different draft model output probabilities when
using LLaMA-2-7B and the EAGLE draft model for speculative decoding. Figure 11 shows the experi-
mental results. The acceptance frequency is positively correlated with the output probability.
E(A) = 4 × (0.5 × 0.8 × 0.5 + 0.2 × 0.8 × 0.5 + 0.5 × 0.6 × 0.4) —— Layer 3
+ 3 × [0.8 × 0.5 × (1 − 0.5 − 0.2) + 0.5 × 0.1
+ 0.6 × 0.4 × (1 − 0.5) + 0.4 × 0.2] —— Layer 2
+ 2 × [0.5 × (1 − 0.8 − 0.1) + 0.4 × (1 − 0.6 − 0.2)] —— Layer 1
+ 1 × (1 − 0.5 − 0.4) —— Root
= 3.07
E(A) = 1 + 0.5 + 0.4 + 0.4 + 0.05 + 0.24 + 0.08 + 0.2 + 0.08 + 0.12
= 3.07