HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: arydshln

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: arXiv.org perpetual non-exclusive license
arXiv:2402.02057v1 [cs.LG] 03 Feb 2024

Break the Sequential Dependency of LLM Inference Using
Lookahead Decoding

Yichao Fu    Peter Bailis    Ion Stoica    Hao Zhang
Abstract

Autoregressive decoding of large language models (LLMs) is memory bandwidth bounded, resulting in high latency and significant wastes of the parallel processing power of modern accelerators. Existing methods for accelerating LLM decoding often require a draft model (e.g., speculative decoding), which is nontrivial to obtain and unable to generalize. In this paper, we introduce Lookahead Decoding, an exact, parallel decoding algorithm that accelerates LLM decoding without needing auxiliary models or data stores. It allows trading per-step log(FLOPs) to reduce the number of total decoding steps, is more parallelizable on single or multiple modern accelerators, and is compatible with concurrent memory-efficient attention (e.g., FlashAttention). Our implementation of Lookahead Decoding can speed up autoregressive decoding by up to 1.8x on MT-bench and 4x with strong scaling on multiple GPUs in code completion tasks. Our code is avialable at https://github.com/hao-ai-lab/LookaheadDecoding

Machine Learning, ICML

1 Introduction

Large language models (LLMs) are transforming the AI industry. As they are increasingly integrated into diverse applications such as search (Team et al., 2023) and chatbots (Ouyang et al., 2022), generating long sequences at low-latency using LLMs is becoming one significant requirement. However, current LLMs generate text based on (Touvron et al., 2023a, b; Jiang et al., 2023; OpenAI, 2023) autoregressive decoding, which falls short in efficiency, primarily for two reasons. First, autoregressive decoding generates only one token at a time. Hence, the overall generation time is proportional to the number of decoding steps. Second, each decoding step largely underutilizes the parallel processing capabilities of modern accelerators (e.g., GPUs). Given the pressing need for low latency in various applications, improving autoregressive decoding remains a central challenge.

Several approaches have been proposed – one such approach is speculative decoding (Chen et al., 2023; Leviathan et al., 2023) and its variants (He et al., 2023; Stern et al., 2018; Cai et al., 2024; Li et al., 2023; Liu et al., 2023; Miao et al., 2023). These methods all follow a guess-and-verify approach: they use a draft model to speculate several subsequent tokens and then use the original (base) LLM to verify these tokens in parallel. Since the draft model requires much fewer resources and the cost of verifying multiple tokens in parallel is similar to the cost of generating a single token, these methods can achieve considerable speedups. However, their speedups are bounded by the token acceptance rate4.1), i.e., the fraction of tokens generated by the draft model that passes the verification test of the base model. This is because every token that fails verification needs to be regenerated by the base model. In the worst case, if most proposed tokens fail verification, these methods may slow down the decoding process. Therefore, achieving a high acceptance rate is essential for these methods. Unfortunately, training a draft model to achieve a high acceptance rate is non-trivial, and the trained draft model does not generalize across base models and datasets.

To address these problems, this paper develops Lookahead Decoding. We build upon a key observation: autoregressive decoding can be equivalently formulated as solving a non-linear system via the fixed point Jacobi iteration method (§2), which we term as Jacobi decoding (Santilli et al., 2023). Each Jacobi decoding step can generate multiple tokens in parallel at different positions. Although these tokens may appear at incorrect positions, we can leverage this parallel generation approach to have the LLM generate several disjoint n-grams in parallel in a single step. These n-grams could potentially be integrated into future parts of the generated sequence, pending verification by the base model to maintain the output distribution.

Lookahead Decoding takes advantage of the particular characteristics of autoregressive decoding, which is bounded by the memory bandwidth–as each generated token depends on all tokens before it–rather than compute, by using the available cycles to generate and verify n𝑛nitalic_n-grams (subsequent tokens) at virtually no additional cost. In a nutshell, Lookahead Decoding consists of a lookahead branch that generates n𝑛nitalic_n-grams and a verification branch that verifies n𝑛nitalic_n-grams, both executing in a single step. To improve efficiency, we use an n𝑛nitalic_n-gram pool to cache the historical n𝑛nitalic_n-grams generated so far. This way, Lookahead Decoding can significantly reduce the latency of LLM inference just by exploiting the compute resources that autoregressive decoding would leave unused. More importantly, Lookahead Decoding scales with the compute – we show that it can linearly reduce the number of decoding steps relative to the log(FLOPs) allocated per step.

We have implemented the algorithm in both Python and CUDA, compatible with memory-efficient attention algorithms (e.g., FlashAttention (Dao, 2023)), and supports various sampling methods without changing the output distribution. We also scale it to multiple GPUs, resulting in Lookahead Parallelism. We evaluate Lookahead Decoding on the popular LLaMA-2 (Touvron et al., 2023b) models. It achieves 1.8x speedup on the challenging multi-turn chat dataset MT-Bench (Zheng et al., 2023) and up to 4x speedup in code completion tasks with Lookahead Parallelism on 8 GPUs. Lookahead Decoding showed significant potential in lowering the latency for latency-sensitive tasks. Our contributions are summarized as follows.

  • We design Lookahead Decoding, a new lossless, parallel decoding algorithm to accelerate LLM inference without needing any auxiliary component.

  • We reveal Lookahead Decoding’s scaling behavior: it linearly reduces the number of decoding steps according to per-step log(\log(roman_log (FLOPs)))). This enables trade-offs between the number of decoding steps and per-step FLOPs, making it future-proof.

  • We show it benefits from the latest memory-efficient attentions and is easily parallelizable by developing its distributed CUDA implementations.

  • We evaluated Lookahead Decoding and demonstrate its effectiveness under different settings.

2 Background

In this section, we formulate both autoregressive and Jacobi decoding from the lens of solving nonlinear systems.

Causal Attention in Decoder Models. Most contemporary LLMs are composed of two core components: token-wise modules (including MLP and normalization (Ba et al., 2016; Zhang & Sennrich, 2019)) and attention (Vaswani et al., 2023) modules. Tokens interact with each other in the attention modules, while in other token-wise modules, they are processed without exchanging information with each other.

The attention layer encompasses three input elements: query 𝐐𝐐\mathbf{Q}bold_Q, key 𝐊𝐊\mathbf{K}bold_K, and value 𝐕𝐕\mathbf{V}bold_V, with the i𝑖iitalic_i-th token in each denoted as 𝐐isubscript𝐐𝑖\mathbf{Q}_{i}bold_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, 𝐊isubscript𝐊𝑖\mathbf{K}_{i}bold_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and 𝐕isubscript𝐕𝑖\mathbf{V}_{i}bold_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, respectively. The attention layer executes the following operation: 𝐎=softmax(𝐐𝐊T)𝐕𝐎softmaxsuperscript𝐐𝐊𝑇𝐕\mathbf{O}=\textrm{softmax}\left(\mathbf{Q}\mathbf{K}^{T}\right)\mathbf{V}bold_O = softmax ( bold_QK start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) bold_V. A lower triangular mask applied to 𝐐𝐊Tsuperscript𝐐𝐊𝑇\mathbf{Q}\mathbf{K}^{T}bold_QK start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT in causal attentions (specific to decoder models) ensures that 𝐎isubscript𝐎𝑖\mathbf{O}_{i}bold_O start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is calculated only from 𝐐isubscript𝐐𝑖\mathbf{Q}_{i}bold_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and 𝐊jsubscript𝐊𝑗\mathbf{K}_{j}bold_K start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, 𝐕jsubscript𝐕𝑗\mathbf{V}_{j}bold_V start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT where ji𝑗𝑖j\leq iitalic_j ≤ italic_i. Because all other layers in the LLM perform token-wise operations, for any given model input 𝐱𝐱\mathbf{x}bold_x and output 𝐨𝐨\mathbf{o}bold_o, 𝐨isubscript𝐨𝑖\mathbf{o}_{i}bold_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (i𝑖iitalic_i-th token in 𝐨𝐨\mathbf{o}bold_o) is exclusively influenced by 𝐱jsubscript𝐱𝑗\mathbf{x}_{j}bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT (j𝑗jitalic_j-th token in 𝐱𝐱\mathbf{x}bold_x) where ji𝑗𝑖j\leq iitalic_j ≤ italic_i.

Autoregressive Decoding in LLMs. LLMs take discrete integer sequences as inputs, where each integer represents a token. We notate 𝐱=(x1,x2,,xs)s𝐱subscript𝑥1subscript𝑥2subscript𝑥𝑠superscript𝑠\mathbf{x}=(x_{1},x_{2},...,x_{s})\in\mathbb{N}^{s}bold_x = ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ∈ blackboard_N start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT of length s𝑠sitalic_s as the input of the model, and 𝐱1:mt=(x1,x2,,xm)superscriptsubscript𝐱:1𝑚𝑡subscript𝑥1subscript𝑥2subscript𝑥𝑚\mathbf{x}_{1:m}^{t}=(x_{1},x_{2},...,x_{m})bold_x start_POSTSUBSCRIPT 1 : italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) to denote a slice of 𝐱𝐱\mathbf{x}bold_x of length m𝑚mitalic_m at step t𝑡titalic_t. LLMs’ output characterizes the probability distribution of the next token. The probability for the s𝑠sitalic_s-th token (i.e., the output of the s1𝑠1s-1italic_s - 1-th token) is decided by all previous input tokens, represented as PM(xs|𝐱1:s1)subscript𝑃𝑀conditionalsubscript𝑥𝑠subscript𝐱:1𝑠1P_{M}(x_{s}|\mathbf{x}_{1:s-1})italic_P start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 1 : italic_s - 1 end_POSTSUBSCRIPT ). Then, the next token input xssubscript𝑥𝑠x_{s}italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is obtained by sampling from PM(xs|𝐱1:s1)subscript𝑃𝑀conditionalsubscript𝑥𝑠subscript𝐱:1𝑠1P_{M}(x_{s}|\mathbf{x}_{1:s-1})italic_P start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 1 : italic_s - 1 end_POSTSUBSCRIPT ) using different methods (e.g., greedy, top-K, and top-P (Kool et al., 2020; Holtzman et al., 2020)). When using greedy sampling, the next token is selected by applying an argmax\arg\!\maxroman_arg roman_max function on PMsubscript𝑃𝑀P_{M}italic_P start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT.

We define 𝐱0superscript𝐱0\mathbf{x}^{0}bold_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT as the prompt tokens given by the user. The LLM needs to generate an output sequence (of length m𝑚mitalic_m) from 𝐱0superscript𝐱0\mathbf{x}^{0}bold_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT. Denote yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as the token generated at step i𝑖iitalic_i. The autoregressive decoding process of m𝑚mitalic_m tokens can be seen as solving the following m𝑚mitalic_m problems one by one (assume greedy sampling):

{y1=argmaxPM(y1|𝐱0)y2=argmaxPM(y2|y1,𝐱0)ym=argmaxPM(ym|𝐲1:m1,𝐱0)casessubscript𝑦1subscript𝑃𝑀conditionalsubscript𝑦1superscript𝐱0subscript𝑦2subscript𝑃𝑀conditionalsubscript𝑦2subscript𝑦1superscript𝐱0subscript𝑦𝑚subscript𝑃𝑀conditionalsubscript𝑦𝑚subscript𝐲:1𝑚1superscript𝐱0\displaystyle\left\{\begin{array}[]{l}y_{1}=\arg\!\max P_{M}(y_{1}|\mathbf{x}^% {0})\\ y_{2}=\arg\!\max P_{M}(y_{2}|y_{1},\mathbf{x}^{0})\\ ...\\ y_{m}=\arg\!\max P_{M}(y_{m}|\mathbf{y}_{1:m-1},\mathbf{x}^{0})\end{array}\right.{ start_ARRAY start_ROW start_CELL italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = roman_arg roman_max italic_P start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | bold_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = roman_arg roman_max italic_P start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL … end_CELL end_ROW start_ROW start_CELL italic_y start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = roman_arg roman_max italic_P start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT | bold_y start_POSTSUBSCRIPT 1 : italic_m - 1 end_POSTSUBSCRIPT , bold_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) end_CELL end_ROW end_ARRAY (1)

Guess-And-Verify Paradigm. The Guess-And-Verify decoding paradigm speculates multiple potential future tokens and subsequently confirms the correctness of these speculations within a single decoding step. Take speculative decoding with greedy sampling as an example: at step t𝑡titalic_t, with the prompt 𝐱0superscript𝐱0\mathbf{x}^{0}bold_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT and tokens 𝐲1:t1subscript𝐲:1𝑡1\mathbf{y}_{1:t-1}bold_y start_POSTSUBSCRIPT 1 : italic_t - 1 end_POSTSUBSCRIPT generated so far, we can use a draft model to autoregressively generate a draft sequence 𝐲t:t+n1subscript𝐲:𝑡𝑡𝑛1\mathbf{y}_{t:t+n-1}bold_y start_POSTSUBSCRIPT italic_t : italic_t + italic_n - 1 end_POSTSUBSCRIPT of length n𝑛nitalic_n. Because 𝐲t:t+n1subscript𝐲:𝑡𝑡𝑛1\mathbf{y}_{t:t+n-1}bold_y start_POSTSUBSCRIPT italic_t : italic_t + italic_n - 1 end_POSTSUBSCRIPT is known a priori, we then use the LLM to solve Eqs 2 in parallel, obtaining 𝐲t:t+nsubscriptsuperscript𝐲:𝑡𝑡𝑛\mathbf{y}^{\prime}_{t:t+n}bold_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t : italic_t + italic_n end_POSTSUBSCRIPT. Then, we verify if yt+isubscript𝑦𝑡𝑖y_{t+i}italic_y start_POSTSUBSCRIPT italic_t + italic_i end_POSTSUBSCRIPT is equal to yt+isuperscriptsubscript𝑦𝑡𝑖y_{t+i}^{\prime}italic_y start_POSTSUBSCRIPT italic_t + italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT for each i𝑖iitalic_i from i=0𝑖0i=0italic_i = 0 to i=n1𝑖𝑛1i=n-1italic_i = italic_n - 1. If there is a match, we accept this token and proceed; otherwise, we stop checking and drop subsequent tokens. Finally, we update 𝐲𝐲\mathbf{y}bold_y with all accepted tokens.

{yt=argmaxPM(yt|𝐲1:t1,𝐱0)yt+1=argmaxPM(yt+1|𝐲1:t,𝐱0)yt+n=argmaxPM(yt+n|𝐲1:t+n1,𝐱0)casessuperscriptsubscript𝑦𝑡subscript𝑃𝑀conditionalsubscript𝑦𝑡subscript𝐲:1𝑡1superscript𝐱0superscriptsubscript𝑦𝑡1subscript𝑃𝑀conditionalsubscript𝑦𝑡1subscript𝐲:1𝑡superscript𝐱0superscriptsubscript𝑦𝑡𝑛subscript𝑃𝑀conditionalsubscript𝑦𝑡𝑛subscript𝐲:1𝑡𝑛1superscript𝐱0\displaystyle\left\{\begin{array}[]{l}y_{t}^{\prime}=\arg\!\max P_{M}(y_{t}|% \mathbf{y}_{1:t-1},\mathbf{x}^{0})\\ y_{t+1}^{\prime}=\arg\!\max P_{M}(y_{t+1}|\mathbf{y}_{1:t},\mathbf{x}^{0})\\ ...\\ y_{t+n}^{\prime}=\arg\!\max P_{M}(y_{t+n}|\mathbf{y}_{1:t+n-1},\mathbf{x}^{0})% \end{array}\right.{ start_ARRAY start_ROW start_CELL italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = roman_arg roman_max italic_P start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_y start_POSTSUBSCRIPT 1 : italic_t - 1 end_POSTSUBSCRIPT , bold_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL italic_y start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = roman_arg roman_max italic_P start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | bold_y start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL … end_CELL end_ROW start_ROW start_CELL italic_y start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = roman_arg roman_max italic_P start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT | bold_y start_POSTSUBSCRIPT 1 : italic_t + italic_n - 1 end_POSTSUBSCRIPT , bold_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) end_CELL end_ROW end_ARRAY (2)

As stated in §1, these approaches depend on a good draft model, which is hard to obtain and cannot generalize.

Jacobi Decoding. By notating f(yi,𝐲1:i1,𝐱0)=yiargmaxPM(yi|𝐲1:i1,𝐱0)𝑓subscript𝑦𝑖subscript𝐲:1𝑖1superscript𝐱0subscript𝑦𝑖subscript𝑃𝑀conditionalsubscript𝑦𝑖subscript𝐲:1𝑖1superscript𝐱0f(y_{i},\mathbf{y}_{1:i-1},\mathbf{x}^{0})=y_{i}-\arg\!\max P_{M}(y_{i}|% \mathbf{y}_{1:i-1},\mathbf{x}^{0})italic_f ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT 1 : italic_i - 1 end_POSTSUBSCRIPT , bold_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) = italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - roman_arg roman_max italic_P start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | bold_y start_POSTSUBSCRIPT 1 : italic_i - 1 end_POSTSUBSCRIPT , bold_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ), we can transform Eqs 1 into the following non-linear system of equations (Song et al., 2021; Santilli et al., 2023):

{f(y1,𝐱0)=0f(y2,y1,𝐱0)=0f(ym,𝐲1:m1,𝐱0)=0cases𝑓subscript𝑦1superscript𝐱00𝑓subscript𝑦2subscript𝑦1superscript𝐱00𝑓subscript𝑦𝑚subscript𝐲:1𝑚1superscript𝐱00\displaystyle\left\{\begin{array}[]{l}f(y_{1},\mathbf{x}^{0})=0\\ f(y_{2},y_{1},\mathbf{x}^{0})=0\\ ...\\ f(y_{m},\mathbf{y}_{1:m-1},\mathbf{x}^{0})=0\end{array}\right.{ start_ARRAY start_ROW start_CELL italic_f ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) = 0 end_CELL end_ROW start_ROW start_CELL italic_f ( italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) = 0 end_CELL end_ROW start_ROW start_CELL … end_CELL end_ROW start_ROW start_CELL italic_f ( italic_y start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT 1 : italic_m - 1 end_POSTSUBSCRIPT , bold_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) = 0 end_CELL end_ROW end_ARRAY (3)

We can solve this non-linear system using Jacobi iteration by iteratively updating all yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT from a random initial guess 𝐲0superscript𝐲0\mathbf{y}^{0}bold_y start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT, along the trajectory 𝐲1,,𝐲t,superscript𝐲1superscript𝐲𝑡\mathbf{y}^{1},...,\mathbf{y}^{t},...bold_y start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , bold_y start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , …, until converging to the fixed point solution 𝐲msuperscript𝐲𝑚\mathbf{y}^{m}bold_y start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT. We detail this algorithm, termed as Jacobi decoding, in Appendix Algorithm 1. This process guarantees to return the solution of all m𝑚mitalic_m variables yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in at most m𝑚mitalic_m iterations, as the very first token of each Jacobi update matches autoregressive decoding. Sometimes, more than one token might be correctly generated in a single iteration, potentially reducing the number of decoding steps. It is worth noting that, as 𝐲tsuperscript𝐲𝑡\mathbf{y}^{t}bold_y start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT is generated based on the past value 𝐲t1superscript𝐲𝑡1\mathbf{y}^{t-1}bold_y start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT on the trajectory, any two adjacent tokens from 𝐲t1superscript𝐲𝑡1\mathbf{y}^{t-1}bold_y start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT and 𝐲tsuperscript𝐲𝑡\mathbf{y}^{t}bold_y start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT can form a meaningful 2-gram.

Limitations of Jacobi Decoding. Empirically, we observe Jacobi decoding can hardly reduce decoding steps, even if it can generate multiple tokens per step. This is because the generated tokens are often put in the wrong positions of the sequence, and correctly placed tokens are frequently replaced by subsequent Jacobi iterations. These prevent it from achieving wall-clock speedup.

3 Lookahead Decoding

Figure 1: Workflow of Lookahead Decoding with W=5𝑊5W=5italic_W = 5, N=3𝑁3N=3italic_N = 3, and G=2𝐺2G=2italic_G = 2. For each decoding step, we do the following. (1) Generate one token at each position in the lookahead branch; (2) Verify and accept 3-grams (searched from the 3-gram pool) with the verification branch; (3) Collect and cache newly generated 3-grams in the pool from lookahead branch trajectories. (4) Update the lookahead branch to maintain a fixed window size.

Refer to caption

Lookahead Decoding leverages Jacobi decoding’s ability to generate many tokens in one step but addresses its limitation. Fig. 1 illustrates its workflow. The key design in Lookahead Decoding is to keep track of the trajectory of Jacobi decoding and generate n𝑛nitalic_n-gram from this trajectory. This is achieved by maintaining a fixed-sized 2D window, with the two dimensions corresponding to the sequence and the time axis, respectively, to generate multiple disjoint n𝑛nitalic_n-grams from the Jacobi iteration trajectory in parallel. We call this process the lookahead branch. In addition, Lookahead Decoding introduces an n𝑛nitalic_n-gram pool to cache these n𝑛nitalic_n-grams generated along the trajectory. Promising n𝑛nitalic_n-gram candidates are verified later by a designed verification branch to preserve the LLM’s output distribution; if passing verification, those disjoint n-grams are integrated into the sequence. The detailed algorithm is shown in Algorithm 2 in Appendix.

3.1 Lookahead Branch

Lookahead Decoding uses a fixed-sized 2D window for efficient n𝑛nitalic_n-gram generation. In contrast to the original Jacobi decoding, which only uses the history tokens from the last step (or equivalently, it generates 2-grams), Lookahead Decoding generates many n𝑛nitalic_n-grams, with n2𝑛2n\geq 2italic_n ≥ 2, in parallel by using the n1𝑛1n-1italic_n - 1 past steps’ history tokens, effectively leveraging more information from the trajectory. The fixed-sized 2D window in the lookahead branch is characterized by two parameters: (1) W𝑊Witalic_W defines the lookahead size into future token positions to conduct parallel decoding; (2) N𝑁Nitalic_N defines the lookback steps into the past Jacobi trajectory to retrieve n𝑛nitalic_n-grams. See Algorithm 2 for a detailed process.

An example of the lookahead branch with W=5𝑊5W=5italic_W = 5 and N=4𝑁4N=4italic_N = 4 is in Fig. 2 (b), in which we look back N1=3𝑁13N-1=3italic_N - 1 = 3 steps and look ahead 5555 tokens for each step. The blue token with the digit 0 is the current step’s (t𝑡titalic_t) input, and the orange, green, and red tokens were generated in previous lookahead branches at steps t3𝑡3t-3italic_t - 3, t2𝑡2t-2italic_t - 2, and t1𝑡1t-1italic_t - 1, respectively. The digit on each token shows its relative position to the current input (i.e., the blue one labeled as 0). In the present stage, we perform a modified Jacobi iteration to generate new tokens for all 5 positions, following the trajectory formed by the preceding 3 steps. Once generated, we collect and cache them in the n𝑛nitalic_n-gram pool (n=4𝑛4n=4italic_n = 4) – for instance, a 4-gram consists of the orange token at position 1, the green token at position 2, the red token at position 3, and a newly generated token.

The most outdated tokens in both dimensions (time and sequence) will be removed, and newly generated tokens will be appended to the lookahead branch to maintain a fixed window size for each step. For example, we will remove all orange and green tokens with position 1 in Fig. 2. We then form a new lookahead branch with green tokens with indices 2, 3, 4, 5, all red tokens, and all newly generated tokens for the next step.

3.2 Verification Branch

Lookahead Decoding preserves the output distribution via its verification branch. We first discuss how to verify in greedy sampling. Recall in speculative decoding: the verification is performed by sending the draft tokens to the LLM to get an output for each draft token, then progressively checking if the last token’s corresponding output, generated by the target LLM, exactly matches the draft token itself (§2). The verification branch in Lookahead Decoding resembles this process, despite verifying many draft n𝑛nitalic_n-gram candidates in parallel. In particular, We first look up from the n𝑛nitalic_n-gram pool for “promising” n𝑛nitalic_n-grams – by checking if a n𝑛nitalic_n-gram starts with a token that exactly matches the last token of the current ongoing sequence. We then use the LLM to verify all these n𝑛nitalic_n-grams in parallel, following a similar fashion as in speculative decoding. See Algorithm 3 in the Appendix for the detailed procedures.

We next discuss how to support more advanced sampling. Previous research (Miao et al., 2023) has developed efficient tree-based verification for speculative decoding with sampling support, where multiple draft sequences derived from a token tree can be verified in parallel. However, it does not apply to Lookahead Decoding as our verification works on disjoint n𝑛nitalic_n-grams instead of trees. We improve it by progressively verifying along the n𝑛nitalic_n-gram length and removing n𝑛nitalic_n-grams with mismatched prefixes. Besides, speculative decoding style verification requires the probability distribution where the draft token is sampled to update the probability distribution when the draft token is rejected. Because we store all n𝑛nitalic_n-grams in a pool instead of discarding them each step, we would need huge memory to store the probability distributions (each of vocabulary size) for the entire n𝑛nitalic_n-gram pool. The key to overcome this is to leverage the mechanism that the verification is indifferent to how draft tokens were sampled – different sampling methods (e.g., greedy sampling) only influence the acceptance rate but keep the output distribution. We can force greedy sampling at the n𝑛nitalic_n-gram generation (lookahead branch), in which the probability distribution degenerates into a one-hot vector. Hence we only need to store which token is selected. We elaborate the approach in Algorithm 4, prove its correctness in Appendix B, and verify its quality and speedups in §5.3.

It is expected to have an increasingly large n𝑛nitalic_n-gram cache hence a growing verification branch as decoding progresses. We set a cap of G𝐺Gitalic_G to limit the maximum number of promising candidates run in parallel in the verification branch to manage the verification cost. Empirically we suggest to set G𝐺Gitalic_G proportional to W𝑊Witalic_W to balance generation and verification. In practice, we simply set G=W𝐺𝑊G=Witalic_G = italic_W.

3.3 Decode, Predict, and Verify in The Same Step

At execution, the lookahead and verification branches can be integrated into one decoding step to leverage parallel processing. This requires a designated attention mask, as shown in Fig. 2 (b). This attention mask is straightforwardly derived following the principle that each token is only visible to the tokens with a larger position index than itself (§2). For example, only the green token at position 5 and all orange tokens are visible to the red token 6. The tokens in the lookahead branch are not visible to the tokens in the verification branch, and vice versa.

Refer to caption

Figure 2: (a) Causal mask for decoder models. (b) Attention mask for Lookahead Decoding with W=5𝑊5W=5italic_W = 5, N=4𝑁4N=4italic_N = 4, and G=2𝐺2G=2italic_G = 2. Digits on tokens indicate relative positions.

Integration with FlashAttention. FlashAttention (Dao et al., 2022; Dao, 2023) can vastly accelerate the training and inference of LLMs by saving memory I/O on the slow memory hierarchy. It forces a causal mask (e.g., Fig. 2 (a)) to avoid all token interactions outside a lower triangular scope, which is not suitable for Lookahead Decoding as we take a more subtle attention mask (e.g., Fig. 2 (b)) for different W𝑊Witalic_W, N𝑁Nitalic_N, and G𝐺Gitalic_G. To solve this, we hardcode Lookahead Decoding’s attention pattern with adjustable W𝑊Witalic_W, N𝑁Nitalic_N, and G𝐺Gitalic_G in FlashAttention. Applying FlashAttention to Lookahead Decoding brings about 20% end-to-end speedup compared to a straightforward implementation on top of native PyTorch in our experiments (§5.2).

3.4 Lookahead Parallelism

Lookahead Decoding is easy to parallelize on multiple GPUs for both lookahead and verification branches. Parallelizing the lookahead branch is achieved by noting that the lookahead computation is composed of several disjoint branches. For example, the branch with green 1 and red 2 tokens does not have interaction with the branch with the tokens green 3 and red 4 in Fig. 2 (b). We can put these disjoint branches onto different GPUs without introducing communication during the inference computation. Parallelizing the verification branch is done by assigning multiple n𝑛nitalic_n-gram candidates to different devices. Because the verification of each candidate, by design, is independent of others, this will not cause communication.

Fig. 3 shows an example of parallelizing the lookahead branch and verification branch in Fig. 2 (b) to four GPUs. This workload allocation will have the orange token 0,1,2,3 and the input token 0 be redundantly placed and computed. However, it can essentially save communication volume during the whole forward pass. We only need to synchronize the generated tokens on each device after the forward pass. We can further scale the W𝑊Witalic_W, N𝑁Nitalic_N, and G𝐺Gitalic_G with multiple GPUs’ increased FLOPs to obtain a lower latency according to Lookahead Decoding’s scalability (§4).

Refer to caption
Figure 3: Distribute the workload of the lookahead branch and verification branch in Fig 2 (b) to 4 GPUs with lookahead parallelism, which can avoid communication during the forward pass.

We name this new parallelism as lookahead parallelism (LP). Unlike previous parallelism methods (including pipeline and tensor parallelisms) that shard the model parameters or states across different GPUs, LP maintains an entire copy of the model for each GPU (thus needing more memory) and allows distributing tokens to different GPUs. Hence, LP is advantageous in inference as it introduces near-zero communication per step while existing model parallelism methods (Narayanan et al., 2021; Shoeybi et al., 2019) involve a large communication overhead on the critical path of each decoding step.

4 Scaling Law of Lookahead Decoding

Since Lookahead Decoding introduces flexible parameters W𝑊Witalic_W and N𝑁Nitalic_N associated with the cost of each parallel decoding step. This section investigates the scaling law between compute FLOPs and the theoretical speedup, and compares it to speculative decoding.

4.1 Estimating Speedup for Speculative Decoding

Speculative decoding uses the draft model to speculate one token sequence at each step. We represent the probability of each token in the sequence passing the verification of the LLM by β𝛽\betaitalic_β (acceptance rate) and notate its expectation E(β)=α𝐸𝛽𝛼E(\beta)=\alphaitalic_E ( italic_β ) = italic_α. If we use the draft model to guess γ𝛾\gammaitalic_γ tokens per step, the expectation of the number of accepted tokens is denoted as (Leviathan et al., 2023):

E(#tokens)=1αγ+11α.𝐸#𝑡𝑜𝑘𝑒𝑛𝑠1superscript𝛼𝛾11𝛼\displaystyle\footnotesize E(\#tokens)=\frac{1-\alpha^{\gamma+1}}{1-\alpha}.italic_E ( # italic_t italic_o italic_k italic_e italic_n italic_s ) = divide start_ARG 1 - italic_α start_POSTSUPERSCRIPT italic_γ + 1 end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_α end_ARG . (4)

Instead of speculating one sequence every time, we would speculate b𝑏bitalic_b sequences. We assume that b𝑏bitalic_b sequences, each of γ𝛾\gammaitalic_γ tokens, are sampled as each token will have the same acceptance rate of β𝛽\betaitalic_β. Under this setting, the expectation of the number of accepted tokens is denoted as follows:

E(#tokens)=(γ+1)i=1γ(1αi)b.𝐸#𝑡𝑜𝑘𝑒𝑛𝑠𝛾1superscriptsubscript𝑖1𝛾superscript1superscript𝛼𝑖𝑏\displaystyle\footnotesize E(\#tokens)=(\gamma+1)-\sum\limits_{i=1}^{\gamma}(1% -\alpha^{i})^{b}.italic_E ( # italic_t italic_o italic_k italic_e italic_n italic_s ) = ( italic_γ + 1 ) - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT ( 1 - italic_α start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT . (5)

See derivations in Appendix C for Eq. 4 and Eq. 5. Note that when b=1𝑏1b=1italic_b = 1, Eq. 5 falls back to Eq. 4.

4.2 Estimating Speedup for Lookahead Decoding

We define the 𝒮=step compression ratio𝒮step compression ratio\mathcal{S}=\textit{step compression ratio}caligraphic_S = step compression ratio as the number of autoregressive steps divided by the number of Lookahead Decoding steps to generate the same length of the sequence. As the number of generated tokens equals the autoregressive steps, it can be denoted as:

𝒮=#generatedtokens#Lookahead Decodingsteps𝒮#𝑔𝑒𝑛𝑒𝑟𝑎𝑡𝑒𝑑𝑡𝑜𝑘𝑒𝑛𝑠#Lookahead Decoding𝑠𝑡𝑒𝑝𝑠\displaystyle\mathcal{S}=\frac{\#generated\ tokens}{\#\textsc{Lookahead % Decoding}\ steps}caligraphic_S = divide start_ARG # italic_g italic_e italic_n italic_e italic_r italic_a italic_t italic_e italic_d italic_t italic_o italic_k italic_e italic_n italic_s end_ARG start_ARG # Lookahead Decoding italic_s italic_t italic_e italic_p italic_s end_ARG (6)

Lookahead Decoding speculates b𝑏bitalic_b sequences every time as in Eq. 5. In each step, we will search n𝑛nitalic_n-grams in the pool starting with the current input token and have at most G𝐺Gitalic_G speculations of length N1𝑁1N-1italic_N - 1. As we set G=W𝐺𝑊G=Witalic_G = italic_W3.2), we have G=W=b𝐺𝑊𝑏G=W=bitalic_G = italic_W = italic_b and N1=γ𝑁1𝛾N-1=\gammaitalic_N - 1 = italic_γ using the notations in Eq. 5. In practice, we cannot expect each step to have equally good speculations (i.e., acceptance rate with E(β)=α𝐸𝛽𝛼E(\beta)=\alphaitalic_E ( italic_β ) = italic_α). We assume that, on average, for every f𝑓fitalic_f step, we have one good speculation with E(#tokens)𝐸#𝑡𝑜𝑘𝑒𝑛𝑠E(\#tokens)italic_E ( # italic_t italic_o italic_k italic_e italic_n italic_s ) tokens accepted, and for the other f1𝑓1f-1italic_f - 1 steps, we fall back to autoregressive decoding due to bad speculations. We use this f𝑓fitalic_f to bridge 𝒮𝒮\mathcal{S}caligraphic_S and E(#tokens)𝐸#𝑡𝑜𝑘𝑒𝑛𝑠E(\#tokens)italic_E ( # italic_t italic_o italic_k italic_e italic_n italic_s ) per step as follows:

𝒮=(f1+E(#tokens))/f.𝒮𝑓1𝐸#𝑡𝑜𝑘𝑒𝑛𝑠𝑓\displaystyle\mathcal{S}=(f-1+E(\#tokens))/f.caligraphic_S = ( italic_f - 1 + italic_E ( # italic_t italic_o italic_k italic_e italic_n italic_s ) ) / italic_f . (7)

We can plot the curve indicated by our formulation with one specific setting as in Fig. 4 (b). We find that the trend of our empirical experiments (LLaMA-2-Chat-7B on MT-Bench with G=W𝐺𝑊G=Witalic_G = italic_W as in Fig. 4 (a)) align well with the formulation to some extent. From this formulation, we conclude that we can linearly reduce the number of decoding steps according to per-step log(b)𝑏\log(b)roman_log ( italic_b ) given a large enough γ𝛾\gammaitalic_γ. In contrast to speculative decoding, Lookahead Decoding will not meet an upper bound indicated in Eq. 4 by simultaneously increasing γ𝛾\gammaitalic_γ and b𝑏bitalic_b. This reveals the scaling law of Lookahead Decoding to linearly reduce decoding steps according to per-step log(\log(roman_log (FLOPs)))) given a large enough N𝑁Nitalic_N, since per-step FLOPs is roughly proportional to the number of input tokens (i.e., (W+G)*(N1)𝑊𝐺𝑁1(W+G)*(N-1)( italic_W + italic_G ) * ( italic_N - 1 )). The scaling law also suggests Lookahead Decoding’s strong scaling to multiple GPUs, in which we can obtain an even greater per-token latency reduction by using more FLOPs, which is advantageous for latency-sensitive tasks.

Refer to caption

Figure 4: (a) Relation of W,N,G𝑊𝑁𝐺W,N,Gitalic_W , italic_N , italic_G and 𝒮𝒮\mathcal{S}caligraphic_S for LLaMA-2-Chat-7B on MT-Bench. (b) When we assume a setting with α=0.425𝛼0.425\alpha=0.425italic_α = 0.425 and f=3.106𝑓3.106f=3.106italic_f = 3.106, the trend of our formulation.

5 Evaluation Results

Model and testbed. We used various versions of the LLaMA-2 (Touvron et al., 2023b) and CodeLlama (Roziere et al., 2023) models, including the 7B, 13B, 34B, and 70B sizes, on two GPU setups S1 and S2. S1 is equipped with NVIDIA A100 GPUs with 80GB of memory. On S1, the 7B, 13B, and 34B models are deployed on a single A100, while the 70B model utilizes 2 A100s with pipeline parallelism supported by Accelerate (Gugger et al., 2022). S2 is a DGX machine with 8 NVIDIA A100 GPUs with 40GB memory and NVLink. All models serve with FP16 precision and batch of 1 if not specified (Cai et al., 2024; He et al., 2023).

Datasets. We benchmarked Lookahead Decoding’s performance across a broad spectrum of datasets and tasks. MT-Bench (Zheng et al., 2023) is a diverse set of multi-turn questions with many unique tokens. GSM8K (Cobbe et al., 2021) contains a set of math questions, in which we use the first 1k questions. HumanEval (Chen et al., 2021) covers both code completion and infilling tasks. We also test on MBPP (Austin et al., 2021) dataset for instruction-based code generation, and on ClassEval (Du et al., 2023) for class-level code completion. To control generation length in code generation tasks, we set the maximum sequence length to 512 and 2,048 on HumanEval and ClassEval, respectively, aligned with prior setups (Ben Allal et al., 2022; Du et al., 2023). Tab. 1 lists detailed settings. In addition, we validate the effectiveness of sampling (§3.2) on XSum (Narayan et al., 2018) and CNN/Daily Mail (See et al., 2017) datasets.

Baseline Settings. Our primary baseline is HuggingFace’s implementation of greedy search (Wolf et al., 2020). Additionally, we employ FlashAttention (Dao et al., 2022; Dao, 2023) as a stronger baseline to assess the performance of FlashAttention empowered Lookahead Decoding. In distributed settings, we evaluate LP against TP (supported by deepspeed (Aminabadi et al., 2022)) and PP (supported by accelerate (Gugger et al., 2022)). We measure the throughput of single batch inference against these baseline settings (Cai et al., 2024; He et al., 2023).

Table 1: Experimental settings for §5.1 and §5.2.
Server Parallel. Model Model Size Dataset
S1 w/o LP LLaMA-2-chat 7B, 13B, 70B MT-Bench
CodeLLaMA 7B, 13B, 34B HumanEval
CodeLLaMA-Inst 7B, 13B, 34B MBPP, GSM8K
S2 w/ LP LLaMA-2-chat 7B, 13B MT-Bench
CodeLLaMA 7B, 13B HumanEval
CodeLLaMA-Python 7B, 13B ClassEval
Refer to caption
Figure 5: Throughput of Lookahead Decoding on various dataset without FlashAttention and distributed serving
Refer to caption
Figure 6: Throughput of Lookahead Decoding with multiple GPUs and FlashAttention for 7B models
Refer to caption
Figure 7: Throughput of Lookahead Decoding with multiple GPUs and FlashAttention for 13B models

5.1 End-to-end Performance

Fig. 5 shows the end-to-end performance of Lookahead Decoding when compared with HuggingFace’s implementation of greedy search on S1. The used tasks and models are shown in Tab. 1. Across various datasets, Lookahead Decoding demonstrates a 1.5x-2.3x speedup. Generally, our method exhibits better performance in code completion tasks (e.g., 2.3x), given the higher occurrence of repetitive tokens during code completions, making predictions easier. Besides, smaller models also exhibit a higher speedup when compared to larger models. This is because Lookahead Decoding trades per-step FLOPs with a step compression ratio (§4). A larger model requires more FLOPs and quickly hits the GPU FLOPs cap compared to a smaller model. So, it shows a lower ability to compress decoding steps given the same GPU setting.

5.2 Performance with LP and FlashAttention

We evaluated the performance of Lookahead Decoding with LP and FlashAttention augmentation on S2 with greedy search. The used tasks and models are shown in Tab. 1. The results for the 7B and 13B models are in Fig. 6 and Fig. 7, respectively. FlashAttention speeds up the PyTorch implementation of Lookahead Decoding by 20%. Notably, FlashAttention-integrated Lookahead Decoding shows 1.8x speedups for the 7B model on MT-Bench compared with autoregressive decoding with FlashAttention (i.e., 1.9x vs 1.07x in Fig. 6). We did a strong scaling of the workloads to multiple GPUs for distributed settings (i.e., increasing GPUs but not increasing workloads). The multiple GPU settings of both TP (w/ DeepSpeed) and PP (w/ Accelerate) bring slowdowns (i.e., 0.75x-0.82x). The results echos DeepSpeed’s documentation (dee, 2023). However, with Lookahead Decoding, we can further utilize the FLOPs of multiple GPUs to reduce the inference latency (e.g., 4x on ClassEval).

Table 2: Sampling with Lookahead Decoding on CNN/Daily Mail and XSum. A temperature (Temp.) of 0.0 equals greedy search. “AR.” is autoregressive and “LA.” is Lookahead Decoding. Rouge scores, speedups against autoregressive, and compression ratio (𝒮𝒮\mathcal{S}caligraphic_S) are reported.
Dataset Temp. Method Rouge-1 Rouge-2 Rouge-L Speedups 𝒮𝒮\mathcal{S}caligraphic_S
CNN. 1.0 AR. 36.55 13.20 22.68 1.00x 1.00x
LA. 36.53 13.27 22.71 1.46x 1.64x
0.0 AR. 37.79 14.59 23.96 1.00x 1.00x
LA. 37.79 14.59 23.96 1.57x 1.72x
Xsum 1.0 AR. 19.15 4.53 12.84 1.00x 1.00x
LA. 19.20 4.53 12.87 1.50x 1.67x
0.0 AR. 19.38 4.78 13.05 1.00x 1.00x
LA. 19.39 4.79 13.06 1.60x 1.77x
Table 3: Compare the effectiveness of both lookahead and verification branch on MT-Bench on A100. FlashAttention is activated. We show the speedups against autoregressive decoding and the compression ratio (𝒮𝒮\mathcal{S}caligraphic_S).
Tag Setting (N, W, G) Prompt as Ref. Speedups 𝒮𝒮\mathcal{S}caligraphic_S
Autoregressive 1.00x 1.00
Prompt Lookup 1.44x 1.55
(10, 1, 3) 1.36x 1.45
(5, 1, 10) 1.36x 1.51
(5, 1, 30) 1.04x 1.12
(5, 1, 30) 1.46x 1.59
(5, 30, 1) 1.61x 1.79
(5, 15, 15) 1.78x 1.96
(5, 15, 15) 1.88x 2.05

5.3 Generation Quality of Lookahead Decoding

We assess the generation quality of Lookahead Decoding on LLaMA-2-7B-Chat model with the prompts in Appendix D on summarization datasets (Chen et al., 2023; Leviathan et al., 2023) in Tab. 2. Whether the sampling is activated, Lookahead Decoding can reserve the output distribution quality, which is evaluated in rouge-1, rouge-2, and rouge-L (Lin, 2004), while achieving 1.46x-1.60x speedups compared with autoregressive decoding. Using sampling gives smaller speedups as the acceptance ratio is lower according to the sampling verification algorithm 4, which aligns with the results in the previous research (Chen et al., 2023; Leviathan et al., 2023). We further verify that using greedy sampling and advanced integrations will not change the generation quality in Appendix E.

5.4 Ablation Study

In this section, we study the importance of the lookahead and verification branch in achieving a high speedup. We experiment on LLaMA-2-7B-Chat and MT-Bench on S1 with various settings. The results are shown in Tab. 3.

We ablate the importance of lookahead branch by comparing the performance of using a lookahead branch to the recent methods of using prompts as reference (Yang et al., 2023; Saxena, 2023). This comparison assumes that Lookahead Decoding does not use the prompt to build the n-gram pool. We use the implementation in transformers v4.37 of prompt lookup as a baseline (②, with prompt_lookup_num_tokens=10). We also use prompt to build n-gram pool to augment Lookahead Decoding (③④⑥⑨). The results show that although using a minimal lookahead branch (W=1𝑊1W=1italic_W = 1) with various N,G𝑁𝐺N,Gitalic_N , italic_G settings (③④⑤⑥) can obtain a decent speedup on MT-Bench, it is still not as good as using balanced branches (⑧). We can find that prompt lookup can surpass prompt as reference implementation in Lookahead Decoding. This is because our method checks if n𝑛nitalic_n-gram starts with one token that exactly matches the last generated token while prompt lookup in transformers v4.37 checks several starting tokens for a better speculation.

We ablate the importance of verification branch by reporting the speedup of using a tiny verification branch and a large lookahead branch (⑦, G=1𝐺1G=1italic_G = 1) . It shows lower performance due to lower potential in accepting speculations compared with a balanced branches (⑧).

Besides, our evaluation shows that using prompt as reference can further boost Lookahead Decoding (⑧ and ⑨). We have integrated them in our implementation.

5.5 Discussion and Limitation

Table 4: Good Config. of Lookahead Decoding on A100 GPUs with G=W𝐺𝑊G=Witalic_G = italic_W.
Model Window Size (W𝑊Witalic_W) N-gram Size (N𝑁Nitalic_N)
7B 15 5
13B 10 5
34B 7 5

The main limitation of Lookahead Decoding is that it requires extra computations. Our experimental results show that on A100, the configuration in Tab. 4 works near optimally in most cases for single batch serving. Because the per-step FLOPs are roughly proportional to the number of per-step input tokens, which is (W+G)*(N1)𝑊𝐺𝑁1(W+G)*(N-1)( italic_W + italic_G ) * ( italic_N - 1 ). If we ignore the attention cost’s increase with sequence length, the 7B, 13B, and 34B models require 120x, 80x, and 56x extra FLOPs per step, respectively. Since the LLM decoding is memory bandwidth-bound rather than compute-bound, these extra FLOPs only turn into a limited wall-clock slowdown for each step.

Given this, Lookahead Decoding needs large surplus FLOPs to obtain high speedups. Running in compute-bound environments (e.g., serving with a large batch size) may cause slowdowns. Another example is shown in Fig. 8, where lower speedup is observed when the GPU’s cap FLOPs is smaller (e.g., on RTX 3090 GPUs).

Based on §4, we need to exponentially increase the per-step FLOPs to obtain a linear reduction in decoding steps. Hence, the setting in Tab. 4 faces a diminishing return. However, when FLOPs are not rich, we see that a gentle speedup (e.g., 30%percent3030\%30 % on RTX 3090 and >50%absentpercent50>50\%> 50 % on A100) on MT-Bench easily achievable, as in Fig. 8, which is a free lunch that requires no extra model, training, or changing the output distribution.

Refer to caption

Figure 8: Compression ratio(𝒮𝒮\mathcal{S}caligraphic_S) and speedups of Lookahead Decoding on RTX 3090 and A100 with N=5𝑁5N=5italic_N = 5, all with FlashAttention. The blue and orange curves of 𝒮𝒮\mathcal{S}caligraphic_S overlap as the device does not affect the ratio.

6 Related Work

Speculative decoding (Chen et al., 2023; Leviathan et al., 2023) pioneer in speedup autoregressive decoding with a draft model. Different methods for obtaining speculations are researched. Specinfer (Miao et al., 2023) uses many draft models obtained from distillation, quantization, and pruning to conduct speculations together. Medusa (Cai et al., 2024), OSD (Liu et al., 2023), and EAGLE (Li et al., 2023) use training to obtain a draft model. REST (He et al., 2023) uses the finetuning dataset itself as a datastore to lookup speculations, while other works (Yang et al., 2023; Saxena, 2023) uses prompt as a reference for speculations. Different from these methods, Lookahead Decoding uses LLM’s parallel generation ability for speculations. Sampling methods are also researched. Specinfer maintains output distribution by a tree-based sampling algorithm. Medusa uses a typical acceptance scheme to accelerate when the temperature is large but does not persist on an exact output distribution. Lookahead Decoding follows Specinfer to maintain output distribution but with multiple disjoint n𝑛nitalic_n-grams.

7 Conclusion

In this paper, we present Lookahead Decoding to parallelize the autoregressive decoding of LLMs without changing the output distribution. It shows notable speedup without a draft model and can linearly decrease the decoding steps with exponential investment in per-step FLOPs.

References

  • dee (2023) Automatic tensor parallelism for huggingface models, 2023. URL https://www.deepspeed.ai/tutorials/automatic-tensor-parallelism.
  • Aminabadi et al. (2022) Aminabadi, R. Y., Rajbhandari, S., Awan, A. A., Li, C., Li, D., Zheng, E., Ruwase, O., Smith, S., Zhang, M., Rasley, J., et al. Deepspeed-inference: enabling efficient inference of transformer models at unprecedented scale. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, pp.  1–15. IEEE, 2022.
  • Austin et al. (2021) Austin, J., Odena, A., Nye, M., Bosma, M., Michalewski, H., Dohan, D., Jiang, E., Cai, C., Terry, M., Le, Q., and Sutton, C. Program synthesis with large language models, 2021.
  • Ba et al. (2016) Ba, J. L., Kiros, J. R., and Hinton, G. E. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  • Ben Allal et al. (2022) Ben Allal, L., Muennighoff, N., Kumar Umapathi, L., Lipkin, B., and von Werra, L. A framework for the evaluation of code generation models. https://github.com/bigcode-project/bigcode-evaluation-harness, 2022.
  • Cai et al. (2024) Cai, T., Li, Y., Geng, Z., Peng, H., Lee, J. D., Chen, D., and Dao, T. Medusa: Simple llm inference acceleration framework with multiple decoding heads, 2024.
  • Chen et al. (2023) Chen, C., Borgeaud, S., Irving, G., Lespiau, J.-B., Sifre, L., and Jumper, J. Accelerating large language model decoding with speculative sampling. arXiv preprint arXiv:2302.01318, 2023.
  • Chen et al. (2021) Chen, M., Tworek, J., Jun, H., Yuan, Q., de Oliveira Pinto, H. P., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov, M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S., Ryder, N., Pavlov, M., Power, A., Kaiser, L., Bavarian, M., Winter, C., Tillet, P., Such, F. P., Cummings, D., Plappert, M., Chantzis, F., Barnes, E., Herbert-Voss, A., Guss, W. H., Nichol, A., Paino, A., Tezak, N., Tang, J., Babuschkin, I., Balaji, S., Jain, S., Saunders, W., Hesse, C., Carr, A. N., Leike, J., Achiam, J., Misra, V., Morikawa, E., Radford, A., Knight, M., Brundage, M., Murati, M., Mayer, K., Welinder, P., McGrew, B., Amodei, D., McCandlish, S., Sutskever, I., and Zaremba, W. Evaluating large language models trained on code, 2021.
  • Cobbe et al. (2021) Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  • Dao (2023) Dao, T. FlashAttention-2: Faster attention with better parallelism and work partitioning. 2023.
  • Dao et al. (2022) Dao, T., Fu, D. Y., Ermon, S., Rudra, A., and Ré, C. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems, 2022.
  • Du et al. (2023) Du, X., Liu, M., Wang, K., Wang, H., Liu, J., Chen, Y., Feng, J., Sha, C., Peng, X., and Lou, Y. Classeval: A manually-crafted benchmark for evaluating llms on class-level code generation, 2023.
  • Gugger et al. (2022) Gugger, S., Debut, L., Wolf, T., Schmid, P., Mueller, Z., Mangrulkar, S., Sun, M., and Bossan, B. Accelerate: Training and inference at scale made simple, efficient and adaptable. https://github.com/huggingface/accelerate, 2022.
  • He et al. (2023) He, Z., Zhong, Z., Cai, T., Lee, J. D., and He, D. Rest: Retrieval-based speculative decoding. arXiv preprint arXiv:2311.08252, 2023.
  • Holtzman et al. (2020) Holtzman, A., Buys, J., Du, L., Forbes, M., and Choi, Y. The curious case of neural text degeneration, 2020.
  • Jiang et al. (2023) Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., Casas, D. d. l., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  • Kool et al. (2020) Kool, W., van Hoof, H., and Welling, M. Ancestral gumbel-top-k sampling for sampling without replacement. Journal of Machine Learning Research, 21(47):1–36, 2020. URL http://jmlr.org/papers/v21/19-985.html.
  • Leviathan et al. (2023) Leviathan, Y., Kalman, M., and Matias, Y. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pp.  19274–19286. PMLR, 2023.
  • Li et al. (2023) Li, Y., Zhang, C., and Zhang, H. Eagle: Lossless acceleration of llm decoding by feature extrapolation, December 2023. URL https://sites.google.com/view/eagle-llm.
  • Lin (2004) Lin, C.-Y. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pp.  74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/W04-1013.
  • Liu et al. (2023) Liu, X., Hu, L., Bailis, P., Stoica, I., Deng, Z., Cheung, A., and Zhang, H. Online speculative decoding, 2023.
  • Miao et al. (2023) Miao, X., Oliaro, G., Zhang, Z., Cheng, X., Wang, Z., Wong, R. Y. Y., Zhu, A., Yang, L., Shi, X., Shi, C., Chen, Z., Arfeen, D., Abhyankar, R., and Jia, Z. Specinfer: Accelerating generative large language model serving with speculative inference and token tree verification, 2023.
  • Narayan et al. (2018) Narayan, S., Cohen, S. B., and Lapata, M. Don’t give me the details, just the summary! Topic-aware convolutional neural networks for extreme summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 2018.
  • Narayanan et al. (2021) Narayanan, D., Shoeybi, M., Casper, J., LeGresley, P., Patwary, M., Korthikanti, V. A., Vainbrand, D., Kashinkunti, P., Bernauer, J., Catanzaro, B., Phanishayee, A., and Zaharia, M. Efficient large-scale language model training on gpu clusters using megatron-lm, 2021.
  • OpenAI (2023) OpenAI. Gpt-4 technical report, 2023.
  • Ouyang et al. (2022) Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  • Roziere et al. (2023) Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X. E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
  • Ruan et al. (2023) Ruan, J. T., Sabir, F., and Chopra, P. Best prompting practices for using the llama 2 chat llm through amazon sagemaker jumpstart, November 2023. URL https://aws.amazon.com/cn/blogs/machine-learning/best-prompting-practices-for-using-the-llama-2-chat-llm-through-amazon-sagemaker-jumpstart/.
  • Santilli et al. (2023) Santilli, A., Severino, S., Postolache, E., Maiorca, V., Mancusi, M., Marin, R., and Rodola, E. Accelerating transformer inference for translation via parallel decoding. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  12336–12355, Toronto, Canada, July 2023. Association for Computational Linguistics. URL https://aclanthology.org/2023.acl-long.689.
  • Saxena (2023) Saxena, A. Prompt lookup decoding, November 2023. URL https://github.com/apoorvumang/prompt-lookup-decoding/.
  • See et al. (2017) See, A., Liu, P. J., and Manning, C. D. Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  1073–1083, 2017.
  • Shoeybi et al. (2019) Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J., and Catanzaro, B. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019.
  • Song et al. (2021) Song, Y., Meng, C., Liao, R., and Ermon, S. Accelerating feedforward computation via parallel nonlinear equation solving, 2021.
  • Stern et al. (2018) Stern, M., Shazeer, N., and Uszkoreit, J. Blockwise parallel decoding for deep autoregressive models, 2018.
  • Team et al. (2023) Team, G., Anil, R., Borgeaud, S., Wu, Y., Alayrac, J.-B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A. M., Hauth, A., et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  • Touvron et al. (2023a) Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  • Touvron et al. (2023b) Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  • Vaswani et al. (2023) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need, 2023.
  • Wolf et al. (2020) Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., Davison, J., Shleifer, S., von Platen, P., Ma, C., Jernite, Y., Plu, J., Xu, C., Scao, T. L., Gugger, S., Drame, M., Lhoest, Q., and Rush, A. M. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp.  38–45, Online, October 2020. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/2020.emnlp-demos.6.
  • Yang et al. (2023) Yang, N., Ge, T., Wang, L., Jiao, B., Jiang, D., Yang, L., Majumder, R., and Wei, F. Inference with reference: Lossless acceleration of large language models, 2023.
  • Zhang & Sennrich (2019) Zhang, B. and Sennrich, R. Root mean square layer normalization, 2019.
  • Zheng et al. (2023) Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E. P., Zhang, H., Gonzalez, J. E., and Stoica, I. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.

Appendix A Algorithms

Algorithm 1 Jacobi decoding
1:  Input: prompt 𝐱0superscript𝐱0\mathbf{x}^{0}bold_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT, model pMsubscript𝑝𝑀p_{M}italic_p start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT, generation length m𝑚mitalic_m
2:  Initialize 𝐲0=(y10,y20,,ym0)superscript𝐲0superscriptsubscript𝑦10superscriptsubscript𝑦20superscriptsubscript𝑦𝑚0\mathbf{y}^{0}=(y_{1}^{0},y_{2}^{0},...,y_{m}^{0})bold_y start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT )
3:  Initialize 𝐲output()superscript𝐲𝑜𝑢𝑡𝑝𝑢𝑡\mathbf{y}^{output}\leftarrow()bold_y start_POSTSUPERSCRIPT italic_o italic_u italic_t italic_p italic_u italic_t end_POSTSUPERSCRIPT ← ( )
4:  for i=1𝑖1i=1italic_i = 1 to m𝑚mitalic_m do
5:     𝐲1:miargmax(PM(𝐲1:mi|𝐲1:mi1,𝐱0))subscriptsuperscript𝐲𝑖:1𝑚subscript𝑃𝑀conditionalsubscriptsuperscript𝐲𝑖:1𝑚subscriptsuperscript𝐲𝑖1:1𝑚superscript𝐱0\mathbf{y}^{i}_{1:m}\leftarrow\arg\!\max(P_{M}(\mathbf{y}^{i}_{1:m}|\mathbf{y}% ^{i-1}_{1:m},\mathbf{x}^{0}))bold_y start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 : italic_m end_POSTSUBSCRIPT ← roman_arg roman_max ( italic_P start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( bold_y start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 : italic_m end_POSTSUBSCRIPT | bold_y start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 : italic_m end_POSTSUBSCRIPT , bold_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) )
6:     𝐨𝐲i𝐨superscript𝐲𝑖\mathbf{o}\leftarrow\mathbf{y}^{i}bold_o ← bold_y start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT
7:     stop𝑠𝑡𝑜𝑝absentstop\leftarrowitalic_s italic_t italic_o italic_p ←StopCondition(𝐲i,𝐲i1)superscript𝐲𝑖superscript𝐲𝑖1(\mathbf{y}^{i},\mathbf{y}^{i-1})( bold_y start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , bold_y start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT )
8:     if stop𝑠𝑡𝑜𝑝stopitalic_s italic_t italic_o italic_p then
9:        break
10:     end if
11:  end for
12:  Output: 𝐨=(y1,y2,,ym)𝐨subscript𝑦1subscript𝑦2subscript𝑦𝑚\mathbf{o}=(y_{1},y_{2},...,y_{m})bold_o = ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT )
Algorithm 2 Lookahead decoding
1:  Input: prompt 𝐱0=(x1,x2,xn)superscript𝐱0subscript𝑥1subscript𝑥2subscript𝑥𝑛\mathbf{x}^{0}=(x_{1},x_{2},...x_{n})bold_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ), model PMsubscript𝑃𝑀P_{M}italic_P start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT, n-gram size N𝑁Nitalic_N, window size W𝑊Witalic_W, max #speculations G𝐺Gitalic_G, max steps m𝑚mitalic_m.
2:  Initialize n-gram pool 𝐂𝐂\mathbf{C}\leftarrow\emptysetbold_C ← ∅
3:  Initialize 𝐨𝐨\mathbf{o}\leftarrow\emptysetbold_o ← ∅
4:  Randomly initialize 2D window 𝐰1:W2N:0subscriptsuperscript𝐰:2𝑁0:1𝑊\mathbf{w}^{2-N:0}_{1:W}bold_w start_POSTSUPERSCRIPT 2 - italic_N : 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 : italic_W end_POSTSUBSCRIPT
5:  Set 𝐨0xnsubscript𝐨0subscript𝑥𝑛\mathbf{o}_{0}\leftarrow x_{n}bold_o start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ← italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT
6:  for i=1𝑖1i=1italic_i = 1 to m𝑚mitalic_m do
7:     if size(𝐨)>=i𝑠𝑖𝑧𝑒𝐨𝑖size(\mathbf{o})>=iitalic_s italic_i italic_z italic_e ( bold_o ) > = italic_i then
8:        Randomly set 𝐰1:Wisuperscriptsubscript𝐰:1𝑊𝑖\mathbf{w}_{1:W}^{i}bold_w start_POSTSUBSCRIPT 1 : italic_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT
9:        continue
10:     end if
11:     {Lookahead Branch}
12:     for j=1𝑗1j=1italic_j = 1 to W𝑊Witalic_W do
13:        if j=1𝑗1j=1italic_j = 1 then
14:           wjiargmaxPM(wji|𝐰ji+2N:i1,𝐨1:i,𝐱0)subscriptsuperscript𝑤𝑖𝑗subscript𝑃𝑀conditionalsuperscriptsubscript𝑤𝑗𝑖subscriptsuperscript𝐰:𝑖2𝑁𝑖1𝑗subscript𝐨:1𝑖superscript𝐱0w^{i}_{j}\leftarrow\arg\!\max P_{M}(w_{j}^{i}|\mathbf{w}^{i+2-N:i-1}_{j},% \mathbf{o}_{1:i},\mathbf{x}^{0})italic_w start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ← roman_arg roman_max italic_P start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT | bold_w start_POSTSUPERSCRIPT italic_i + 2 - italic_N : italic_i - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_o start_POSTSUBSCRIPT 1 : italic_i end_POSTSUBSCRIPT , bold_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT )
15:        else
16:           wjiargmaxPM(wji|𝐰ji+2N:i1,𝐰2:ji+1N,𝐨1:i,𝐱0)subscriptsuperscript𝑤𝑖𝑗subscript𝑃𝑀conditionalsuperscriptsubscript𝑤𝑗𝑖subscriptsuperscript𝐰:𝑖2𝑁𝑖1𝑗superscriptsubscript𝐰:2𝑗𝑖1𝑁subscript𝐨:1𝑖superscript𝐱0w^{i}_{j}\leftarrow\arg\!\max P_{M}(w_{j}^{i}|\mathbf{w}^{i+2-N:i-1}_{j},% \mathbf{w}_{2:j}^{i+1-N},\mathbf{o}_{1:i},\mathbf{x}^{0})italic_w start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ← roman_arg roman_max italic_P start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT | bold_w start_POSTSUPERSCRIPT italic_i + 2 - italic_N : italic_i - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_w start_POSTSUBSCRIPT 2 : italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i + 1 - italic_N end_POSTSUPERSCRIPT , bold_o start_POSTSUBSCRIPT 1 : italic_i end_POSTSUBSCRIPT , bold_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT )
17:        end if
18:     end for
19:     𝐰1:Wi=(w1i,w2i,,wWi)superscriptsubscript𝐰:1𝑊𝑖superscriptsubscript𝑤1𝑖superscriptsubscript𝑤2𝑖superscriptsubscript𝑤𝑊𝑖\mathbf{w}_{1:W}^{i}=(w_{1}^{i},w_{2}^{i},...,w_{W}^{i})bold_w start_POSTSUBSCRIPT 1 : italic_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = ( italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , … , italic_w start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT )
20:     {Verification Branch}
21:     𝐠𝐠\mathbf{g}\leftarrow\emptysetbold_g ← ∅
22:     for j=1𝑗1j=1italic_j = 1 to G𝐺Gitalic_G do
23:        𝐠isuperscript𝐠𝑖absent\mathbf{g}^{i}\leftarrowbold_g start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ← n-gram from 𝐂𝐂\mathbf{C}bold_C starting with 𝐨i1subscript𝐨𝑖1\mathbf{o}_{i-1}bold_o start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT
24:     end for
25:     {Verification Algorithm in Algo. 3 and Algo. 4.}
26:     𝐨𝐨\mathbf{o}bold_o.append(Verification((x0,𝐨1:i),PM,𝐠)superscript𝑥0subscript𝐨normal-:1𝑖subscript𝑃𝑀𝐠((x^{0},\mathbf{o}_{1:i}),P_{M},\mathbf{g})( ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , bold_o start_POSTSUBSCRIPT 1 : italic_i end_POSTSUBSCRIPT ) , italic_P start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT , bold_g ))
27:     {Update n-gram pool}
28:     for j=1𝑗1j=1italic_j = 1 to W𝑊Witalic_W do
29:        add n-gram 𝐰jiN+1:isubscriptsuperscript𝐰:𝑖𝑁1𝑖𝑗\mathbf{w}^{i-N+1:i}_{j}bold_w start_POSTSUPERSCRIPT italic_i - italic_N + 1 : italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT to 𝐂𝐂\mathbf{C}bold_C
30:     end for
31:  end for
32:  Output: 𝐨1:m=(y1,y2,,ym)subscript𝐨:1𝑚subscript𝑦1subscript𝑦2subscript𝑦𝑚\mathbf{o}_{1:m}=(y_{1},y_{2},...,y_{m})bold_o start_POSTSUBSCRIPT 1 : italic_m end_POSTSUBSCRIPT = ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT )
Algorithm 3 Greedy Verification with Lookahead Decoding
1:  input prefill 𝐱0superscript𝐱0\mathbf{x}^{0}bold_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT, model PMsubscript𝑃𝑀P_{M}italic_P start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT, n-grams 𝐠isuperscript𝐠𝑖\mathbf{g}^{i}bold_g start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT with i[1,G]𝑖1𝐺i\in[1,G]italic_i ∈ [ 1 , italic_G ]
2:  output 𝐨𝐨\mathbf{o}bold_o {accepted tokens of length 1 to N}
3:  function GreedyVerification(𝐱0superscript𝐱0\mathbf{x}^{0}bold_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT, PMsubscript𝑃𝑀P_{M}italic_P start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT, 𝐠𝐠\mathbf{g}bold_g
4:     𝐕,𝐃,𝐨,,formulae-sequence𝐕𝐃𝐨\mathbf{V},\mathbf{D},\mathbf{o}\leftarrow\emptyset,\emptyset,\emptysetbold_V , bold_D , bold_o ← ∅ , ∅ , ∅
5:     for i=1𝑖1i=1italic_i = 1 to G𝐺Gitalic_G do
6:        𝐕𝐕\mathbf{V}bold_V.append(𝐠2:isubscriptsuperscript𝐠𝑖:2absent\mathbf{g}^{i}_{2:}bold_g start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 : end_POSTSUBSCRIPT) {each is a n-1 gram}
7:        𝐃𝐃\mathbf{D}bold_D.append(PM(𝐠2:i,𝐱next|𝐠2:i,𝐱0)subscript𝑃𝑀subscriptsuperscript𝐠superscript𝑖:2absentconditionalsubscript𝐱𝑛𝑒𝑥𝑡subscriptsuperscript𝐠𝑖:2absentsuperscript𝐱0P_{M}(\mathbf{g}^{i^{\prime}}_{2:},\mathbf{x}_{next}|\mathbf{g}^{i}_{2:},% \mathbf{x}^{0})italic_P start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( bold_g start_POSTSUPERSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 : end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_n italic_e italic_x italic_t end_POSTSUBSCRIPT | bold_g start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 : end_POSTSUBSCRIPT , bold_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ))
8:        {obtain last token of 𝐱0superscript𝐱0\mathbf{x}^{0}bold_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT and all 𝐠2:isubscriptsuperscript𝐠𝑖:2absent\mathbf{g}^{i}_{2:}bold_g start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 : end_POSTSUBSCRIPT’s outputs – totally N probability distributions}
9:     end for
10:     for i=1𝑖1i=1italic_i = 1 to N1𝑁1N-1italic_N - 1 do
11:        j1𝑗1j\leftarrow 1italic_j ← 1
12:        is_accept0𝑖𝑠_𝑎𝑐𝑐𝑒𝑝𝑡0is\_accept\leftarrow 0italic_i italic_s _ italic_a italic_c italic_c italic_e italic_p italic_t ← 0
13:        𝒫𝐃𝒫𝐃\mathcal{P}\leftarrow\mathbf{D}caligraphic_P ← bold_D[j𝑗jitalic_j]i𝑖{}_{i}start_FLOATSUBSCRIPT italic_i end_FLOATSUBSCRIPT {𝐃𝐃\mathbf{D}bold_D[j𝑗jitalic_j] is a series of N probability distributions;all 𝐃[j]i𝐃subscriptdelimited-[]𝑗𝑖\mathbf{D}[j]_{i}bold_D [ italic_j ] start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT should be the same as different distributions are removed; size(𝐃𝐃\mathbf{D}bold_D)>0absent0>0> 0 is guaranteed}
14:        while j𝑗absentj\leqitalic_j ≤ size(𝐕𝐕\mathbf{V}bold_Vdo
15:           sj𝐕subscript𝑠𝑗𝐕s_{j}\leftarrow\mathbf{V}italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ← bold_V[j𝑗jitalic_j]i𝑖{}_{i}start_FLOATSUBSCRIPT italic_i end_FLOATSUBSCRIPT
16:           if sj=argmax𝒫subscript𝑠𝑗𝒫s_{j}=\arg\!\max\mathcal{P}italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = roman_arg roman_max caligraphic_P then
17:              {accepted, update all potential speculations and probabilities}
18:              𝐨𝐨\mathbf{o}bold_o.append(sjsubscript𝑠𝑗s_{j}italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT)
19:              is_accept1𝑖𝑠_𝑎𝑐𝑐𝑒𝑝𝑡1is\_accept\leftarrow 1italic_i italic_s _ italic_a italic_c italic_c italic_e italic_p italic_t ← 1
20:              𝐕new,𝐃new,formulae-sequencesubscript𝐕𝑛𝑒𝑤subscript𝐃𝑛𝑒𝑤\mathbf{V}_{new},\mathbf{D}_{new}\leftarrow\emptyset,\emptysetbold_V start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT , bold_D start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT ← ∅ , ∅
21:              for k=j𝑘𝑗k=jitalic_k = italic_j to size(𝐕𝐕\mathbf{V}bold_Vdo
22:                 if sjsubscript𝑠𝑗s_{j}italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT=𝐕𝐕\mathbf{V}bold_V[k𝑘kitalic_k]i𝑖{}_{i}start_FLOATSUBSCRIPT italic_i end_FLOATSUBSCRIPT then
23:                    𝐕newsubscript𝐕𝑛𝑒𝑤\mathbf{V}_{new}bold_V start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT.append(𝐕[k]𝐕delimited-[]𝑘\mathbf{V}[k]bold_V [ italic_k ])
24:                    𝐃newsubscript𝐃𝑛𝑒𝑤\mathbf{D}_{new}bold_D start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT.append(𝐃[k]𝐃delimited-[]𝑘\mathbf{D}[k]bold_D [ italic_k ])
25:                 end if
26:              end for
27:              𝐕,𝐃𝐕new,𝐃newformulae-sequence𝐕𝐃subscript𝐕𝑛𝑒𝑤subscript𝐃𝑛𝑒𝑤\mathbf{V},\mathbf{D}\leftarrow\mathbf{V}_{new},\mathbf{D}_{new}bold_V , bold_D ← bold_V start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT , bold_D start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT
28:              break
29:           else
30:              {rejected, go to next speculation }
31:              jj+1𝑗𝑗1j\leftarrow j+1italic_j ← italic_j + 1
32:           end if
33:        end while
34:        if is_accept𝑖𝑠_𝑎𝑐𝑐𝑒𝑝𝑡is\_acceptitalic_i italic_s _ italic_a italic_c italic_c italic_e italic_p italic_t then
35:           continue
36:        else
37:           {guarantee one step movement}
38:           𝐨𝐨\mathbf{o}bold_o.append(argmax𝒫𝒫\arg\!\max\mathcal{P}roman_arg roman_max caligraphic_P)
39:           break
40:        end if
41:     end for
42:     if is_accept𝑖𝑠_𝑎𝑐𝑐𝑒𝑝𝑡is\_acceptitalic_i italic_s _ italic_a italic_c italic_c italic_e italic_p italic_t then
43:        𝐨𝐨\mathbf{o}bold_o.append(argmax𝐃[1]N𝐃subscriptdelimited-[]1𝑁\arg\!\max\mathbf{D}[1]_{N}roman_arg roman_max bold_D [ 1 ] start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT)
44:     end if
45:     return 𝐨𝐨\mathbf{o}bold_o
46:  end function
Algorithm 4 Sample Verification with Lookahead Decoding
1:  input prefill 𝐱0superscript𝐱0\mathbf{x}^{0}bold_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT, model PMsubscript𝑃𝑀P_{M}italic_P start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT, n-grams 𝐠isuperscript𝐠𝑖\mathbf{g}^{i}bold_g start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT with i[1,G]𝑖1𝐺i\in[1,G]italic_i ∈ [ 1 , italic_G ]
2:  output 𝐨𝐨\mathbf{o}bold_o {accepted tokens of length 1 to N}
3:  function SampleVerification(𝐱0superscript𝐱0\mathbf{x}^{0}bold_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT, PMsubscript𝑃𝑀P_{M}italic_P start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT, 𝐠𝐠\mathbf{g}bold_g
4:     𝐕,𝐃,𝐨,,formulae-sequence𝐕𝐃𝐨\mathbf{V},\mathbf{D},\mathbf{o}\leftarrow\emptyset,\emptyset,\emptysetbold_V , bold_D , bold_o ← ∅ , ∅ , ∅
5:     for i=1𝑖1i=1italic_i = 1 to G𝐺Gitalic_G do
6:        𝐕𝐕\mathbf{V}bold_V.append(𝐠2:isubscriptsuperscript𝐠𝑖:2absent\mathbf{g}^{i}_{2:}bold_g start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 : end_POSTSUBSCRIPT){each is a n-1 gram}
7:        𝐃𝐃\mathbf{D}bold_D.append(PM(𝐠2:i,𝐱next|𝐠2:i,𝐱0)subscript𝑃𝑀superscriptsubscript𝐠:2absentsuperscript𝑖conditionalsubscript𝐱𝑛𝑒𝑥𝑡superscriptsubscript𝐠:2absent𝑖superscript𝐱0P_{M}(\mathbf{g}_{2:}^{i^{\prime}},\mathbf{x}_{next}|\mathbf{g}_{2:}^{i},% \mathbf{x}^{0})italic_P start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( bold_g start_POSTSUBSCRIPT 2 : end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT , bold_x start_POSTSUBSCRIPT italic_n italic_e italic_x italic_t end_POSTSUBSCRIPT | bold_g start_POSTSUBSCRIPT 2 : end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ))
8:        {obtain last token of 𝐱0superscript𝐱0\mathbf{x}^{0}bold_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT and all 𝐠2:isubscriptsuperscript𝐠𝑖:2absent\mathbf{g}^{i}_{2:}bold_g start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 : end_POSTSUBSCRIPT’s outputs – totally N probability distributions}
9:     end for
10:     for i=1𝑖1i=1italic_i = 1 to N1𝑁1N-1italic_N - 1 do
11:        j1𝑗1j\leftarrow 1italic_j ← 1
12:        is_accept0𝑖𝑠_𝑎𝑐𝑐𝑒𝑝𝑡0is\_accept\leftarrow 0italic_i italic_s _ italic_a italic_c italic_c italic_e italic_p italic_t ← 0
13:        𝒫j𝐃subscript𝒫𝑗𝐃\mathcal{P}_{j}\leftarrow\mathbf{D}caligraphic_P start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ← bold_D[j𝑗jitalic_j]i𝑖{}_{i}start_FLOATSUBSCRIPT italic_i end_FLOATSUBSCRIPT {𝐃𝐃\mathbf{D}bold_D[j𝑗jitalic_j] is a series of N probability distributions;all 𝐃𝐃\mathbf{D}bold_D[j𝑗jitalic_j]i𝑖{}_{i}start_FLOATSUBSCRIPT italic_i end_FLOATSUBSCRIPT should be the same; size(𝐃𝐃\mathbf{D}bold_D)>0absent0>0> 0 is guaranteed}
14:        while j𝑗absentj\leqitalic_j ≤ size(𝐕𝐕\mathbf{V}bold_Vdo
15:           sj𝐕subscript𝑠𝑗𝐕s_{j}\leftarrow\mathbf{V}italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ← bold_V[j𝑗jitalic_j]i𝑖{}_{i}start_FLOATSUBSCRIPT italic_i end_FLOATSUBSCRIPT
16:           sample rU(0,1)similar-to𝑟𝑈01r\sim U(0,1)italic_r ∼ italic_U ( 0 , 1 )
17:           if r𝒫j(sj)𝑟subscript𝒫𝑗subscript𝑠𝑗r\leq\mathcal{P}_{j}(s_{j})italic_r ≤ caligraphic_P start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) then
18:              {accepted, update all potential speculations and probabilities}
19:              𝐨𝐨\mathbf{o}bold_o.append(sjsubscript𝑠𝑗s_{j}italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT)
20:              is_accept1𝑖𝑠_𝑎𝑐𝑐𝑒𝑝𝑡1is\_accept\leftarrow 1italic_i italic_s _ italic_a italic_c italic_c italic_e italic_p italic_t ← 1
21:              𝐕new,𝐃new,formulae-sequencesubscript𝐕𝑛𝑒𝑤subscript𝐃𝑛𝑒𝑤\mathbf{V}_{new},\mathbf{D}_{new}\leftarrow\emptyset,\emptysetbold_V start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT , bold_D start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT ← ∅ , ∅
22:              for k=j𝑘𝑗k=jitalic_k = italic_j to size(𝐕𝐕\mathbf{V}bold_Vdo
23:                 if sjsubscript𝑠𝑗s_{j}italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT=𝐕𝐕\mathbf{V}bold_V[k𝑘kitalic_k]i𝑖{}_{i}start_FLOATSUBSCRIPT italic_i end_FLOATSUBSCRIPT then
24:                    𝐕newsubscript𝐕𝑛𝑒𝑤\mathbf{V}_{new}bold_V start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT.append(𝐕[k]𝐕delimited-[]𝑘\mathbf{V}[k]bold_V [ italic_k ])
25:                    𝐃newsubscript𝐃𝑛𝑒𝑤\mathbf{D}_{new}bold_D start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT.append(𝐃[k]𝐃delimited-[]𝑘\mathbf{D}[k]bold_D [ italic_k ])
26:                 end if
27:              end for
28:              𝐕,𝐃𝐕new,𝐃newformulae-sequence𝐕𝐃subscript𝐕𝑛𝑒𝑤subscript𝐃𝑛𝑒𝑤\mathbf{V},\mathbf{D}\leftarrow\mathbf{V}_{new},\mathbf{D}_{new}bold_V , bold_D ← bold_V start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT , bold_D start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT
29:              break
30:           else
31:              {rejected, go to next speculation }
32:              𝒫j(sj)=0subscript𝒫𝑗subscript𝑠𝑗0\mathcal{P}_{j}(s_{j})=0caligraphic_P start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = 0
33:              𝒫j+1=norm(𝒫j)subscript𝒫𝑗1normsubscript𝒫𝑗\mathcal{P}_{j+1}=\textrm{norm}(\mathcal{P}_{j})caligraphic_P start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT = norm ( caligraphic_P start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT )
34:              jj+1𝑗𝑗1j\leftarrow j+1italic_j ← italic_j + 1
35:           end if
36:        end while
37:        if is_accept𝑖𝑠_𝑎𝑐𝑐𝑒𝑝𝑡is\_acceptitalic_i italic_s _ italic_a italic_c italic_c italic_e italic_p italic_t then
38:           continue
39:        else
40:           {guarantee one step movement}
41:           sample 𝐱next𝒫jsimilar-tosubscript𝐱𝑛𝑒𝑥𝑡subscript𝒫𝑗\mathbf{x}_{next}\sim\mathcal{P}_{j}bold_x start_POSTSUBSCRIPT italic_n italic_e italic_x italic_t end_POSTSUBSCRIPT ∼ caligraphic_P start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT
42:           𝐨𝐨\mathbf{o}bold_o.append(𝐱nextsubscript𝐱𝑛𝑒𝑥𝑡\mathbf{x}_{next}bold_x start_POSTSUBSCRIPT italic_n italic_e italic_x italic_t end_POSTSUBSCRIPT)
43:           break
44:        end if
45:     end for
46:     if is_accept𝑖𝑠_𝑎𝑐𝑐𝑒𝑝𝑡is\_acceptitalic_i italic_s _ italic_a italic_c italic_c italic_e italic_p italic_t then
47:        𝐨𝐨\mathbf{o}bold_o.append(sample 𝐱next𝐃[1]Nsimilar-tosubscript𝐱𝑛𝑒𝑥𝑡𝐃subscriptdelimited-[]1𝑁\mathbf{x}_{next}\sim\mathbf{D}[1]_{N}bold_x start_POSTSUBSCRIPT italic_n italic_e italic_x italic_t end_POSTSUBSCRIPT ∼ bold_D [ 1 ] start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT)
48:     end if
49:     return 𝐨𝐨\mathbf{o}bold_o
50:  end function

Appendix B Proof: Output distribution preserved disjoint n-gram verification

The sampling verification in Lookahead Decoding is adapted from the algorithm in Specinfer but with all speculations generated by the greedy sample. It does not change the output distribution from a fundamental point that how the draft model generates speculations is unimportant.

Theorem A For a given LLM, prompt and previously generated tokens 𝐱=(x1,x2,,xi)𝐱subscript𝑥1subscript𝑥2subscript𝑥𝑖\mathbf{x}=(x_{1},x_{2},...,x_{i})bold_x = ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), and G𝐺Gitalic_G speculations 𝐬=(s1,s2,,sG)𝐬subscript𝑠1subscript𝑠2subscript𝑠𝐺\mathbf{s}=(s_{1},s_{2},...,s_{G})bold_s = ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ) of next token xi+1subscript𝑥𝑖1x_{i+1}italic_x start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT. Each speculation token is sampled by a greedy sample (i.e., probability of 1). We use P(v|𝐱)𝑃conditional𝑣𝐱P(v|\mathbf{x})italic_P ( italic_v | bold_x ) to represent the probability of xi+1=vsubscript𝑥𝑖1𝑣x_{i+1}=vitalic_x start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT = italic_v sampled by the LLM and use Q(v|𝐱)𝑄conditional𝑣𝐱Q(v|\mathbf{x})italic_Q ( italic_v | bold_x ) to represent the probability of xi+1=vsubscript𝑥𝑖1𝑣x_{i+1}=vitalic_x start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT = italic_v sampled by our proposed algorithm 4. We use P(v)𝑃𝑣P(v)italic_P ( italic_v ) and Q(v)𝑄𝑣Q(v)italic_Q ( italic_v ) for short. We need to prove P(v)=Q(v)𝑃𝑣𝑄𝑣P(v)=Q(v)italic_P ( italic_v ) = italic_Q ( italic_v ) for any G𝐺Gitalic_G, and any v𝑣vitalic_v and sjsubscript𝑠𝑗s_{j}italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT from the full vocabulary V𝑉Vitalic_V.

Proof.

The proof of this part corresponds to line 14 to line 44 in algorithm 4. Given speculations 𝐬𝐬\mathbf{s}bold_s, we use aj(v)subscript𝑎𝑗𝑣a_{j}(v)italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_v ) to represent the probability that the token v𝑣vitalic_v is accepted by the j𝑗jitalic_j-th speculation (line 18-line 30), and rj(sj)subscript𝑟𝑗subscript𝑠𝑗r_{j}(s_{j})italic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) is the probability that the token sjsubscript𝑠𝑗s_{j}italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is rejected by the j𝑗jitalic_j-th speculation (line 30-line 35), where sjsubscript𝑠𝑗s_{j}italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is the j𝑗jitalic_j-th speculation’s token. Moreover, aG+1(v)superscriptsubscript𝑎𝐺1𝑣a_{G+1}^{\prime}(v)italic_a start_POSTSUBSCRIPT italic_G + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_v ) is the probability of being accepted by the sampling at line 41. For simplicity, we use ajsubscript𝑎𝑗a_{j}italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT to represent aj(v)subscript𝑎𝑗𝑣a_{j}(v)italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_v ), ajsuperscriptsubscript𝑎𝑗a_{j}^{\prime}italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT to represent aj(v)superscriptsubscript𝑎𝑗𝑣a_{j}^{\prime}(v)italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_v ), and use rjsubscript𝑟𝑗r_{j}italic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT to represent rj(sj)subscript𝑟𝑗subscript𝑠𝑗r_{j}(s_{j})italic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ). We use 𝒫1subscript𝒫1\mathcal{P}_{1}caligraphic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to represent the probability distribution obtained in line 13 and 𝒫jsubscript𝒫𝑗\mathcal{P}_{j}caligraphic_P start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT to present the updated probability before the j𝑗jitalic_j-th speculation. We have 𝒫1(v)=P(v)subscript𝒫1𝑣𝑃𝑣\mathcal{P}_{1}(v)=P(v)caligraphic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_v ) = italic_P ( italic_v ) as 𝒫1(v)subscript𝒫1𝑣\mathcal{P}_{1}(v)caligraphic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_v ) is never updated. We define QG(v)subscript𝑄𝐺𝑣Q_{G}(v)italic_Q start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ( italic_v ) is the probability of xi+1=vsubscript𝑥𝑖1𝑣x_{i+1}=vitalic_x start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT = italic_v sampled by algorithm 4 when we have G𝐺Gitalic_G speculations. Then we should have:

QG(v)=a1+r1a2+r1r2a3++aGk=1G1rk+aG+1k=1Grksubscript𝑄𝐺𝑣subscript𝑎1subscript𝑟1subscript𝑎2subscript𝑟1subscript𝑟2subscript𝑎3subscript𝑎𝐺superscriptsubscriptproduct𝑘1𝐺1subscript𝑟𝑘superscriptsubscript𝑎𝐺1superscriptsubscriptproduct𝑘1𝐺subscript𝑟𝑘Q_{G}(v)=a_{1}+r_{1}a_{2}+r_{1}r_{2}a_{3}+...+a_{G}\prod\limits_{k=1}^{G-1}r_{% k}+a_{G+1}^{\prime}\prod\limits_{k=1}^{G}r_{k}italic_Q start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ( italic_v ) = italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT + … + italic_a start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G - 1 end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_a start_POSTSUBSCRIPT italic_G + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT

We use induction to prove QG(v)=P(v)subscript𝑄𝐺𝑣𝑃𝑣Q_{G}(v)=P(v)italic_Q start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ( italic_v ) = italic_P ( italic_v ) for any G1𝐺1G\geq 1italic_G ≥ 1, any vV𝑣𝑉v\in Vitalic_v ∈ italic_V, and any sjVsubscript𝑠𝑗𝑉s_{j}\in Vitalic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_V with 1jG1𝑗𝐺1\leq j\leq G1 ≤ italic_j ≤ italic_G:

  1. 1)

    When G=1𝐺1G=1italic_G = 1, we have QG(v)=a1+r1a2subscript𝑄𝐺𝑣subscript𝑎1subscript𝑟1superscriptsubscript𝑎2Q_{G}(v)=a_{1}+r_{1}a_{2}^{\prime}italic_Q start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ( italic_v ) = italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. The initial guess s1subscript𝑠1s_{1}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT can be either the same at v𝑣vitalic_v or be different from v𝑣vitalic_v.

    1. (1)

      When s1=vsubscript𝑠1𝑣s_{1}=vitalic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_v, a1subscript𝑎1a_{1}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT equals 𝒫1(v)subscript𝒫1𝑣\mathcal{P}_{1}(v)caligraphic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_v ) at line 17, which is the same as P(v)𝑃𝑣P(v)italic_P ( italic_v ) as it is never updated. Upon this, we have r1=1a1=1𝒫1(v)=1P(v)subscript𝑟11subscript𝑎11subscript𝒫1𝑣1𝑃𝑣r_{1}=1-a_{1}=1-\mathcal{P}_{1}(v)=1-P(v)italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 1 - italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 1 - caligraphic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_v ) = 1 - italic_P ( italic_v ). And, a2superscriptsubscript𝑎2a_{2}^{\prime}italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is the updated probability 𝒫2(v)subscript𝒫2𝑣\mathcal{P}_{2}(v)caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_v ) at line 42. Since 𝒫2(v)subscript𝒫2𝑣\mathcal{P}_{2}(v)caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_v ) is set to zero once rejected at line 32, a2=0superscriptsubscript𝑎20a_{2}^{\prime}=0italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 0. In this case, QG(v)=P(v)+(1P(v))*0=P(v)subscript𝑄𝐺𝑣𝑃𝑣1𝑃𝑣0𝑃𝑣Q_{G}(v)=P(v)+(1-P(v))*0=P(v)italic_Q start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ( italic_v ) = italic_P ( italic_v ) + ( 1 - italic_P ( italic_v ) ) * 0 = italic_P ( italic_v ).

    2. (2)

      When s1vsubscript𝑠1𝑣s_{1}\neq vitalic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≠ italic_v, a1subscript𝑎1a_{1}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT should be 00 even if sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is accepted. Moreover, we have r1=1𝒫1(s1)subscript𝑟11subscript𝒫1subscript𝑠1r_{1}=1-\mathcal{P}_{1}(s_{1})italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 1 - caligraphic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ). Then 𝒫2(v)subscript𝒫2𝑣\mathcal{P}_{2}(v)caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_v ) is updated to 𝒫1(v)1𝒫1(s1)subscript𝒫1𝑣1subscript𝒫1subscript𝑠1\frac{\mathcal{P}_{1}(v)}{1-\mathcal{P}_{1}(s_{1})}divide start_ARG caligraphic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_v ) end_ARG start_ARG 1 - caligraphic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG at lines 32 and 33. Then a2=𝒫2(v)subscript𝑎2subscript𝒫2𝑣a_{2}’=\mathcal{P}_{2}(v)italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ’ = caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_v ). In this case, QG(v)=0+r1*P(v)r1=P(v)subscript𝑄𝐺𝑣0subscript𝑟1𝑃𝑣subscript𝑟1𝑃𝑣Q_{G}(v)=0+r_{1}*\frac{P(v)}{r_{1}}=P(v)italic_Q start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ( italic_v ) = 0 + italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT * divide start_ARG italic_P ( italic_v ) end_ARG start_ARG italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG = italic_P ( italic_v ).

  2. 2)

    When G=g𝐺𝑔G=gitalic_G = italic_g holds, which means Qg(v)=a1+r1a2++agk=1g1rk+ag+1k=1grk=P(v)subscript𝑄𝑔𝑣subscript𝑎1subscript𝑟1subscript𝑎2subscript𝑎𝑔superscriptsubscriptproduct𝑘1𝑔1subscript𝑟𝑘superscriptsubscript𝑎𝑔1superscriptsubscriptproduct𝑘1𝑔subscript𝑟𝑘𝑃𝑣Q_{g}(v)=a_{1}+r_{1}a_{2}+...+a_{g}\prod\limits_{k=1}^{g-1}r_{k}+a_{g+1}^{% \prime}\prod\limits_{k=1}^{g}r_{k}=P(v)italic_Q start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_v ) = italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + … + italic_a start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g - 1 end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_a start_POSTSUBSCRIPT italic_g + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_P ( italic_v ) for any sj,vVsubscript𝑠𝑗𝑣𝑉s_{j},v\in Vitalic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_v ∈ italic_V, 1jg1𝑗𝑔1\leq j\leq g1 ≤ italic_j ≤ italic_g.

    We prove Qg+1(v)=Qg(v)ag+1k=1grk+ag+1k=1grk+ag+2k=1g+1rk=P(v)subscript𝑄𝑔1𝑣subscript𝑄𝑔𝑣superscriptsubscript𝑎𝑔1superscriptsubscriptproduct𝑘1𝑔subscript𝑟𝑘subscript𝑎𝑔1superscriptsubscriptproduct𝑘1𝑔subscript𝑟𝑘superscriptsubscript𝑎𝑔2superscriptsubscriptproduct𝑘1𝑔1subscript𝑟𝑘𝑃𝑣Q_{g+1}(v)=Q_{g}(v)-a_{g+1}^{\prime}\prod\limits_{k=1}^{g}r_{k}+a_{g+1}\prod% \limits_{k=1}^{g}r_{k}+a_{g+2}^{\prime}\prod\limits_{k=1}^{g+1}r_{k}=P(v)italic_Q start_POSTSUBSCRIPT italic_g + 1 end_POSTSUBSCRIPT ( italic_v ) = italic_Q start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_v ) - italic_a start_POSTSUBSCRIPT italic_g + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_a start_POSTSUBSCRIPT italic_g + 1 end_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_a start_POSTSUBSCRIPT italic_g + 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g + 1 end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_P ( italic_v ) for the same sj,vVsubscript𝑠𝑗𝑣𝑉s_{j},v\in Vitalic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_v ∈ italic_V, 1jg1𝑗𝑔1\leq j\leq g1 ≤ italic_j ≤ italic_g, and any sg+1Vsubscript𝑠𝑔1𝑉s_{g+1}\in Vitalic_s start_POSTSUBSCRIPT italic_g + 1 end_POSTSUBSCRIPT ∈ italic_V.

    1. (1)

      When sgvsubscript𝑠𝑔𝑣s_{g}\neq vitalic_s start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ≠ italic_v, we have ag=0subscript𝑎𝑔0a_{g}=0italic_a start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT = 0. If 𝒫g(v)=0subscript𝒫𝑔𝑣0\mathcal{P}_{g}(v)=0caligraphic_P start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_v ) = 0, we have all ag+1=0superscriptsubscript𝑎𝑔10a_{g+1}^{\prime}=0italic_a start_POSTSUBSCRIPT italic_g + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 0, ag+1=0subscript𝑎𝑔10a_{g+1}=0italic_a start_POSTSUBSCRIPT italic_g + 1 end_POSTSUBSCRIPT = 0 and ag+2=0superscriptsubscript𝑎𝑔20a_{g+2}^{\prime}=0italic_a start_POSTSUBSCRIPT italic_g + 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 0. It ensures that Qg+1(v)=Qg(v)0+0+0=Qg(v)=P(v)subscript𝑄𝑔1𝑣subscript𝑄𝑔𝑣000subscript𝑄𝑔𝑣𝑃𝑣Q_{g+1}(v)=Q_{g}(v)-0+0+0=Q_{g}(v)=P(v)italic_Q start_POSTSUBSCRIPT italic_g + 1 end_POSTSUBSCRIPT ( italic_v ) = italic_Q start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_v ) - 0 + 0 + 0 = italic_Q start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_v ) = italic_P ( italic_v ).

      If 𝒫g(v)0subscript𝒫𝑔𝑣0\mathcal{P}_{g}(v)\neq 0caligraphic_P start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_v ) ≠ 0, 𝒫g+1(v)=𝒫g(v)rg=𝒫1(v)k=1grksubscript𝒫𝑔1𝑣subscript𝒫𝑔𝑣subscript𝑟𝑔subscript𝒫1𝑣superscriptsubscriptproduct𝑘1𝑔subscript𝑟𝑘\mathcal{P}_{g+1}(v)=\frac{\mathcal{P}_{g}(v)}{r_{g}}=\frac{\mathcal{P}_{1}(v)% }{\prod\limits_{k=1}^{g}r_{k}}caligraphic_P start_POSTSUBSCRIPT italic_g + 1 end_POSTSUBSCRIPT ( italic_v ) = divide start_ARG caligraphic_P start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_v ) end_ARG start_ARG italic_r start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT end_ARG = divide start_ARG caligraphic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_v ) end_ARG start_ARG ∏ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG since sgvsubscript𝑠𝑔𝑣s_{g}\neq vitalic_s start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ≠ italic_v by observation.

      Then we have Qg+1(v)=Qg(v)𝒫g+1(v)k=1grk+ag+1(v)k=1grk+ag+2k=1g+1rk=Qg(v)𝒫1(v)k=1grkk=1grk+ag+1k=1grk+ag+2k=1g+1rk=Qg(v)P(v)+ag+1(v)k=1grk+ag+2k=1g+1rksubscript𝑄𝑔1𝑣subscript𝑄𝑔𝑣subscript𝒫𝑔1𝑣superscriptsubscriptproduct𝑘1𝑔subscript𝑟𝑘subscript𝑎𝑔1𝑣superscriptsubscriptproduct𝑘1𝑔subscript𝑟𝑘superscriptsubscript𝑎𝑔2superscriptsubscriptproduct𝑘1𝑔1subscript𝑟𝑘subscript𝑄𝑔𝑣subscript𝒫1𝑣superscriptsubscriptproduct𝑘1𝑔subscript𝑟𝑘superscriptsubscriptproduct𝑘1𝑔subscript𝑟𝑘subscript𝑎𝑔1superscriptsubscriptproduct𝑘1𝑔subscript𝑟𝑘superscriptsubscript𝑎𝑔2superscriptsubscriptproduct𝑘1𝑔1subscript𝑟𝑘subscript𝑄𝑔𝑣𝑃𝑣subscript𝑎𝑔1𝑣superscriptsubscriptproduct𝑘1𝑔subscript𝑟𝑘superscriptsubscript𝑎𝑔2superscriptsubscriptproduct𝑘1𝑔1subscript𝑟𝑘Q_{g+1}(v)=Q_{g}(v)-\mathcal{P}_{g+1}(v)\prod\limits_{k=1}^{g}r_{k}+a_{g+1}(v)% \prod\limits_{k=1}^{g}r_{k}+a_{g+2}^{\prime}\prod\limits_{k=1}^{g+1}r_{k}=Q_{g% }(v)-\frac{\mathcal{P}_{1}(v)}{\prod\limits_{k=1}^{g}r_{k}}\prod\limits_{k=1}^% {g}r_{k}+a_{g+1}\prod\limits_{k=1}^{g}r_{k}+a_{g+2}^{\prime}\prod\limits_{k=1}% ^{g+1}r_{k}=Q_{g}(v)-P(v)+a_{g+1}(v)\prod\limits_{k=1}^{g}r_{k}+a_{g+2}^{% \prime}\prod\limits_{k=1}^{g+1}r_{k}italic_Q start_POSTSUBSCRIPT italic_g + 1 end_POSTSUBSCRIPT ( italic_v ) = italic_Q start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_v ) - caligraphic_P start_POSTSUBSCRIPT italic_g + 1 end_POSTSUBSCRIPT ( italic_v ) ∏ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_a start_POSTSUBSCRIPT italic_g + 1 end_POSTSUBSCRIPT ( italic_v ) ∏ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_a start_POSTSUBSCRIPT italic_g + 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g + 1 end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_Q start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_v ) - divide start_ARG caligraphic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_v ) end_ARG start_ARG ∏ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ∏ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_a start_POSTSUBSCRIPT italic_g + 1 end_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_a start_POSTSUBSCRIPT italic_g + 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g + 1 end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_Q start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_v ) - italic_P ( italic_v ) + italic_a start_POSTSUBSCRIPT italic_g + 1 end_POSTSUBSCRIPT ( italic_v ) ∏ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_a start_POSTSUBSCRIPT italic_g + 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g + 1 end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. Here we have another two cases:

      ➀ If sg+1=vsubscript𝑠𝑔1𝑣s_{g+1}=vitalic_s start_POSTSUBSCRIPT italic_g + 1 end_POSTSUBSCRIPT = italic_v, ag+1k=1grk=𝒫1(vi=v)k=1grkk=1grk=P(v)subscript𝑎𝑔1superscriptsubscriptproduct𝑘1𝑔subscript𝑟𝑘subscript𝒫1subscript𝑣𝑖𝑣superscriptsubscriptproduct𝑘1𝑔subscript𝑟𝑘superscriptsubscriptproduct𝑘1𝑔subscript𝑟𝑘𝑃𝑣a_{g+1}\prod\limits_{k=1}^{g}r_{k}=\frac{\mathcal{P}_{1}(v_{i}=v)}{\prod% \limits_{k=1}^{g}r_{k}}\prod\limits_{k=1}^{g}r_{k}=P(v)italic_a start_POSTSUBSCRIPT italic_g + 1 end_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = divide start_ARG caligraphic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_v ) end_ARG start_ARG ∏ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ∏ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_P ( italic_v ) and ag+2=𝒫g+2(v)=0superscriptsubscript𝑎𝑔2subscript𝒫𝑔2𝑣0a_{g+2}^{\prime}=\mathcal{P}_{g+2}(v)=0italic_a start_POSTSUBSCRIPT italic_g + 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = caligraphic_P start_POSTSUBSCRIPT italic_g + 2 end_POSTSUBSCRIPT ( italic_v ) = 0. We have Qg+1(v)=Qg(v)P(v)+P(v)+0=Qg(v)=P(v)subscript𝑄𝑔1𝑣subscript𝑄𝑔𝑣𝑃𝑣𝑃𝑣0subscript𝑄𝑔𝑣𝑃𝑣Q_{g+1}(v)=Q_{g}(v)-P(v)+P(v)+0=Q_{g}(v)=P(v)italic_Q start_POSTSUBSCRIPT italic_g + 1 end_POSTSUBSCRIPT ( italic_v ) = italic_Q start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_v ) - italic_P ( italic_v ) + italic_P ( italic_v ) + 0 = italic_Q start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_v ) = italic_P ( italic_v ).

      ➁ If sg+1vsubscript𝑠𝑔1𝑣s_{g+1}\neq vitalic_s start_POSTSUBSCRIPT italic_g + 1 end_POSTSUBSCRIPT ≠ italic_v, ag+1=0subscript𝑎𝑔10a_{g+1}=0italic_a start_POSTSUBSCRIPT italic_g + 1 end_POSTSUBSCRIPT = 0. ag+2k=1g+1rk=𝒫g+2(v)k=1g+1rk=𝒫1(v)k=1g+1rkk=1g+1rk=P(v)superscriptsubscript𝑎𝑔2superscriptsubscriptproduct𝑘1𝑔1subscript𝑟𝑘subscript𝒫𝑔2𝑣superscriptsubscriptproduct𝑘1𝑔1subscript𝑟𝑘subscript𝒫1𝑣superscriptsubscriptproduct𝑘1𝑔1subscript𝑟𝑘superscriptsubscriptproduct𝑘1𝑔1subscript𝑟𝑘𝑃𝑣a_{g+2}^{\prime}\prod\limits_{k=1}^{g+1}r_{k}=\mathcal{P}_{g+2}(v)\prod\limits% _{k=1}^{g+1}r_{k}=\frac{\mathcal{P}_{1}(v)}{\prod\limits_{k=1}^{g+1}r_{k}}% \prod\limits_{k=1}^{g+1}r_{k}=P(v)italic_a start_POSTSUBSCRIPT italic_g + 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g + 1 end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = caligraphic_P start_POSTSUBSCRIPT italic_g + 2 end_POSTSUBSCRIPT ( italic_v ) ∏ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g + 1 end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = divide start_ARG caligraphic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_v ) end_ARG start_ARG ∏ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g + 1 end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ∏ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g + 1 end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_P ( italic_v ). So Qg+1(v)=Qg(v)P(v)+0+P(v)=Qg(v)=P(v)subscript𝑄𝑔1𝑣subscript𝑄𝑔𝑣𝑃𝑣0𝑃𝑣subscript𝑄𝑔𝑣𝑃𝑣Q_{g+1}(v)=Q_{g}(v)-P(v)+0+P(v)=Q_{g}(v)=P(v)italic_Q start_POSTSUBSCRIPT italic_g + 1 end_POSTSUBSCRIPT ( italic_v ) = italic_Q start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_v ) - italic_P ( italic_v ) + 0 + italic_P ( italic_v ) = italic_Q start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_v ) = italic_P ( italic_v )

    2. (2)

      When sg=vsubscript𝑠𝑔𝑣s_{g}=vitalic_s start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT = italic_v, 𝒫g+1(v)subscript𝒫𝑔1𝑣\mathcal{P}_{g+1}(v)caligraphic_P start_POSTSUBSCRIPT italic_g + 1 end_POSTSUBSCRIPT ( italic_v ) is set to zero at line 32 after this step. In this case, we have ag+1=𝒫g+1(v)=0superscriptsubscript𝑎𝑔1subscript𝒫𝑔1𝑣0a_{g+1}^{\prime}=\mathcal{P}_{g+1}(v)=0italic_a start_POSTSUBSCRIPT italic_g + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = caligraphic_P start_POSTSUBSCRIPT italic_g + 1 end_POSTSUBSCRIPT ( italic_v ) = 0, ag+1=𝒫g+1(v)=0subscript𝑎𝑔1subscript𝒫𝑔1𝑣0a_{g+1}=\mathcal{P}_{g+1}(v)=0italic_a start_POSTSUBSCRIPT italic_g + 1 end_POSTSUBSCRIPT = caligraphic_P start_POSTSUBSCRIPT italic_g + 1 end_POSTSUBSCRIPT ( italic_v ) = 0 and ag+2=0superscriptsubscript𝑎𝑔20a_{g+2}^{\prime}=0italic_a start_POSTSUBSCRIPT italic_g + 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 0. It makes that Qg+1(v)=Qg(v)0+0+0=Qg(v)=P(v)subscript𝑄𝑔1𝑣subscript𝑄𝑔𝑣000subscript𝑄𝑔𝑣𝑃𝑣Q_{g+1}(v)=Q_{g}(v)-0+0+0=Q_{g}(v)=P(v)italic_Q start_POSTSUBSCRIPT italic_g + 1 end_POSTSUBSCRIPT ( italic_v ) = italic_Q start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_v ) - 0 + 0 + 0 = italic_Q start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_v ) = italic_P ( italic_v )

This part of the proof guarantees that from line 14 to line 44 in Algorithm 4, any new token appended to 𝐨𝐨\mathbf{o}bold_o can follow the original distribution of the LLM. Line 21 to line 28 guarantees that sequences in 𝐕𝐕\mathbf{V}bold_V share the same prefix of length i1𝑖1i-1italic_i - 1 in every iteration. This further guarantees that 𝒫𝒫\mathcal{P}caligraphic_P from 𝐃[j]i𝐃subscriptdelimited-[]𝑗𝑖\mathbf{D}[j]_{i}bold_D [ italic_j ] start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the same for all j𝑗jitalic_j, follows the wanted distribution. Thus, the correctness of the whole sampling algorithm is proved.

Appendix C Derivation of Expectation of The Number of Accepted Tokens

We first start with single-candidate speculation. We need to obtain the probability of accepting i𝑖iitalic_i tokens as P(#acceptedtokens=i)𝑃#𝑎𝑐𝑐𝑒𝑝𝑡𝑒𝑑𝑡𝑜𝑘𝑒𝑛𝑠𝑖P(\#accepted\ tokens=i)italic_P ( # italic_a italic_c italic_c italic_e italic_p italic_t italic_e italic_d italic_t italic_o italic_k italic_e italic_n italic_s = italic_i ) for all possible i𝑖iitalic_i. Since the speculation’s length is γ𝛾\gammaitalic_γ, the probability of accepting i𝑖iitalic_i tokens with iγ+2𝑖𝛾2i\geq\gamma+2italic_i ≥ italic_γ + 2 is 0. P(#acceptedtokens=1)𝑃#𝑎𝑐𝑐𝑒𝑝𝑡𝑒𝑑𝑡𝑜𝑘𝑒𝑛𝑠1P(\#accepted\ tokens=1)italic_P ( # italic_a italic_c italic_c italic_e italic_p italic_t italic_e italic_d italic_t italic_o italic_k italic_e italic_n italic_s = 1 ) is the probability of the first token being rejected, which is 1α1𝛼1-\alpha1 - italic_α. The probability P(#acceptedtokens=i)=P(#acceptedtokens=i1)*α𝑃#𝑎𝑐𝑐𝑒𝑝𝑡𝑒𝑑𝑡𝑜𝑘𝑒𝑛𝑠𝑖𝑃#𝑎𝑐𝑐𝑒𝑝𝑡𝑒𝑑𝑡𝑜𝑘𝑒𝑛𝑠𝑖1𝛼P(\#accepted\ tokens=i)=P(\#accepted\ tokens=i-1)*\alphaitalic_P ( # italic_a italic_c italic_c italic_e italic_p italic_t italic_e italic_d italic_t italic_o italic_k italic_e italic_n italic_s = italic_i ) = italic_P ( # italic_a italic_c italic_c italic_e italic_p italic_t italic_e italic_d italic_t italic_o italic_k italic_e italic_n italic_s = italic_i - 1 ) * italic_α, for all iγ𝑖𝛾i\leq\gammaitalic_i ≤ italic_γ. The probability P(#acceptedtokens=γ+1)𝑃#𝑎𝑐𝑐𝑒𝑝𝑡𝑒𝑑𝑡𝑜𝑘𝑒𝑛𝑠𝛾1P(\#accepted\ tokens=\gamma+1)italic_P ( # italic_a italic_c italic_c italic_e italic_p italic_t italic_e italic_d italic_t italic_o italic_k italic_e italic_n italic_s = italic_γ + 1 ) is accepting all tokens, which is αγsuperscript𝛼𝛾\alpha^{\gamma}italic_α start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT. Thus we have the following, which is Eq. 4:

E(#tokens)𝐸#𝑡𝑜𝑘𝑒𝑛𝑠\displaystyle\footnotesize E(\#tokens)italic_E ( # italic_t italic_o italic_k italic_e italic_n italic_s ) =\displaystyle== i=1γ+1i*P(#acceptedtokens=i)superscriptsubscript𝑖1𝛾1𝑖𝑃#𝑎𝑐𝑐𝑒𝑝𝑡𝑒𝑑𝑡𝑜𝑘𝑒𝑛𝑠𝑖\displaystyle\sum\limits_{i=1}^{\gamma+1}i*P(\#accepted\ tokens=i)∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_γ + 1 end_POSTSUPERSCRIPT italic_i * italic_P ( # italic_a italic_c italic_c italic_e italic_p italic_t italic_e italic_d italic_t italic_o italic_k italic_e italic_n italic_s = italic_i ) (8)
=\displaystyle== 1*(1α)+2*(1α)*α++(γ+1)*αγ11𝛼21𝛼𝛼𝛾1superscript𝛼𝛾\displaystyle 1*(1-\alpha)+2*(1-\alpha)*\alpha+...+(\gamma+1)*\alpha^{\gamma}1 * ( 1 - italic_α ) + 2 * ( 1 - italic_α ) * italic_α + … + ( italic_γ + 1 ) * italic_α start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT
=\displaystyle== (1α)+(2α2α2)+(3α23α3)++(γ+1)αγ1𝛼2𝛼2superscript𝛼23superscript𝛼23superscript𝛼3𝛾1superscript𝛼𝛾\displaystyle(1-\alpha)+(2\alpha-2\alpha^{2})+(3\alpha^{2}-3\alpha^{3})+...+(% \gamma+1)\alpha^{\gamma}( 1 - italic_α ) + ( 2 italic_α - 2 italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + ( 3 italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 3 italic_α start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) + … + ( italic_γ + 1 ) italic_α start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT
=\displaystyle== 1+(α+2α)+(2α2+3α2)+(3α3+4α3)++(γ+1)αγ1𝛼2𝛼2superscript𝛼23superscript𝛼23superscript𝛼34superscript𝛼3𝛾1superscript𝛼𝛾\displaystyle 1+(-\alpha+2\alpha)+(-2\alpha^{2}+3\alpha^{2})+(-3\alpha^{3}+4% \alpha^{3})+...+(\gamma+1)\alpha^{\gamma}1 + ( - italic_α + 2 italic_α ) + ( - 2 italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 3 italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + ( - 3 italic_α start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT + 4 italic_α start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) + … + ( italic_γ + 1 ) italic_α start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT
=\displaystyle== 1+α+α2+α3++αγ1𝛼superscript𝛼2superscript𝛼3superscript𝛼𝛾\displaystyle 1+\alpha+\alpha^{2}+\alpha^{3}+...+\alpha^{\gamma}1 + italic_α + italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_α start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT + … + italic_α start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT
=\displaystyle== 1αγ+11α1superscript𝛼𝛾11𝛼\displaystyle\frac{1-\alpha^{\gamma+1}}{1-\alpha}divide start_ARG 1 - italic_α start_POSTSUPERSCRIPT italic_γ + 1 end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_α end_ARG

We then investigate the case of speculations with a batch size of b𝑏bitalic_b. We need to obtain the probability of accepting i𝑖iitalic_i tokens as P(#acceptedtokens=i)𝑃#𝑎𝑐𝑐𝑒𝑝𝑡𝑒𝑑𝑡𝑜𝑘𝑒𝑛𝑠𝑖P(\#accepted\ tokens=i)italic_P ( # italic_a italic_c italic_c italic_e italic_p italic_t italic_e italic_d italic_t italic_o italic_k italic_e italic_n italic_s = italic_i ). Since all speculations’ length is γ𝛾\gammaitalic_γ, the probability of accepting a𝑎aitalic_a tokens with aγ+2𝑎𝛾2a\geq\gamma+2italic_a ≥ italic_γ + 2 is 0. We use pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to denote (1αi)bsuperscript1superscript𝛼𝑖𝑏(1-\alpha^{i})^{b}( 1 - italic_α start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT, which is the probability that at most i𝑖iitalic_i tokens are accepted in all b𝑏bitalic_b speculations. For all iγ𝑖𝛾i\leq\gammaitalic_i ≤ italic_γ, we should have P(#acceptedtokens=i)=pipi1𝑃#𝑎𝑐𝑐𝑒𝑝𝑡𝑒𝑑𝑡𝑜𝑘𝑒𝑛𝑠𝑖subscript𝑝𝑖subscript𝑝𝑖1P(\#accepted\ tokens=i)=p_{i}-p_{i-1}italic_P ( # italic_a italic_c italic_c italic_e italic_p italic_t italic_e italic_d italic_t italic_o italic_k italic_e italic_n italic_s = italic_i ) = italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_p start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT. And, the probability P(#acceptedtokens=γ+1)𝑃#𝑎𝑐𝑐𝑒𝑝𝑡𝑒𝑑𝑡𝑜𝑘𝑒𝑛𝑠𝛾1P(\#accepted\ tokens=\gamma+1)italic_P ( # italic_a italic_c italic_c italic_e italic_p italic_t italic_e italic_d italic_t italic_o italic_k italic_e italic_n italic_s = italic_γ + 1 ) should be (1pγ)1subscript𝑝𝛾(1-p_{\gamma})( 1 - italic_p start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ). Thus we have the following, which is Eq. 5:

E(#tokens)𝐸#𝑡𝑜𝑘𝑒𝑛𝑠\displaystyle\footnotesize E(\#tokens)italic_E ( # italic_t italic_o italic_k italic_e italic_n italic_s ) =\displaystyle== i=1γ+1i*P(#acceptedtokens=i)superscriptsubscript𝑖1𝛾1𝑖𝑃#𝑎𝑐𝑐𝑒𝑝𝑡𝑒𝑑𝑡𝑜𝑘𝑒𝑛𝑠𝑖\displaystyle\sum\limits_{i=1}^{\gamma+1}i*P(\#accepted\ tokens=i)∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_γ + 1 end_POSTSUPERSCRIPT italic_i * italic_P ( # italic_a italic_c italic_c italic_e italic_p italic_t italic_e italic_d italic_t italic_o italic_k italic_e italic_n italic_s = italic_i ) (9)
=\displaystyle== i=1γi(pipi1)+(γ+1)*(1pγ)superscriptsubscript𝑖1𝛾𝑖subscript𝑝𝑖subscript𝑝𝑖1𝛾11subscript𝑝𝛾\displaystyle\sum\limits_{i=1}^{\gamma}{i(p_{i}-p_{i-1})}+(\gamma+1)*(1-p_{% \gamma})∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT italic_i ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_p start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) + ( italic_γ + 1 ) * ( 1 - italic_p start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT )
=\displaystyle== (p1p0)+(2p22p1)+(3p33p2)++(γ+1)(1pγ)subscript𝑝1subscript𝑝02subscript𝑝22subscript𝑝13subscript𝑝33subscript𝑝2𝛾11subscript𝑝𝛾\displaystyle(p_{1}-p_{0})+(2p_{2}-2p_{1})+(3p_{3}-3p_{2})+...+(\gamma+1)(1-p_% {\gamma})( italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + ( 2 italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - 2 italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + ( 3 italic_p start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT - 3 italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) + … + ( italic_γ + 1 ) ( 1 - italic_p start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT )
=\displaystyle== p0+(p12p1)+(2p23p2)+(3p34p3)+(γ+1)subscript𝑝0subscript𝑝12subscript𝑝12subscript𝑝23subscript𝑝23subscript𝑝34subscript𝑝3𝛾1\displaystyle-p_{0}+(p_{1}-2p_{1})+(2p_{2}-3p_{2})+(3p_{3}-4p_{3})...+(\gamma+1)- italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + ( italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - 2 italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + ( 2 italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - 3 italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) + ( 3 italic_p start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT - 4 italic_p start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) … + ( italic_γ + 1 )
=\displaystyle== p0p1p2pγ+(γ+1)subscript𝑝0subscript𝑝1subscript𝑝2subscript𝑝𝛾𝛾1\displaystyle-p_{0}-p_{1}-p_{2}-...-p_{\gamma}+(\gamma+1)- italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - … - italic_p start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT + ( italic_γ + 1 )
=\displaystyle== (γ+1)i=1γ(1αi)b𝛾1superscriptsubscript𝑖1𝛾superscript1superscript𝛼𝑖𝑏\displaystyle(\gamma+1)-\sum\limits_{i=1}^{\gamma}(1-\alpha^{i})^{b}( italic_γ + 1 ) - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT ( 1 - italic_α start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT

Appendix D Prompt for LLaMA-2-Chat on Summarization Tasks

We use the following as the prompt for summarization task, modified from (Ruan et al., 2023).

➤ Prompt:
[[[[INST]]]] <<much-less-than<<< <SYS>>much-greater-than>>> >
You are an intelligent chatbot. Answer the questions only using the following context:
{Original Text}
Here are some rules you always follow:
- Generate human readable output, avoid creating output with gibberish text.
- Generate only the requested output, don’t include any other language before or after the requested output.
- Never say thank you, that you are happy to help, that you are an AI agent, etc. Just answer directly.
- Generate professional language typically used in business documents in North America.
- Never generate offensive or foul language.
<</much-less-thanabsent<</< < /SYS>>much-greater-than>>> >
Briefly summarize the given context. [/[/[ /INST]]]]
Summary:

Appendix E Verification of Generation Quality for Greedy Sampling and Advanced Supports

Generation Quality with Greedy Search is not changed. Theoretically, Lookahead Decoding does not change the output generation of greedy search due to the verification mechanism. However, Lookahead Decoding’s output does not perfectly align with the huggingface’s implementation of greedy search in practice. We attribute this discrepancy to numerical accuracy issues. To substantiate this claim, we compared the output results as follows. We use the LLaMA-2-7b-Chat model’s single precision (FP32) inference with huggingface’s greedy search on 160 turns on the MT-Bench dataset as a baseline. With single precision inference, the outputs of Lookahead Decoding (on 1GPU, 4GPUs, and 8GPUs) are the same as the output of the baseline. With half-precision (FP16) inference, huggingface’s greedy search has 35 out of 160 (w/o FlashAttention) and 42 out of 160 (w/ FlashAttention) answers not perfectly aligned with the baseline output. In contrast, Lookahead Decoding and its integration with FlashAttention and multi-GPU inference has 35-44 results different from the baseline output under different settings. We claim this result can show that Lookahead Decoding can retain the output distribution using a greedy search within the numerical error range (not worse than huggingface’s half-precision inference). Besides, Tab. 2 also strengthens the statements for greedy search.

Generation Quality with LP and FlashAttention Augmentation is not changed. We verify that FlashAttention and LP Support will not change the compression ratio (𝒮𝒮\mathcal{S}caligraphic_S) of vanilla Lookahead Decoding. We compared each 18 generations of Lookahead Decoding w/ FlashAttention and w/o FlashAttention (7B and 13B model on MT-Bench, HumanEval, and ClassEval); the average 𝒮𝒮\mathcal{S}caligraphic_S w/ FlashAttention is 3.267 while w/o FlashAttention is 3.259, with less than 0.3% differences. We also compared 6 generations of Lookahead Decoding on a single GPU and 12 generations with LP (7B model on MT-Bench, HumanEval, and ClassEval, both with N=5𝑁5N=5italic_N = 5, W=15𝑊15W=15italic_W = 15, and G=15𝐺15G=15italic_G = 15). The average 𝒮𝒮\mathcal{S}caligraphic_S on a single GPU is 2.558, while on multiple GPUs, it is 2.557, with less than 0.1% differences. We claim that our advanced support does not change 𝒮𝒮\mathcal{S}caligraphic_S.