Break the Sequential Dependency of LLM Inference Using
Lookahead Decoding
Abstract
Autoregressive decoding of large language models (LLMs) is memory bandwidth bounded, resulting in high latency and significant wastes of the parallel processing power of modern accelerators. Existing methods for accelerating LLM decoding often require a draft model (e.g., speculative decoding), which is nontrivial to obtain and unable to generalize. In this paper, we introduce Lookahead Decoding, an exact, parallel decoding algorithm that accelerates LLM decoding without needing auxiliary models or data stores. It allows trading per-step log(FLOPs) to reduce the number of total decoding steps, is more parallelizable on single or multiple modern accelerators, and is compatible with concurrent memory-efficient attention (e.g., FlashAttention). Our implementation of Lookahead Decoding can speed up autoregressive decoding by up to 1.8x on MT-bench and 4x with strong scaling on multiple GPUs in code completion tasks. Our code is avialable at https://github.com/hao-ai-lab/LookaheadDecoding
1 Introduction
Large language models (LLMs) are transforming the AI industry. As they are increasingly integrated into diverse applications such as search (Team et al., 2023) and chatbots (Ouyang et al., 2022), generating long sequences at low-latency using LLMs is becoming one significant requirement. However, current LLMs generate text based on (Touvron et al., 2023a, b; Jiang et al., 2023; OpenAI, 2023) autoregressive decoding, which falls short in efficiency, primarily for two reasons. First, autoregressive decoding generates only one token at a time. Hence, the overall generation time is proportional to the number of decoding steps. Second, each decoding step largely underutilizes the parallel processing capabilities of modern accelerators (e.g., GPUs). Given the pressing need for low latency in various applications, improving autoregressive decoding remains a central challenge.
Several approaches have been proposed – one such approach is speculative decoding (Chen et al., 2023; Leviathan et al., 2023) and its variants (He et al., 2023; Stern et al., 2018; Cai et al., 2024; Li et al., 2023; Liu et al., 2023; Miao et al., 2023). These methods all follow a guess-and-verify approach: they use a draft model to speculate several subsequent tokens and then use the original (base) LLM to verify these tokens in parallel. Since the draft model requires much fewer resources and the cost of verifying multiple tokens in parallel is similar to the cost of generating a single token, these methods can achieve considerable speedups. However, their speedups are bounded by the token acceptance rate (§4.1), i.e., the fraction of tokens generated by the draft model that passes the verification test of the base model. This is because every token that fails verification needs to be regenerated by the base model. In the worst case, if most proposed tokens fail verification, these methods may slow down the decoding process. Therefore, achieving a high acceptance rate is essential for these methods. Unfortunately, training a draft model to achieve a high acceptance rate is non-trivial, and the trained draft model does not generalize across base models and datasets.
To address these problems, this paper develops Lookahead Decoding. We build upon a key observation: autoregressive decoding can be equivalently formulated as solving a non-linear system via the fixed point Jacobi iteration method (§2), which we term as Jacobi decoding (Santilli et al., 2023). Each Jacobi decoding step can generate multiple tokens in parallel at different positions. Although these tokens may appear at incorrect positions, we can leverage this parallel generation approach to have the LLM generate several disjoint n-grams in parallel in a single step. These n-grams could potentially be integrated into future parts of the generated sequence, pending verification by the base model to maintain the output distribution.
Lookahead Decoding takes advantage of the particular characteristics of autoregressive decoding, which is bounded by the memory bandwidth–as each generated token depends on all tokens before it–rather than compute, by using the available cycles to generate and verify -grams (subsequent tokens) at virtually no additional cost. In a nutshell, Lookahead Decoding consists of a lookahead branch that generates -grams and a verification branch that verifies -grams, both executing in a single step. To improve efficiency, we use an -gram pool to cache the historical -grams generated so far. This way, Lookahead Decoding can significantly reduce the latency of LLM inference just by exploiting the compute resources that autoregressive decoding would leave unused. More importantly, Lookahead Decoding scales with the compute – we show that it can linearly reduce the number of decoding steps relative to the log(FLOPs) allocated per step.
We have implemented the algorithm in both Python and CUDA, compatible with memory-efficient attention algorithms (e.g., FlashAttention (Dao, 2023)), and supports various sampling methods without changing the output distribution. We also scale it to multiple GPUs, resulting in Lookahead Parallelism. We evaluate Lookahead Decoding on the popular LLaMA-2 (Touvron et al., 2023b) models. It achieves 1.8x speedup on the challenging multi-turn chat dataset MT-Bench (Zheng et al., 2023) and up to 4x speedup in code completion tasks with Lookahead Parallelism on 8 GPUs. Lookahead Decoding showed significant potential in lowering the latency for latency-sensitive tasks. Our contributions are summarized as follows.
-
•
We design Lookahead Decoding, a new lossless, parallel decoding algorithm to accelerate LLM inference without needing any auxiliary component.
-
•
We reveal Lookahead Decoding’s scaling behavior: it linearly reduces the number of decoding steps according to per-step FLOPs. This enables trade-offs between the number of decoding steps and per-step FLOPs, making it future-proof.
-
•
We show it benefits from the latest memory-efficient attentions and is easily parallelizable by developing its distributed CUDA implementations.
-
•
We evaluated Lookahead Decoding and demonstrate its effectiveness under different settings.
2 Background
In this section, we formulate both autoregressive and Jacobi decoding from the lens of solving nonlinear systems.
Causal Attention in Decoder Models. Most contemporary LLMs are composed of two core components: token-wise modules (including MLP and normalization (Ba et al., 2016; Zhang & Sennrich, 2019)) and attention (Vaswani et al., 2023) modules. Tokens interact with each other in the attention modules, while in other token-wise modules, they are processed without exchanging information with each other.
The attention layer encompasses three input elements: query , key , and value , with the -th token in each denoted as , , and , respectively. The attention layer executes the following operation: . A lower triangular mask applied to in causal attentions (specific to decoder models) ensures that is calculated only from and , where . Because all other layers in the LLM perform token-wise operations, for any given model input and output , (-th token in ) is exclusively influenced by (-th token in ) where .
Autoregressive Decoding in LLMs. LLMs take discrete integer sequences as inputs, where each integer represents a token. We notate of length as the input of the model, and to denote a slice of of length at step . LLMs’ output characterizes the probability distribution of the next token. The probability for the -th token (i.e., the output of the -th token) is decided by all previous input tokens, represented as . Then, the next token input is obtained by sampling from using different methods (e.g., greedy, top-K, and top-P (Kool et al., 2020; Holtzman et al., 2020)). When using greedy sampling, the next token is selected by applying an function on .
We define as the prompt tokens given by the user. The LLM needs to generate an output sequence (of length ) from . Denote as the token generated at step . The autoregressive decoding process of tokens can be seen as solving the following problems one by one (assume greedy sampling):
(1) |
Guess-And-Verify Paradigm. The Guess-And-Verify decoding paradigm speculates multiple potential future tokens and subsequently confirms the correctness of these speculations within a single decoding step. Take speculative decoding with greedy sampling as an example: at step , with the prompt and tokens generated so far, we can use a draft model to autoregressively generate a draft sequence of length . Because is known a priori, we then use the LLM to solve Eqs 2 in parallel, obtaining . Then, we verify if is equal to for each from to . If there is a match, we accept this token and proceed; otherwise, we stop checking and drop subsequent tokens. Finally, we update with all accepted tokens.
(2) |
As stated in §1, these approaches depend on a good draft model, which is hard to obtain and cannot generalize.
Jacobi Decoding. By notating , we can transform Eqs 1 into the following non-linear system of equations (Song et al., 2021; Santilli et al., 2023):
(3) |
We can solve this non-linear system using Jacobi iteration by iteratively updating all from a random initial guess , along the trajectory , until converging to the fixed point solution . We detail this algorithm, termed as Jacobi decoding, in Appendix Algorithm 1. This process guarantees to return the solution of all variables in at most iterations, as the very first token of each Jacobi update matches autoregressive decoding. Sometimes, more than one token might be correctly generated in a single iteration, potentially reducing the number of decoding steps. It is worth noting that, as is generated based on the past value on the trajectory, any two adjacent tokens from and can form a meaningful 2-gram.
Limitations of Jacobi Decoding. Empirically, we observe Jacobi decoding can hardly reduce decoding steps, even if it can generate multiple tokens per step. This is because the generated tokens are often put in the wrong positions of the sequence, and correctly placed tokens are frequently replaced by subsequent Jacobi iterations. These prevent it from achieving wall-clock speedup.
3 Lookahead Decoding
Lookahead Decoding leverages Jacobi decoding’s ability to generate many tokens in one step but addresses its limitation. Fig. 1 illustrates its workflow. The key design in Lookahead Decoding is to keep track of the trajectory of Jacobi decoding and generate -gram from this trajectory. This is achieved by maintaining a fixed-sized 2D window, with the two dimensions corresponding to the sequence and the time axis, respectively, to generate multiple disjoint -grams from the Jacobi iteration trajectory in parallel. We call this process the lookahead branch. In addition, Lookahead Decoding introduces an -gram pool to cache these -grams generated along the trajectory. Promising -gram candidates are verified later by a designed verification branch to preserve the LLM’s output distribution; if passing verification, those disjoint n-grams are integrated into the sequence. The detailed algorithm is shown in Algorithm 2 in Appendix.
3.1 Lookahead Branch
Lookahead Decoding uses a fixed-sized 2D window for efficient -gram generation. In contrast to the original Jacobi decoding, which only uses the history tokens from the last step (or equivalently, it generates 2-grams), Lookahead Decoding generates many -grams, with , in parallel by using the past steps’ history tokens, effectively leveraging more information from the trajectory. The fixed-sized 2D window in the lookahead branch is characterized by two parameters: (1) defines the lookahead size into future token positions to conduct parallel decoding; (2) defines the lookback steps into the past Jacobi trajectory to retrieve -grams. See Algorithm 2 for a detailed process.
An example of the lookahead branch with and is in Fig. 2 (b), in which we look back steps and look ahead tokens for each step. The blue token with the digit 0 is the current step’s () input, and the orange, green, and red tokens were generated in previous lookahead branches at steps , , and , respectively. The digit on each token shows its relative position to the current input (i.e., the blue one labeled as 0). In the present stage, we perform a modified Jacobi iteration to generate new tokens for all 5 positions, following the trajectory formed by the preceding 3 steps. Once generated, we collect and cache them in the -gram pool () – for instance, a 4-gram consists of the orange token at position 1, the green token at position 2, the red token at position 3, and a newly generated token.
The most outdated tokens in both dimensions (time and sequence) will be removed, and newly generated tokens will be appended to the lookahead branch to maintain a fixed window size for each step. For example, we will remove all orange and green tokens with position 1 in Fig. 2. We then form a new lookahead branch with green tokens with indices 2, 3, 4, 5, all red tokens, and all newly generated tokens for the next step.
3.2 Verification Branch
Lookahead Decoding preserves the output distribution via its verification branch. We first discuss how to verify in greedy sampling. Recall in speculative decoding: the verification is performed by sending the draft tokens to the LLM to get an output for each draft token, then progressively checking if the last token’s corresponding output, generated by the target LLM, exactly matches the draft token itself (§2). The verification branch in Lookahead Decoding resembles this process, despite verifying many draft -gram candidates in parallel. In particular, We first look up from the -gram pool for “promising” -grams – by checking if a -gram starts with a token that exactly matches the last token of the current ongoing sequence. We then use the LLM to verify all these -grams in parallel, following a similar fashion as in speculative decoding. See Algorithm 3 in the Appendix for the detailed procedures.
We next discuss how to support more advanced sampling. Previous research (Miao et al., 2023) has developed efficient tree-based verification for speculative decoding with sampling support, where multiple draft sequences derived from a token tree can be verified in parallel. However, it does not apply to Lookahead Decoding as our verification works on disjoint -grams instead of trees. We improve it by progressively verifying along the -gram length and removing -grams with mismatched prefixes. Besides, speculative decoding style verification requires the probability distribution where the draft token is sampled to update the probability distribution when the draft token is rejected. Because we store all -grams in a pool instead of discarding them each step, we would need huge memory to store the probability distributions (each of vocabulary size) for the entire -gram pool. The key to overcome this is to leverage the mechanism that the verification is indifferent to how draft tokens were sampled – different sampling methods (e.g., greedy sampling) only influence the acceptance rate but keep the output distribution. We can force greedy sampling at the -gram generation (lookahead branch), in which the probability distribution degenerates into a one-hot vector. Hence we only need to store which token is selected. We elaborate the approach in Algorithm 4, prove its correctness in Appendix B, and verify its quality and speedups in §5.3.
It is expected to have an increasingly large -gram cache hence a growing verification branch as decoding progresses. We set a cap of to limit the maximum number of promising candidates run in parallel in the verification branch to manage the verification cost. Empirically we suggest to set proportional to to balance generation and verification. In practice, we simply set .
3.3 Decode, Predict, and Verify in The Same Step
At execution, the lookahead and verification branches can be integrated into one decoding step to leverage parallel processing. This requires a designated attention mask, as shown in Fig. 2 (b). This attention mask is straightforwardly derived following the principle that each token is only visible to the tokens with a larger position index than itself (§2). For example, only the green token at position 5 and all orange tokens are visible to the red token 6. The tokens in the lookahead branch are not visible to the tokens in the verification branch, and vice versa.
Integration with FlashAttention. FlashAttention (Dao et al., 2022; Dao, 2023) can vastly accelerate the training and inference of LLMs by saving memory I/O on the slow memory hierarchy. It forces a causal mask (e.g., Fig. 2 (a)) to avoid all token interactions outside a lower triangular scope, which is not suitable for Lookahead Decoding as we take a more subtle attention mask (e.g., Fig. 2 (b)) for different , , and . To solve this, we hardcode Lookahead Decoding’s attention pattern with adjustable , , and in FlashAttention. Applying FlashAttention to Lookahead Decoding brings about 20% end-to-end speedup compared to a straightforward implementation on top of native PyTorch in our experiments (§5.2).
3.4 Lookahead Parallelism
Lookahead Decoding is easy to parallelize on multiple GPUs for both lookahead and verification branches. Parallelizing the lookahead branch is achieved by noting that the lookahead computation is composed of several disjoint branches. For example, the branch with green 1 and red 2 tokens does not have interaction with the branch with the tokens green 3 and red 4 in Fig. 2 (b). We can put these disjoint branches onto different GPUs without introducing communication during the inference computation. Parallelizing the verification branch is done by assigning multiple -gram candidates to different devices. Because the verification of each candidate, by design, is independent of others, this will not cause communication.
Fig. 3 shows an example of parallelizing the lookahead branch and verification branch in Fig. 2 (b) to four GPUs. This workload allocation will have the orange token 0,1,2,3 and the input token 0 be redundantly placed and computed. However, it can essentially save communication volume during the whole forward pass. We only need to synchronize the generated tokens on each device after the forward pass. We can further scale the , , and with multiple GPUs’ increased FLOPs to obtain a lower latency according to Lookahead Decoding’s scalability (§4).

We name this new parallelism as lookahead parallelism (LP). Unlike previous parallelism methods (including pipeline and tensor parallelisms) that shard the model parameters or states across different GPUs, LP maintains an entire copy of the model for each GPU (thus needing more memory) and allows distributing tokens to different GPUs. Hence, LP is advantageous in inference as it introduces near-zero communication per step while existing model parallelism methods (Narayanan et al., 2021; Shoeybi et al., 2019) involve a large communication overhead on the critical path of each decoding step.
4 Scaling Law of Lookahead Decoding
Since Lookahead Decoding introduces flexible parameters and associated with the cost of each parallel decoding step. This section investigates the scaling law between compute FLOPs and the theoretical speedup, and compares it to speculative decoding.
4.1 Estimating Speedup for Speculative Decoding
Speculative decoding uses the draft model to speculate one token sequence at each step. We represent the probability of each token in the sequence passing the verification of the LLM by (acceptance rate) and notate its expectation . If we use the draft model to guess tokens per step, the expectation of the number of accepted tokens is denoted as (Leviathan et al., 2023):
(4) |
Instead of speculating one sequence every time, we would speculate sequences. We assume that sequences, each of tokens, are sampled as each token will have the same acceptance rate of . Under this setting, the expectation of the number of accepted tokens is denoted as follows:
(5) |
See derivations in Appendix C for Eq. 4 and Eq. 5. Note that when , Eq. 5 falls back to Eq. 4.
4.2 Estimating Speedup for Lookahead Decoding
We define the as the number of autoregressive steps divided by the number of Lookahead Decoding steps to generate the same length of the sequence. As the number of generated tokens equals the autoregressive steps, it can be denoted as:
(6) |
Lookahead Decoding speculates sequences every time as in Eq. 5. In each step, we will search -grams in the pool starting with the current input token and have at most speculations of length . As we set (§3.2), we have and using the notations in Eq. 5. In practice, we cannot expect each step to have equally good speculations (i.e., acceptance rate with ). We assume that, on average, for every step, we have one good speculation with tokens accepted, and for the other steps, we fall back to autoregressive decoding due to bad speculations. We use this to bridge and per step as follows:
(7) |
We can plot the curve indicated by our formulation with one specific setting as in Fig. 4 (b). We find that the trend of our empirical experiments (LLaMA-2-Chat-7B on MT-Bench with as in Fig. 4 (a)) align well with the formulation to some extent. From this formulation, we conclude that we can linearly reduce the number of decoding steps according to per-step given a large enough . In contrast to speculative decoding, Lookahead Decoding will not meet an upper bound indicated in Eq. 4 by simultaneously increasing and . This reveals the scaling law of Lookahead Decoding to linearly reduce decoding steps according to per-step FLOPs given a large enough , since per-step FLOPs is roughly proportional to the number of input tokens (i.e., ). The scaling law also suggests Lookahead Decoding’s strong scaling to multiple GPUs, in which we can obtain an even greater per-token latency reduction by using more FLOPs, which is advantageous for latency-sensitive tasks.
5 Evaluation Results
Model and testbed. We used various versions of the LLaMA-2 (Touvron et al., 2023b) and CodeLlama (Roziere et al., 2023) models, including the 7B, 13B, 34B, and 70B sizes, on two GPU setups S1 and S2. S1 is equipped with NVIDIA A100 GPUs with 80GB of memory. On S1, the 7B, 13B, and 34B models are deployed on a single A100, while the 70B model utilizes 2 A100s with pipeline parallelism supported by Accelerate (Gugger et al., 2022). S2 is a DGX machine with 8 NVIDIA A100 GPUs with 40GB memory and NVLink. All models serve with FP16 precision and batch of 1 if not specified (Cai et al., 2024; He et al., 2023).
Datasets. We benchmarked Lookahead Decoding’s performance across a broad spectrum of datasets and tasks. MT-Bench (Zheng et al., 2023) is a diverse set of multi-turn questions with many unique tokens. GSM8K (Cobbe et al., 2021) contains a set of math questions, in which we use the first 1k questions. HumanEval (Chen et al., 2021) covers both code completion and infilling tasks. We also test on MBPP (Austin et al., 2021) dataset for instruction-based code generation, and on ClassEval (Du et al., 2023) for class-level code completion. To control generation length in code generation tasks, we set the maximum sequence length to 512 and 2,048 on HumanEval and ClassEval, respectively, aligned with prior setups (Ben Allal et al., 2022; Du et al., 2023). Tab. 1 lists detailed settings. In addition, we validate the effectiveness of sampling (§3.2) on XSum (Narayan et al., 2018) and CNN/Daily Mail (See et al., 2017) datasets.
Baseline Settings. Our primary baseline is HuggingFace’s implementation of greedy search (Wolf et al., 2020). Additionally, we employ FlashAttention (Dao et al., 2022; Dao, 2023) as a stronger baseline to assess the performance of FlashAttention empowered Lookahead Decoding. In distributed settings, we evaluate LP against TP (supported by deepspeed (Aminabadi et al., 2022)) and PP (supported by accelerate (Gugger et al., 2022)). We measure the throughput of single batch inference against these baseline settings (Cai et al., 2024; He et al., 2023).
Server | Parallel. | Model | Model Size | Dataset |
S1 | w/o LP | LLaMA-2-chat | 7B, 13B, 70B | MT-Bench |
CodeLLaMA | 7B, 13B, 34B | HumanEval | ||
CodeLLaMA-Inst | 7B, 13B, 34B | MBPP, GSM8K | ||
S2 | w/ LP | LLaMA-2-chat | 7B, 13B | MT-Bench |
CodeLLaMA | 7B, 13B | HumanEval | ||
CodeLLaMA-Python | 7B, 13B | ClassEval |



5.1 End-to-end Performance
Fig. 5 shows the end-to-end performance of Lookahead Decoding when compared with HuggingFace’s implementation of greedy search on S1. The used tasks and models are shown in Tab. 1. Across various datasets, Lookahead Decoding demonstrates a 1.5x-2.3x speedup. Generally, our method exhibits better performance in code completion tasks (e.g., 2.3x), given the higher occurrence of repetitive tokens during code completions, making predictions easier. Besides, smaller models also exhibit a higher speedup when compared to larger models. This is because Lookahead Decoding trades per-step FLOPs with a step compression ratio (§4). A larger model requires more FLOPs and quickly hits the GPU FLOPs cap compared to a smaller model. So, it shows a lower ability to compress decoding steps given the same GPU setting.
5.2 Performance with LP and FlashAttention
We evaluated the performance of Lookahead Decoding with LP and FlashAttention augmentation on S2 with greedy search. The used tasks and models are shown in Tab. 1. The results for the 7B and 13B models are in Fig. 6 and Fig. 7, respectively. FlashAttention speeds up the PyTorch implementation of Lookahead Decoding by 20%. Notably, FlashAttention-integrated Lookahead Decoding shows 1.8x speedups for the 7B model on MT-Bench compared with autoregressive decoding with FlashAttention (i.e., 1.9x vs 1.07x in Fig. 6). We did a strong scaling of the workloads to multiple GPUs for distributed settings (i.e., increasing GPUs but not increasing workloads). The multiple GPU settings of both TP (w/ DeepSpeed) and PP (w/ Accelerate) bring slowdowns (i.e., 0.75x-0.82x). The results echos DeepSpeed’s documentation (dee, 2023). However, with Lookahead Decoding, we can further utilize the FLOPs of multiple GPUs to reduce the inference latency (e.g., 4x on ClassEval).
Dataset | Temp. | Method | Rouge-1 | Rouge-2 | Rouge-L | Speedups | |
CNN. | 1.0 | AR. | 36.55 | 13.20 | 22.68 | 1.00x | 1.00x |
LA. | 36.53 | 13.27 | 22.71 | 1.46x | 1.64x | ||
0.0 | AR. | 37.79 | 14.59 | 23.96 | 1.00x | 1.00x | |
LA. | 37.79 | 14.59 | 23.96 | 1.57x | 1.72x | ||
Xsum | 1.0 | AR. | 19.15 | 4.53 | 12.84 | 1.00x | 1.00x |
LA. | 19.20 | 4.53 | 12.87 | 1.50x | 1.67x | ||
0.0 | AR. | 19.38 | 4.78 | 13.05 | 1.00x | 1.00x | |
LA. | 19.39 | 4.79 | 13.06 | 1.60x | 1.77x |
Tag | Setting (N, W, G) | Prompt as Ref. | Speedups | |
① | Autoregressive | ✗ | 1.00x | 1.00 |
② | Prompt Lookup | ✓ | 1.44x | 1.55 |
③ | (10, 1, 3) | ✓ | 1.36x | 1.45 |
④ | (5, 1, 10) | ✓ | 1.36x | 1.51 |
⑤ | (5, 1, 30) | ✗ | 1.04x | 1.12 |
⑥ | (5, 1, 30) | ✓ | 1.46x | 1.59 |
⑦ | (5, 30, 1) | ✗ | 1.61x | 1.79 |
⑧ | (5, 15, 15) | ✗ | 1.78x | 1.96 |
⑨ | (5, 15, 15) | ✓ | 1.88x | 2.05 |
5.3 Generation Quality of Lookahead Decoding
We assess the generation quality of Lookahead Decoding on LLaMA-2-7B-Chat model with the prompts in Appendix D on summarization datasets (Chen et al., 2023; Leviathan et al., 2023) in Tab. 2. Whether the sampling is activated, Lookahead Decoding can reserve the output distribution quality, which is evaluated in rouge-1, rouge-2, and rouge-L (Lin, 2004), while achieving 1.46x-1.60x speedups compared with autoregressive decoding. Using sampling gives smaller speedups as the acceptance ratio is lower according to the sampling verification algorithm 4, which aligns with the results in the previous research (Chen et al., 2023; Leviathan et al., 2023). We further verify that using greedy sampling and advanced integrations will not change the generation quality in Appendix E.
5.4 Ablation Study
In this section, we study the importance of the lookahead and verification branch in achieving a high speedup. We experiment on LLaMA-2-7B-Chat and MT-Bench on S1 with various settings. The results are shown in Tab. 3.
We ablate the importance of lookahead branch by comparing the performance of using a lookahead branch to the recent methods of using prompts as reference (Yang et al., 2023; Saxena, 2023). This comparison assumes that Lookahead Decoding does not use the prompt to build the n-gram pool. We use the implementation in transformers v4.37 of prompt lookup as a baseline (②, with prompt_lookup_num_tokens=10). We also use prompt to build n-gram pool to augment Lookahead Decoding (③④⑥⑨). The results show that although using a minimal lookahead branch () with various settings (③④⑤⑥) can obtain a decent speedup on MT-Bench, it is still not as good as using balanced branches (⑧). We can find that prompt lookup can surpass prompt as reference implementation in Lookahead Decoding. This is because our method checks if -gram starts with one token that exactly matches the last generated token while prompt lookup in transformers v4.37 checks several starting tokens for a better speculation.
We ablate the importance of verification branch by reporting the speedup of using a tiny verification branch and a large lookahead branch (⑦, ) . It shows lower performance due to lower potential in accepting speculations compared with a balanced branches (⑧).
Besides, our evaluation shows that using prompt as reference can further boost Lookahead Decoding (⑧ and ⑨). We have integrated them in our implementation.
5.5 Discussion and Limitation
Model | Window Size () | N-gram Size () |
7B | 15 | 5 |
13B | 10 | 5 |
34B | 7 | 5 |
The main limitation of Lookahead Decoding is that it requires extra computations. Our experimental results show that on A100, the configuration in Tab. 4 works near optimally in most cases for single batch serving. Because the per-step FLOPs are roughly proportional to the number of per-step input tokens, which is . If we ignore the attention cost’s increase with sequence length, the 7B, 13B, and 34B models require 120x, 80x, and 56x extra FLOPs per step, respectively. Since the LLM decoding is memory bandwidth-bound rather than compute-bound, these extra FLOPs only turn into a limited wall-clock slowdown for each step.
Given this, Lookahead Decoding needs large surplus FLOPs to obtain high speedups. Running in compute-bound environments (e.g., serving with a large batch size) may cause slowdowns. Another example is shown in Fig. 8, where lower speedup is observed when the GPU’s cap FLOPs is smaller (e.g., on RTX 3090 GPUs).
Based on §4, we need to exponentially increase the per-step FLOPs to obtain a linear reduction in decoding steps. Hence, the setting in Tab. 4 faces a diminishing return. However, when FLOPs are not rich, we see that a gentle speedup (e.g., on RTX 3090 and on A100) on MT-Bench easily achievable, as in Fig. 8, which is a free lunch that requires no extra model, training, or changing the output distribution.
6 Related Work
Speculative decoding (Chen et al., 2023; Leviathan et al., 2023) pioneer in speedup autoregressive decoding with a draft model. Different methods for obtaining speculations are researched. Specinfer (Miao et al., 2023) uses many draft models obtained from distillation, quantization, and pruning to conduct speculations together. Medusa (Cai et al., 2024), OSD (Liu et al., 2023), and EAGLE (Li et al., 2023) use training to obtain a draft model. REST (He et al., 2023) uses the finetuning dataset itself as a datastore to lookup speculations, while other works (Yang et al., 2023; Saxena, 2023) uses prompt as a reference for speculations. Different from these methods, Lookahead Decoding uses LLM’s parallel generation ability for speculations. Sampling methods are also researched. Specinfer maintains output distribution by a tree-based sampling algorithm. Medusa uses a typical acceptance scheme to accelerate when the temperature is large but does not persist on an exact output distribution. Lookahead Decoding follows Specinfer to maintain output distribution but with multiple disjoint -grams.
7 Conclusion
In this paper, we present Lookahead Decoding to parallelize the autoregressive decoding of LLMs without changing the output distribution. It shows notable speedup without a draft model and can linearly decrease the decoding steps with exponential investment in per-step FLOPs.
References
- dee (2023) Automatic tensor parallelism for huggingface models, 2023. URL https://www.deepspeed.ai/tutorials/automatic-tensor-parallelism.
- Aminabadi et al. (2022) Aminabadi, R. Y., Rajbhandari, S., Awan, A. A., Li, C., Li, D., Zheng, E., Ruwase, O., Smith, S., Zhang, M., Rasley, J., et al. Deepspeed-inference: enabling efficient inference of transformer models at unprecedented scale. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–15. IEEE, 2022.
- Austin et al. (2021) Austin, J., Odena, A., Nye, M., Bosma, M., Michalewski, H., Dohan, D., Jiang, E., Cai, C., Terry, M., Le, Q., and Sutton, C. Program synthesis with large language models, 2021.
- Ba et al. (2016) Ba, J. L., Kiros, J. R., and Hinton, G. E. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- Ben Allal et al. (2022) Ben Allal, L., Muennighoff, N., Kumar Umapathi, L., Lipkin, B., and von Werra, L. A framework for the evaluation of code generation models. https://github.com/bigcode-project/bigcode-evaluation-harness, 2022.
- Cai et al. (2024) Cai, T., Li, Y., Geng, Z., Peng, H., Lee, J. D., Chen, D., and Dao, T. Medusa: Simple llm inference acceleration framework with multiple decoding heads, 2024.
- Chen et al. (2023) Chen, C., Borgeaud, S., Irving, G., Lespiau, J.-B., Sifre, L., and Jumper, J. Accelerating large language model decoding with speculative sampling. arXiv preprint arXiv:2302.01318, 2023.
- Chen et al. (2021) Chen, M., Tworek, J., Jun, H., Yuan, Q., de Oliveira Pinto, H. P., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov, M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S., Ryder, N., Pavlov, M., Power, A., Kaiser, L., Bavarian, M., Winter, C., Tillet, P., Such, F. P., Cummings, D., Plappert, M., Chantzis, F., Barnes, E., Herbert-Voss, A., Guss, W. H., Nichol, A., Paino, A., Tezak, N., Tang, J., Babuschkin, I., Balaji, S., Jain, S., Saunders, W., Hesse, C., Carr, A. N., Leike, J., Achiam, J., Misra, V., Morikawa, E., Radford, A., Knight, M., Brundage, M., Murati, M., Mayer, K., Welinder, P., McGrew, B., Amodei, D., McCandlish, S., Sutskever, I., and Zaremba, W. Evaluating large language models trained on code, 2021.
- Cobbe et al. (2021) Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
- Dao (2023) Dao, T. FlashAttention-2: Faster attention with better parallelism and work partitioning. 2023.
- Dao et al. (2022) Dao, T., Fu, D. Y., Ermon, S., Rudra, A., and Ré, C. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems, 2022.
- Du et al. (2023) Du, X., Liu, M., Wang, K., Wang, H., Liu, J., Chen, Y., Feng, J., Sha, C., Peng, X., and Lou, Y. Classeval: A manually-crafted benchmark for evaluating llms on class-level code generation, 2023.
- Gugger et al. (2022) Gugger, S., Debut, L., Wolf, T., Schmid, P., Mueller, Z., Mangrulkar, S., Sun, M., and Bossan, B. Accelerate: Training and inference at scale made simple, efficient and adaptable. https://github.com/huggingface/accelerate, 2022.
- He et al. (2023) He, Z., Zhong, Z., Cai, T., Lee, J. D., and He, D. Rest: Retrieval-based speculative decoding. arXiv preprint arXiv:2311.08252, 2023.
- Holtzman et al. (2020) Holtzman, A., Buys, J., Du, L., Forbes, M., and Choi, Y. The curious case of neural text degeneration, 2020.
- Jiang et al. (2023) Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., Casas, D. d. l., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
- Kool et al. (2020) Kool, W., van Hoof, H., and Welling, M. Ancestral gumbel-top-k sampling for sampling without replacement. Journal of Machine Learning Research, 21(47):1–36, 2020. URL http://jmlr.org/papers/v21/19-985.html.
- Leviathan et al. (2023) Leviathan, Y., Kalman, M., and Matias, Y. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pp. 19274–19286. PMLR, 2023.
- Li et al. (2023) Li, Y., Zhang, C., and Zhang, H. Eagle: Lossless acceleration of llm decoding by feature extrapolation, December 2023. URL https://sites.google.com/view/eagle-llm.
- Lin (2004) Lin, C.-Y. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pp. 74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/W04-1013.
- Liu et al. (2023) Liu, X., Hu, L., Bailis, P., Stoica, I., Deng, Z., Cheung, A., and Zhang, H. Online speculative decoding, 2023.
- Miao et al. (2023) Miao, X., Oliaro, G., Zhang, Z., Cheng, X., Wang, Z., Wong, R. Y. Y., Zhu, A., Yang, L., Shi, X., Shi, C., Chen, Z., Arfeen, D., Abhyankar, R., and Jia, Z. Specinfer: Accelerating generative large language model serving with speculative inference and token tree verification, 2023.
- Narayan et al. (2018) Narayan, S., Cohen, S. B., and Lapata, M. Don’t give me the details, just the summary! Topic-aware convolutional neural networks for extreme summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 2018.
- Narayanan et al. (2021) Narayanan, D., Shoeybi, M., Casper, J., LeGresley, P., Patwary, M., Korthikanti, V. A., Vainbrand, D., Kashinkunti, P., Bernauer, J., Catanzaro, B., Phanishayee, A., and Zaharia, M. Efficient large-scale language model training on gpu clusters using megatron-lm, 2021.
- OpenAI (2023) OpenAI. Gpt-4 technical report, 2023.
- Ouyang et al. (2022) Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- Roziere et al. (2023) Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X. E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
- Ruan et al. (2023) Ruan, J. T., Sabir, F., and Chopra, P. Best prompting practices for using the llama 2 chat llm through amazon sagemaker jumpstart, November 2023. URL https://aws.amazon.com/cn/blogs/machine-learning/best-prompting-practices-for-using-the-llama-2-chat-llm-through-amazon-sagemaker-jumpstart/.
- Santilli et al. (2023) Santilli, A., Severino, S., Postolache, E., Maiorca, V., Mancusi, M., Marin, R., and Rodola, E. Accelerating transformer inference for translation via parallel decoding. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 12336–12355, Toronto, Canada, July 2023. Association for Computational Linguistics. URL https://aclanthology.org/2023.acl-long.689.
- Saxena (2023) Saxena, A. Prompt lookup decoding, November 2023. URL https://github.com/apoorvumang/prompt-lookup-decoding/.
- See et al. (2017) See, A., Liu, P. J., and Manning, C. D. Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1073–1083, 2017.
- Shoeybi et al. (2019) Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J., and Catanzaro, B. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019.
- Song et al. (2021) Song, Y., Meng, C., Liao, R., and Ermon, S. Accelerating feedforward computation via parallel nonlinear equation solving, 2021.
- Stern et al. (2018) Stern, M., Shazeer, N., and Uszkoreit, J. Blockwise parallel decoding for deep autoregressive models, 2018.
- Team et al. (2023) Team, G., Anil, R., Borgeaud, S., Wu, Y., Alayrac, J.-B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A. M., Hauth, A., et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
- Touvron et al. (2023a) Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
- Touvron et al. (2023b) Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
- Vaswani et al. (2023) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need, 2023.
- Wolf et al. (2020) Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., Davison, J., Shleifer, S., von Platen, P., Ma, C., Jernite, Y., Plu, J., Xu, C., Scao, T. L., Gugger, S., Drame, M., Lhoest, Q., and Rush, A. M. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45, Online, October 2020. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/2020.emnlp-demos.6.
- Yang et al. (2023) Yang, N., Ge, T., Wang, L., Jiao, B., Jiang, D., Yang, L., Majumder, R., and Wei, F. Inference with reference: Lossless acceleration of large language models, 2023.
- Zhang & Sennrich (2019) Zhang, B. and Sennrich, R. Root mean square layer normalization, 2019.
- Zheng et al. (2023) Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E. P., Zhang, H., Gonzalez, J. E., and Stoica, I. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
Appendix A Algorithms
Appendix B Proof: Output distribution preserved disjoint n-gram verification
The sampling verification in Lookahead Decoding is adapted from the algorithm in Specinfer but with all speculations generated by the greedy sample. It does not change the output distribution from a fundamental point that how the draft model generates speculations is unimportant.
Theorem A For a given LLM, prompt and previously generated tokens , and speculations of next token . Each speculation token is sampled by a greedy sample (i.e., probability of 1). We use to represent the probability of sampled by the LLM and use to represent the probability of sampled by our proposed algorithm 4. We use and for short. We need to prove for any , and any and from the full vocabulary .
Proof.
The proof of this part corresponds to line 14 to line 44 in algorithm 4. Given speculations , we use to represent the probability that the token is accepted by the -th speculation (line 18-line 30), and is the probability that the token is rejected by the -th speculation (line 30-line 35), where is the -th speculation’s token. Moreover, is the probability of being accepted by the sampling at line 41. For simplicity, we use to represent , to represent , and use to represent . We use to represent the probability distribution obtained in line 13 and to present the updated probability before the -th speculation. We have as is never updated. We define is the probability of sampled by algorithm 4 when we have speculations. Then we should have:
We use induction to prove for any , any , and any with :
-
1)
When , we have . The initial guess can be either the same at or be different from .
-
(1)
When , equals at line 17, which is the same as as it is never updated. Upon this, we have . And, is the updated probability at line 42. Since is set to zero once rejected at line 32, . In this case, .
-
(2)
When , should be even if is accepted. Moreover, we have . Then is updated to at lines 32 and 33. Then . In this case, .
-
(1)
-
2)
When holds, which means for any , .
We prove for the same , , and any .
-
(1)
When , we have . If , we have all , and . It ensures that .
If , since by observation.
Then we have . Here we have another two cases:
➀ If , and . We have .
➁ If , . . So
-
(2)
When , is set to zero at line 32 after this step. In this case, we have , and . It makes that
-
(1)
∎
This part of the proof guarantees that from line 14 to line 44 in Algorithm 4, any new token appended to can follow the original distribution of the LLM. Line 21 to line 28 guarantees that sequences in share the same prefix of length in every iteration. This further guarantees that from is the same for all , follows the wanted distribution. Thus, the correctness of the whole sampling algorithm is proved.
Appendix C Derivation of Expectation of The Number of Accepted Tokens
We first start with single-candidate speculation. We need to obtain the probability of accepting tokens as for all possible . Since the speculation’s length is , the probability of accepting tokens with is 0. is the probability of the first token being rejected, which is . The probability , for all . The probability is accepting all tokens, which is . Thus we have the following, which is Eq. 4:
(8) | |||||
We then investigate the case of speculations with a batch size of . We need to obtain the probability of accepting tokens as . Since all speculations’ length is , the probability of accepting tokens with is 0. We use to denote , which is the probability that at most tokens are accepted in all speculations. For all , we should have . And, the probability should be . Thus we have the following, which is Eq. 5:
(9) | |||||
Appendix D Prompt for LLaMA-2-Chat on Summarization Tasks
We use the following as the prompt for summarization task, modified from (Ruan et al., 2023).
➤ Prompt: |
INST SYS |
You are an intelligent chatbot. Answer the questions only using the following context: |
{Original Text} |
Here are some rules you always follow: |
- Generate human readable output, avoid creating output with gibberish text. |
- Generate only the requested output, don’t include any other language before or after the requested output. |
- Never say thank you, that you are happy to help, that you are an AI agent, etc. Just answer directly. |
- Generate professional language typically used in business documents in North America. |
- Never generate offensive or foul language. |
SYS |
Briefly summarize the given context. INST |
Summary: |
Appendix E Verification of Generation Quality for Greedy Sampling and Advanced Supports
Generation Quality with Greedy Search is not changed. Theoretically, Lookahead Decoding does not change the output generation of greedy search due to the verification mechanism. However, Lookahead Decoding’s output does not perfectly align with the huggingface’s implementation of greedy search in practice. We attribute this discrepancy to numerical accuracy issues. To substantiate this claim, we compared the output results as follows. We use the LLaMA-2-7b-Chat model’s single precision (FP32) inference with huggingface’s greedy search on 160 turns on the MT-Bench dataset as a baseline. With single precision inference, the outputs of Lookahead Decoding (on 1GPU, 4GPUs, and 8GPUs) are the same as the output of the baseline. With half-precision (FP16) inference, huggingface’s greedy search has 35 out of 160 (w/o FlashAttention) and 42 out of 160 (w/ FlashAttention) answers not perfectly aligned with the baseline output. In contrast, Lookahead Decoding and its integration with FlashAttention and multi-GPU inference has 35-44 results different from the baseline output under different settings. We claim this result can show that Lookahead Decoding can retain the output distribution using a greedy search within the numerical error range (not worse than huggingface’s half-precision inference). Besides, Tab. 2 also strengthens the statements for greedy search.
Generation Quality with LP and FlashAttention Augmentation is not changed. We verify that FlashAttention and LP Support will not change the compression ratio () of vanilla Lookahead Decoding. We compared each 18 generations of Lookahead Decoding w/ FlashAttention and w/o FlashAttention (7B and 13B model on MT-Bench, HumanEval, and ClassEval); the average w/ FlashAttention is 3.267 while w/o FlashAttention is 3.259, with less than 0.3% differences. We also compared 6 generations of Lookahead Decoding on a single GPU and 12 generations with LP (7B model on MT-Bench, HumanEval, and ClassEval, both with , , and ). The average on a single GPU is 2.558, while on multiple GPUs, it is 2.557, with less than 0.1% differences. We claim that our advanced support does not change .