PaSS: Parallel Speculative Sampling

Monea, Giovanni; Joulin, Armand; Grave, Edouard

Computer Science > Computation and Language

arXiv:2311.13581 (cs)

[Submitted on 22 Nov 2023]

Title:PaSS: Parallel Speculative Sampling

Authors:Giovanni Monea, Armand Joulin, Edouard Grave

View PDF

Abstract:Scaling the size of language models to tens of billions of parameters has led to impressive performance on a wide range of tasks. At generation, these models are used auto-regressively, requiring a forward pass for each generated token, and thus reading the full set of parameters from memory. This memory access forms the primary bottleneck for generation and it worsens as the model size increases. Moreover, executing a forward pass for multiple tokens in parallel often takes nearly the same time as it does for just one token. These two observations lead to the development of speculative sampling, where a second smaller model is used to draft a few tokens, that are then validated or rejected using a single forward pass of the large model. Unfortunately, this method requires two models that share the same tokenizer and thus limits its adoption. As an alternative, we propose to use parallel decoding as a way to draft multiple tokens from a single model with no computational cost, nor the need for a second model. Our approach only requires an additional input token that marks the words that will be generated simultaneously. We show promising performance (up to $30\%$ speed-up) while requiring only as few as $O(d_{emb})$ additional parameters.

Comments:	Accepted at the 3rd workshop on Efficient Natural Language and Speech Processing (ENLSP, NeurIPS 2023)
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2311.13581 [cs.CL]
	(or arXiv:2311.13581v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2311.13581

Submission history

From: Giovanni Monea [view email]
[v1] Wed, 22 Nov 2023 18:37:27 UTC (23 KB)

Computer Science > Computation and Language

Title:PaSS: Parallel Speculative Sampling

Submission history

Access Paper:

Current browse context:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:PaSS: Parallel Speculative Sampling

Submission history

Access Paper:

Current browse context:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators