Stream2LLM: Overlap Context Streaming and Prefill for Reduced Time-to-First-Token (TTFT)

Bachkaniwala, Rajveer; Luo, Chengqi; So, Richard; Mahajan, Divya; Rong, Kexin

Computer Science > Databases

arXiv:2604.16395 (cs)

[Submitted on 29 Mar 2026 (v1), last revised 22 Apr 2026 (this version, v2)]

Title:Stream2LLM: Overlap Context Streaming and Prefill for Reduced Time-to-First-Token (TTFT)

Authors:Rajveer Bachkaniwala, Chengqi Luo, Richard So, Divya Mahajan, Kexin Rong

View PDF HTML (experimental)

Abstract:Context retrieval systems for LLM inference face a critical challenge: high retrieval latency creates a fundamental tension between waiting for complete context (poor time-to-first-token) and proceeding without it (reduced quality). Streaming context incrementally--overlapping retrieval with inference--can mitigate this latency, but doing so with concurrent requests introduces new challenges: requests contend for GPU compute and memory, and scheduling must adapt to dynamic context arrivals.
We present Stream2LLM, a streaming-aware LLM serving system for concurrent prefill-decode disaggregated deployments. Stream2LLM introduces adaptive scheduling and preemption for two distinct retrieval patterns: append-mode (progressive context accumulation) and update-mode (iterative refinement with cache invalidation). It decouples scheduling decisions from resource acquisition, enabling flexible preemption strategies guided by hardware-specific cost models, and uses longest common prefix matching to minimize redundant computation when input changes dynamically. To evaluate Stream2LLM, we collect two large-scale, real-world streaming workloads based on web crawling and approximate nearest neighbor search. Our evaluation demonstrates that streaming architecture delivers up to 11x TTFT improvements, with cost-aware scheduling providing critical benefits under memory pressure, all while maintaining throughput parity with non-streaming baselines.
Code: this https URL

Comments:	Minor revision: expanded evaluation, unified baseline naming, added code link and acknowledgments
Subjects:	Databases (cs.DB); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2604.16395 [cs.DB]
	(or arXiv:2604.16395v2 [cs.DB] for this version)
	https://doi.org/10.48550/arXiv.2604.16395

Submission history

From: Rajveer Bachkaniwala [view email]
[v1] Sun, 29 Mar 2026 06:49:12 UTC (3,385 KB)
[v2] Wed, 22 Apr 2026 19:11:26 UTC (3,105 KB)

Computer Science > Databases

Title:Stream2LLM: Overlap Context Streaming and Prefill for Reduced Time-to-First-Token (TTFT)

Submission history

Access Paper:

Current browse context:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Databases

Title:Stream2LLM: Overlap Context Streaming and Prefill for Reduced Time-to-First-Token (TTFT)

Submission history

Access Paper:

Current browse context:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators