0% found this document useful (0 votes)
61 views17 pages

Fast25 Qin

Uploaded by

zqli0924
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views17 pages

Fast25 Qin

Uploaded by

zqli0924
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Mooncake: Trading More Storage for Less

Computation — A KVCache-centric Architecture for


Serving LLM Chatbot
Ruoyu Qin, Moonshot AI and Tsinghua University; Zheming Li, Weiran He, and
Jialei Cui, Moonshot AI; Feng Ren, Mingxing Zhang, Yongwei Wu, and
Weimin Zheng, Tsinghua University; Xinran Xu, Moonshot AI
https://www.usenix.org/conference/fast25/presentation/qin

This paper is included in the Proceedings of the


23rd USENIX Conference on File and Storage Technologies.
February 25–27, 2025 • Santa Clara, CA, USA
ISBN 978-1-939133-45-8

Open access to the Proceedings


of the 23rd USENIX Conference on
File and Storage Technologies
is sponsored by
M OONCAKE: Trading More Storage for Less Computation –
A KVCache-centric Architecture for Serving LLM Chatbot
Ruoyu Qin♠♡1 Zheming Li♠1 Weiran He♠ Jialei Cui♠ Feng Ren♡
Mingxing Zhang♡2 Yongwei Wu♡ Weimin Zheng♡ Xinran Xu♠2
♠ Moonshot AI ♡ Tsinghua University

Abstract Mooncake vLLM Prefix Caching


M OONCAKE is the serving platform for Kimi, an LLM chat- vLLM vLLM Chunked Prefill

Request Capacity Ratio (%)


bot service developed by Moonshot AI. This platform features 100

+59%
a KVCache-centric disaggregated architecture that not only 80

+157%
separates prefill and decoding clusters but also efficiently

+498%
utilizes the underexploited CPU, DRAM, SSD and NIC re- 60
sources of the GPU cluster to establish a disaggregated KV-
Cache. At the core of M OONCAKE is its KVCache-centric 40

Threshold III
Threshold II
Threshold I
global cache and a scheduler designed to maximize through-
20
put while adhering to stringent latency-related Service Level
Objectives (SLOs). 0
Our experiments demonstrate that M OONCAKE excels in 100 200300 500 1000
Time between Tokens (ms)
scenarios involving long-context inputs. In tests using real
traces, M OONCAKE increases the effective request capacity Figure 1: The experiment of the effective request capacity of
by 59%∼498% when compared to baseline methods, all while M OONCAKE under the real-world conversation workload and
complying with SLOs. Currently, M OONCAKE is operational different TBT SLOs. In this experiment, M OONCAKE and
across thousands of nodes, processing over 100 billion tokens three baseline systems utilize 16 8×A800 nodes each. More
daily. In practical deployments, M OONCAKE’s innovative on §5.2.
architecture enables Kimi to handle 115% and 107% more
requests on NVIDIA A800 and H800 clusters, respectively, it is necessary to decouple and restructure them into several
compared to previous systems. disaggregated resource pools, each optimized for different but
collaborative goals. For example, many other researchers [7–
9] have suggested separating prefill servers from decoding
1 Introduction servers, because these two stages of LLM serving have very
different computational characteristics.
With the rapid adoption of large language models (LLMs) in
various scenarios [1–4], the workloads for LLM serving have Further advancing this disaggregation strategy, we have en-
become significantly diversified. These workloads differ in gineered a disaggregated KVCache by pooling CPU, DRAM,
input/output length, distribution of arrival, and, most impor- SSD and RDMA resources of the GPU cluster, referred to as
tantly, demand different kinds of Service Level Objectives M OONCAKE Store. This novel architecture harnesses under-
(SLOs). As a Model as a Service (MaaS) provider, one of the utilized resources to enable efficient near-GPU prefix caching,
primary goals of Kimi [5] is to solve an optimization problem significantly enhancing the global cache capacity and inter-
with multiple complex constraints. The optimization goal is node transfer bandwidth. The resultant distributed KVCache
to maximize overall effective throughput, which directly im- system embodies the principle of trading more storage for
pacts revenue, while the constraints reflect varying levels of less computation. Thus, as demonstrated in Figure 1, it sub-
SLOs. These SLOs typically involve meeting latency-related stantially boosts Kimi’s maximum throughput capacity in
requirements, mainly the time to first token (TTFT) and the meeting the required SLOs for many important real-world
time between tokens (TBT). scenarios. Later in this paper, we will first delve into a mathe-
To achieve this goal, a prerequisite is to make the best use matical analysis of this strategy’s benefits for LLM serving
of the various kinds of resources available in the GPU cluster. and empirically assess its efficacy using real-world data (§2.2).
Specifically, although GPU servers are currently provided as Then, we will detail the design choices made in implementing
highly integrated nodes (e.g., DGX/HGX supercomputers [6]), this petabyte-level disaggregated cache, which is intercon-
nected via an RDMA network up to 8×400 Gbps (§3.2).
1 Ruoyu Qin’s part of work done as an intern at Moonshot AI, contributed

equally with Zheming Li.


Building on this idea, we also found that the scheduling
2 Corresponding to [email protected], xuxin- of KVCache is central to LLM serving, and hence propose a
[email protected]. corresponding disaggregated architecture. Figure 2 presents

USENIX Association 23rd USENIX Conference on File and Storage Technologies 155
KVCache-
Prefill Instance Prefill Instance
centric
Conductor GPU/VRAM GPU/VRAM Prefill Stage

Prefill Pool
PP/SP Optimization Goal
Local Local
Cache-aware Chunked Chunked
Prefill Prefill Prefill max Cache Reuse
Scheduler Scheduler Paged KVCache Scheduler Paged KVCache s.t.
TTFT SLO,
MFU Lower Bound,
CPU/DRAM/SSD CPU/DRAM/SSD
Mooncake Store KVCache < DRAM
Distributed KVCache Pool Distributed KVCache Pool

A
R

M
D

D
M
KVCache

R
A
Balance KVCache Transfer Engine
Scheduler

R
A

D
M

M
D

A
R
CPU/DRAM/SSD CPU/DRAM/SSD

Distributed KVCache Pool Distributed KVCache Pool


Decoding Stage
Optimization Goal
GPU/VRAM GPU/VRAM
Decoding Pool

Load-balance Paged KVCache Paged KVCache max Throughput


Decoding
Scheduler s.t.
Local Local
TBT SLO,
Scheduler Scheduler KVCache < VRAM

Decoding Instance Decoding Instance

Figure 2: M OONCAKE Architecture.

our current KVCache-centric disaggregated architecture replicates hot KVCache blocks without requiring precise pre-
for LLM serving, named M OONCAKE. For each request, the dictions of future KVCache usage. Experimental results show
global scheduler (Conductor) will select a pair of prefill and that our KVCache-centric scheduling can significantly lower
decoding instances and schedule the request in the following TTFT in real-world scenarios.
steps: 1) transfer as much reusable KVCache as possible to We will also describe the main design choices made during
the selected prefill instance; 2) complete the prefill stage in its implementation, especially those not covered in current
chunks/layers and continuously stream the output KVCache research. For example, regarding P/D disaggregation, there
to the corresponding decoding instance; 3) load the KVCache are currently debates on its feasibility in large-scale practice
and add the request to the continuous batching process at the due to bandwidth requirements and trade-offs associated with
decoding instance for generating request outputs. chunked prefill (e.g., Sarathi-Serve [10]). We demonstrate,
Although this process seems straightforward, the selection through comparison with vLLM, that with a highly optimized
policy is complex due to many restrictions. In the prefill stage, transfer engine, the communication challenges can be man-
the main objective is to reuse the KVCache as much as possi- aged, and P/D disaggregation is preferable for scenarios with
ble to avoid redundant computation. However, the distributed stringent SLO limits (§5.2). Additionally, we discuss how to
KVCache pool faces challenges in terms of both capacity and implement a separate prefill node pool that seamlessly han-
access latency. Thus Conductor is responsible for scheduling dles the dynamic distribution of context length. We employ a
requests with KVCache-awareness and executing scheduling chunked pipeline parallelism (CPP) mechanism to scale the
operations such as swapping and replication accordingly. The processing of a single request across multiple nodes, which
hottest blocks should be replicated to multiple nodes to avoid is necessary for reducing the TTFT of long-context inputs.
fetching congestion, while the coldest ones should be swapped Compared to traditional sequence parallelism (SP) based so-
out to reduce reserving costs. In contrast, the decoding stage lutions, CPP reduces network consumption and simplifies the
has different optimization goals and constraints. The aim is to reliance on frequent elastic scaling (§3.3).
aggregate as many tokens as possible in a decoding batch to M OONCAKE is currently the serving platform of Kimi and
improve the Model FLOPs Utilization (MFU). However, this has successfully handled exponential workload growth (more
objective is restricted not only by the TBT SLO but also by than 100 billion tokens a day). According to our historical
the total size of aggregated KVCache that can be contained statistics, the innovative architecture of M OONCAKE enables
in the VRAM. Kimi to handle 115% and 107% more requests on the A800
In §4, we will detail our KVCache-centric request schedul- and H800 clusters, respectively, compared to previous sys-
ing algorithm, which balances instance loads and user expe- tems.
rience as measured by TTFT and TBT SLOs. This includes To ensure the reproducibility of our results while safe-
a heuristic-based automated hotspot migration scheme that guarding proprietary information, we also provide detailed

156 23rd USENIX Conference on File and Storage Technologies USENIX Association
experimental outcomes using a dummy model mirroring the Notation Description Value
architecture of LLaMA3-70B, based on replayed traces of l Num layers 80
actual workloads. These traces, along with the KVCache d Model dimension 8192
transfer infrastructure of M OONCAKE, are open-sourced at a, b Constant coefficients in Equation 1 4, 22
https://github.com/kvcache-ai/Mooncake. gqa Num q heads / Num kv heads 8
In end-to-end experiments using public datasets and real s Tensor element size 2 B (BFloat16)
G GPU computation throughput 8×312 TFLOPS
workloads, M OONCAKE excels in long-context scenarios.
Bh2d Host-to-device bandwidth 128 GB/s
Compared to the baseline method, M OONCAKE can achieve
Bnic NIC bandwidth 800 Gbps
up to a 498% increase in the effective request capacity while n, p Prompt and matched prefix length, respectively
meeting SLOs. In §5.3, we compare M OONCAKE Store with
the local cache design and find that the global cache design of Table 1: Notations and parameters. Model and machine pa-
M OONCAKE Store significantly improves the cache hit rate. rameters are set according to LLaMA3-70B and 8×A800.
In our experiments, the cache hit rate is up to 2.36× higher cessive token generations for the same request, referred to as
than that of the local cache, resulting in up to 48% savings the time between tokens (TBT).
in prefill computation time. M OONCAKE, to the best of our In real deployments, if the monitor detects unmet SLOs, we
knowledge, is the first system to demonstrate the significant need to either add inference resources or reject some incoming
benefits of using a distributed KVCache pool to share KV- requests. However, due to the current contingent supply of
Cache across different chat sessions and queries in large-scale GPUs, elastically scaling out the inference cluster is typically
deployment scenarios. We also evaluate the performance of unfeasible. Therefore, we proactively reject requests that are
the transfer engine that supports high-speed RDMA transfers predicted not to meet the SLOs to alleviate the cluster’s load.
in M OONCAKE, which shows it is approximately 2.4× and Our main objective is to maximize overall throughput while
4.6× faster than existing solutions (§5.4). adhering to SLOs, a concept referred to as goodput in other
research [8, 15].
2 Preliminary and Problem Definition
2.2 More Storage for Less Computation
2.1 Service Level Objectives of LLM Serving
To meet the stringent SLOs described above, a commonly
Modern large language models (LLMs) are based on the adopted solution is to cache previously generated KVCache
Transformer architecture, which utilizes attention mecha- and reuse it upon finding a prefix match. However, existing
nisms and multilayer perceptrons (MLP) to process input. approaches [16–18] typically restrict caching to local HBM
Popular Transformer-based models, such as GPT [11] and and DRAM, assuming that the transfer bandwidth required for
LLaMA [12], employ a decoder-only structure. Each infer- global scheduling would be prohibitively high. But, as we will
ence request is logically divided into two stages: the prefill described later in §5.3, the capacity of local DRAM supports
stage and the decoding stage. only up to 50% of the theoretical cache hit rate, making the
During the prefill stage, all input tokens are processed in design of a global cache essential. In this section, we present
parallel, and hence it is typically computationally intensive. a mathematical analysis of the actual bandwidth necessary to
This stage generates the first output token while storing inter- benefit from this strategy, explaining why distributed caching
mediate results of computed keys and values, referred to as is advantageous, especially for larger models like LLaMA3-
the KVCache. The decoding stage then uses this KVCache 70B. More experimental results will be given later in §5.4.2.
to autoregressively generate new tokens. It processes only We base our analysis on the model using notations de-
one token at a time per batch due to the limitation of autore- scribed in Table 1 and incorporate specific parameters of
gressive generation, which makes it memory-constrained and LLaMA3-70B. Essentially, current popular LLMs are au-
causes computation time to increase sublinearly with batch toregressive language models where each token’s KVCache
size. Thus, a widely used optimization in the decoding stage depends only on itself and preceding tokens. Therefore,
is continuous batching [13, 14]. Before each iteration, the KVCache corresponding to the same input prefix can be
scheduler checks the status and adds newly arrived requests reused without affecting output accuracy. If a current request’s
to the batch while removing completed requests. prompt of length n shares a common prefix of length p with
Due to the distinct characteristics of the prefill and decod- previously cached KVCache, its prefill process can be opti-
ing stages, MaaS providers set different metrics to measure mized as follows:
their corresponding Service Level Objectives (SLOs). Specif- q[p : n], k[p : n], v[p : n] = MLP(hidden[p : n])
ically, the prefill stage is mainly concerned with the latency
between the request arrival and the generation of the first to- k[1 : n], v[1 : n] ← KVCache + (k[p : n], v[p : n])
ken, known as the time to first token (TTFT). On the other o[p : n] = Attention(q[p : n], k[1 : n], v[1 : n])
hand, the decoding stage focuses on the latency between suc- KVCache ← (k[1 : n], v[1 : n])

USENIX Association 23rd USENIX Conference on File and Storage Technologies 157
Given input length n, the FLOPS of the prefill stage can be Prefill Instance Decoding Instance
calculated as: GPU
s2: Incremental Prefill
GPU s4: Decoding

2 2 (s1): KVCache Reuse


f lops(n) = l × (an d + bnd ) (1) Incremental
Prefix KVCache Full KVCache
KVCache

Thus, reusing KVCache approximately reduces the compu-


tation cost of prefill by l × (ap2 d + bpd 2 ). However, this re- CPU
Layer-wise
CPU
Async Load†
quires transferring the cached KVCache into the prefill GPU’s Load and Store*

HBM, with a size of p×l ×(2×d/gqa)×s. Assuming the av- Prefix KVCache
Incremental
Full KVCache
KVCache
erage computation throughput is G and the average KVCache
loading speed is B (where B is determined by the minimum s3: KVCache Transfer

of Bh2d and Bnic ), the reuse of KVCache is beneficial in terms


of TTFT if: Figure 3: Workflow of inference instances.

B 2ds skipped if no prefix cache exists. This selection balances three


> (2) objectives: reusing as much KVCache as possible, balancing
G gqa × (apd + bd 2 )
the workloads of different prefill nodes, and guaranteeing the
In such scenarios, reusing the KVCache not only reduces
TTFT SLO. It leads to a KVCache-centric scheduling that
GPU time and costs but also enhances the user experience
will be further discussed in §4.
by improving TTFT. The criteria for bandwidth B relative to
2) Incremental Prefill: The prefill node completes the prefill
computation throughput G are more readily met with larger
stage using prefix cache and stores the newly generated in-
values of d, which is proportional to the model size. For exam-
cremental KVCache back into CPU memory. If the number
ple, when running LLaMA3-70B on a machine with 8×A800
of uncached input tokens exceeds a certain threshold, the
GPUs and assuming a prefix length of 8192, Equation 2 yields
prefill stage is split into multiple chunks and executed in a
a minimum required B of 6 GB/s. The requirement for B will
pipeline manner. This threshold is selected to fully utilize the
be enlarged to 19 GB/s for an 8×H800 machine. Moreover,
corresponding GPU’s computational power and is typically
in practical scenarios, because the transfer stages cannot be
larger than 1000 tokens. The reason for using chunked but
perfectly overlapped with each other, the actual bandwidth
still disaggregated prefill nodes is explained in §3.3.
requirement is even higher. However, as we will demonstrate
3) KVCache Transfer: M OONCAKE Store is deployed in each
in §5.4.2, a fully utilized 100 Gbps NIC per NVIDIA A800
node to manage and transfer these caches. This step is asyn-
HGX network is sufficient to meet these criteria.
chronously executed and overlapped with the above incre-
mental prefill step, streaming the KVCache generated by each
3 Design of M OONCAKE model layer to the destination decoding node’s CPU memory
to reduce waiting time.
3.1 Overview 4) Decoding: After all the KVCache is received in the CPU
As depicted in Figure 2, M OONCAKE employs a disaggre- memory of the decoding node, the request joins the next batch
gated architecture that not only separates prefill from decod- in a continuous batching manner. The decoding node is pre-
ing nodes but also groups the CPU, DRAM, SSD, and RDMA selected by Conductor based on its current load to ensure it
resources of the GPU cluster to implement a disaggregated does not violate the TBT SLO.
KVCache. To schedule all these disaggregated components, at
its center, M OONCAKE implements a global scheduler named 3.2 M OONCAKE Store: Cache of KVCache
Conductor. Conductor is responsible for dispatching requests
based on the current distribution of the KVCache and work- Central to M OONCAKE is its efficient implementation of a
load characteristics. M OONCAKE Store, detailed in §3.2, distributed global cache of KVCache, referred to as M OON -
manages the storage and transfer of these KVCache blocks. CAKE Store. As described in §2.2, reusing cached KVCache
Specifically, Figure 3 demonstrates the typical workflow not only cuts computation costs but also improves user experi-
of a request. Once tokenizing is finished, the conductor se- ence by reducing the TTFT, particularly when the aggregated
lects a pair of prefill nodes and a decoding node, and starts a bandwidth is fully utilized. However, achieving full utiliza-
workflow comprising four steps: tion is challenging because the bandwidth can reach up to
1) KVCache Reuse: The selected prefill node (group) receives 8×400 Gbps, comparable to DRAM bandwidth.
a request that includes the raw input, the block keys of the We first introduce how M OONCAKE Store manages KV-
prefix cache that can be reused, and the block keys of the Cache in §3.2.1, including its storage scheme and eviction
full cache allocated to the request. It loads the prefix cache policy. In §3.2.2, we describe the object-based APIs and mem-
from remote CPU memory into GPU memory based on the ory transfer APIs of M OONCAKE Store. In §3.2.3, we will
prefix cache block keys to bootstrap the request. This step is detail the design of M OONCAKE Store’s transfer engine, a

158 23rd USENIX Conference on File and Storage Technologies USENIX Association
high-performance, zero-copy KVCache transfer system de- 3.2.3 Transfer Engine
signed to maximize the benefits of using multiple RDMA
NICs per machine. It enhances execution efficiency and relia- To efficiently implement the above APIs, a transfer engine
bility through techniques such as topology-aware path selec- is designed to achieve several key objectives: 1) Effectively
tion and endpoint pooling. distribute transfer tasks across multiple RDMA NIC devices;
2) Abstract the complexities of RDMA connection manage-
ment from the APIs; and 3) Appropriately handle temporary
3.2.1 KVCache Management network failures. This transfer engine has been meticulously
In M OONCAKE Store, all KVCache is stored as paged blocks engineered to fulfill each of these goals.
within a distributed cache pool. The block size, i.e., the num- Network Setup The benefits of M OONCAKE rely on a
ber of tokens contained in each block, is determined by the high-bandwidth network interconnect. Currently, we use stan-
model size and the optimal network transmission size, typi- dard HGX machines where each A800 GPU is paired with
cally ranging from 16 to 512 tokens. Each block is attached a 100/200 Gbps NIC, and each H800 GPU is paired with a
with a hash key determined by both its own hash and its prefix 200/400 Gbps NIC, which is comparable to memory band-
for deduplication. The same hash key may have multiple repli- width and existing libraries (other than NCCL) fail to fully
cas across different nodes to mitigate hot-cache access latency, utilize this capacity. As for NCCL, it cannot gracefully handle
controlled by our cache-load-balancing policy described in dynamic topology changes due to the addition or removal of
§4.2. nodes/NICs and does not support DRAM-to-DRAM pathes.
M OONCAKE Store allocates space for each cache block In contrast, the transfer engine endeavors to find alternative
in the cache pool and logs metadata such as the block key paths upon failure.
and its address. When the cache pool is full, M OONCAKE To address congestion, the network utilizes RoCEv2 tuned
Store employs a LRU (Least Recently Used) strategy to evict by cloud providers. In the scheduler, we mitigate congestion
an existing cache block—unless the block is currently being by increasing the number of replicas for hot KVCaches (§4.2).
accessed by an ongoing request—and overwrites the evicted Topology-aware path selection. Modern inference servers
block’s space with the new block. often consist of multiple CPU sockets, DRAM, GPUs, and
RDMA NIC devices. Although it’s technically possible to
3.2.2 Interface transfer data from local DRAM or VRAM to a remote location
using any RDMA NIC, these transfers can be limited by the
At the higher layer, M OONCAKE Store offers object-based
bandwidth constraints of the Ultra Path Interconnect (UPI)
APIs such as put, get, and change_replica. These facil-
or PCIe Switch. To overcome these limitations, M OONCAKE
itate the caching of KVCache in a disaggregated manner,
Store implements a topology-aware path selection algorithm.
organizing mini blocks of KVCache as memory objects and
Before processing requests, each server generates a topol-
enabling Conductor to adjust the number of replicas for each
ogy matrix and broadcasts it across the cluster. This matrix
KVCache block to achieve higher bandwidth aggregation.
categorizes network interface cards (NICs) into "preferred"
These functions are supported by a set of synchronous batch
and "secondary" lists for various types of memory, which
transfer APIs, detailed in Listings 1.
types are specified during memory registration. Under normal
Transfer operations are available for both DRAM and
conditions, a NIC from the preferred list is selected for trans-
GPU VRAM and will utilize GPU Direct RDMA when op-
fers, facilitating RDMA operations within the local NUMA
timal, provided that the specified memory region has been
or GPU Direct RDMA through the local PCIe switch only. In
pre-registered. The completion of these operations can be
case of failures, NICs from both lists may be utilized. The pro-
monitored asynchronously via the getTransferStatus API,
cess involves identifying the appropriate local and target NICs
which reports whether transfers are ongoing or have encoun-
based on the memory addresses, establishing a connection,
tered errors.
and executing the data transfer.
For instance, as illustrated in Figure 4, to transfer data from
Listing 1: Memory transfer APIs in M OONCAKE Store. buffer 0 (assigned to cpu:0) in the local node to buffer 1 (as-
int registerLocalMemory(void *vaddr, size_t len, signed to cpu:1) in the target node, the engine first identifies
const string &type); the preferred NICs for cpu:0 using the local server’s topol-
BatchID allocateBatchID(size_t batch_size); ogy matrix and selects one, such as mlx5_1, as the local NIC.
int submitTransfer(BatchID batch_id, Similarly, the target NIC, such as mlx5_3, is selected based on
const vector<Request> &entries); the target memory address. This setup enables establishing an
int getTransferStatus(BatchID batch_id,
RDMA connection from mlx5_1@local to mlx5_3@target
int request_index,
Status &status);
to carry out RDMA read and write operations.
int freeBatchID(BatchID batch_id); To further maximize bandwidth utilization, a single re-
quest’s transfer is internally divided into multiple slices at a

USENIX Association 23rd USENIX Conference on File and Storage Technologies 159
mlx5_0 mlx5_1 cuda:0 cuda:1 mlx5_2 mlx5_3 cuda:2 cuda:3

Virtual Address Space


Registered PCIe Switch PCIe Switch

Memory Controller
Memory Controller
Buffer
Buffer 0 at Socket 0

cpu:1
DRAM (cpu:0) Interconnect

cpu:0
BatchTransfer CPU 0 CPU 1
Buffer 1 at Socket 1 – Read/Write 16 GT/s
DRAM (cpu:1)
Buffer 1 at Socket 1
… DRAM (cpu:1)
Topology matrix Preferred NICs
Buffer 0 at GPU 0
VRAM (cuda:0) {
"cpu:0":[["mlx5_0","mlx5_1"],["mlx5_2","mlx5_3"]],
"cpu:1":[["mlx5_2","mlx5_3"],["mlx5_0","mlx5_1"]],
"cuda:0":[["mlx5_0"],["mlx5_1","mlx5_2","mlx5_3"]],
...
} Secondary NICs
a) Batch transfer interface b) Topology-aware path selection

Figure 4: Transfer engine of M OONCAKE Store.

granularity of 16 KB. Each slice might use a different path, is primarily driven by the fact that online services typically
enabling collaborative work among all RDMA NICs. have more stringent SLOs. While chunked prefill reduces de-
coding interference, it remains challenging to simultaneously
Endpoint management. M OONCAKE Store employs a pair
maximize MFU during the prefill stage and meet the TBT
of endpoints to represent the connection between a local
SLO during the decoding stage. We will demonstrate this in
RDMA NIC and a remote RDMA NIC. In practice, each
the end-to-end experiments in §5.2. Another important rea-
endpoint includes one or more RDMA queue pair objects.
son is that we think prefill nodes require different cross-node
Connections in M OONCAKE Store are established in an on
parallelism settings to handle long contexts as the available
demand manner; endpoints remain unpaired until the first
context length of recent LLMs is increasing rapidly, from 8k
request is made.
to 128k and even up to 1 million tokens [20]. Typically, for
To prevent a large number of endpoints from slowing down
such long context requests, the input tokens can be 10 to 100
request processing, M OONCAKE Store employs endpoint
times larger than the output tokens, making optimizing the
pooling, which caps the maximum number of active connec-
TTFT crucial. Due to the abundant parallelism in long con-
tions. We use the SIEVE [19] algorithm to manage endpoint
text prefill, using more than a single 8×GPU node to process
eviction. If a connection fails due to link errors, it is removed
them in parallel is desirable. However, extending tensor paral-
from the endpoint pools on both sides and re-established dur-
lelism (TP) across more than one node requires two expensive
ing the next data transfer attempt.
RDMA-based all-reduce operations per layer, significantly
Failure handing. In a multi-NIC environment, one com- reducing the MFU of prefill nodes.
mon failure scenario is the temporary unavailability of a spe- Recently, many works have proposed sequence parallelism
cific NIC, while other routes may still connect two nodes. (SP) [21–27]. SP partitions the input sequences of requests
M OONCAKE Store is designed to adeptly manage such tem- across different nodes to achieve acceleration, allowing even
porary failures effectively. If a connection is identified as long requests to meet the TTFT SLO. However, when ap-
unavailable, M OONCAKE Store automatically identifies an plied to shorter input requests, SP results in a lower MFU
alternative, reachable path and resubmits the request to a dif- compared to using single-node TP only. Recent research [15]
ferent RDMA NIC device. Furthermore, M OONCAKE Store proposes elastic sequence parallelism to dynamically scale up
is capable of detecting problems with other RDMA resources, or down the SP group. Although possible, this adds complex-
including RDMA contexts and completion queues. It tem- ity to our architecture. Additionally, SP still requires frequent
porarily avoids using these resources until the issue, such as a cross-node communication, which lowers the MFU and com-
downed link, is resolved. petes with network resources for transferring KVCache across
nodes.
3.3 M OONCAKE’s Prefill Pool To address this, M OONCAKE leverages the autoregressive
property of decoder-only transformers and implements chun-
Unlike the inviolable decoding nodes, the necessity and ked pipeline parallelism (CPP) for long context prefill. We
best practices for designing a separate and elastic prefill group every X nodes in the prefill cluster into a pipelined
pool remain under debate. For example, although many re- prefill node group. For each request, its input tokens are par-
searchers [7–9] share our intuition to use a disaggregated titioned into chunks, each no longer than the pre f ill_chunk.
architecture, it is worth discussing whether this separation is Different chunks of the same request can be processed simul-
still necessary with the introduction of chunked prefill [10]. taneously by different nodes, thus parallelizing the processing
However, after careful consideration, we decided to main- and reducing TTFT.
tain M OONCAKE’s disaggregated architecture. This decision CPP offers two main benefits: 1) Similar to pipeline par-

160 23rd USENIX Conference on File and Storage Technologies USENIX Association
allelism in training, it requires cross-node communication
20
only at the boundaries of each pipeline stage, which can be 19.65
Average
easily overlapped with computation. This leads to better MFU 15

TTFT (s)
and less network resource contention with KVCache trans-
fer. 2) It naturally fits both short and long contexts, bringing 10
no significant overhead for short context prefill and avoid- 5.27
ing frequent dynamic adjustment of node partitioning. This 5 3.07 3.58

pipeline-based acceleration method has been explored in train-


0
Globa L L Rando
ing systems [28], but to our knowledge, this is the first appli- l Cach ocal Cache oad balanc m
e Awa Aware ing
cation in the inference stage, as long context inference has re
only recently emerged.
Figure 5: The prefill scheduling experiment.

TTFTs are computed in parallel, rendering the processing


4 Scheduling time negligible compared to the inference time.
4.1 Prefill Global Scheduling More difficulty lies in predicting the transfer time because
it is determined not only by the size of the transferred data
Previous research on LLM serving typically uses a load- but also by the current network status, especially whether the
balancing strategy that evaluates the load on each instance sending node is under congestion. This also necessitates the
based on the number of assigned requests. In M OONCAKE, replication of hot KVCache blocks, which will be discussed
however, the selection of prefill instances considers additional in §4.2.
factors—not just load but also the prefix cache hit length and
the distribution of reusable KVCache blocks. While there is a
preference to route requests to prefill instances with longer
4.2 Cache Load Balancing
prefix cache lengths to reduce computation costs, it may be In M OONCAKE, each prefill instance has its own set of local
beneficial to schedule them to other nodes to ensure over- prefix caches. The usage frequency of these caches varies
all system balance and meet TTFT SLOs. To address these significantly. For example, system prompts are accessed by
complexities, we propose a cache-aware global scheduling almost every request, whereas caches storing content from
algorithm that accounts for both the prefill time due to the a local long document may be used by only one user. As
prefix cache and the local queuing time. discussed in §4.1, Conductor’s role is crucial in achieving an
Algorithm 1 details the mechanism for our KVCache- optimal balance between cache matching and instance load.
centric prefill scheduling. For every new request, block keys Thus, from the perspective of the distributed cache system,
are then compared one by one against each prefill instance’s load balancing also plays an important role. Specifically, it
cache keys to identify the prefix match length (pre f ix_len) involves strategizing on how to back up caches to ensure that
With this matching information, Conductor estimates the cor- global prefill scheduling can achieve both high cache hits and
responding execution time based on the request length and low load.
pre f ix_len (which varies by instance), using a polynomial A straw-man solution to this KVCache scheduling problem
regression model fitted with offline data. It then adds the could be collecting the global usages of each block, using a
estimated waiting time for that request to get the TTFT on prediction model to forecast their future usages, and making
that instance. Finally, Conductor assigns the request to the scheduling decisions accordingly. However, unlike the estima-
instance with the shortest TTFT and updates the cache and tion of prefill time, workloads are highly dynamic and change
queue times for that instance accordingly. If the SLO is not significantly over time. Especially for a MaaS provider ex-
achievable, Conductor directly returns the HTTP 429 Too periencing rapid growth in its user base, it is impossible to
Many Requests response status code to the upper layers. accurately predict future usages. Thus, we propose a heuristic-
The backbone of this scheduling framework is straightfor- based automated hotspot migration scheme to enhance cache
ward, but complexities are hidden in the engineering imple- load balancing.
mentation of various components. For example, to predict the As previously noted, requests may not always be directed
computation time of the prefill stage for a request, we employ to the prefill instance with the longest prefix cache length due
a predictive model derived from offline test data. This model to high instance load. In such cases, Conductor forwards the
estimates the prefill duration based on the request’s length cache’s location and the request to an alternative instance if
and prefix cache hit length. Thanks to the regular computation the estimated additional prefill time is shorter than the trans-
pattern of Transformers, the error bound of this prediction is fer time. This instance proactively retrieves the KVCache
small as long as enough offline data is available. The queu- from the holder and stores it locally. More importantly, we
ing time for a request is calculated by aggregating the prefill prefer to compute the input tokens if the best remote prefix
times of all queued requests. In practical implementations, match length is no larger than the current local reusable prefix

USENIX Association 23rd USENIX Conference on File and Storage Technologies 161
Algorithm 1 KVCache-centric Scheduling Algorithm 5 Evaluation
Input: prefill instance pool P, decoding instance pool D, request R,
cache block size B. As described before, according to historical statistics of Kimi,
Output: the prefill and decoding instances (p, d) to process R. M OONCAKE enables Kimi to handle 115% and 107% more
1: block_keys ← PrefixHash(R.prompt_tokens, B) requests on the A800 and H800 clusters, respectively, com-
2: TTFT, p ← inf, 0/ pared to our previous systems based on vLLM. To further
3: best_len, best_instance ← FindBestPrefixMatch(P, block_keys)
validate this results and ensure reproducibility, in this section,
4: for instance ∈ P do
best_len we conduct a series of end-to-end and ablation experiments on
5: if instance.prefix_len > kvcache_balancing_threshold then
6: prefix_len ← best_len
M OONCAKE with a dummy LLaMA3-70B model to address
7: transfer_len ← best_len − instance.prefix_len the following questions: 1) Does M OONCAKE outperform
8: Ttransfer ← EstimateKVCacheTransferTime(transfer_len) existing LLM inference systems in real-world scenarios? 2)
9: else Compared to conventional prefix caching methods, does the
10: prefix_len ← instance.prefix_len design of M OONCAKE Store significantly improve M OON -
11: Ttransfer ← 0 CAKE ’s performance?
12: Tqueue ← EstimatePrefillQueueTime(instance)
13: Tprefill ← EstimatePrefillExecutionTime(
len(R.prompt_tokens), prefix_len) 5.1 Setup
14: if TTFT > Ttransfer + Tqueue + Tprefill then
15: TTFT ← Ttransfer + Tqueue + Tprefill Testbed. During the reproducing experiments, the system
16: p ← instance was deployed on a high-performance computing node cluster
17: d, TBT ← SelectDecodingInstance(D) to evaluate its performance. Each node in the cluster is con-
18: if TTFT > TTFT_SLO or TBT > TBT_SLO then figured with eight NVIDIA-A800-SXM4-80GB GPUs and
19: reject R; return four 200 Gbps RDMA NICs. The KVCache block size in
best_len
20: if p.prefix_len > kvcache_balancing_threshold then M OONCAKE Store is set to 256. For deploying M OONCAKE,
21: TransferKVCache(best_instance, p) each node operates as either a prefill instance or a decoding
22: return (p, d) instance based on the startup parameters. For deploying other
systems, each node hosts a single instance.
Metric. Specifically, we measure the TTFT and TBT of
multiplied by a threshold1 Both strategies not only reduce each request, where the TBT is calculated as the average of
the prefill time for requests but also facilitate the automatic the longest 10% of the token arrival intervals. As mentioned
replication of hotspot caches, allowing for their broader dis- in §2, the threshold for TTFT is set to 30 s, and TBT thresh-
tribution across multiple instances. olds are set to 100 ms, 200 ms, and 300 ms, depending on
the scenario. We consider requests with both TTFT and TBT
To validate the effectiveness of our strategy, we conduct a below their respective thresholds as effective requests, and
scheduling experiment that compares random scheduling and the proportion of effective requests among all requests as the
load-balancing scheduling with our strategy. We further com- effective request capacity. For brevity, the subsequent experi-
pare the local cache-aware scheduling described in §4.1 and ments not mentioning TTFT are assumed to meet the TTFT
the global cache-aware scheduling described in this section threshold. To more intricately compare the caching perfor-
that considers cache load balancing. In random scheduling, mance, we also measure the GPU time during the prefill stage
a prefill instance is selected arbitrarily for each request. In and the cache hit rate for each request.
load-balancing scheduling, the instance with the lightest load
Baseline. We employ vLLM [14], one of the state-of-the-art
is chosen. Specifically, we build a M OONCAKE cluster con-
open-source LLM serving systems, as our experimental base-
sisting of 16 8×A800 nodes, and replay the conversation trace
line. vLLM features continuous batching and PagedAttention
detailed in §5.2.1 for the experiment. We assess the perfor-
technologies, significantly enhancing inference throughput.
mance of each scheduling algorithm based on the TTFTs. The
Despite its strengths, vLLM’s architecture, which couples the
experimental results, depicted in Figure 5, demonstrate that
prefill and decoding stages, can disrupt decoding especially in
our KVCache-centric scheduling algorithms outperform ran-
scenarios involving long contexts. Recent updates to vLLM
dom and load-balancing scheduling. By incorporating cache
have integrated features like prefix caching and chunked pre-
load balancing, the global cache-aware algorithm reduces the
fill to improve performance metrics in long-context scenarios,
average TTFT by an additional 14% compared to the local
such as TTFT and TBT. In our experiments, we also compare
cache-aware algorithm.
these features of vLLM. In our experiments, we utilize the
latest release (v0.5.1) of vLLM. Due to limitations in the
1 This
threshold is currently adjusted manually but can be adaptively current implementation, we test the prefix cache and chunked
adjusted by an algorithm in the future. prefill features of this version separately.

162 23rd USENIX Conference on File and Storage Technologies USENIX Association
Conversation Tool&Agent Synthetic Mooncake vLLM Prefix Caching
Avg Input Len 12035 8596 15325 vLLM vLLM Chunked Prefill

Request Capacity Ratio (%)


Avg Output Len 343 182 149 100

+22%
+42%
Cache Ratio 40% 59% 66%

+64%
Arrival Pattern Timestamp Timestamp Poisson 80
Num Requests 12031 23608 3993
60
Table 2: Workload Statistics.
40

Threshold III
Threshold II
Threshold I
5.2 End-to-end Performance
20
In our end-to-end experiments, we evaluate the request han-
dling capabilities of M OONCAKE and baseline systems under 0
various workloads. Specifically, we measure the maximum 100 200300 500 1000
Time between Tokens (ms)
throughput that remains within the defined SLO thresholds.
We employ three types of workloads in our tests: two real- Figure 6: The experiment of the effective request capacity of
world traces sampled from Kimi that represent online con- M OONCAKE under the tool&agent workload.
versations and tool&agent interactions, respectively, and a
synthetic workload to cover different inference scenarios. We ing datasets: ShareGPT [32], Leval [29], and LooGLE [30].
will first describe the unique characteristics of these work- ShareGPT comprises multi-turn conversations with short in-
loads and then discuss the results. Lastly, we analyze the GPU put lengths. Leval serves as a benchmark for evaluating model
computation time during the prefill stage, further demonstrat- performance over long contexts, simulating scenarios where
ing the advantages of M OONCAKE Store in enhancing cache requests involve lengthy system prompts typical of tool and
utilization and reducing computation costs. agent interactions. LooGLE is tailored for long-context QA
and summarization tasks, featuring input lengths of up to 100k
tokens and including both multi-turn QA and single-turn sum-
5.2.1 Workload
marizations, making it well-suited for long text summarization
Conversation workload. Chatbots [1, 5] represent one of and QA scenarios. Overall, the synthetic workload has the
the most prevalent applications of LLMs, making conversa- longest average input length. Despite having the highest pro-
tional requests a highly representative workload for LLM in- portion of prefix caching, its cache hits are quite dispersed,
ference. As shown in Table 2, the conversation workload con- thus requiring a substantial cache capacity.
tains a significant portion of long-context requests—reaching During preprocessing, each conversation turn was mapped
up to 128k tokens and averaging around 12k tokens—which is into a separate request, incorporating both the input and out-
comparable to the data lengths found in current long-context puts from previous interactions. For datasets featuring mul-
datasets [29, 30]. Moreover, the workload has an average of tiple questions with the same lengthy prompt, each question
approximately 40% prefix caching ratio brought about by and its preceding prompt were treated as a single request. We
multi-turn conversations. We sampled 1 hour of conversation combined the processed datasets in a 1:1:1 ratio, preserving
traces from an online inference cluster, where each record the sequential relationships within the multi-turn dialogue
includes the input and output lengths along with timestamps requests while randomly shuffling them. Since the datasets do
of arrival. Requests are dispatched according to these times- not specify arrival times, we simulated realistic conditions by
tamps and are preemptively terminated once the model output dispatching requests at a defined rate using a Poisson process.
reaches the predetermined length.
Tool&Agent workload. Recent studies [31] involving 5.2.2 Effective Request Capacity
LLMs deployed as tools or agents to perform tasks have been
increasing. These tasks are typically characterized by the in- To assess the maximum number of requests that can adhere
corporation of pre-designed, often lengthy, system prompts to the SLOs under different workloads, we test four system
that are fully repetitive. We collected traces of the tool&agent configurations: M OONCAKE, vLLM, vLLM with the prefix
workload, also sampled over a 1-hour period. As indicated caching feature, and vLLM with the chunked prefill feature,
in Table 2, this workload exhibits a high proportion of prefix each utilizing 16 nodes.
caching, with shorter input and output lengths. Conversation workload. The results for this workload are
Synthetic workload. The synthetic workload was con- presented in Figure 1. This workload, characterized by vary-
structed from a combination of publicly available datasets. ing input lengths and longer output lengths, causes significant
We categorized the requests in the real trace into three types: fluctuations in TBT for the vLLM system due to the lengthy
short conversations, tool and agent calls, and long text sum- contexts in the prefill stage. While chunked prefill reduces
marization and QA. For each category, we selected the follow- decoding interference, balancing the enhancement of MFU

USENIX Association 23rd USENIX Conference on File and Storage Technologies 163
Mooncake vLLM Prefix Caching
Mooncake vLLM Prefix Caching
vLLM vLLM Chunked Prefill
vLLM vLLM Chunked Prefill
Request Capacity Ratio (%)

3x
100

Prefill GPU Time (s)

.3
+28%

x 3
+40%

x
x

76
90
+62%

59
2.
.
2

x 1
x
80

2.
56
43

x
1.

68
1.

2.
x
12
60

2.

x
1

40
1.
40

Threshold III
Threshold II
Threshold I

0
Conversation Tool&Agent Synthetic
20
Figure 8: Average GPU time of each request during the prefill
0 stage under different workloads.
100 200300 500 1000
Time between Tokens (ms)
Figure 7: The experiment of the effective request capacity of longest prefill GPU time due to its longer input lengths and
M OONCAKE under the synthetic workload. lower prefix cache ratio. The synthetic workload, featuring
the highest prefix cache ratio and dispersed cache hotspots,
achieves optimal cache hit rates within M OONCAKE’s global
in the prefill stage with the TBT constraints in the decoding
cache pool. Consequently, despite having the longest aver-
stage remains challenging. Despite meeting the TTFT SLO,
age input lengths, it requires less prefill GPU time than the
its effective request capacity is still suboptimal. Compared to
conversation workload. Finally, the tool&agent workload ex-
vLLM, M OONCAKE achieves a very significant increase in
hibits the shortest prefill GPU time because it has the shortest
effective request capacity.
average input length and a relatively high prefix cache ratio.
Tool&Agent workload. In contrast, the tool&agent work-
Across different systems, M OONCAKE significantly re-
load has a high proportion of prefix caching and shorter output
duces GPU time by fully utilizing global cache for prefix
lengths, favoring the vLLM system as the short prefill time
caching, achieving reductions of 36%, 53%, and 64% for con-
minimally impacts output. However, as illustrated in Figure 6,
versation, tool&agent, and synthetic workloads, respectively,
vLLM and vLLM with chunked prefill experience more severe
compared to vLLM. vLLM featuring prefix caching uses lo-
disruptions in decoding due to longer prefill processing times,
cal cache stored on HBM, where the cache capacity is far
resulting in a lower effective caching capacity than vLLM
lower than that of M OONCAKE. Its prefill GPU time is 1.43×
with prefix caching. M OONCAKE uses a global cache pool
and 1.40× higher than M OONCAKE for the conversation and
to significantly increase caching capacity and optimize cache
tool&agent workloads, respectively. However, in the synthetic
utilization through internode transfers, excelling in scenarios
workload, where cache hotspots are more dispersed, the prefill
with high prefix caching. As a result, it enhances effective
GPU time of vLLM with prefix caching is nearly equivalent
caching capacity by 42% compared to vLLM with prefix
to vLLM, and is 2.59× that of M OONCAKE. vLLM with
caching under the 200 ms threshold.
chunked prefill sacrifices some prefill efficiency to maintain
Synthetic workload. The synthetic workload features the lower TBT during the decoding stage, resulting in the longest
longest average input lengths and dispersed cache hotspots prefill GPU times, which are 1.90×, 2.68×, and 3.33× that
which leads to poor cache utilization under smaller cache ca- of M OONCAKE for the three workloads.
pacities. As depicted in Figure 7, most requests processed by
M OONCAKE maintain a TBT within 100 ms, whereas about
20% of requests handled by vLLM exceed 300 ms. The per- 5.3 M OONCAKE Store
formance of systems with prefix caching and chunked prefill To address Question 2, we examine the effects of M OON -
is similar to vLLM, as they fail to mitigate the impact of long CAKE Store’s global cache pool on system performance. Our
contexts on the decoding stage. Compared to vLLM, M OON - analysis reveals that although using local DRAM to construct
CAKE increases effective request capacity by 40% under the
KVCache memory increases cache capacity than HBM only,
200 ms threshold. restricting the cache to a single node still leads to suboptimal
cache utilization. We will first conduct a quantitative anal-
5.2.3 Prefill GPU Time ysis of cache capacity requirements and then showcase the
benefits through practical workload experiments.
Prefill GPU time is positively correlated with requests’ TTFT
and serving cost and is determined by requests’ input lengths
5.3.1 Quantitative Analysis of Cache Capacity
and cache hit rates. We analyze the average GPU time during
the prefill stage under different workloads, as shown in Fig- Considering the LLaMA3-70B model, the KVCache size
ure 8. For M OONCAKE, the conversation workload incurs the required for a single token is 320 KB. Despite the possibility

164 23rd USENIX Conference on File and Storage Technologies USENIX Association
10th 100th 1000th 10000th
Conversation Synthetic
Conversation Tool&Agent Synthetic
Tool&Agent All 8

Replica Count
6

0.6 4
Cache Hit Rate

75% max 2

46% max 0
0.4 0 1000 2000 3000
Time (s)
0 1000 2000 3000
Time (s)
0 500
Time (s)
1000

48% max
41% max Figure 11: Replication count of cache keys across various
0.2
workloads. We continuously monitor and record the keys and
counts of all cache blocks every 30 seconds, subsequently
0.0 ranking the cache keys by the cumulative counts from all sam-
104 105 106 107 108 109 ples. This figure depicts the temporal variation in replication
Cache Capacity (tokens) numbers for cache keys ranked at the 10th, 100th, 1000th, and
Figure 9: Quantitative analysis of prefix cache hit rates with 10,000th positions.
varying cache capacities. We consider only the sequence of
requests and do not account for factors such as prefill com- to 1 to isolate the impact of the decoding stage. Each node in
putation time or the replication of hotspots in the cache. The the local cache setup has a 3M token capacity but can only
dashed line for 3M tokens capacity represents the local cache access its own cache. The global scheduler is programmed
capacity, with the intersection points indicating the ratio of to direct requests to nodes with higher prefix match ratios to
the cache hit rate to the theoretical maximum hit rate. maximize cache utilization. Conversely, in the global cache
setup, each node also has a 3M token capacity but can share
Local Cache Global Cache caches across all nodes, supported by proactive inter-node
1.5 cache migration. The experimental data, shown in Figure 10,
x
36
x

Prefill GPU Time (s)


Cache Hit Rate (%)

38

indicates that the global cache achieves higher cache hit rates
2.

x
76
1.

60
0.

1.0
and shorter average prefill GPU computation times across all
x
x

52
22

tested workloads. Compared to the local cache, the global


0.
2.

40
x
74

cache exhibits a maximum increase of 136% in cache hit rate


0.

0.5
20 and a reduction of up to 48% in prefill computation time.
0 0.0
Con Tool& Synt Con Tool& Synt
ve rsat Age hetic ve rsat Age hetic 5.3.3 Cache Replica
ion nt ion nt

Building upon the cache load balancing scheduling strategy


Figure 10: Cache hit rates and average GPU computation time discussed in §4.2, the cache keys in M OONCAKE Store may
for prefill in global and local caches. have replicas distributed across different machines, thereby
reducing access latency for hot caches. To further investi-
of reserving approximately 1 TB of DRAM for local caching, gate the system’s dynamic behavior, we count the number of
this setup only supports storage for about 3 million tokens, cache replicas for keys across three workloads, as shown in
which proves insufficient. Figure 9 displays theoretical cache Figure 11.
hit rates under various workloads and their combinations. It can be observed that in the conversation and tool&agent
The findings indicate that a local cache with a 3M token workloads, there are highly concentrated hot caches (e.g., the
capacity does not achieve 50% of the theoretical maximum top 100 keys), which, after the system stabilizes, have replicas
hit rate in most scenarios. We also determine that, in these on almost every instance in the prefill pool. In contrast, the
workloads, a cache capacity of 50M tokens nearly reaches the synthetic workload has fewer shared prefix caches, resulting
theoretical maximum hit rate of 100%, which require to pool in fewer replicas and potential fluctuations, even for the top
at least 20 nodes’ DRAM. The results highlight that a global 10 blocks. This demonstrates that our scheduling strategy in
cache significantly enhances capacity over local caches, thus §4.2 effectively provides replicas for hot caches, particularly
improving cache hit rates and reducing GPU times. in scenarios with highly concentrated prefix caches.

5.3.2 Practical Workload Experiment 5.4 KVCache Transfer Performance


To evaluate the effectiveness of global versus local caching 5.4.1 Transfer Engine
mechanisms, we focus on two metrics: cache hit rate and
average GPU computation time for prefill. We configure a M OONCAKE’s transfer engine is designed to facilitate effi-
cluster with 10 prefill nodes and restrict all request outputs cient cache transfers between nodes. We compare its latency

USENIX Association 23rd USENIX Conference on File and Storage Technologies 165
Transfer Engine TCP Gloo
Schedule Transfer Decode
4 x 200 Gbps NICs 8 x 400 Gbps NICs Prefill Load Cache
4 4
Latency (s)

Latency (s)
30
6

Latency (s)

Latency (s)
2 7.5x 2 16.2x
20
2.4x 4
4.6x
0 0 10
0 25 50 75 100 0 25 50 75 100 2
Cache Size (GB) Cache Size (GB)
0 0
8k 16k 32k 64k 128k 8k 16k 32k 64k 128k
Figure 12: Latency of inter-node cache transfer. Prompt Length Prompt Length

Theoretical Time (a) Prefix cache ratio 0%. (b) Prefix cache ratio 95%.
Transfer Time (s)

4 3 Real Time
Avg TTFT (s)

TTFT of Recomputation 2
Figure 14: End-to-end latency breakdown of M OONCAKE. In
2 the figure, Prefill represents the time for layer-wise prefill that
1 integrates cache loading and storing, and Decode represents
0 0
the time to decode 128 tokens. All processes with diagonal
0 100 200 300 400 0 100 200 300 400
BandWidth (Gbps) BandWidth (Gbps) stripes can proceed asynchronously with model inference and
do not affect M OONCAKE’s throughput.
(a) Average TTFT. (b) Transfer time.
network congestion, as demonstrated by the substantial diver-
Figure 13: The synthetic workload experiment with varying
gence between actual and theoretical transfer times illustrated
network bandwidths.
in Figure 13b. Consequently, we recommend a minimum
network bandwidth of 100 Gbps to ensure optimal system
with other popular schemes, considering two alternative base-
performance.
lines: torch.distributed with a Gloo backend and TCP-
based transfers. All schemes are tested with a concurrency
level of 64 and a minimum transfer granularity of 128 KB. As 5.4.3 E2E Latency Breakdown
depicted in Figure 12, the transfer engine consistently exhibits
The latency of a single inference request in M OONCAKE
significantly lower latency than the alternative methods. In the
can be decomposed into five components: 1) scheduling and
scenario of transferring 40 GB of data, corresponding to the
queuing time; 2) layer-wise prefill time; 3) cache transfer time;
cache size for LLaMA3-70B with 128k tokens, the transfer
4) time required for the decoding node to load cache from
engine achieves bandwidth of 87 GB/s and 190 GB/s under
DRAM to HBM; and 5) decoding time. We experimentally
network configurations of 4×200 Gbps and 8×400 Gbps, re-
analyze the proportion of these five components under settings
spectively. These rates are approximately 2.4× and 4.6×
with prefix cache ratios of 0% and 95%, as shown in Figure 14.
faster than those achieved using the TCP protocol. The code
First, it is evident from the figure that the introduction of
of this transfer engine will also be open sourced later as it is
prefix caching significantly reduces the prefill time. Specif-
a decoupled and basic tool that can be used in many scenar-
ically, with an input length of 128k tokens, prefix caching
ios (e.g., it is also used in the checkpoint transfer service of
reduces the prefill time by 92%. Furthermore, the overhead
Moonshot AI).
introduced by M OONCAKE has minimal impact on the sys-
tem’s performance. The Schedule, Transfer, and Load Cache
5.4.2 Bandwidth Demand by M OONCAKE components can proceed asynchronously with model infer-
M OONCAKE’s global cache pool relies on efficient inter-node ence and therefore do not affect M OONCAKE’s throughput.
cache transfers to hide cache transfer times within GPU com- Moreover, the increase in TTFT due to these overheads is
putation times. We evaluate the impact of network bandwidth smaller than the reduction achieved by prefix caching. Even
on the system’s performance by simulating a range of band- when accounting for the overhead, prefix caching in M OON -
widths from 24 Gbps to 400 Gbps and measuring the transfer CAKE can reduce TTFT by 86% with an input length of 128k
time and TTFT under the synthetic workload described in tokens.
§5.2.1. Figure 13a shows that the average TTFT of requests
decreases as bandwidth increases. When the total communi- 5.5 P/D Ratio
cation bandwidth exceeds 100 Gbps, the average TTFT re-
mains below 2 s, significantly less than the TTFT of the re- As a deployed P/D disaggregation system, in this section, we
computation baseline. However, when bandwidth falls below explore the impact of different P/D ratios on system perfor-
100 Gbps, system performance is significantly compromised. mance. We define the P/D ratio as the number of prefill nodes
This is marked by a sharp increase in TTFT and evident to decoding nodes. Using the clusters comprising 16 nodes

166 23rd USENIX Conference on File and Storage Technologies USENIX Association
Effective Request Ratio (%)

100 Prefix caching is also widely adopted to enable the reuse


TTFT
95 TBT
of KVCache across multiple requests, reducing computa-

Avg TBT (ms)


Avg TTFT (s)
4
50 tional overhead in LLM inference systems [14, 34]. Prompt
90
Cache [16] precomputes and stores frequently used text KV-
85 40 Cache on inference servers, facilitating their reuse and sig-
2
80 nificantly reducing inference latency. SGLang [17] leverages
5P 6 7 8 9 1 1 5P 6 7 8 9 1 1
RadixAttention, which uses a least recently used (LRU) cache
11 P10 P9D P8D P7D 0P6 1P5 11 P10 P9D P8D P7D 0P6 1P5
D D D D D D D D within a radix tree structure to efficiently enable automatic
sharing across various reuse patterns.
(a) Effective request capacity (b) Latency with varying P/D ra- Among these approaches, CachedAttention [38], a con-
with varying P/D ratio. tio. current work with us, proposes a hierarchical KV caching
system that utilizes cost-effective memory and storage media
Figure 15: The impact of the P/D ratio on the system perfor-
to accommodate KVCache for all requests. The architecture
mance. P is short for prefill nodes and D is short for decoding
of M OONCAKE shares many design choices with CachedAt-
nodes.
tention. However, in long-context inference, the KVCache
becomes extremely large, requiring high capacity and efficient
but with varying P/D ratios, we measure the average TTFT data transfer along with KVCache-centric global scheduling.
and TBT under the synthetic workload described in §5.2.1. Additionally, M OONCAKE is not a standalone cache service,
We then calculate the effective request capacity as introduced it incorporates both a memory-efficient cache storage mecha-
in §5.2.2, setting the thresholds for TTFT and TBT to 10 nism and a cache-aware scheduling strategy, further improv-
seconds and 100 milliseconds, respectively. Increasing the ing prefix caching efficiency.
number of prefill nodes reduces TTFT but increases TBT, and The benefits of a distributed KVCache pool depend on
vice versa (Figure 15b). Therefore, we need to find a balance the cache hit rate, which increases as the per-token cache
between TTFT and TBT. Figure 15a demonstrates that when size decreases under a fixed capacity. Consequently, orthogo-
the P/D ratio is approximately 1:1, M OONCAKE achieves its nal techniques such as KVCache compression [39–41] and
highest effective request capacity, indicating that the loads on KVCache-friendly attention architectures [42, 43] can further
the prefill and decoding clusters are relatively balanced. enhance our approach.
We also note that some prior work [7, 9] has proposed
dynamically switching the roles of nodes between prefill and
decoding. However, in practical deployments, we find that the
statistical characteristics of online traffic are generally stable. 7 Conclusion
Therefore, we choose to fix the P/D ratio while continuously
monitoring the loads of the prefill and decoding clusters, only This paper presents M OONCAKE, a KVCache-centric disag-
switching node roles when significant load fluctuations occur. gregated architecture designed for efficiently serving LLMs,
particularly in long-context scenarios. We discuss the neces-
sity, challenges, and design choices involved in balancing
6 Related Work the goal of maximizing overall effective throughput while
meeting latency-related SLO requirements.
Significant efforts have been dedicated to enhancing the effi-
ciency of LLM serving systems through scheduling, memory
management, and resource dissaggregation. Production-grade
systems like FasterTransformer [33], TensorRT-LLM [34], Acknowledgments
and DeepSpeed Inference [35] are designed to significantly
boost throughput. Orca [13] employs iteration-level schedul- We thank the anonymous reviewers and our shepherd, Mr.
ing to facilitate concurrent processing at various stages, while Kan Wu, for their valuable feedback. The authors affili-
vLLM [14] leverages dynamic KVCache management to op- ated with Tsinghua University are all in the Department
timize memory. FlexGen [36], Sarathi-Serve [10], and Fast- of Computer Science and Technology, Beijing National Re-
Serve [37] incorporate innovative scheduling and swapping search Center for Information Science and Technology (BN-
strategies to distribute workloads effectively across limited Rist), Tsinghua University, China. This work is supported by
hardware, often complementing each other’s optimizations. National Key Research & Development Program of China
Further optimizations [7–9] lead to the separation of prefill (2022YFB4502004), Natural Science Foundation of China
and decoding stages, leading to the disaggregated architecture (62141216) and Tsinghua University Initiative Scientific Re-
of M OONCAKE. Our design of M OONCAKE builds on these search Program, Young Elite Scientists Sponsorship Program
developments, particularly drawing from the open-source by CAST (2022QNRC001), and Beijing HaiZhi XingTu Tech-
community of vLLM, for which we are deeply appreciative. nology Co., Ltd.

USENIX Association 23rd USENIX Conference on File and Storage Technologies 167
References [11] Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
Dario Amodei, Ilya Sutskever, et al. Language mod-
[1] OpenAI. Introducing chatgpt. https://openai.com/ els are unsupervised multitask learners. OpenAI blog,
blog/chatgpt, 2022. 1(8):9, 2019.

[2] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, [12] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier
Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Martinet, Marie-Anne Lachaux, Timothée Lacroix, Bap-
Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. tiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar,
Llama 2: Open foundation and fine-tuned chat models. et al. Llama: Open and efficient foundation language
arXiv preprint arXiv:2307.09288, 2023. models. arXiv preprint arXiv:2302.13971, 2023.
[13] Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soo-
[3] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, jeong Kim, and Byung-Gon Chun. Orca: A distributed
Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri serving system for transformer-based generative mod-
Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, els. In 16th USENIX Symposium on Operating Systems
et al. Evaluating large language models trained on code. Design and Implementation (OSDI 22), pages 521–538,
arXiv preprint arXiv:2107.03374, 2021. 2022.
[4] Charles Packer, Vivian Fang, Shishir G Patil, Kevin [14] Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying
Lin, Sarah Wooders, and Joseph E Gonzalez. Memgpt: Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez,
Towards llms as operating systems. arXiv preprint Hao Zhang, and Ion Stoica. Efficient memory man-
arXiv:2310.08560, 2023. agement for large language model serving with page-
dattention. In Proceedings of the 29th Symposium on
[5] Moonshot AI. Kimi. https://kimi.moonshot.cn, Operating Systems Principles, pages 611–626, 2023.
2023.
[15] Bingyang Wu, Shengyu Liu, Yinmin Zhong, Peng Sun,
[6] NVIDIA. Nvidia h100 tensor core gpu archi- Xuanzhe Liu, and Xin Jin. Loongserve: Efficiently serv-
tecture. https://resources.nvidia.com/ ing long-context large language models with elastic se-
en-us-tensor-core, 2022. quence parallelism. In Proceedings of the ACM SIGOPS
30th Symposium on Operating Systems Principles, pages
[7] Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka 640–654, 2024.
Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. [16] In Gim, Guojun Chen, Seung-seob Lee, Nikhil Sarda,
Splitwise: Efficient generative llm inference using phase Anurag Khandelwal, and Lin Zhong. Prompt cache:
splitting. In 2024 ACM/IEEE 51st Annual International Modular attention reuse for low-latency inference. Pro-
Symposium on Computer Architecture (ISCA), pages ceedings of Machine Learning and Systems, 6:325–338,
118–132. IEEE, 2024. 2024.
[8] Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, [17] Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Jeff
Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. Dist- Huang, Chuyue Sun, Cody Hao Yu, Shiyi Cao, Chris-
serve: Disaggregating prefill and decoding for goodput- tos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al.
optimized large language model serving. In 18th Efficiently programming large language models using
USENIX Symposium on Operating Systems Design and sglang. arXiv preprint arXiv:2312.07104, 2023.
Implementation (OSDI 24), pages 193–210, 2024.
[18] Vikranth Srivatsa, Zijian He, Reyna Abhyankar, Dong-
[9] Cunchen Hu, Heyang Huang, Liangliang Xu, Xusheng ming Li, and Yiying Zhang. Preble: Efficient distributed
Chen, Jiang Xu, Shuang Chen, Hao Feng, Chenxi Wang, prompt scheduling for llm serving. 2024.
Sa Wang, Yungang Bao, et al. Inference without interfer- [19] Yazhuo Zhang, Juncheng Yang, Yao Yue, Ymir Vig-
ence: Disaggregate llm inference for mixed downstream fusson, and KV Rashmi. Sieve is simpler than lru: an
workloads. arXiv preprint arXiv:2401.11181, 2024. efficient turn-key eviction algorithm for web caches. In
21st USENIX Symposium on Networked Systems Design
[10] Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree and Implementation (NSDI 24), pages 1229–1246, 2024.
Mohan, Nipun Kwatra, Bhargav Gulavani, Alexey Tu-
manov, and Ramachandran Ramjee. Taming throughput- [20] Google. Our next-generation model: Gemini
latency tradeoff in llm inference with sarathi-serve. In 1.5. https://blog.google/technology/ai/
18th USENIX Symposium on Operating Systems Design google-gemini-next-generation-model-february-2024,
and Implementation (OSDI 24), pages 117–134, 2024. 2024.

168 23rd USENIX Conference on File and Storage Technologies USENIX Association
[21] Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, [31] Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Mered-
Minjia Zhang, Leon Song, Samyam Rajbhandari, and ith Ringel Morris, Percy Liang, and Michael S Bernstein.
Yuxiong He. Deepspeed ulysses: System optimizations Generative agents: Interactive simulacra of human be-
for enabling training of extreme long sequence trans- havior. In Proceedings of the 36th annual acm sympo-
former models. arXiv preprint arXiv:2309.14509, 2023. sium on user interface software and technology, pages
1–22, 2023.
[22] Hao Liu, Matei Zaharia, and Pieter Abbeel. Ring atten-
tion with blockwise transformers for near-infinite con- [32] Sharegpt teams. https://sharegpt.com/.
text. In NeurIPS 2023 Foundation Models for Decision
Making Workshop, 2023. [33] NVIDIA Corporation. Fastertransformer. https://
[23] William Brandon, Aniruddha Nrusimha, Kevin Qian, github.com/NVIDIA/FasterTransformer, 2019.
Zachary Ankner, Tian Jin, Zhiye Song, and Jonathan
Ragan-Kelley. Striped attention: Faster ring attention for [34] NVIDIA Corporation. Tensorrt-llm. https://github.
causal transformers. arXiv preprint arXiv:2311.09431, com/NVIDIA/TensorRT-LLM, 2023.
2023.
[35] Reza Yazdani Aminabadi, Samyam Rajbhandari, Am-
[24] Dacheng Li, Rulin Shao, Anze Xie, Eric P Xing, mar Ahmad Awan, Cheng Li, Du Li, Elton Zheng,
Joseph E Gonzalez, Ion Stoica, Xuezhe Ma, and Hao Olatunji Ruwase, Shaden Smith, Minjia Zhang, Jeff
Zhang. Lightseq:: Sequence level parallelism for dis- Rasley, et al. Deepspeed-inference: enabling efficient in-
tributed training of long context transformers. In Work- ference of transformer models at unprecedented scale. In
shop on Advancing Neural Network Training: Computa- SC22: International Conference for High Performance
tional Efficiency, Scalability, and Resource Optimization Computing, Networking, Storage and Analysis, pages
(WANT@ NeurIPS 2023), 2023. 1–15. IEEE, 2022.

[25] Vijay Anand Korthikanti, Jared Casper, Sangkug Lym, [36] Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan
Lawrence McAfee, Michael Andersch, Mohammad Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher
Shoeybi, and Bryan Catanzaro. Reducing activation re- Ré, Ion Stoica, and Ce Zhang. Flexgen: High-throughput
computation in large transformer models. Proceedings generative inference of large language models with a
of Machine Learning and Systems, 5:341–353, 2023. single gpu. In International Conference on Machine
[26] Shenggui Li, Fuzhao Xue, Chaitanya Baranwal, Yong- Learning, pages 31094–31116. PMLR, 2023.
bin Li, and Yang You. Sequence parallelism: Long
sequence training from system perspective. In Proceed- [37] Bingyang Wu, Yinmin Zhong, Zili Zhang, Gang Huang,
ings of the 61st Annual Meeting of the Association for Xuanzhe Liu, and Xin Jin. Fast distributed inference
Computational Linguistics (Volume 1: Long Papers), serving for large language models. arXiv preprint
pages 2391–2404, 2023. arXiv:2305.05920, 2023.

[27] Jiarui Fang and Shangchun Zhao. Usp: A unified se- [38] Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang,
quence parallelism approach for long context generative Djordje Jevdjic, Junbo Deng, Xingkun Yang, Zhou Yu,
ai. arXiv preprint arXiv:2405.07719, 2024. and Pengfei Zuo. Cost-efficient large language model
serving for multi-turn conversations with cachedatten-
[28] Zhuohan Li, Siyuan Zhuang, Shiyuan Guo, Danyang tion. In 2024 USENIX Annual Technical Conference
Zhuo, Hao Zhang, Dawn Song, and Ion Stoica. Terapipe: (USENIX ATC 24), pages 111–126, 2024.
Token-level pipeline parallelism for training large-scale
language models. In International Conference on Ma- [39] Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray,
chine Learning, pages 6543–6552. PMLR, 2021. Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao,
[29] Chenxin An, Shansan Gong, Ming Zhong, Xingjian Shan Lu, Ganesh Ananthanarayanan, et al. Cachegen:
Zhao, Mukai Li, Jun Zhang, Lingpeng Kong, and Kv cache compression and streaming for fast large lan-
Xipeng Qiu. L-eval: Instituting standardized evalua- guage model serving. In Proceedings of the ACM SIG-
tion for long context language models. arXiv preprint COMM 2024 Conference, pages 38–56, 2024.
arXiv:2307.11088, 2023.
[40] Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh,
[30] Jiaqi Li, Mengmeng Wang, Zilong Zheng, and Muhan Michael W Mahoney, Yakun Sophia Shao, Kurt Keutzer,
Zhang. Loogle: Can long-context language mod- and Amir Gholami. Kvquant: Towards 10 million con-
els understand long contexts? arXiv preprint text length llm inference with kv cache quantization.
arXiv:2311.04939, 2023. arXiv preprint arXiv:2401.18079, 2024.

USENIX Association 23rd USENIX Conference on File and Storage Technologies 169
[41] Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong,
Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and
Xia Hu. Kivi: A tuning-free asymmetric 2bit quanti-
zation for kv cache. arXiv preprint arXiv:2402.02750,
2024.

[42] Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu,
Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai
Dai, Daya Guo, et al. Deepseek-v2: A strong, econom-
ical, and efficient mixture-of-experts language model.
arXiv preprint arXiv:2405.04434, 2024.

[43] William Brandon, Mayank Mishra, Aniruddha


Nrusimha, Rameswar Panda, and Jonathan Ragan
Kelly. Reducing transformer key-value cache size with
cross-layer attention. arXiv preprint arXiv:2405.12981,
2024.

170 23rd USENIX Conference on File and Storage Technologies USENIX Association

You might also like