Fast25 Qin
Fast25 Qin
+59%
a KVCache-centric disaggregated architecture that not only 80
+157%
separates prefill and decoding clusters but also efficiently
+498%
utilizes the underexploited CPU, DRAM, SSD and NIC re- 60
sources of the GPU cluster to establish a disaggregated KV-
Cache. At the core of M OONCAKE is its KVCache-centric 40
Threshold III
Threshold II
Threshold I
global cache and a scheduler designed to maximize through-
20
put while adhering to stringent latency-related Service Level
Objectives (SLOs). 0
Our experiments demonstrate that M OONCAKE excels in 100 200300 500 1000
Time between Tokens (ms)
scenarios involving long-context inputs. In tests using real
traces, M OONCAKE increases the effective request capacity Figure 1: The experiment of the effective request capacity of
by 59%∼498% when compared to baseline methods, all while M OONCAKE under the real-world conversation workload and
complying with SLOs. Currently, M OONCAKE is operational different TBT SLOs. In this experiment, M OONCAKE and
across thousands of nodes, processing over 100 billion tokens three baseline systems utilize 16 8×A800 nodes each. More
daily. In practical deployments, M OONCAKE’s innovative on §5.2.
architecture enables Kimi to handle 115% and 107% more
requests on NVIDIA A800 and H800 clusters, respectively, it is necessary to decouple and restructure them into several
compared to previous systems. disaggregated resource pools, each optimized for different but
collaborative goals. For example, many other researchers [7–
9] have suggested separating prefill servers from decoding
1 Introduction servers, because these two stages of LLM serving have very
different computational characteristics.
With the rapid adoption of large language models (LLMs) in
various scenarios [1–4], the workloads for LLM serving have Further advancing this disaggregation strategy, we have en-
become significantly diversified. These workloads differ in gineered a disaggregated KVCache by pooling CPU, DRAM,
input/output length, distribution of arrival, and, most impor- SSD and RDMA resources of the GPU cluster, referred to as
tantly, demand different kinds of Service Level Objectives M OONCAKE Store. This novel architecture harnesses under-
(SLOs). As a Model as a Service (MaaS) provider, one of the utilized resources to enable efficient near-GPU prefix caching,
primary goals of Kimi [5] is to solve an optimization problem significantly enhancing the global cache capacity and inter-
with multiple complex constraints. The optimization goal is node transfer bandwidth. The resultant distributed KVCache
to maximize overall effective throughput, which directly im- system embodies the principle of trading more storage for
pacts revenue, while the constraints reflect varying levels of less computation. Thus, as demonstrated in Figure 1, it sub-
SLOs. These SLOs typically involve meeting latency-related stantially boosts Kimi’s maximum throughput capacity in
requirements, mainly the time to first token (TTFT) and the meeting the required SLOs for many important real-world
time between tokens (TBT). scenarios. Later in this paper, we will first delve into a mathe-
To achieve this goal, a prerequisite is to make the best use matical analysis of this strategy’s benefits for LLM serving
of the various kinds of resources available in the GPU cluster. and empirically assess its efficacy using real-world data (§2.2).
Specifically, although GPU servers are currently provided as Then, we will detail the design choices made in implementing
highly integrated nodes (e.g., DGX/HGX supercomputers [6]), this petabyte-level disaggregated cache, which is intercon-
nected via an RDMA network up to 8×400 Gbps (§3.2).
1 Ruoyu Qin’s part of work done as an intern at Moonshot AI, contributed
USENIX Association 23rd USENIX Conference on File and Storage Technologies 155
KVCache-
Prefill Instance Prefill Instance
centric
Conductor GPU/VRAM GPU/VRAM Prefill Stage
Prefill Pool
PP/SP Optimization Goal
Local Local
Cache-aware Chunked Chunked
Prefill Prefill Prefill max Cache Reuse
Scheduler Scheduler Paged KVCache Scheduler Paged KVCache s.t.
TTFT SLO,
MFU Lower Bound,
CPU/DRAM/SSD CPU/DRAM/SSD
Mooncake Store KVCache < DRAM
Distributed KVCache Pool Distributed KVCache Pool
A
R
M
D
D
M
KVCache
R
A
Balance KVCache Transfer Engine
Scheduler
R
A
D
M
M
D
A
R
CPU/DRAM/SSD CPU/DRAM/SSD
our current KVCache-centric disaggregated architecture replicates hot KVCache blocks without requiring precise pre-
for LLM serving, named M OONCAKE. For each request, the dictions of future KVCache usage. Experimental results show
global scheduler (Conductor) will select a pair of prefill and that our KVCache-centric scheduling can significantly lower
decoding instances and schedule the request in the following TTFT in real-world scenarios.
steps: 1) transfer as much reusable KVCache as possible to We will also describe the main design choices made during
the selected prefill instance; 2) complete the prefill stage in its implementation, especially those not covered in current
chunks/layers and continuously stream the output KVCache research. For example, regarding P/D disaggregation, there
to the corresponding decoding instance; 3) load the KVCache are currently debates on its feasibility in large-scale practice
and add the request to the continuous batching process at the due to bandwidth requirements and trade-offs associated with
decoding instance for generating request outputs. chunked prefill (e.g., Sarathi-Serve [10]). We demonstrate,
Although this process seems straightforward, the selection through comparison with vLLM, that with a highly optimized
policy is complex due to many restrictions. In the prefill stage, transfer engine, the communication challenges can be man-
the main objective is to reuse the KVCache as much as possi- aged, and P/D disaggregation is preferable for scenarios with
ble to avoid redundant computation. However, the distributed stringent SLO limits (§5.2). Additionally, we discuss how to
KVCache pool faces challenges in terms of both capacity and implement a separate prefill node pool that seamlessly han-
access latency. Thus Conductor is responsible for scheduling dles the dynamic distribution of context length. We employ a
requests with KVCache-awareness and executing scheduling chunked pipeline parallelism (CPP) mechanism to scale the
operations such as swapping and replication accordingly. The processing of a single request across multiple nodes, which
hottest blocks should be replicated to multiple nodes to avoid is necessary for reducing the TTFT of long-context inputs.
fetching congestion, while the coldest ones should be swapped Compared to traditional sequence parallelism (SP) based so-
out to reduce reserving costs. In contrast, the decoding stage lutions, CPP reduces network consumption and simplifies the
has different optimization goals and constraints. The aim is to reliance on frequent elastic scaling (§3.3).
aggregate as many tokens as possible in a decoding batch to M OONCAKE is currently the serving platform of Kimi and
improve the Model FLOPs Utilization (MFU). However, this has successfully handled exponential workload growth (more
objective is restricted not only by the TBT SLO but also by than 100 billion tokens a day). According to our historical
the total size of aggregated KVCache that can be contained statistics, the innovative architecture of M OONCAKE enables
in the VRAM. Kimi to handle 115% and 107% more requests on the A800
In §4, we will detail our KVCache-centric request schedul- and H800 clusters, respectively, compared to previous sys-
ing algorithm, which balances instance loads and user expe- tems.
rience as measured by TTFT and TBT SLOs. This includes To ensure the reproducibility of our results while safe-
a heuristic-based automated hotspot migration scheme that guarding proprietary information, we also provide detailed
156 23rd USENIX Conference on File and Storage Technologies USENIX Association
experimental outcomes using a dummy model mirroring the Notation Description Value
architecture of LLaMA3-70B, based on replayed traces of l Num layers 80
actual workloads. These traces, along with the KVCache d Model dimension 8192
transfer infrastructure of M OONCAKE, are open-sourced at a, b Constant coefficients in Equation 1 4, 22
https://github.com/kvcache-ai/Mooncake. gqa Num q heads / Num kv heads 8
In end-to-end experiments using public datasets and real s Tensor element size 2 B (BFloat16)
G GPU computation throughput 8×312 TFLOPS
workloads, M OONCAKE excels in long-context scenarios.
Bh2d Host-to-device bandwidth 128 GB/s
Compared to the baseline method, M OONCAKE can achieve
Bnic NIC bandwidth 800 Gbps
up to a 498% increase in the effective request capacity while n, p Prompt and matched prefix length, respectively
meeting SLOs. In §5.3, we compare M OONCAKE Store with
the local cache design and find that the global cache design of Table 1: Notations and parameters. Model and machine pa-
M OONCAKE Store significantly improves the cache hit rate. rameters are set according to LLaMA3-70B and 8×A800.
In our experiments, the cache hit rate is up to 2.36× higher cessive token generations for the same request, referred to as
than that of the local cache, resulting in up to 48% savings the time between tokens (TBT).
in prefill computation time. M OONCAKE, to the best of our In real deployments, if the monitor detects unmet SLOs, we
knowledge, is the first system to demonstrate the significant need to either add inference resources or reject some incoming
benefits of using a distributed KVCache pool to share KV- requests. However, due to the current contingent supply of
Cache across different chat sessions and queries in large-scale GPUs, elastically scaling out the inference cluster is typically
deployment scenarios. We also evaluate the performance of unfeasible. Therefore, we proactively reject requests that are
the transfer engine that supports high-speed RDMA transfers predicted not to meet the SLOs to alleviate the cluster’s load.
in M OONCAKE, which shows it is approximately 2.4× and Our main objective is to maximize overall throughput while
4.6× faster than existing solutions (§5.4). adhering to SLOs, a concept referred to as goodput in other
research [8, 15].
2 Preliminary and Problem Definition
2.2 More Storage for Less Computation
2.1 Service Level Objectives of LLM Serving
To meet the stringent SLOs described above, a commonly
Modern large language models (LLMs) are based on the adopted solution is to cache previously generated KVCache
Transformer architecture, which utilizes attention mecha- and reuse it upon finding a prefix match. However, existing
nisms and multilayer perceptrons (MLP) to process input. approaches [16–18] typically restrict caching to local HBM
Popular Transformer-based models, such as GPT [11] and and DRAM, assuming that the transfer bandwidth required for
LLaMA [12], employ a decoder-only structure. Each infer- global scheduling would be prohibitively high. But, as we will
ence request is logically divided into two stages: the prefill described later in §5.3, the capacity of local DRAM supports
stage and the decoding stage. only up to 50% of the theoretical cache hit rate, making the
During the prefill stage, all input tokens are processed in design of a global cache essential. In this section, we present
parallel, and hence it is typically computationally intensive. a mathematical analysis of the actual bandwidth necessary to
This stage generates the first output token while storing inter- benefit from this strategy, explaining why distributed caching
mediate results of computed keys and values, referred to as is advantageous, especially for larger models like LLaMA3-
the KVCache. The decoding stage then uses this KVCache 70B. More experimental results will be given later in §5.4.2.
to autoregressively generate new tokens. It processes only We base our analysis on the model using notations de-
one token at a time per batch due to the limitation of autore- scribed in Table 1 and incorporate specific parameters of
gressive generation, which makes it memory-constrained and LLaMA3-70B. Essentially, current popular LLMs are au-
causes computation time to increase sublinearly with batch toregressive language models where each token’s KVCache
size. Thus, a widely used optimization in the decoding stage depends only on itself and preceding tokens. Therefore,
is continuous batching [13, 14]. Before each iteration, the KVCache corresponding to the same input prefix can be
scheduler checks the status and adds newly arrived requests reused without affecting output accuracy. If a current request’s
to the batch while removing completed requests. prompt of length n shares a common prefix of length p with
Due to the distinct characteristics of the prefill and decod- previously cached KVCache, its prefill process can be opti-
ing stages, MaaS providers set different metrics to measure mized as follows:
their corresponding Service Level Objectives (SLOs). Specif- q[p : n], k[p : n], v[p : n] = MLP(hidden[p : n])
ically, the prefill stage is mainly concerned with the latency
between the request arrival and the generation of the first to- k[1 : n], v[1 : n] ← KVCache + (k[p : n], v[p : n])
ken, known as the time to first token (TTFT). On the other o[p : n] = Attention(q[p : n], k[1 : n], v[1 : n])
hand, the decoding stage focuses on the latency between suc- KVCache ← (k[1 : n], v[1 : n])
USENIX Association 23rd USENIX Conference on File and Storage Technologies 157
Given input length n, the FLOPS of the prefill stage can be Prefill Instance Decoding Instance
calculated as: GPU
s2: Incremental Prefill
GPU s4: Decoding
HBM, with a size of p×l ×(2×d/gqa)×s. Assuming the av- Prefix KVCache
Incremental
Full KVCache
KVCache
erage computation throughput is G and the average KVCache
loading speed is B (where B is determined by the minimum s3: KVCache Transfer
158 23rd USENIX Conference on File and Storage Technologies USENIX Association
high-performance, zero-copy KVCache transfer system de- 3.2.3 Transfer Engine
signed to maximize the benefits of using multiple RDMA
NICs per machine. It enhances execution efficiency and relia- To efficiently implement the above APIs, a transfer engine
bility through techniques such as topology-aware path selec- is designed to achieve several key objectives: 1) Effectively
tion and endpoint pooling. distribute transfer tasks across multiple RDMA NIC devices;
2) Abstract the complexities of RDMA connection manage-
ment from the APIs; and 3) Appropriately handle temporary
3.2.1 KVCache Management network failures. This transfer engine has been meticulously
In M OONCAKE Store, all KVCache is stored as paged blocks engineered to fulfill each of these goals.
within a distributed cache pool. The block size, i.e., the num- Network Setup The benefits of M OONCAKE rely on a
ber of tokens contained in each block, is determined by the high-bandwidth network interconnect. Currently, we use stan-
model size and the optimal network transmission size, typi- dard HGX machines where each A800 GPU is paired with
cally ranging from 16 to 512 tokens. Each block is attached a 100/200 Gbps NIC, and each H800 GPU is paired with a
with a hash key determined by both its own hash and its prefix 200/400 Gbps NIC, which is comparable to memory band-
for deduplication. The same hash key may have multiple repli- width and existing libraries (other than NCCL) fail to fully
cas across different nodes to mitigate hot-cache access latency, utilize this capacity. As for NCCL, it cannot gracefully handle
controlled by our cache-load-balancing policy described in dynamic topology changes due to the addition or removal of
§4.2. nodes/NICs and does not support DRAM-to-DRAM pathes.
M OONCAKE Store allocates space for each cache block In contrast, the transfer engine endeavors to find alternative
in the cache pool and logs metadata such as the block key paths upon failure.
and its address. When the cache pool is full, M OONCAKE To address congestion, the network utilizes RoCEv2 tuned
Store employs a LRU (Least Recently Used) strategy to evict by cloud providers. In the scheduler, we mitigate congestion
an existing cache block—unless the block is currently being by increasing the number of replicas for hot KVCaches (§4.2).
accessed by an ongoing request—and overwrites the evicted Topology-aware path selection. Modern inference servers
block’s space with the new block. often consist of multiple CPU sockets, DRAM, GPUs, and
RDMA NIC devices. Although it’s technically possible to
3.2.2 Interface transfer data from local DRAM or VRAM to a remote location
using any RDMA NIC, these transfers can be limited by the
At the higher layer, M OONCAKE Store offers object-based
bandwidth constraints of the Ultra Path Interconnect (UPI)
APIs such as put, get, and change_replica. These facil-
or PCIe Switch. To overcome these limitations, M OONCAKE
itate the caching of KVCache in a disaggregated manner,
Store implements a topology-aware path selection algorithm.
organizing mini blocks of KVCache as memory objects and
Before processing requests, each server generates a topol-
enabling Conductor to adjust the number of replicas for each
ogy matrix and broadcasts it across the cluster. This matrix
KVCache block to achieve higher bandwidth aggregation.
categorizes network interface cards (NICs) into "preferred"
These functions are supported by a set of synchronous batch
and "secondary" lists for various types of memory, which
transfer APIs, detailed in Listings 1.
types are specified during memory registration. Under normal
Transfer operations are available for both DRAM and
conditions, a NIC from the preferred list is selected for trans-
GPU VRAM and will utilize GPU Direct RDMA when op-
fers, facilitating RDMA operations within the local NUMA
timal, provided that the specified memory region has been
or GPU Direct RDMA through the local PCIe switch only. In
pre-registered. The completion of these operations can be
case of failures, NICs from both lists may be utilized. The pro-
monitored asynchronously via the getTransferStatus API,
cess involves identifying the appropriate local and target NICs
which reports whether transfers are ongoing or have encoun-
based on the memory addresses, establishing a connection,
tered errors.
and executing the data transfer.
For instance, as illustrated in Figure 4, to transfer data from
Listing 1: Memory transfer APIs in M OONCAKE Store. buffer 0 (assigned to cpu:0) in the local node to buffer 1 (as-
int registerLocalMemory(void *vaddr, size_t len, signed to cpu:1) in the target node, the engine first identifies
const string &type); the preferred NICs for cpu:0 using the local server’s topol-
BatchID allocateBatchID(size_t batch_size); ogy matrix and selects one, such as mlx5_1, as the local NIC.
int submitTransfer(BatchID batch_id, Similarly, the target NIC, such as mlx5_3, is selected based on
const vector<Request> &entries); the target memory address. This setup enables establishing an
int getTransferStatus(BatchID batch_id,
RDMA connection from mlx5_1@local to mlx5_3@target
int request_index,
Status &status);
to carry out RDMA read and write operations.
int freeBatchID(BatchID batch_id); To further maximize bandwidth utilization, a single re-
quest’s transfer is internally divided into multiple slices at a
USENIX Association 23rd USENIX Conference on File and Storage Technologies 159
mlx5_0 mlx5_1 cuda:0 cuda:1 mlx5_2 mlx5_3 cuda:2 cuda:3
Memory Controller
Memory Controller
Buffer
Buffer 0 at Socket 0
cpu:1
DRAM (cpu:0) Interconnect
cpu:0
BatchTransfer CPU 0 CPU 1
Buffer 1 at Socket 1 – Read/Write 16 GT/s
DRAM (cpu:1)
Buffer 1 at Socket 1
… DRAM (cpu:1)
Topology matrix Preferred NICs
Buffer 0 at GPU 0
VRAM (cuda:0) {
"cpu:0":[["mlx5_0","mlx5_1"],["mlx5_2","mlx5_3"]],
"cpu:1":[["mlx5_2","mlx5_3"],["mlx5_0","mlx5_1"]],
"cuda:0":[["mlx5_0"],["mlx5_1","mlx5_2","mlx5_3"]],
...
} Secondary NICs
a) Batch transfer interface b) Topology-aware path selection
granularity of 16 KB. Each slice might use a different path, is primarily driven by the fact that online services typically
enabling collaborative work among all RDMA NICs. have more stringent SLOs. While chunked prefill reduces de-
coding interference, it remains challenging to simultaneously
Endpoint management. M OONCAKE Store employs a pair
maximize MFU during the prefill stage and meet the TBT
of endpoints to represent the connection between a local
SLO during the decoding stage. We will demonstrate this in
RDMA NIC and a remote RDMA NIC. In practice, each
the end-to-end experiments in §5.2. Another important rea-
endpoint includes one or more RDMA queue pair objects.
son is that we think prefill nodes require different cross-node
Connections in M OONCAKE Store are established in an on
parallelism settings to handle long contexts as the available
demand manner; endpoints remain unpaired until the first
context length of recent LLMs is increasing rapidly, from 8k
request is made.
to 128k and even up to 1 million tokens [20]. Typically, for
To prevent a large number of endpoints from slowing down
such long context requests, the input tokens can be 10 to 100
request processing, M OONCAKE Store employs endpoint
times larger than the output tokens, making optimizing the
pooling, which caps the maximum number of active connec-
TTFT crucial. Due to the abundant parallelism in long con-
tions. We use the SIEVE [19] algorithm to manage endpoint
text prefill, using more than a single 8×GPU node to process
eviction. If a connection fails due to link errors, it is removed
them in parallel is desirable. However, extending tensor paral-
from the endpoint pools on both sides and re-established dur-
lelism (TP) across more than one node requires two expensive
ing the next data transfer attempt.
RDMA-based all-reduce operations per layer, significantly
Failure handing. In a multi-NIC environment, one com- reducing the MFU of prefill nodes.
mon failure scenario is the temporary unavailability of a spe- Recently, many works have proposed sequence parallelism
cific NIC, while other routes may still connect two nodes. (SP) [21–27]. SP partitions the input sequences of requests
M OONCAKE Store is designed to adeptly manage such tem- across different nodes to achieve acceleration, allowing even
porary failures effectively. If a connection is identified as long requests to meet the TTFT SLO. However, when ap-
unavailable, M OONCAKE Store automatically identifies an plied to shorter input requests, SP results in a lower MFU
alternative, reachable path and resubmits the request to a dif- compared to using single-node TP only. Recent research [15]
ferent RDMA NIC device. Furthermore, M OONCAKE Store proposes elastic sequence parallelism to dynamically scale up
is capable of detecting problems with other RDMA resources, or down the SP group. Although possible, this adds complex-
including RDMA contexts and completion queues. It tem- ity to our architecture. Additionally, SP still requires frequent
porarily avoids using these resources until the issue, such as a cross-node communication, which lowers the MFU and com-
downed link, is resolved. petes with network resources for transferring KVCache across
nodes.
3.3 M OONCAKE’s Prefill Pool To address this, M OONCAKE leverages the autoregressive
property of decoder-only transformers and implements chun-
Unlike the inviolable decoding nodes, the necessity and ked pipeline parallelism (CPP) for long context prefill. We
best practices for designing a separate and elastic prefill group every X nodes in the prefill cluster into a pipelined
pool remain under debate. For example, although many re- prefill node group. For each request, its input tokens are par-
searchers [7–9] share our intuition to use a disaggregated titioned into chunks, each no longer than the pre f ill_chunk.
architecture, it is worth discussing whether this separation is Different chunks of the same request can be processed simul-
still necessary with the introduction of chunked prefill [10]. taneously by different nodes, thus parallelizing the processing
However, after careful consideration, we decided to main- and reducing TTFT.
tain M OONCAKE’s disaggregated architecture. This decision CPP offers two main benefits: 1) Similar to pipeline par-
160 23rd USENIX Conference on File and Storage Technologies USENIX Association
allelism in training, it requires cross-node communication
20
only at the boundaries of each pipeline stage, which can be 19.65
Average
easily overlapped with computation. This leads to better MFU 15
TTFT (s)
and less network resource contention with KVCache trans-
fer. 2) It naturally fits both short and long contexts, bringing 10
no significant overhead for short context prefill and avoid- 5.27
ing frequent dynamic adjustment of node partitioning. This 5 3.07 3.58
USENIX Association 23rd USENIX Conference on File and Storage Technologies 161
Algorithm 1 KVCache-centric Scheduling Algorithm 5 Evaluation
Input: prefill instance pool P, decoding instance pool D, request R,
cache block size B. As described before, according to historical statistics of Kimi,
Output: the prefill and decoding instances (p, d) to process R. M OONCAKE enables Kimi to handle 115% and 107% more
1: block_keys ← PrefixHash(R.prompt_tokens, B) requests on the A800 and H800 clusters, respectively, com-
2: TTFT, p ← inf, 0/ pared to our previous systems based on vLLM. To further
3: best_len, best_instance ← FindBestPrefixMatch(P, block_keys)
validate this results and ensure reproducibility, in this section,
4: for instance ∈ P do
best_len we conduct a series of end-to-end and ablation experiments on
5: if instance.prefix_len > kvcache_balancing_threshold then
6: prefix_len ← best_len
M OONCAKE with a dummy LLaMA3-70B model to address
7: transfer_len ← best_len − instance.prefix_len the following questions: 1) Does M OONCAKE outperform
8: Ttransfer ← EstimateKVCacheTransferTime(transfer_len) existing LLM inference systems in real-world scenarios? 2)
9: else Compared to conventional prefix caching methods, does the
10: prefix_len ← instance.prefix_len design of M OONCAKE Store significantly improve M OON -
11: Ttransfer ← 0 CAKE ’s performance?
12: Tqueue ← EstimatePrefillQueueTime(instance)
13: Tprefill ← EstimatePrefillExecutionTime(
len(R.prompt_tokens), prefix_len) 5.1 Setup
14: if TTFT > Ttransfer + Tqueue + Tprefill then
15: TTFT ← Ttransfer + Tqueue + Tprefill Testbed. During the reproducing experiments, the system
16: p ← instance was deployed on a high-performance computing node cluster
17: d, TBT ← SelectDecodingInstance(D) to evaluate its performance. Each node in the cluster is con-
18: if TTFT > TTFT_SLO or TBT > TBT_SLO then figured with eight NVIDIA-A800-SXM4-80GB GPUs and
19: reject R; return four 200 Gbps RDMA NICs. The KVCache block size in
best_len
20: if p.prefix_len > kvcache_balancing_threshold then M OONCAKE Store is set to 256. For deploying M OONCAKE,
21: TransferKVCache(best_instance, p) each node operates as either a prefill instance or a decoding
22: return (p, d) instance based on the startup parameters. For deploying other
systems, each node hosts a single instance.
Metric. Specifically, we measure the TTFT and TBT of
multiplied by a threshold1 Both strategies not only reduce each request, where the TBT is calculated as the average of
the prefill time for requests but also facilitate the automatic the longest 10% of the token arrival intervals. As mentioned
replication of hotspot caches, allowing for their broader dis- in §2, the threshold for TTFT is set to 30 s, and TBT thresh-
tribution across multiple instances. olds are set to 100 ms, 200 ms, and 300 ms, depending on
the scenario. We consider requests with both TTFT and TBT
To validate the effectiveness of our strategy, we conduct a below their respective thresholds as effective requests, and
scheduling experiment that compares random scheduling and the proportion of effective requests among all requests as the
load-balancing scheduling with our strategy. We further com- effective request capacity. For brevity, the subsequent experi-
pare the local cache-aware scheduling described in §4.1 and ments not mentioning TTFT are assumed to meet the TTFT
the global cache-aware scheduling described in this section threshold. To more intricately compare the caching perfor-
that considers cache load balancing. In random scheduling, mance, we also measure the GPU time during the prefill stage
a prefill instance is selected arbitrarily for each request. In and the cache hit rate for each request.
load-balancing scheduling, the instance with the lightest load
Baseline. We employ vLLM [14], one of the state-of-the-art
is chosen. Specifically, we build a M OONCAKE cluster con-
open-source LLM serving systems, as our experimental base-
sisting of 16 8×A800 nodes, and replay the conversation trace
line. vLLM features continuous batching and PagedAttention
detailed in §5.2.1 for the experiment. We assess the perfor-
technologies, significantly enhancing inference throughput.
mance of each scheduling algorithm based on the TTFTs. The
Despite its strengths, vLLM’s architecture, which couples the
experimental results, depicted in Figure 5, demonstrate that
prefill and decoding stages, can disrupt decoding especially in
our KVCache-centric scheduling algorithms outperform ran-
scenarios involving long contexts. Recent updates to vLLM
dom and load-balancing scheduling. By incorporating cache
have integrated features like prefix caching and chunked pre-
load balancing, the global cache-aware algorithm reduces the
fill to improve performance metrics in long-context scenarios,
average TTFT by an additional 14% compared to the local
such as TTFT and TBT. In our experiments, we also compare
cache-aware algorithm.
these features of vLLM. In our experiments, we utilize the
latest release (v0.5.1) of vLLM. Due to limitations in the
1 This
threshold is currently adjusted manually but can be adaptively current implementation, we test the prefix cache and chunked
adjusted by an algorithm in the future. prefill features of this version separately.
162 23rd USENIX Conference on File and Storage Technologies USENIX Association
Conversation Tool&Agent Synthetic Mooncake vLLM Prefix Caching
Avg Input Len 12035 8596 15325 vLLM vLLM Chunked Prefill
+22%
+42%
Cache Ratio 40% 59% 66%
+64%
Arrival Pattern Timestamp Timestamp Poisson 80
Num Requests 12031 23608 3993
60
Table 2: Workload Statistics.
40
Threshold III
Threshold II
Threshold I
5.2 End-to-end Performance
20
In our end-to-end experiments, we evaluate the request han-
dling capabilities of M OONCAKE and baseline systems under 0
various workloads. Specifically, we measure the maximum 100 200300 500 1000
Time between Tokens (ms)
throughput that remains within the defined SLO thresholds.
We employ three types of workloads in our tests: two real- Figure 6: The experiment of the effective request capacity of
world traces sampled from Kimi that represent online con- M OONCAKE under the tool&agent workload.
versations and tool&agent interactions, respectively, and a
synthetic workload to cover different inference scenarios. We ing datasets: ShareGPT [32], Leval [29], and LooGLE [30].
will first describe the unique characteristics of these work- ShareGPT comprises multi-turn conversations with short in-
loads and then discuss the results. Lastly, we analyze the GPU put lengths. Leval serves as a benchmark for evaluating model
computation time during the prefill stage, further demonstrat- performance over long contexts, simulating scenarios where
ing the advantages of M OONCAKE Store in enhancing cache requests involve lengthy system prompts typical of tool and
utilization and reducing computation costs. agent interactions. LooGLE is tailored for long-context QA
and summarization tasks, featuring input lengths of up to 100k
tokens and including both multi-turn QA and single-turn sum-
5.2.1 Workload
marizations, making it well-suited for long text summarization
Conversation workload. Chatbots [1, 5] represent one of and QA scenarios. Overall, the synthetic workload has the
the most prevalent applications of LLMs, making conversa- longest average input length. Despite having the highest pro-
tional requests a highly representative workload for LLM in- portion of prefix caching, its cache hits are quite dispersed,
ference. As shown in Table 2, the conversation workload con- thus requiring a substantial cache capacity.
tains a significant portion of long-context requests—reaching During preprocessing, each conversation turn was mapped
up to 128k tokens and averaging around 12k tokens—which is into a separate request, incorporating both the input and out-
comparable to the data lengths found in current long-context puts from previous interactions. For datasets featuring mul-
datasets [29, 30]. Moreover, the workload has an average of tiple questions with the same lengthy prompt, each question
approximately 40% prefix caching ratio brought about by and its preceding prompt were treated as a single request. We
multi-turn conversations. We sampled 1 hour of conversation combined the processed datasets in a 1:1:1 ratio, preserving
traces from an online inference cluster, where each record the sequential relationships within the multi-turn dialogue
includes the input and output lengths along with timestamps requests while randomly shuffling them. Since the datasets do
of arrival. Requests are dispatched according to these times- not specify arrival times, we simulated realistic conditions by
tamps and are preemptively terminated once the model output dispatching requests at a defined rate using a Poisson process.
reaches the predetermined length.
Tool&Agent workload. Recent studies [31] involving 5.2.2 Effective Request Capacity
LLMs deployed as tools or agents to perform tasks have been
increasing. These tasks are typically characterized by the in- To assess the maximum number of requests that can adhere
corporation of pre-designed, often lengthy, system prompts to the SLOs under different workloads, we test four system
that are fully repetitive. We collected traces of the tool&agent configurations: M OONCAKE, vLLM, vLLM with the prefix
workload, also sampled over a 1-hour period. As indicated caching feature, and vLLM with the chunked prefill feature,
in Table 2, this workload exhibits a high proportion of prefix each utilizing 16 nodes.
caching, with shorter input and output lengths. Conversation workload. The results for this workload are
Synthetic workload. The synthetic workload was con- presented in Figure 1. This workload, characterized by vary-
structed from a combination of publicly available datasets. ing input lengths and longer output lengths, causes significant
We categorized the requests in the real trace into three types: fluctuations in TBT for the vLLM system due to the lengthy
short conversations, tool and agent calls, and long text sum- contexts in the prefill stage. While chunked prefill reduces
marization and QA. For each category, we selected the follow- decoding interference, balancing the enhancement of MFU
USENIX Association 23rd USENIX Conference on File and Storage Technologies 163
Mooncake vLLM Prefix Caching
Mooncake vLLM Prefix Caching
vLLM vLLM Chunked Prefill
vLLM vLLM Chunked Prefill
Request Capacity Ratio (%)
3x
100
.3
+28%
x 3
+40%
x
x
76
90
+62%
59
2.
.
2
x 1
x
80
2.
56
43
x
1.
68
1.
2.
x
12
60
2.
x
1
40
1.
40
Threshold III
Threshold II
Threshold I
0
Conversation Tool&Agent Synthetic
20
Figure 8: Average GPU time of each request during the prefill
0 stage under different workloads.
100 200300 500 1000
Time between Tokens (ms)
Figure 7: The experiment of the effective request capacity of longest prefill GPU time due to its longer input lengths and
M OONCAKE under the synthetic workload. lower prefix cache ratio. The synthetic workload, featuring
the highest prefix cache ratio and dispersed cache hotspots,
achieves optimal cache hit rates within M OONCAKE’s global
in the prefill stage with the TBT constraints in the decoding
cache pool. Consequently, despite having the longest aver-
stage remains challenging. Despite meeting the TTFT SLO,
age input lengths, it requires less prefill GPU time than the
its effective request capacity is still suboptimal. Compared to
conversation workload. Finally, the tool&agent workload ex-
vLLM, M OONCAKE achieves a very significant increase in
hibits the shortest prefill GPU time because it has the shortest
effective request capacity.
average input length and a relatively high prefix cache ratio.
Tool&Agent workload. In contrast, the tool&agent work-
Across different systems, M OONCAKE significantly re-
load has a high proportion of prefix caching and shorter output
duces GPU time by fully utilizing global cache for prefix
lengths, favoring the vLLM system as the short prefill time
caching, achieving reductions of 36%, 53%, and 64% for con-
minimally impacts output. However, as illustrated in Figure 6,
versation, tool&agent, and synthetic workloads, respectively,
vLLM and vLLM with chunked prefill experience more severe
compared to vLLM. vLLM featuring prefix caching uses lo-
disruptions in decoding due to longer prefill processing times,
cal cache stored on HBM, where the cache capacity is far
resulting in a lower effective caching capacity than vLLM
lower than that of M OONCAKE. Its prefill GPU time is 1.43×
with prefix caching. M OONCAKE uses a global cache pool
and 1.40× higher than M OONCAKE for the conversation and
to significantly increase caching capacity and optimize cache
tool&agent workloads, respectively. However, in the synthetic
utilization through internode transfers, excelling in scenarios
workload, where cache hotspots are more dispersed, the prefill
with high prefix caching. As a result, it enhances effective
GPU time of vLLM with prefix caching is nearly equivalent
caching capacity by 42% compared to vLLM with prefix
to vLLM, and is 2.59× that of M OONCAKE. vLLM with
caching under the 200 ms threshold.
chunked prefill sacrifices some prefill efficiency to maintain
Synthetic workload. The synthetic workload features the lower TBT during the decoding stage, resulting in the longest
longest average input lengths and dispersed cache hotspots prefill GPU times, which are 1.90×, 2.68×, and 3.33× that
which leads to poor cache utilization under smaller cache ca- of M OONCAKE for the three workloads.
pacities. As depicted in Figure 7, most requests processed by
M OONCAKE maintain a TBT within 100 ms, whereas about
20% of requests handled by vLLM exceed 300 ms. The per- 5.3 M OONCAKE Store
formance of systems with prefix caching and chunked prefill To address Question 2, we examine the effects of M OON -
is similar to vLLM, as they fail to mitigate the impact of long CAKE Store’s global cache pool on system performance. Our
contexts on the decoding stage. Compared to vLLM, M OON - analysis reveals that although using local DRAM to construct
CAKE increases effective request capacity by 40% under the
KVCache memory increases cache capacity than HBM only,
200 ms threshold. restricting the cache to a single node still leads to suboptimal
cache utilization. We will first conduct a quantitative anal-
5.2.3 Prefill GPU Time ysis of cache capacity requirements and then showcase the
benefits through practical workload experiments.
Prefill GPU time is positively correlated with requests’ TTFT
and serving cost and is determined by requests’ input lengths
5.3.1 Quantitative Analysis of Cache Capacity
and cache hit rates. We analyze the average GPU time during
the prefill stage under different workloads, as shown in Fig- Considering the LLaMA3-70B model, the KVCache size
ure 8. For M OONCAKE, the conversation workload incurs the required for a single token is 320 KB. Despite the possibility
164 23rd USENIX Conference on File and Storage Technologies USENIX Association
10th 100th 1000th 10000th
Conversation Synthetic
Conversation Tool&Agent Synthetic
Tool&Agent All 8
Replica Count
6
0.6 4
Cache Hit Rate
75% max 2
46% max 0
0.4 0 1000 2000 3000
Time (s)
0 1000 2000 3000
Time (s)
0 500
Time (s)
1000
48% max
41% max Figure 11: Replication count of cache keys across various
0.2
workloads. We continuously monitor and record the keys and
counts of all cache blocks every 30 seconds, subsequently
0.0 ranking the cache keys by the cumulative counts from all sam-
104 105 106 107 108 109 ples. This figure depicts the temporal variation in replication
Cache Capacity (tokens) numbers for cache keys ranked at the 10th, 100th, 1000th, and
Figure 9: Quantitative analysis of prefix cache hit rates with 10,000th positions.
varying cache capacities. We consider only the sequence of
requests and do not account for factors such as prefill com- to 1 to isolate the impact of the decoding stage. Each node in
putation time or the replication of hotspots in the cache. The the local cache setup has a 3M token capacity but can only
dashed line for 3M tokens capacity represents the local cache access its own cache. The global scheduler is programmed
capacity, with the intersection points indicating the ratio of to direct requests to nodes with higher prefix match ratios to
the cache hit rate to the theoretical maximum hit rate. maximize cache utilization. Conversely, in the global cache
setup, each node also has a 3M token capacity but can share
Local Cache Global Cache caches across all nodes, supported by proactive inter-node
1.5 cache migration. The experimental data, shown in Figure 10,
x
36
x
38
indicates that the global cache achieves higher cache hit rates
2.
x
76
1.
60
0.
1.0
and shorter average prefill GPU computation times across all
x
x
52
22
40
x
74
0.5
20 and a reduction of up to 48% in prefill computation time.
0 0.0
Con Tool& Synt Con Tool& Synt
ve rsat Age hetic ve rsat Age hetic 5.3.3 Cache Replica
ion nt ion nt
USENIX Association 23rd USENIX Conference on File and Storage Technologies 165
Transfer Engine TCP Gloo
Schedule Transfer Decode
4 x 200 Gbps NICs 8 x 400 Gbps NICs Prefill Load Cache
4 4
Latency (s)
Latency (s)
30
6
Latency (s)
Latency (s)
2 7.5x 2 16.2x
20
2.4x 4
4.6x
0 0 10
0 25 50 75 100 0 25 50 75 100 2
Cache Size (GB) Cache Size (GB)
0 0
8k 16k 32k 64k 128k 8k 16k 32k 64k 128k
Figure 12: Latency of inter-node cache transfer. Prompt Length Prompt Length
Theoretical Time (a) Prefix cache ratio 0%. (b) Prefix cache ratio 95%.
Transfer Time (s)
4 3 Real Time
Avg TTFT (s)
TTFT of Recomputation 2
Figure 14: End-to-end latency breakdown of M OONCAKE. In
2 the figure, Prefill represents the time for layer-wise prefill that
1 integrates cache loading and storing, and Decode represents
0 0
the time to decode 128 tokens. All processes with diagonal
0 100 200 300 400 0 100 200 300 400
BandWidth (Gbps) BandWidth (Gbps) stripes can proceed asynchronously with model inference and
do not affect M OONCAKE’s throughput.
(a) Average TTFT. (b) Transfer time.
network congestion, as demonstrated by the substantial diver-
Figure 13: The synthetic workload experiment with varying
gence between actual and theoretical transfer times illustrated
network bandwidths.
in Figure 13b. Consequently, we recommend a minimum
network bandwidth of 100 Gbps to ensure optimal system
with other popular schemes, considering two alternative base-
performance.
lines: torch.distributed with a Gloo backend and TCP-
based transfers. All schemes are tested with a concurrency
level of 64 and a minimum transfer granularity of 128 KB. As 5.4.3 E2E Latency Breakdown
depicted in Figure 12, the transfer engine consistently exhibits
The latency of a single inference request in M OONCAKE
significantly lower latency than the alternative methods. In the
can be decomposed into five components: 1) scheduling and
scenario of transferring 40 GB of data, corresponding to the
queuing time; 2) layer-wise prefill time; 3) cache transfer time;
cache size for LLaMA3-70B with 128k tokens, the transfer
4) time required for the decoding node to load cache from
engine achieves bandwidth of 87 GB/s and 190 GB/s under
DRAM to HBM; and 5) decoding time. We experimentally
network configurations of 4×200 Gbps and 8×400 Gbps, re-
analyze the proportion of these five components under settings
spectively. These rates are approximately 2.4× and 4.6×
with prefix cache ratios of 0% and 95%, as shown in Figure 14.
faster than those achieved using the TCP protocol. The code
First, it is evident from the figure that the introduction of
of this transfer engine will also be open sourced later as it is
prefix caching significantly reduces the prefill time. Specif-
a decoupled and basic tool that can be used in many scenar-
ically, with an input length of 128k tokens, prefix caching
ios (e.g., it is also used in the checkpoint transfer service of
reduces the prefill time by 92%. Furthermore, the overhead
Moonshot AI).
introduced by M OONCAKE has minimal impact on the sys-
tem’s performance. The Schedule, Transfer, and Load Cache
5.4.2 Bandwidth Demand by M OONCAKE components can proceed asynchronously with model infer-
M OONCAKE’s global cache pool relies on efficient inter-node ence and therefore do not affect M OONCAKE’s throughput.
cache transfers to hide cache transfer times within GPU com- Moreover, the increase in TTFT due to these overheads is
putation times. We evaluate the impact of network bandwidth smaller than the reduction achieved by prefix caching. Even
on the system’s performance by simulating a range of band- when accounting for the overhead, prefix caching in M OON -
widths from 24 Gbps to 400 Gbps and measuring the transfer CAKE can reduce TTFT by 86% with an input length of 128k
time and TTFT under the synthetic workload described in tokens.
§5.2.1. Figure 13a shows that the average TTFT of requests
decreases as bandwidth increases. When the total communi- 5.5 P/D Ratio
cation bandwidth exceeds 100 Gbps, the average TTFT re-
mains below 2 s, significantly less than the TTFT of the re- As a deployed P/D disaggregation system, in this section, we
computation baseline. However, when bandwidth falls below explore the impact of different P/D ratios on system perfor-
100 Gbps, system performance is significantly compromised. mance. We define the P/D ratio as the number of prefill nodes
This is marked by a sharp increase in TTFT and evident to decoding nodes. Using the clusters comprising 16 nodes
166 23rd USENIX Conference on File and Storage Technologies USENIX Association
Effective Request Ratio (%)
USENIX Association 23rd USENIX Conference on File and Storage Technologies 167
References [11] Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
Dario Amodei, Ilya Sutskever, et al. Language mod-
[1] OpenAI. Introducing chatgpt. https://openai.com/ els are unsupervised multitask learners. OpenAI blog,
blog/chatgpt, 2022. 1(8):9, 2019.
[2] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, [12] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier
Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Martinet, Marie-Anne Lachaux, Timothée Lacroix, Bap-
Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. tiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar,
Llama 2: Open foundation and fine-tuned chat models. et al. Llama: Open and efficient foundation language
arXiv preprint arXiv:2307.09288, 2023. models. arXiv preprint arXiv:2302.13971, 2023.
[13] Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soo-
[3] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, jeong Kim, and Byung-Gon Chun. Orca: A distributed
Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri serving system for transformer-based generative mod-
Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, els. In 16th USENIX Symposium on Operating Systems
et al. Evaluating large language models trained on code. Design and Implementation (OSDI 22), pages 521–538,
arXiv preprint arXiv:2107.03374, 2021. 2022.
[4] Charles Packer, Vivian Fang, Shishir G Patil, Kevin [14] Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying
Lin, Sarah Wooders, and Joseph E Gonzalez. Memgpt: Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez,
Towards llms as operating systems. arXiv preprint Hao Zhang, and Ion Stoica. Efficient memory man-
arXiv:2310.08560, 2023. agement for large language model serving with page-
dattention. In Proceedings of the 29th Symposium on
[5] Moonshot AI. Kimi. https://kimi.moonshot.cn, Operating Systems Principles, pages 611–626, 2023.
2023.
[15] Bingyang Wu, Shengyu Liu, Yinmin Zhong, Peng Sun,
[6] NVIDIA. Nvidia h100 tensor core gpu archi- Xuanzhe Liu, and Xin Jin. Loongserve: Efficiently serv-
tecture. https://resources.nvidia.com/ ing long-context large language models with elastic se-
en-us-tensor-core, 2022. quence parallelism. In Proceedings of the ACM SIGOPS
30th Symposium on Operating Systems Principles, pages
[7] Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka 640–654, 2024.
Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. [16] In Gim, Guojun Chen, Seung-seob Lee, Nikhil Sarda,
Splitwise: Efficient generative llm inference using phase Anurag Khandelwal, and Lin Zhong. Prompt cache:
splitting. In 2024 ACM/IEEE 51st Annual International Modular attention reuse for low-latency inference. Pro-
Symposium on Computer Architecture (ISCA), pages ceedings of Machine Learning and Systems, 6:325–338,
118–132. IEEE, 2024. 2024.
[8] Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, [17] Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Jeff
Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. Dist- Huang, Chuyue Sun, Cody Hao Yu, Shiyi Cao, Chris-
serve: Disaggregating prefill and decoding for goodput- tos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al.
optimized large language model serving. In 18th Efficiently programming large language models using
USENIX Symposium on Operating Systems Design and sglang. arXiv preprint arXiv:2312.07104, 2023.
Implementation (OSDI 24), pages 193–210, 2024.
[18] Vikranth Srivatsa, Zijian He, Reyna Abhyankar, Dong-
[9] Cunchen Hu, Heyang Huang, Liangliang Xu, Xusheng ming Li, and Yiying Zhang. Preble: Efficient distributed
Chen, Jiang Xu, Shuang Chen, Hao Feng, Chenxi Wang, prompt scheduling for llm serving. 2024.
Sa Wang, Yungang Bao, et al. Inference without interfer- [19] Yazhuo Zhang, Juncheng Yang, Yao Yue, Ymir Vig-
ence: Disaggregate llm inference for mixed downstream fusson, and KV Rashmi. Sieve is simpler than lru: an
workloads. arXiv preprint arXiv:2401.11181, 2024. efficient turn-key eviction algorithm for web caches. In
21st USENIX Symposium on Networked Systems Design
[10] Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree and Implementation (NSDI 24), pages 1229–1246, 2024.
Mohan, Nipun Kwatra, Bhargav Gulavani, Alexey Tu-
manov, and Ramachandran Ramjee. Taming throughput- [20] Google. Our next-generation model: Gemini
latency tradeoff in llm inference with sarathi-serve. In 1.5. https://blog.google/technology/ai/
18th USENIX Symposium on Operating Systems Design google-gemini-next-generation-model-february-2024,
and Implementation (OSDI 24), pages 117–134, 2024. 2024.
168 23rd USENIX Conference on File and Storage Technologies USENIX Association
[21] Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, [31] Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Mered-
Minjia Zhang, Leon Song, Samyam Rajbhandari, and ith Ringel Morris, Percy Liang, and Michael S Bernstein.
Yuxiong He. Deepspeed ulysses: System optimizations Generative agents: Interactive simulacra of human be-
for enabling training of extreme long sequence trans- havior. In Proceedings of the 36th annual acm sympo-
former models. arXiv preprint arXiv:2309.14509, 2023. sium on user interface software and technology, pages
1–22, 2023.
[22] Hao Liu, Matei Zaharia, and Pieter Abbeel. Ring atten-
tion with blockwise transformers for near-infinite con- [32] Sharegpt teams. https://sharegpt.com/.
text. In NeurIPS 2023 Foundation Models for Decision
Making Workshop, 2023. [33] NVIDIA Corporation. Fastertransformer. https://
[23] William Brandon, Aniruddha Nrusimha, Kevin Qian, github.com/NVIDIA/FasterTransformer, 2019.
Zachary Ankner, Tian Jin, Zhiye Song, and Jonathan
Ragan-Kelley. Striped attention: Faster ring attention for [34] NVIDIA Corporation. Tensorrt-llm. https://github.
causal transformers. arXiv preprint arXiv:2311.09431, com/NVIDIA/TensorRT-LLM, 2023.
2023.
[35] Reza Yazdani Aminabadi, Samyam Rajbhandari, Am-
[24] Dacheng Li, Rulin Shao, Anze Xie, Eric P Xing, mar Ahmad Awan, Cheng Li, Du Li, Elton Zheng,
Joseph E Gonzalez, Ion Stoica, Xuezhe Ma, and Hao Olatunji Ruwase, Shaden Smith, Minjia Zhang, Jeff
Zhang. Lightseq:: Sequence level parallelism for dis- Rasley, et al. Deepspeed-inference: enabling efficient in-
tributed training of long context transformers. In Work- ference of transformer models at unprecedented scale. In
shop on Advancing Neural Network Training: Computa- SC22: International Conference for High Performance
tional Efficiency, Scalability, and Resource Optimization Computing, Networking, Storage and Analysis, pages
(WANT@ NeurIPS 2023), 2023. 1–15. IEEE, 2022.
[25] Vijay Anand Korthikanti, Jared Casper, Sangkug Lym, [36] Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan
Lawrence McAfee, Michael Andersch, Mohammad Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher
Shoeybi, and Bryan Catanzaro. Reducing activation re- Ré, Ion Stoica, and Ce Zhang. Flexgen: High-throughput
computation in large transformer models. Proceedings generative inference of large language models with a
of Machine Learning and Systems, 5:341–353, 2023. single gpu. In International Conference on Machine
[26] Shenggui Li, Fuzhao Xue, Chaitanya Baranwal, Yong- Learning, pages 31094–31116. PMLR, 2023.
bin Li, and Yang You. Sequence parallelism: Long
sequence training from system perspective. In Proceed- [37] Bingyang Wu, Yinmin Zhong, Zili Zhang, Gang Huang,
ings of the 61st Annual Meeting of the Association for Xuanzhe Liu, and Xin Jin. Fast distributed inference
Computational Linguistics (Volume 1: Long Papers), serving for large language models. arXiv preprint
pages 2391–2404, 2023. arXiv:2305.05920, 2023.
[27] Jiarui Fang and Shangchun Zhao. Usp: A unified se- [38] Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang,
quence parallelism approach for long context generative Djordje Jevdjic, Junbo Deng, Xingkun Yang, Zhou Yu,
ai. arXiv preprint arXiv:2405.07719, 2024. and Pengfei Zuo. Cost-efficient large language model
serving for multi-turn conversations with cachedatten-
[28] Zhuohan Li, Siyuan Zhuang, Shiyuan Guo, Danyang tion. In 2024 USENIX Annual Technical Conference
Zhuo, Hao Zhang, Dawn Song, and Ion Stoica. Terapipe: (USENIX ATC 24), pages 111–126, 2024.
Token-level pipeline parallelism for training large-scale
language models. In International Conference on Ma- [39] Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray,
chine Learning, pages 6543–6552. PMLR, 2021. Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao,
[29] Chenxin An, Shansan Gong, Ming Zhong, Xingjian Shan Lu, Ganesh Ananthanarayanan, et al. Cachegen:
Zhao, Mukai Li, Jun Zhang, Lingpeng Kong, and Kv cache compression and streaming for fast large lan-
Xipeng Qiu. L-eval: Instituting standardized evalua- guage model serving. In Proceedings of the ACM SIG-
tion for long context language models. arXiv preprint COMM 2024 Conference, pages 38–56, 2024.
arXiv:2307.11088, 2023.
[40] Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh,
[30] Jiaqi Li, Mengmeng Wang, Zilong Zheng, and Muhan Michael W Mahoney, Yakun Sophia Shao, Kurt Keutzer,
Zhang. Loogle: Can long-context language mod- and Amir Gholami. Kvquant: Towards 10 million con-
els understand long contexts? arXiv preprint text length llm inference with kv cache quantization.
arXiv:2311.04939, 2023. arXiv preprint arXiv:2401.18079, 2024.
USENIX Association 23rd USENIX Conference on File and Storage Technologies 169
[41] Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong,
Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and
Xia Hu. Kivi: A tuning-free asymmetric 2bit quanti-
zation for kv cache. arXiv preprint arXiv:2402.02750,
2024.
[42] Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu,
Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai
Dai, Daya Guo, et al. Deepseek-v2: A strong, econom-
ical, and efficient mixture-of-experts language model.
arXiv preprint arXiv:2405.04434, 2024.
170 23rd USENIX Conference on File and Storage Technologies USENIX Association