missing

MeanCache: User-Centric Semantic Cache for Large Language Model Based Web Services

Waris Gill Virginia TechBlacksburgUSA [email protected] Mohamed Elidrisi CiscoSeattleUSA [email protected] Pallavi Kalapatapu CiscoSan JoseUSA [email protected] Ammar Ahmed University of MinnesotaMinneapolisUSA [email protected] Ali Anwar University of MinnesotaMinneapolisUSA [email protected]  and  Muhammad Ali Gulzar Virginia TechBlacksburgUSA [email protected]
Abstract.

Large Language Models (LLMs) like ChatGPT and Llama have revolutionized natural language processing and search engine dynamics. However, these models incur exceptionally high computational costs. For instance, GPT-3 consists of 175 billion parameters, where inference demands billions of floating-point operations. Caching is a natural solution to reduce LLM inference costs on repeated queries, which constitute about 31% of the total queries. However, existing caching methods are incapable of finding semantic similarities among LLM queries nor do they operate on contextual queries, leading to unacceptable false hit-and-miss rates.

This paper introduces MeanCache, a user-centric semantic cache for LLM-based services that identifies semantically similar queries to determine cache hit or miss. Using MeanCache, the response to a user’s semantically similar query can be retrieved from a local cache rather than re-querying the LLM, thus reducing costs, service provider load, and environmental impact. MeanCache leverages Federated Learning (FL) to collaboratively train a query similarity model without violating user privacy. By placing a local cache in each user’s device and using FL, MeanCache reduces the latency and costs and enhances model performance, resulting in lower false hit rates. MeanCache also encodes context chains for every cached query, offering a simple yet highly effective mechanism to discern contextual query responses from standalone. Our experiments benchmarked against the state-of-the-art caching method, reveal that MeanCache attains an approximately 17% higher F-score and a 20% increase in precision during semantic cache hit-and-miss decisions while performing even better on contextual queries. It also reduces the storage requirement by 83% and accelerates semantic cache hit-and-miss decisions by 11%.

Federated Learning, Caching, Large Language Models, Cache, Hit, Miss, Rate, ChatGPT, Llama, Semantic, Search, Large Language Models, LLMs, GPT, Machine Learning, Privacy, Web, Embeddings, Cosine Similarity, Client, Prompt, Query,

1. Introduction

Large Language Models (LLMs) like ChatGPT (cha, [n. d.]), Google Bard (bar, [n. d.]), Claude (cla, [n. d.]), and Llama 2 (Touvron et al., 2023) have demonstrated remarkable capabilities in understanding and generating human language, leading to significant advancements in applications ranging from search engines to conversational agents. LLMs are increasingly integrated into platforms like the Perplexity AI search engine, Rabbit OS (rab, [n. d.]), Arc browser (arc, [n. d.]), and Bing Chat (bin, [n. d.]). For instance, in 2023, 1.9 billion queries were sent to Bing chat, reflecting the massive use of such systems (bin, [n. d.]).

Motivation. Generating responses to user queries with LLMs, such as GPT-3, requires substantial computations and poses environmental challenges (Frantar et al., 2023; Wang et al., 2023; Tseng et al., 2024; Sachdeva et al., 2024; Ma et al., 2024; Strubell et al., 2019). For example, GPT-3’s 175B parameters in float16 format consume 326 GB memory, exceeding single GPU capacities and necessitating multi-GPU deployments (Frantar et al., 2023). These requirements lead to high operational costs.

Consequently, LLM-based services charge users and limit query rates (per, [n. d.]; ope, [n. d.]). Prior studies report that users frequently submit similar queries to web services (Lempel and Moran, 2003; Xie and O’Hallaron, 2002)—approximately 33% of search engine queries being resubmitted (Markatos, 2001). We investigate if a similar pattern exists with queries to LLMs. Our analysis of a large corpus of real-world user conversations with ChatGPT reveals that approximately 31% of such queries have a high similarity with a previously submitted query, as shown in Section 2. Thus, there is a potential to reduce the query inference cost of LLMs by up to one-third.

Caching serves as an effective technique in traditional web services to address duplicate search queries, avoiding redundant computations, significantly improving response time, reducing the load on query processors, and enhancing network bandwidth utilization (Markatos, 2001; Saraiva et al., 2001; Xie and O’Hallaron, 2002; Lempel and Moran, 2003; Brin and Page, 1998; Podlipnig and Böszörmenyi, 2003; Baeza-Yates et al., 2007). If applicable to LLMs-based web services, such caching can substantially impact billions of floating point operations, thereby decreasing operational costs and environmental impacts.

Refer to caption
Figure 1. Analysis of real-world ChatGPT conversations reveals that, on average, 31% of queries are similar to previously submitted queries.

Problem. Existing caching techniques  (Saraiva et al., 2001; Markatos, 2001; Xie and O’Hallaron, 2002; Lempel and Moran, 2003) use keyword matching, which often struggles to capture the semantic similarity among similar queries to LLM-based web services, resulting in a significantly low hit rate. For instance, existing caches do not detect the semantic similarity between ”How can I increase the battery life of my smartphone?” and ”Tips for extending the duration of my phone’s power source.”, leading to a cache miss. Recently, Zhu et al. (2023) and GPTCache(Bang, 2023) present server-side semantic caching for the LLMs-based services to address the limitations of keyword-matching caching techniques. If a new query is semantically similar to any query in the cache, the server returns the response from the cache. Otherwise, a model multiplexer selects the most suitable LLM for the query to generate the response.

Existing semantic caches have several limitations. First, they demand a significantly large central cache to store the queries and responses of all users, which is unscalable and violates users’ privacy. Second, they incur the network cost of sending a user query to the server even if there is a cache hit. An end user will still be charged for the query even if the query is served from the server cache. Third, they use a single server-side embedding model to find the semantic similarity among queries, which does not generalize to each user’s querying patterns. For instance, Google Keyboard (gbo, [n. d.]) adapts to each user’s unique writing style and embeds such personalized behaviors to enhance the accuracy of the next word prediction model. Fourth, they employ Llama-2 to enhance the accuracy of semantic matching (Bang, 2023); however, in practice, such models perform billions of operations to generate embeddings, offsetting the benefits of the cache. Lastly, they are only effective on standalone LLM queries, resulting in unbearably high false hit rates for contextually different but semantically similar queries.

Key Insights and Contributions of MeanCache. This work introduces a novel user-centric semantic caching system called MeanCache. MeanCache provides a privacy-preserving caching system that returns the response to similar queries directly from the user’s local cache, bypassing the need to query the LLM-based web service. MeanCache achieves these goals in the following ways.

To address privacy concerns associated with central server-side caching, MeanCache introduces a user-side cache design ensuring that the user’s queries and responses are never stored outside of the user’s device. While a personalized per-user cache may not offer benefits for similar queries across multiple users, our investigation with real-world conversations with ChatGPT reveals that approximately 31% of the queries are still similar to at least one of the previously submitted queries by the same user. Therefore, having a user-centric cache offers the same, if not more, opportunities to eliminate expensive LLM queries.

To find a semantic match between a new query and cached queries, MeanCache uses smaller embedding models such as MPNet (Song et al., 2020) to generate embeddings for semantic matching locally. Previous work has shown that a smaller model can achieve performance comparable to larger models on custom tasks (Ouyang et al., 2022; Penedo et al., 2024; Du and Kaelbling, 2024). Prior work (Bang, 2023) proposes using large language models, such as Llama 2, to generate embeddings. However, these models require billions of operations, making them unsuitable for embeddings generation.

Due to different contexts around queries, LLM may return different responses for semantically similar queries. For such queries, MeanCache also records the contextual chain, parent queries already in the cache, for a given query. To find a response for a contextual query, MeanCache verifies the context of a contextual query by matching a given query’s context with the cached query’s context chain to accurately retrieve responses for previously seen contextually different but semantically similar queries from the cache.

Each user may not have sufficient queries to customize an embedding model that can help find a semantic match between new queries and cached queries. To address this, MeanCache utilizes federated learning, which exploits data silos on user devices for private training for collaborative learning, thereby personalizing an embedding model for each user. This privacy-preserving training not only customizes the embedding model to the user’s querying patterns but also enhances the performance (i.e., accuracy) of semantic caching without compromising user privacy (i.e., without storing user data on the web server).

The runtime performance of MeanCache is primarily influenced by the time taken to match a new query embedding vector with existing ones in the cache to find a semantically similar query. The search time is directly proportional to the dimensions of the embedding vector. To optimize runtime performance, MeanCache compresses the embedding vector by leveraging principal component analysis (PCA) (Pearson, 1901; Hotelling, 1933; Jolliffe and Cadima, 2016), effectively reducing the size of the embedding vector (i.e., projecting it to lower dimensional space). MeanCache also offers an adaptive cosine similarity threshold, which is also collaboratively computed using FL, to improve accuracy in finding semantic matches between queries.

Study of Real-World Queries. To measure the prevalence of similar queries in real-world conversation with LLM-based web service, We recruited 20 participants who regularly use ChatGPT and conducted a thorough analysis of their queries in a privacy-preserving manner. Across 27K queries, we find that 31% of a user’s queries are similar to at least one previous query by the same user, which could have been retrieved from a cache.

Evaluations. We compare MeanCache with GPTCache (Bang, 2023), a widely-used open-source semantic cache for LLM-based web services. GPTCache (Bang, 2023) is the most closely related work to MeanCache and has received over 6,000 stars on GitHub (zil, [n. d.]). We benchmark MeanCache’s performance against GPTCache on the GPTCache’s dataset (dat, [n. d.]) to demonstrate its effectiveness and highlight the improvements MeanCache offers over existing solutions.

MeanCache surpasses GPTCache (Bang, 2023) by achieving a 17% higher F-score and approximately a 20% increase in precision in end-to-end deployment for identifying duplicate queries to LLM-based web services. MeanCache’s performance on contextual queries is even more impressive when compared to GPTCache (baseline). For contextual queries, MeanCache achieves a 25% higher F-score and accuracy, and a 32% higher precision over the baseline. MeanCache’s embedding compression utility approximately reduces storage and memory needs by 83% and results in 11% faster semantic matching while still outperforming the state-of-the-art GPTCache.

Artifact Availability: MeanCache is implemented in the Flower FL framework (Beutel et al., 2020). The complete source code and contextual queries dataset will be available upon acceptance.

2. Similar User Queries in Practice

Prior study (Markatos, 2001) measured the prevalence of similar queries across the entire user population. To understand the extent to which a single user submits similar queries, we conducted the first empirical study to measure how often a real-world ChatGPT user submits similar queries. Given the private nature of conversations with the ChatGPT service, study participants are naturally reluctant to share their query corpus. To address this problem, we develop a script that runs locally on each participant’s computer and identifies semantically similar queries, and calculates the percentage of similar queries on the participant’s query corpus. We then ask the participants to review the results and share, if comfortable, only three metrics (% of similar queries, total queries, and days since the first query) anonymously with the authors.

We recruited 20 participants, including male and female participants, university professors, and graduate students who regularly use ChatGPT. Before measuring the query statistics, our analysis script addresses false positives by excluding large queries with lengths beyond 256 tokens. We examined over 27K queries across 20 participants who have been using ChatGPT for an average of 332 days.

Figure 1 presents the results of this study. We find that, on average, 31% of each participant’s queries are highly similar to at least one of the previously submitted queries by the same participant, which necessitates a user-centric cache for LLM-based web services.

3. Background

3.1. Federated Learning

Federated Learning is a distributed, privacy-preserving training of a machine learning (ML) model. An FL round starts when a central server shares a global model with all the participating clients. Each client trains the model on its local data and sends the updated model to the central server. The central server aggregates the received models’ weights to form a new global model to be used in the next round. Different aggregating algorithms have been proposed, such as FedAvg (McMahan et al., 2017), FedProx (Li et al., 2020), and FedMA (Wang et al., 2020b). The FedAvg algorithm is the most popular one, and it aggregates the models with the following equation:

(1) Wglobalt+1=k=1Knknwk,tsuperscriptsubscript𝑊𝑔𝑙𝑜𝑏𝑎𝑙𝑡1superscriptsubscript𝑘1𝐾subscript𝑛𝑘𝑛subscript𝑤𝑘𝑡W_{global}^{t+1}=\sum_{k=1}^{K}\frac{n_{k}}{n}w_{k,t}italic_W start_POSTSUBSCRIPT italic_g italic_l italic_o italic_b italic_a italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT divide start_ARG italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG italic_n end_ARG italic_w start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT

Wglobalt+1superscriptsubscript𝑊𝑔𝑙𝑜𝑏𝑎𝑙𝑡1W_{global}^{t+1}italic_W start_POSTSUBSCRIPT italic_g italic_l italic_o italic_b italic_a italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPTrepresents the new global model, wk,tsubscript𝑤𝑘𝑡w_{k,t}italic_w start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT denotes the model of the kthsuperscript𝑘𝑡k^{th}italic_k start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT client at the tthsuperscript𝑡𝑡t^{th}italic_t start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT round, nksubscript𝑛𝑘n_{k}italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT are the number of samples available at kthsuperscript𝑘𝑡k^{th}italic_k start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT client and n𝑛nitalic_n is the total number of samples of clients participating in the given round. This procedure is reiterated for multiple rounds until the model reaches convergence.

3.2. Transformer Architecture

The transformer architecture is a deep learning model based on the attention mechanism, which allows the model to focus on the important parts of the input (Vaswani et al., 2017). The transformer architecture is primarily used for sequence-to-sequence tasks in NLP, such as machine translation, text summarization, and question-answering.

The attention mechanism calculates the importance of each input element by considering its relationship with other elements in the sequence. This enables the model to capture long-range dependencies and improve performance on tasks that require an understanding of context. The feed-forward layer helps the model capture complex patterns and relationships in the data. The transformer architecture also incorporates positional encoding, which provides information about the position of each element in the input sequence. This helps the model understand the order of the elements, which is crucial for tasks like machine translation.

3.2.1. Embeddings and Cosine Similarity

Transformer model can encode the given text into embeddings. Embeddings represent words or phrases as dense vectors in a high-dimensional space and capture the semantic meaning of words or phrases, and therefore, words or sentences with similar meanings will have similar vector representations (Reimers and Gurevych, 2019; Karpukhin et al., 2020; Ni et al., 2022). For example, the words ”king” and ”queen” are semantically similar because they both refer to royalty. Therefore, their embeddings will be close to each other in the vector space. These embeddings encode the contextual information and semantic meaning of the sentences. Cosine similarity is a measure of similarity between two vectors. In the context of embeddings, cosine similarity is often used to quantify the similarity between sentence embeddings. It is calculated by taking the cosine of the angle between the two vectors. The cosine similarity ranges from -1 to 1, where -1 indicates completely opposite vectors, 0 indicates orthogonal (perpendicular) vectors, and 1 indicates identical vectors. Cosine similarity (θ𝜃\thetaitalic_θ) between embeddings E1 and E2 is measured as:

(2) θ=E1E2E1E2𝜃E1E2normE1normE2\theta=\frac{{\textbf{E1}}\cdot{\textbf{E2}}}{\|{\textbf{E1}}\|\|{\textbf{E2}}\|}italic_θ = divide start_ARG E1 ⋅ E2 end_ARG start_ARG ∥ E1 ∥ ∥ E2 ∥ end_ARG

Here, E1 and E2 represent the two word embeddings, ”dot” represents the dot product of the vectors, and ”——E1——” and ”——E2——” represent the Euclidean norms (magnitudes) of the vectors.

Refer to caption
Figure 2. MeanCache’s Workflow

4. MeanCache’s Design

MeanCache is a user-centric semantic caching system optimized for user-side operation. Figure 2 illustrates the workflow of MeanCache for similar queries. Algorithm 1 further explains MeanCache’s querying and population process programmatically. When a user submits a query to an LLM web service with MeanCache enabled, MeanCache computes the query embeddings (Line  3). These embeddings are then matched with the embedding of the cached queries using cosine similarity (Line  4). For every similar query found within the cache, MeanCache analyzes the context chain for every query (Line  7) and matches its embedding with the conversational history (Line  9) of the submitted query. If MeanCache finds a similar query with a similar context chain, the response is retrieved from the local cache and returned to the user (Line  13). Otherwise, MeanCache forwards the query to the LLM web service to obtain the response. The query and its response and embeddings are then stored in the cache (Line  18).

The key contribution of MeanCache comes from harnessing the collective intelligence of multiple users to training a semantic similarity model and its user-centric design that addresses both privacy and scalability issues. To achieve these, MeanCache takes the following design decisions. It employs a small embedding model with lower computational overhead than large Language Learning Model (LLM) based embedding models, such as Llama 2. It uses federated learning (FL) for collaborative training to fine-tune the embedding model without ever storing user data on a central server. This approach generates high-quality embeddings and improves the accuracy of embedding matching for retrieving similar queries. To handle contextual queries, MeanCache includes contextual chain information in its cache against every query to identify if the cached response for a query is only applicable under a specific context. This design is capable of handling long contextual chains. To reduce storage and memory overhead and expedite the search time for finding similar queries in the cache, MeanCache compresses the embeddings using Principal Component Analysis (PCA). MeanCache leverages user feedback (i.e., forcing LLM request instead of the cache query) to adapt the threshold for cosine similarity. It then shares the updated threshold via FL to determine the optimal threshold across multiple users. It does this by varying the threshold and selecting the one that optimizes the F-score of the cache on clients.

Algorithm 1 MeanCache Workflow for LLM Queries
1:Input: User query Q𝑄Qitalic_Q, User Query Context Cqsubscript𝐶𝑞C_{q}italic_C start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT
2:Output: Response R𝑅Ritalic_R
3:Compute query embedding EQsubscript𝐸𝑄E_{Q}italic_E start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT
4:similar_queries𝑠𝑖𝑚𝑖𝑙𝑎𝑟_𝑞𝑢𝑒𝑟𝑖𝑒𝑠absentsimilar\_queries\leftarrowitalic_s italic_i italic_m italic_i italic_l italic_a italic_r _ italic_q italic_u italic_e italic_r italic_i italic_e italic_s ← FindSimilarQueriesinCache(EQsubscript𝐸𝑄E_{Q}italic_E start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT)
5:context_match𝑐𝑜𝑛𝑡𝑒𝑥𝑡_𝑚𝑎𝑡𝑐absentcontext\_match\leftarrowitalic_c italic_o italic_n italic_t italic_e italic_x italic_t _ italic_m italic_a italic_t italic_c italic_h ← False
6:for each query Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in similar_queries𝑠𝑖𝑚𝑖𝑙𝑎𝑟_𝑞𝑢𝑒𝑟𝑖𝑒𝑠similar\_queriesitalic_s italic_i italic_m italic_i italic_l italic_a italic_r _ italic_q italic_u italic_e italic_r italic_i italic_e italic_s do
7:    ctx𝑐𝑡𝑥absentctx\leftarrowitalic_c italic_t italic_x ← the entire context chain of Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
8:    Ectxsubscript𝐸𝑐𝑡𝑥absentE_{ctx}\leftarrowitalic_E start_POSTSUBSCRIPT italic_c italic_t italic_x end_POSTSUBSCRIPT ← Retrieve embedding of ctx𝑐𝑡𝑥ctxitalic_c italic_t italic_x
9:    Ecqsubscript𝐸𝑐𝑞absentE_{cq}\leftarrowitalic_E start_POSTSUBSCRIPT italic_c italic_q end_POSTSUBSCRIPT ← Retrieve embedding of Cqsubscript𝐶𝑞C_{q}italic_C start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT
10:    Compute cosine similarity sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT between Ecqsubscript𝐸𝑐𝑞E_{cq}italic_E start_POSTSUBSCRIPT italic_c italic_q end_POSTSUBSCRIPT and Ectxsubscript𝐸𝑐𝑡𝑥E_{ctx}italic_E start_POSTSUBSCRIPT italic_c italic_t italic_x end_POSTSUBSCRIPT
11:    if sisubscript𝑠𝑖absents_{i}\geqitalic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≥ context similarity threshold then
12:        context_match𝑐𝑜𝑛𝑡𝑒𝑥𝑡_𝑚𝑎𝑡𝑐absentcontext\_match\leftarrowitalic_c italic_o italic_n italic_t italic_e italic_x italic_t _ italic_m italic_a italic_t italic_c italic_h ← True
13:        Retrieve response R𝑅Ritalic_R from cache for Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
14:        return R𝑅Ritalic_R
15:    end if
16:end for
17:if not context_match𝑐𝑜𝑛𝑡𝑒𝑥𝑡_𝑚𝑎𝑡𝑐context\_matchitalic_c italic_o italic_n italic_t italic_e italic_x italic_t _ italic_m italic_a italic_t italic_c italic_h then
18:    R𝑅absentR\leftarrowitalic_R ← LLMResponseAndEnrollInCache(Q𝑄Qitalic_Q, EQsubscript𝐸𝑄E_{Q}italic_E start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT , Cqsubscript𝐶𝑞C_{q}italic_C start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT)
19:end if
20:return R𝑅Ritalic_R
Refer to caption
Figure 3. Privacy-Preserving Embedding Model FL Training in MeanCache.

4.1. FL Based Embedding Model Training

GPTCache(Bang, 2023) suggests using Llama 2 to generate superior embeddings, thereby enhancing semantic matching accuracy. However, this approach has several limitations. Large language models like Llama 2 are not only sizable, being gigabytes in size, but they also require substantial computational resources to generate embeddings. Deploying such models, especially at the end-user level, is impractical due to their size and significant computational overhead for semantic matching. MeanCache employs a compact embedding model, which has a lower computational overhead compared to large embedding models. The smaller model may not provide the same level of accuracy as an LLM. However, a smaller model trained on customized tasks can match the performance of an LLM (Ouyang et al., 2022; Penedo et al., 2024; Du and Kaelbling, 2024). One challenge in using smaller embedding models is that each user may not have sufficient data to train and customize the embedding model.

MeanCache utilizes federated learning to exploit the vast amount of distributed data available on users’ devices to train and personalize the smaller embedding model. Federated Learning allows the users to train the embedding model locally and learn the optimal threshold for cosine similarity. The updated weights and local threshold are shared with the server. The server aggregates the updated weights and cosine similarity threshold from multiple users to update the global model, which is redistributed back to the users. This approach ensures that the user’s privacy is maintained, and the collective intelligence of multiple users is leveraged to improve the performance of the caching system. Figure 3 shows the overview of privacy-preserving training of the embedding model with FL.

In the first step, the server sends the initial weights of the embedding model (Wglobalt+1superscriptsubscript𝑊𝑔𝑙𝑜𝑏𝑎𝑙𝑡1W_{global}^{t+1}italic_W start_POSTSUBSCRIPT italic_g italic_l italic_o italic_b italic_a italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT) and global threshold (τ𝜏\tauitalic_τ) to a subset of users as shown in step 1 in Figure 3. The subset of users is usually randomly selected or selected based on their battery level, network bandwidth, or performance history. Additionally, the server also sends the hyperparameters (e.g., learning rate, batch size, epochs) necessary for FL training of the embedding model.

4.1.1. Client Training.

Upon receiving the embedding model (Wglobalt+1superscriptsubscript𝑊𝑔𝑙𝑜𝑏𝑎𝑙𝑡1W_{global}^{t+1}italic_W start_POSTSUBSCRIPT italic_g italic_l italic_o italic_b italic_a italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT) from the server, each client replaces its local embedding model weights with the newly received weights. Subsequently, each client trains the embedding model locally on its unique local dataset (step 2 in Figure 3).

To generate high-quality embeddings from unique and similar queries, MeanCache’s clients’ training employs a multitask learning approach, integrating two distinct loss functions: contrastive loss  (Reimers and Gurevych, 2019) and multiple-negatives ranking loss  (Henderson et al., 2017; Reimers and Gurevych, 2019). These loss functions update the weights of the embedding model during the training process. The contrastive loss function operates by distancing unique (non-duplicate) queries from each other within the embedding space, thereby facilitating the differentiation between duplicate and non-duplicate queries. Unlike contrastive loss, the multiple-negatives ranking loss function minimizes the distance between positive pairs (duplicate queries) amidst a large set of potential candidates i.e., multiple-negatives ranking loss does not concentrate on distancing unique queries and its primary objective is to draw positive pairs (similar queries) closer within the embedding space.

This multitasks learning approach enables MeanCache to adjust to diverse query patterns exhibited by users. For instance, some users may generate more repetitive queries compared to others, while certain users may not produce any repetitive queries at all. Interestingly, MeanCache’s multitask learning objective can benefit from learning even from a user with no repetitive queries. This is because MeanCache’s global embedding model (Wglobalt+1superscriptsubscript𝑊𝑔𝑙𝑜𝑏𝑎𝑙𝑡1W_{global}^{t+1}italic_W start_POSTSUBSCRIPT italic_g italic_l italic_o italic_b italic_a italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT) will learn to widen the distance between unique queries, thereby effectively learning the true misses of the non-duplicate queries and minimizing the false hits during the search process. True miss happens when a similar query is not present in the cache. A false hit is when a query is found and returned from the cache, which is not actually similar.

4.1.2. Finding the Optimal Threshold for Cosine Similarity.

After generating query embeddings using an embedding model, a similarity metric such as cosine similarity is used to determine if the new query embeddings match the cached embeddings of past queries. This process involves setting a threshold for cosine similarity, which is a delicate balance.

In addition to privacy-preserving training of the embedding model, MeanCache also learns the optimal threshold (τ𝜏\tauitalic_τ) for cosine similarity. The range of τ𝜏\tauitalic_τ is between 0 and 1. This threshold (τ𝜏\tauitalic_τ) dictates the level of similarity above which a cached query is considered relevant to the current user query. Setting the threshold too low could result in numerous false hits, leading to retrieving irrelevant queries from the cache. Conversely, a threshold that is set too high might cause many false misses, where relevant queries are not retrieved from the cache. Hence, finding the optimal threshold (τ𝜏\tauitalic_τ) for cosine similarity is vital.

During the client’s local training, MeanCache determines this optimal threshold (τ𝜏\tauitalic_τ) from the client’s response to the cache query response. Even after finding a cached response, a user requests a response from the LLM, MeanCache considers it as a false positive and adjusts its threshold. MeanCache varies the threshold τ𝜏\tauitalic_τ to find the optimal threshold that optimizes the F-score of the cache (Section 5.3). The user will use the optimal threshold to find similar queries in the cache. By finding the optimal threshold, MeanCache effectively balances between true hits and true misses, therefore yielding improved accuracy in semantic similarity matching to return the response from the cache on duplicate queries.

4.1.3. Aggregation.

After client local training and finding the optimal threshold, each client sends updated weights of the global model (Wglobalt+1superscriptsubscript𝑊𝑔𝑙𝑜𝑏𝑎𝑙𝑡1W_{global}^{t+1}italic_W start_POSTSUBSCRIPT italic_g italic_l italic_o italic_b italic_a italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT) and optimal threshold (τ𝜏\tauitalic_τ) to the server (step 3 in Figure 3). The server aggregates the updated weights from multiple users to form a new embedding model (Wglobalt+1superscriptsubscript𝑊𝑔𝑙𝑜𝑏𝑎𝑙𝑡1W_{global}^{t+1}italic_W start_POSTSUBSCRIPT italic_g italic_l italic_o italic_b italic_a italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT) using FedAvg (McMahan et al., 2017) as shown in step 4 of Figure 3. The server also computes the mean of the received optimal thresholds from the clients for a global optimal threshold (τglobalsubscript𝜏𝑔𝑙𝑜𝑏𝑎𝑙\tau_{global}italic_τ start_POSTSUBSCRIPT italic_g italic_l italic_o italic_b italic_a italic_l end_POSTSUBSCRIPT). The benefit of finding τglobalsubscript𝜏𝑔𝑙𝑜𝑏𝑎𝑙\tau_{global}italic_τ start_POSTSUBSCRIPT italic_g italic_l italic_o italic_b italic_a italic_l end_POSTSUBSCRIPT is that when a new user joins the system, the user will not have queries to find its own optimal threshold. In such cases, the system can use τglobalsubscript𝜏𝑔𝑙𝑜𝑏𝑎𝑙\tau_{global}italic_τ start_POSTSUBSCRIPT italic_g italic_l italic_o italic_b italic_a italic_l end_POSTSUBSCRIPT as a starting point for semantic similarity.

After finding the global optimal threshold and the global embedding model, the server then redistributes the updated embedding model to the users for the next round of FL training. This process is repeated over several to improve the semantic matching accuracy of the MeanCache (i.e., lower false hits and false misses). After the completion of the FL training, each client will have access to a fine-tuned embedding model (Wglobalt+1superscriptsubscript𝑊𝑔𝑙𝑜𝑏𝑎𝑙𝑡1W_{global}^{t+1}italic_W start_POSTSUBSCRIPT italic_g italic_l italic_o italic_b italic_a italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT) to generate high-quality embeddings that can capture the complex semantics of a user query.

Refer to caption
Figure 4. Embeddings Compression using PCA

4.1.4. Embeddings Compression using PCA

The substantial size of the embedding vector, for instance, the Llama 2 embeddings dimension being 4096, can lead to considerable overhead during the matching process of a new user query embeddings with cached queries embeddings. This is due to the search time being directly proportional to the dimensions of the embedding vector. Furthermore, high-dimensional embeddings demand more memory and storage. For example, the embeddings generated by Llama 2 for a single query require approximately 32.05 KB of memory storage. Consequently, calculating the cosine similarity between two high-dimensional embeddings, specifically new query embeddings and each embedding in the cache, becomes computationally demanding and time-intensive. Therefore, to improve the search time for identifying similar queries in the cache and to reduce the client’s storage needs, it is essential to diminish the dimensionality of the embeddings while ensuring minimal impact on the tool’s performance.

Principal Component Analysis (PCA) is a dimensionality reduction technique that is widely used to compress high-dimensional data into a lower-dimensional space  (Pearson, 1901; Hotelling, 1933; Jolliffe and Cadima, 2016), while still maintaining the most important information. First, MeanCache generates embeddings for all the users’ queries using the embedding model. Next, MeanCache applies PCA to learn the principal components of all the queries embeddings generated in the previous step, as shown in Figure 4-a. MeanCache integrates the learned principal components as an additional layer in the embedding model. This new layer will project the original embeddings onto the lower dimensional space, producing compressed embeddings (Figure 4-b).

When a non-duplicate query is received, MeanCache uses the updated embedding model (with PCA layer) to generate the compressed embeddings (Figure 4-b) for the new query and store the query, response, and the compressed embedding in the cache. Storing the compressed embeddings in the cache will significantly reduce the storage and memory overhead of the embeddings. Next, when a duplicate query is received, MeanCache uses the same embedding model with PCA components to generate the compressed embeddings for this duplicate query and find similar queries in the cache. Since the embeddings are compressed, the search time for finding similar queries in the cache will be significantly reduced.

4.2. Cache Population and MeanCache Implementation

Once the embedding model is trained within MeanCache on each user, it is deployed as depicted in Figure 2. Initially, when a new user starts using MeanCache, the local cache is vacant. During these interactions, if a user query’s response is not found in the cache, the request is forwarded to the LLM web service to retrieve the response, which is then inserted in the cache. If MeanCache finds semantically similar queries in the cache for any of the following queries from the user, it analyzes the context chain for every similar cached query and matches its embedding with the conversational history of the submitted query. If MeanCache finds a similar query with a similar context chain, the response is retrieved from the local cache and returned to the user. Otherwise, MeanCache forwards the query to the LLM web service to obtain the response. The query and its response and embeddings are then stored in the cache.

MeanCache is a python-based application that is built on the Flower FL framework (Beutel et al., 2020). A user can submit LLM queries via this application to take advantage of the local cache. The central server, which orchestrates the FL training, may reside on the LLM web service. We employ the Sbert (Reimers and Gurevych, 2019) library to train MPNet (Song et al., 2020) and Albert (Lan et al., 2019) on each client and to generate query embeddings. To efficiently execute a cosine similarity search between a query embedding and cached embeddings, we utilize Sbert’s semantic search, which can handle up to 1 million entries in the cache. MeanCache cache storage is persistent and built using DiskCache (Dis, [n. d.]) library.

5. Evaluation

Our evaluation answers the following research questions.

  • Is it possible to train an embedding model in a privacy-preserving manner without the centralized data (section 5.2)?

  • What effect does the cosine similarity threshold have on the performance of MeanCache (section 5.3)?

  • Besides preserving user query privacy, how does MeanCache perform in comparison to baseline in terms of performance metrics (section 5.4)?

  • How accurately does MeanCache retrieve contextual queries from the cache (section 5.5)?

  • What level of overhead does MeanCache impose on the user side? Is it possible to reduce the embedding dimension to save storage space while outperforming the baseline (section 5.6)?

  • GPTCache (Bang, 2023) suggests using Llama 2 to generate embeddings to improve semantic matching. Is it feasible to use Llama 2 to compute embeddings at the user side for semantic matching (section 5.7)?

5.1. Evaluation Settings

We conduct evaluations of MeanCache against the baseline (Bang, 2023) to demonstrate that MeanCache achieves optimal performance while preserving user-privacy (i.e., without storing the user queries at the server). For a fair comparison between MeanCache and baseline, we employ the optimal configuration as described in the GPTCache study (Bang, 2023). This configuration utilizes Albert (Lan et al., 2019) and sets the cosine similarity threshold at 0.7 to determine the cache hit or miss.

5.1.1. Transformer Models and Datasets.

For extensive evaluations of MeanCache, we utilize the Llama 2 (Touvron et al., 2023), MPNet (Song et al., 2020), and Albert (Lan et al., 2019) transformer models to generate embeddings.

We evaluate MeanCache using the GPTCache dataset. The dataset is partitioned into training, testing, and validation subsets. The training and validation datasets are randomly distributed among the clients, with each client receiving non-overlapping data points. During local training, each client utilizes its training dataset to update its local embedding model and employs the validation dataset to determine the optimal threshold for cosine similarity (Section 5.3). The testing dataset, located at the central server, facilitates a fair comparison between MeanCache and GPTCache (Bang, 2023). Since there does not exist any dataset of contextual queries, we generate a synthetic dataset using GPT-4 consisting of 450 queries, including duplicates, non-duplicates, and contextual queries, to evaluate MeanCache performance on contextual queries.

5.1.2. Experimental Setup

The experiments are conducted on a high-performance computing cluster, equipped with 128 cores, 504 GB of memory, and four A100 Nvidia GPUs, each with 80 GB of memory. We utilize the Flower FL (Beutel et al., 2020) library to simulate a federated learning setup. Additionally, the SBERT (Reimers and Gurevych, 2019) library is employed to train the embedding model on each client. The number of clients participating in FL training are 20. The number of clients is restricted due to the limited size of the GPTCache dataset, which is inadequate for distribution among hundreds of clients. However, we believe MeanCache results are not influenced by the number of clients, and the evaluation setup of MeanCache is consistent with the evaluation standard in FL (Avdiukhin and Kasiviswanathan, 2021; Wang et al., 2020a).

5.1.3. Evaluation Metrics

In caching systems, the efficacy has traditionally been gauged by cache hit-and-miss rates. A cache hit implies the data or query is retrieved from the cache, whereas a cache miss indicates the opposite. Semantic caching introduces a nuanced classification: true and false hits, alongside true and false misses. A true hit signifies a correct match between a query and a similar cached query, whereas a false hit is an incorrect match with a non-similar cached query. A true miss signifies when a query does not have a similar cached query, whereas a false miss is when a query has a similar cached query but is not received from the cache. Thus, traditional hit/miss metrics are potentially misleading in semantic caches. For example, a query might incorrectly match with an irrelevant cached query (deemed a hit traditionally) due to semantic matching. We adopt precision, recall, F score, and accuracy for a comprehensive evaluation of MeanCache against GPTCache (baseline). These metrics are defined as follows:

Precision. The ratio of true positive hits to all positive hits (including both true positives and false positives). In semantic caching, this measures how many of the queries matched to a cached query are correctly matched. Precision=TPTP+FPPrecision𝑇𝑃𝑇𝑃𝐹𝑃\text{Precision}=\frac{TP}{TP+FP}Precision = divide start_ARG italic_T italic_P end_ARG start_ARG italic_T italic_P + italic_F italic_P end_ARG where TP𝑇𝑃TPitalic_T italic_P represents true positives (true hits) and FP𝐹𝑃FPitalic_F italic_P represents false positives (false hits).

Recall. The ratio of true positive hits to all relevant items (including both true positives and false negatives). In semantic caching, this assesses the proportion of correctly matched queries out of all queries that should have been matched to a cached query. Recall=TPTP+FNRecall𝑇𝑃𝑇𝑃𝐹𝑁\text{Recall}=\frac{TP}{TP+FN}Recall = divide start_ARG italic_T italic_P end_ARG start_ARG italic_T italic_P + italic_F italic_N end_ARG where FN𝐹𝑁FNitalic_F italic_N represents false negatives (false misses).

Fβ Score. A weighted harmonic mean of precision and recall, balancing the two based on the value of β𝛽\betaitalic_β. β>1𝛽1\beta>1italic_β > 1 gives more weight to recall, while β<1𝛽1\beta<1italic_β < 1 emphasizes precision.

Fβ=(1+β2)Precision×Recall(β2×Precision)+Recallsubscript𝐹𝛽1superscript𝛽2PrecisionRecallsuperscript𝛽2PrecisionRecallF_{\beta}=(1+\beta^{2})\cdot\frac{\text{Precision}\times\text{Recall}}{(\beta^% {2}\times\text{Precision})+\text{Recall}}italic_F start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT = ( 1 + italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ⋅ divide start_ARG Precision × Recall end_ARG start_ARG ( italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT × Precision ) + Recall end_ARG

Accuracy. The ratio of correctly identified queries (both true hits and true misses) to all queries. Accuracy=TP+TNTP+TN+FP+FNAccuracy𝑇𝑃𝑇𝑁𝑇𝑃𝑇𝑁𝐹𝑃𝐹𝑁\text{Accuracy}=\frac{TP+TN}{TP+TN+FP+FN}Accuracy = divide start_ARG italic_T italic_P + italic_T italic_N end_ARG start_ARG italic_T italic_P + italic_T italic_N + italic_F italic_P + italic_F italic_N end_ARG where TN𝑇𝑁TNitalic_T italic_N represents true negatives (true misses).

01020304050FL Training Rounds0.750.800.850.900.95ScoreF1PrecisionRecallAccuracy
Figure 5. In MeanCache, MPNet’s FL training helps generate high-quality embeddings by leveraging users’ private data, thereby improving query matching precision.
01020304050FL Training Rounds0.750.800.850.900.95ScoreF1PrecisionRecallAccuracy
Figure 6. FL training boosts MeanCache’s user query matching precision with Albert embedding model.

5.2. Privacy Preserving Embeddings Model Training

Storing user or client queries on the server side presents a potential privacy risk. To address this, each client can retain its local data on its own device. The ensuing challenge is how to train an embedding model that can also utilize the distributed data from all clients. Federated Learning is recognized for training neural networks in a privacy-preserving manner. As such, MeanCache employs FL to train and fine-tune an embedding model, thereby preserving privacy and leveraging the dataset residing on the client’s side. In this section, our objective is to evaluate whether FL training can progressively enhance the embedding model to generate high-quality embeddings for user queries. To simulate this scenario, we distribute the training dataset among 20 clients. In each round, we sample 4 clients, conducting a total of 50 FL training rounds. Each client trains its embedding model for 6 epochs, operating on a dedicated A100 GPU. We conduct two experiments with the Albert and MPNet models. The batch size is set to 256 for the Albert model, and for MPNet, it is set to 128 during the local training by participating clients.

Figures 5 and 6 depict the performance of MeanCache as FL training progresses. The X-axis represents the training round, while the Y-axis shows the performance metrics such as F-score, precision, recall, and accuracy of the global model (Wglobalt+1superscriptsubscript𝑊𝑔𝑙𝑜𝑏𝑎𝑙𝑡1W_{global}^{t+1}italic_W start_POSTSUBSCRIPT italic_g italic_l italic_o italic_b italic_a italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT). As illustrated in Figure 5, the F-score for MPNet increases from 0.82 to 0.88, and for Albert, it rises from 0.83 to 0.86, as shown in Figure 6. Similarly, precision for MPNet significantly increases from 0.74 to 0.85, as depicted in Figure 5, and for Albert, it increases from 0.74 to 0.81, as demonstrated in Figure 6. Given that MPNet is a more robust transformer architecture compared to Albert, it is also observed during our training that MPNet outperforms Albert, exhibiting superior learning in FL settings.

Summary. Overall, FL training increases 11% precision of MeanCache for MPNet and a 7% increase for Albert. Therefore, it is evident that the performance of the embedding model to generate high-quality embeddings can improve in a privacy-preserving manner using FL training, that is, without storing user queries at the central server.
0.00.20.40.60.81.0Threshold0.250.500.751.00ScoreF1PrecisionRecallAccuracy
Figure 7. MeanCache automatically optimizes MPNet’s threshold.
0.00.20.40.60.81.0Threshold0.000.250.500.751.00ScoreF1PrecisionRecallAccuracy
Figure 8. For Albert embedding model, MeanCache identifies an optimal threshold of 0.78, achieving a peak F-score of 0.88 on validation data.

5.3. Cosine Similarity Threshold Impact on Semantic Matching

The process of semantic matching for a new user query begins by generating the embeddings of the user’s query (Eqsubscript𝐸𝑞E_{q}italic_E start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT) using the embedding model. Following this, the cosine similarity (θ𝜃\thetaitalic_θ) is computed with the cached embeddings. If the cosine similarity θ𝜃\thetaitalic_θ exceeds the threshold τ𝜏\tauitalic_τ, the cache is hit and the response to the user query is returned from the cache. Therefore, the cosine similarity threshold τ𝜏\tauitalic_τ plays a crucial role in determining the similarity between a user query and cached entries. A low threshold value of τ𝜏\tauitalic_τ can lead to false hits (incorrect matches), while a high threshold might overlook the appropriate matches (i.e., false cache misses or false negatives). Hence, identifying an optimal threshold is essential.

To illustrate this, MeanCache varies the threshold τ𝜏\tauitalic_τ from 0 to 9 and evaluates the performance metrics F-Score, precision, recall, and accuracy with an equal distribution of duplicate and non-duplicate queries from the validation data to avoid bias.

Figures 7 and 8 show how the cosine similarity threshold (τ𝜏\tauitalic_τ) affects MeanCache’s performance. The X-axis represents the threshold τ𝜏\tauitalic_τ values, and the Y-axis denotes the performance metrics. For instance, at a 0.3 threshold, MeanCache’s semantic matching accuracy using MPNet is 57%, with a precision of 54% as shown in Figure 7. Similarly, with Albert at the same threshold, the accuracy is 55%, and precision is 53% (Figure 8). Precision typically improves with an increase in threshold. However, beyond a certain point, increasing the threshold τ𝜏\tauitalic_τ leads to a decline in F score, accuracy, and recall, i.e., higher value of τ𝜏\tauitalic_τ increases false cache misses (false negatives). Therefore, setting an optimal threshold is imperative.

For MPNet, the optimal threshold τ𝜏\tauitalic_τ is identified at 0.83, achieving an F1 score of 0.89, precision of 0.92, and accuracy of 0.90 (Figure 7). For Albert, the optimal threshold is 0.78, with an F1 score of 0.88, precision of 0.86, and accuracy of 0.88. Note that in the rest of the evaluations, we use these thresholds for MPNet and Albert with MeanCache.

Summary. Experimental evaluations indicate that GPTCache’s suggested threshold of 0.7 will result in suboptimal performance during semantic matching. Moreover, the optimal threshold τ𝜏\tauitalic_τ values varies with the embedding model. MeanCache optimally adjusts the threshold based on user data, outperforming GPTCache’s suggested threshold by 16% in precision and 4% in F score for MPNet, and by 10% in precision and 2% in F score for Albert. This highlights the importance of fine-tuning the cosine similarity threshold to its optimal value.
020406080100Query ID0.00.51.0Time (s)Llama 2Llama 2+GPTCacheLlama 2+MeanCache
Figure 9. Response times of 100 randomly sampled user queries to the Llama 2-based LLM service in three scenarios: without any semantic cache, with GPTCache, and with MeanCache. The integration of a semantic cache does not add significant overhead to non-duplicate queries, meaning it does not impede the performance of the LLM-based service. Moreover, it significantly reduces the average response times for duplicate queries (70-99) by serving them from the local cache.
020406080100Query IDHitMissLabelReal LabelGPTCache Predicted LabelMeanCache Predicted Label
Figure 10. Comparison of MeanCache and GPTCache on a set of 100 queries, including 70 non-duplicate and 30 duplicate queries, sent to the Llama 2-based LLM service. Queries 0 to 69 are non-duplicate (i.e., real label is miss), and GPTCache produces significantly higher false hits on these unique queries compared to MeanCache.
01Predicted Label01Real Label6118966234
(a) MeanCache
01Predicted Label01Real Label46723346254
(b) GPTCache
Figure 11. Confusion matrices for MeanCache and GPTCache on 1000 queries comprising 700 unique and 300 duplicate queries. Among the 700 unique queries, MeanCache produces only 89 false hits, while GPTCache generates a significantly larger number of false hits, 233 in total.
Metrics Standalone Queries Contextual Queries
GPT- Mean- Mean- GPT- MeanCache
Cache Cache Cache Cache
(MPNet) (Albert)
F score 0.56 0.73 0.68 0.67 0.93
Precision 0.52 0.72 0.66 0.66 0.98
Recall 0.85 0.78 0.77 0.71 0.79
Accuracy 0.72 0.85 0.81 0.61 0.86
Table 1. MeanCache outperforms GPTCache (baseline) on both standalone and contextual queries.
100020003000(a) Number of Cached Queries02000400060008000Storage Size (KBs)100020003000(b) Number of Cached Queries0.010.020.03Avg. Search Time (s)100020003000(c) Number of Cached Queries0.50.60.7F ScoreGPTCacheMeanCache (MPNet)MeanCache (Albert)MeanCache-Compressed (MPNet)MeanCache-Compressed (Albert)
Figure 12. (a) MeanCache’s embedding compression utility significantly reduces storage space by 83% compared to GPTCache. (b) MeanCache’s semantic matching speed is 11% faster with compression enabled, while still outperforming GPTCache. (c) MeanCache’s F score is slightly lower with compression enabled, but it still outperforms GPTCache.

5.4. MeanCache Comparison with Baseline

Markatos (2001); Xie and O’Hallaron (2002) have shown that users often send repeated queries to online services. While these repeated queries typically make up a smaller volume than the total query count, identifying them at the client-side can significantly reduce server load, decrease bandwidth requirements, and improve response times. Therefore, detecting repeated queries not only benefits LLM based services but also enhances the overall user experience and conserves the permissible query quota. We evaluate MeanCache against GPTCache, a baseline semantic cache, to assess improvements in user experience, precision, recall, and F score. We select a sample of 1000 queries, 30% of which are repeated queries (i.e., 300 queries are repeated), and load these 1000 queries as cached queries. Note that repeated queries are usually fewer than non-repeated queries. Thus, we use 30% as repeated queries, a similar percentage previously observed for web services (Markatos, 2001).

Initially, we send a new set of one thousand queries to Llama 2 (i.e., without any semantic cache) to establish a baseline for response times. We limit responses to 50 tokens to reflect practical response sizes, although actual sizes can be much larger. Note that MeanCache’s performance is not dependent on the response as it only matches the queries. Next, we send these queries to Llama 2 based local LLM service with MeanCache and GPTCache to measure the response times and performance metrics (e.g., precision, recall, F-score) respectively. An analysis of a random subset of 100 queries (70 non-duplicate and 30 duplicate) from the 1000 queries shows the impact of caching on response times in Figure 9 and cache hit and miss rates in Figure 10. The X-axis represents the query and the Y-axis represents each semantic cache hit and miss alongside the real label in Figure 10. The Y-axis in Figure 9 shows the response time of each query without any cache (Llama 2), with GPTCache (Llama 2+ GPTCache), and with MeanCache (Llama 2+MeanCache). Figure 9 demonstrates that implementing a semantic cache does not impede the performance (queries ranging from 0 to 69) and improves the user experience as response times reduces on duplicate queries (queries 70 to 99).

However, Figure 10 shows that GPTCache produces significant false hits on non-duplicate queries (queries 0 to 69) compared to MeanCache. Each false hit means the user receives an incorrect query response from the cache and must resend the same query to LLM service, resulting in a poor user experience. To prioritize precision, we adjust the beta value in the F score to 0.5, valuing precision twice as highly as recall to ensure user satisfaction by avoiding false positives. This decision is driven by the need to minimize user inconvenience caused by incorrect cache hits. A false miss (low recall) does not require user intervention as the false miss query will be automatically routed to the LLM. Thus, precision in semantic caching is more important than recall.

Table 1 and the confusion matrices in Figure 11 highlight MeanCache’s superior performance over GPTCache on the 1000 user queries. Notably, MeanCache with MPNet achieves a precision of 0.72, significantly surpassing GPTCache’s 0.52. This superiority is evident in the lower false positive rates (i.e., false hits) shown in Figure 11(a). The number of false hits for MeanCache is 89 (Figure 11(a)), while GPTCache has 233 false hits, as depicted in Figure 11(b). In practical terms, this means that with GPTCache, the end user has to manually resend 233 queries to the LLM service to get the correct responses, compared to only 89 queries with MeanCache. While GPTCache’s recall is higher than MeanCache’s, as we discussed earlier, precision is significantly more important than recall, and MeanCache outperforms GPTCache in this regard. Overall, the F-score of MeanCache with MPNet is 0.73 and 0.68 with the Albert embedding model, both of which outperform GPTCache’s F-score of 0.56.

Summary. In an end-to-end deployment with Llama 2 as LLM service, MeanCache significantly outperforms GPTCache. It demonstrates a 17% higher F score and a 20% increase in precision in optimal configuration. The substantial reduction in false cache hits enhances the end-user experience.

5.5. Contextual Queries

Refer to caption
(a) Ideally, all 100 queries should result in misses. However, GPTCache incorrectly produces 54 false hits, while MeanCache yields only 3.
Refer to caption
(b) MeanCache yields 8% more true hits than the baseline.
Figure 13. Performance on Contextual Queries: MeanCache vs. Baseline. (a) reports MeanCache’s fewer false hits 3 vs. 54 of GPTCache. (b) reports higher true hits by MeanCache.
Refer to caption
(a) MeanCache
Refer to caption
(b) GPTCache (baseline)
Figure 14. MeanCache only reports three false hits compared to 54 false hits by GPTCache. False hits are undesirable as they require the user to resend the query to the LLM service to obtain the correct response.

The end-user interaction with an LLM service mainly involves two types of user queries: standalone (e.g., Q1 Draw a line in Python?) and follow-up or contextual queries (e.g., Q2 Change the color to red). While standalone queries can be processed independently, contextual queries require prior context for accurate responses. Suppose that both Q1 and Q2 with their responses are cached. The user sends a new query, Q3 Draw a circle?, followed by Q4 Change the color to red. Q4 is semantically similar to Q2 in the cache but refers to a different context. GPTCache incorrectly identifies this as a cache hit, leading to an inaccurate response. MeanCache addresses this issue by verifying the context chain (Algorithm 1) of semantically matched queries. This approach ensures that contextual queries (Q4) result in a true cache miss, prompting a request to the LLM service for an appropriate response.

On the dataset of 450 contextual queries (see Section 5.1.1), first, we populate the cache (MeanCache and baseline) with 200 queries (100 standalone and 100 contextual queries of the standalone queries). Next, we send 150 duplicate queries (75 standalone + 75 contextual) and 100 non-duplicate queries (a total of 250 queries) to the cache-enabled LLM. Figure 13 shows the true label (whether the query should be returned from the cache or not) and the corresponding GPTCache and MeanCache performance (predicted label). Note that in Figure 13(a), all the queries should be answered by the LLM; in other words, there should be no hits. However, GPTCache has 54 false hits, while MeanCache has only three false hits. This is also shown in the confusion matrix in Figure 14. Table 1 (Column-3) summarizes the comparative results. MeanCache outperforms GPTCache by over 25% in both F-Score and accuracy. Additionally, MeanCache achieves 32% higher precision compared to the baseline.

Summary. GPTCache’s low performance stems from its inability to consider contextual information, leading to high false hit rates. In contrast, MeanCache demonstrates significantly superior handling of contextual queries, with a 25% improvement in accuracy over the baseline.

5.6. MeanCache Embedding Compression utility and Impact on Storage Space

Clients or users often face storage limitations compared to web servers. Storing embeddings in the local cache on the user side for semantic search demands memory storage. Various models yield embeddings with differing vector sizes; for example, the MPNet and Albert models produce an output embedding vector of 768 dimensions, whereas the Llama 2 model’s embeddings dimension size is 4096. The embedding vector size also affects semantic search speeds, where smaller vectors could enhance speed and lower resource demands.

Figure 12 illustrates the effects of MeanCache dimension reduction utility on the storage, semantic matching speed (overhead), and MeanCache’s performance (F score). The x-axis indicates the number of queries stored in the cache, while the y-axis shows storage size, average search time, and the F score in respective graphs. MeanCache-Compressed (MPNet) and MeanCache-Compressed (Albert) represent instances where MeanCache decreases the embedding dimensions from 768 to 64 by employing the compression, as detailed in MeanCache design (Section 4).

Figure 12 demonstrates that an increase in stored queries linearly raises storage needs. Yet MeanCache with compression enabled drastically lowers storage needs by 83% compared to GPTCache.

Figure 12 also indicates that compression decreases the average search time, with MeanCache enabled compression approximately 11% faster. Moreover, despite a slight decrease in F score with compression enabled, MeanCache still surpasses GPTCache. Furthermore, given the evidence from Figures 5 and 6 that MPNet produces more precise embeddings, and it is also clear from Figure 12 that MPNet’s embeddings are particularly resilient to compression and excel in semantic matching.

Summary. The application of embedding compression optimization in MeanCache offers substantial benefits, including an 83% savings in storage and an 11% faster search process, while still outperforming the established baseline (GPTCache).
Llama-2mpnetalbert0.000.020.04Computation Time (s)0.0400.0050.009Llama-2mpnetalbert0102030Storage Size (KBs)3266
Figure 15. Llama 2 takes significantly longer to compute embeddings and requires substantially more storage space than Albert and MPNet.
0.00.20.40.60.81.0Threshold0.000.250.500.751.00ScoreF1PrecisionRecallAccuracy
Figure 16. Llama 2 as an embedding model does not perform well in semantic matching, even at optimal thresholds. The maximum F1 score is 0.75, significantly lower than with the smaller embedding models (MPNet and Albert).

5.7. Infeasibility of Embedding Generation with Llama 2

GPTCache(Bang, 2023) recommend using Llama 2 for generating embeddings to enhance GPTCache’s performance. However, Llama 2’s embedding computation is expensive in terms of inference time, requires substantial storage, and incurs considerable overhead during semantic searches. For example, Llama 2 with its 7 billion parameters, demands 30 GB of memory (Siz, [n. d.]), whereas Albert and MPNet require only 43 MB and 420 MB, respectively.

To highlight the impracticality to generate query embeddings with Llama 2, we compare the embedding computation time, embedding storage space requirement of Llama 2, Albert and MPNet transformer models. Figure 15 shows the average time to compute the embeddings of a single query and storage requirements for the embeddings. Figure 15 shows that the average embedding computation time of Llama 2 is 0.04 seconds, while for Albert and MPNet the average computation time is 0.005 and 0.009 seconds respectively. Single query embeddings generated from Llama 2 takes approximately 32 KBs of space and embeddings generated by both MPNet and Albert only take 6 KBs, as shown in Figure 15.

Furthermore, we evaluate the performance of the Llama 2 on embedding generation and semantic matching. Figure 16 shows that the performance of Llama 2 with different cosine similarity threshold (τ𝜏\tauitalic_τ) and corresponding performance metric. We can see that the performance of Llama 2 is not good even with the optimal cosine similarity threshold, as also noted by researchers (Use, [n. d.]). The maximum F1 score achieved by Llama 2 is 0.75 which is quite low when compared with the optimal thresholds scores from the Figures 7 and Figure 8.

Smaller models tailored for specific tasks often surpass larger models in efficiency (Ouyang et al., 2022; Penedo et al., 2024; Du and Kaelbling, 2024). Thus, diverging from GPTCache’s approach, we advocate for adopting smaller yet efficient embedding models for semantic caching. These models not only ensure optimal performance but also minimize the semantic cache’s overhead on users, featuring lower inference demands and reduced output embedding sizes, thereby facilitating deployment on edge devices.

Summary. Llama 2 as an embedding model is not a viable option for generating embeddings. Future enhancements might improve its performance to generate high-quality embeddings, yet the computational demands, semantic search duration, and storage requirements are likely to remain elevated, consequently impacting the system’s overall efficacy.

6. Related Work

Several caching systems are proposed to optimize the performance of search engines, databases and operating systems. Saraiva et al. (2001) suggest a two-tier dynamic caching architecture for web search engines to enhance response times in hierarchical systems. Utilizing LRU eviction at both levels, they demonstrate how the second-tier cache can significantly lower disk traffic and boost throughput.  Baeza-Yates and Saint-Jean (2003) propose a three-level index organization and  Long and Suel (2005) propose a three-tier caching. Xie and O’Hallaron (2002) examined two real search engine datasets to explore query locality, aiming to develop a caching strategy based on this concept. Their analysis centered on query frequency and distribution, assessing the feasibility of caching at various levels, such as server, proxy, and client side. Lempel and Moran (2003) proposed a novel caching technique, called Probability Driven Cache (PDC), to optimize the performance of search engines. PDC uses the probability of a query to be repeated to decide whether to cache the query or not. Fagni et al. (2006) proposes SDC (Static Dynamic Cache) that exploit the temporal and spatial locality present in the query stream to avoid processing the same query and thereby saving the computational resources. Baeza-Yates et al. (2007) explore efficient caching design for web search engines, comparing static and dynamic caching, and weighing caching query results against posting lists. Zhang et al. (2008) examines challenges in large search engines processing thousands of queries per second across vast document collections. The study focuses on index compression and caching optimizations, evaluating various inverted list compression algorithms, including novel variants, and analyzing different caching policies (e.g., LRU, LFU). Marin et al. (2010) introduces an innovative cache hierarchy aimed at boosting web search engines’ efficiency in processing user queries. It aims to store different pieces of data useful for solving frequent queries, thereby enhancing overall query throughput and reducing single-user query latency and power consumption.

In all of the above studies, the caching systems are designed for traditional search engines that process keyword queries (i.e., key word matching) and return a list of links to the user as a response. However, such caching systems, when applied to LLMs-based web servers or APIs, do not provide a single concise response and may return many false results. On top of that, such caching techniques will fail to capture the semantic similarity among repeated queries and will result in a significantly low hit rate. Zhu et al. (2023) and (Bang, 2023) propose server-side caching for the services based on LLMs to reduce the massive computational cost of LLMs. For instance, Zhu et al. (2023) upon receiving a new query at the server checks if the query is semantically similar to any query in the cache. If the query is similar to any query in the cache then the server returns the response from the cache. Otherwise, a model multiplexer is used to select the most suitable LLM for the query.

Although these techniques can handle the semantic similarity among the queries and can provide a single concise response, they raise privacy concerns as they store user queries on an external server. These techniques are static and unable to adapt to the behavior of each user. Furthermore, an end user will still be charged for the query even if the query is served from the cache. Thus, we need a user-centered semantic cache that can run on the user side, which can benefit the user in terms of privacy, cost, and latency. Such a cache should be able to detect the semantic similarity among the queries and should be able to adapt to the behavior of each user in a privacy preserving manner. MeanCache provides the aforementioned benefits without violating the user’s privacy.

7. Conclusion

MeanCache is the first user-centric semantic cache for Language Model (LLM) based web services, such as ChatGPT. In MeanCache, clients train the global embedding model using federated learning on their local data, while preserving user privacy. The global model formed after the aggregation generates high-quality embeddings for semantic matching. When a new user’s query is duplicated, MeanCache semantically matches the new query with the user’s local cache and returns the most relevant results. This process reduces the computational cost of the LLM service, improves bandwidth and latency, and conserves the user’s query quota for the LLM service. Despite compressing embeddings and saving 83% of storage space, MeanCache outperforms the existing baseline semantic cache (GPTCache). With its distribution cache design, MeanCache offers pathways to reduce one-third of the LLM query inference costs related to semantically similar queries.

References

  • (1)
  • arc ([n. d.]) [n. d.]. Arc Max is the popular browser’s new suite of AI tools - The Verge. https://www.theverge.com/2023/10/3/23898907/arc-max-ai-browser-mac-ios. (Accessed on 01/19/2024).
  • Dis ([n. d.]) [n. d.]. DiskCache: Disk Backed Cache — DiskCache 5.6.1 documentation. https://grantjenks.com/docs/diskcache/. (Accessed on 04/01/2024).
  • gbo ([n. d.]) [n. d.]. Federated Learning: Collaborative Machine Learning without Centralized Training Data – Google Research Blog. https://blog.research.google/2017/04/federated-learning-collaborative.html. (Accessed on 04/01/2024).
  • bar ([n. d.]) [n. d.]. Google AI updates: Bard and new AI features in Search. https://blog.google/technology/ai/bard-google-ai-search-updates/. (Accessed on 01/18/2024).
  • dat ([n. d.]) [n. d.]. GPTCache/examples/benchmark at main · zilliztech/GPTCache. https://github.com/zilliztech/GPTCache/tree/main/examples/benchmark. (Accessed on 03/04/2024).
  • cha ([n. d.]) [n. d.]. Introducing ChatGPT. https://openai.com/blog/chatgpt. (Accessed on 01/18/2024).
  • cla ([n. d.]) [n. d.]. Introducing Claude 2.1 \ Anthropic. https://www.anthropic.com/news/claude-2-1. (Accessed on 01/18/2024).
  • rab ([n. d.]) [n. d.]. Learning human actions on computer applications. https://www.rabbit.tech/research. (Accessed on 01/19/2024).
  • bin ([n. d.]) [n. d.]. Microsoft Edge and Bing Users Engage in Over 1.9 Billion Copilot Chats in 2023. https://www.msn.com/en-us/money/other/microsoft-edge-and-bing-users-engage-in-over-19-billion-copilot-chats-in-2023/ar-AA1mmrxZ. (Accessed on 01/22/2024).
  • ope ([n. d.]) [n. d.]. OpenAI Pricing. https://openai.com/pricing. (Accessed on 01/19/2024).
  • per ([n. d.]) [n. d.]. Perplexity Pro. https://www.perplexity.ai/pro. (Accessed on 03/01/2024).
  • Siz ([n. d.]) [n. d.]. Sizing Guide - NVIDIA Docs. https://docs.nvidia.com/ai-enterprise/workflows-generative-ai/0.1.0/sizing-guide.html. (Accessed on 01/25/2024).
  • Use ([n. d.]) [n. d.]. [User] Embedding doesn’t seem to work? · Issue #899 · ggerganov/llama.cpp. https://github.com/ggerganov/llama.cpp/issues/899. (Accessed on 01/18/2024).
  • zil ([n. d.]) [n. d.]. zilliztech/GPTCache: Semantic cache for LLMs. Fully integrated with LangChain and llama_index. https://github.com/zilliztech/gptcache. (Accessed on 03/03/2024).
  • Avdiukhin and Kasiviswanathan (2021) Dmitrii Avdiukhin and Shiva Kasiviswanathan. 2021. Federated learning under arbitrary communication patterns. In International Conference on Machine Learning. PMLR, 425–435.
  • Baeza-Yates et al. (2007) Ricardo Baeza-Yates, Aristides Gionis, Flavio Junqueira, Vanessa Murdock, Vassilis Plachouras, and Fabrizio Silvestri. 2007. The impact of caching on search engines. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval. 183–190.
  • Baeza-Yates and Saint-Jean (2003) Ricardo Baeza-Yates and Felipe Saint-Jean. 2003. A three level search engine index based in query log distribution. In String Processing and Information Retrieval: 10th International Symposium, SPIRE 2003, Manaus, Brazil, October 8-10, 2003. Proceedings 10. Springer, 56–65.
  • Bang (2023) Fu Bang. 2023. GPTCache: An Open-Source Semantic Cache for LLM Applications Enabling Faster Answers and Cost Savings. In Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023). 212–218.
  • Beutel et al. (2020) Daniel J Beutel, Taner Topal, Akhil Mathur, Xinchi Qiu, Javier Fernandez-Marques, Yan Gao, Lorenzo Sani, Hei Li Kwing, Titouan Parcollet, Pedro PB de Gusmão, and Nicholas D Lane. 2020. Flower: A Friendly Federated Learning Research Framework. arXiv preprint arXiv:2007.14390 (2020).
  • Brin and Page (1998) Sergey Brin and Lawrence Page. 1998. The anatomy of a large-scale hypertextual web search engine. Computer networks and ISDN systems 30, 1-7 (1998), 107–117.
  • Du and Kaelbling (2024) Yilun Du and Leslie Kaelbling. 2024. Compositional Generative Modeling: A Single Model is Not All You Need. arXiv preprint arXiv:2402.01103 (2024).
  • Fagni et al. (2006) Tiziano Fagni, Raffaele Perego, Fabrizio Silvestri, and Salvatore Orlando. 2006. Boosting the performance of web search engines: Caching and prefetching query results by exploiting historical usage data. ACM Transactions on Information Systems (TOIS) 24, 1 (2006), 51–78.
  • Frantar et al. (2023) Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. 2023. OPTQ: Accurate Quantization for Generative Pre-trained Transformers. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net. https://openreview.net/pdf?id=tcbBPnfwxS
  • Henderson et al. (2017) Matthew Henderson, Rami Al-Rfou, Brian Strope, Yun-Hsuan Sung, László Lukács, Ruiqi Guo, Sanjiv Kumar, Balint Miklos, and Ray Kurzweil. 2017. Efficient natural language response suggestion for smart reply. arXiv preprint arXiv:1705.00652 (2017).
  • Hotelling (1933) Harold Hotelling. 1933. Analysis of a complex of statistical variables into principal components. Journal of educational psychology 24, 6 (1933), 417.
  • Jolliffe and Cadima (2016) Ian T Jolliffe and Jorge Cadima. 2016. Principal component analysis: a review and recent developments. Philosophical transactions of the royal society A: Mathematical, Physical and Engineering Sciences 374, 2065 (2016), 20150202.
  • Karpukhin et al. (2020) Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Computational Linguistics, Online, 6769–6781. https://doi.org/10.18653/v1/2020.emnlp-main.550
  • Lan et al. (2019) Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. CoRR abs/1909.11942 (2019). arXiv:1909.11942 http://arxiv.org/abs/1909.11942
  • Lempel and Moran (2003) Ronny Lempel and Shlomo Moran. 2003. Predictive caching and prefetching of query results in search engines. In Proceedings of the 12th international conference on World Wide Web. 19–28.
  • Li et al. (2020) Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. 2020. Federated optimization in heterogeneous networks. Proceedings of Machine learning and systems 2 (2020), 429–450.
  • Long and Suel (2005) Xiaohui Long and Torsten Suel. 2005. Three-level caching for efficient query processing in large web search engines. In Proceedings of the 14th international conference on World Wide Web. 257–266.
  • Ma et al. (2024) Shuming Ma, Hongyu Wang, Lingxiao Ma, Lei Wang, Wenhui Wang, Shaohan Huang, Li Dong, Ruiping Wang, Jilong Xue, and Furu Wei. 2024. The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits. arXiv preprint arXiv:2402.17764 (2024).
  • Marin et al. (2010) Mauricio Marin, Veronica Gil-Costa, and Carlos Gomez-Pantoja. 2010. New caching techniques for web search engines. In Proceedings of the 19th ACM international symposium on high performance distributed computing. 215–226.
  • Markatos (2001) Evangelos P. Markatos. 2001. On caching search engine query results. Computer Communications 24, 2 (2001), 137–143.
  • McMahan et al. (2017) Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. 2017. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics. PMLR, 1273–1282.
  • Ni et al. (2022) Jianmo Ni, Gustavo Hernandez Abrego, Noah Constant, Ji Ma, Keith Hall, Daniel Cer, and Yinfei Yang. 2022. Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models. In Findings of the Association for Computational Linguistics: ACL 2022, Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). Association for Computational Linguistics, Dublin, Ireland, 1864–1874. https://doi.org/10.18653/v1/2022.findings-acl.146
  • Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 35 (2022), 27730–27744.
  • Pearson (1901) Karl Pearson. 1901. LIII. On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 2, 11 (1901), 559–572. https://doi.org/10.1080/14786440109462720 arXiv:https://doi.org/10.1080/14786440109462720
  • Penedo et al. (2024) Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Hamza Alobeidli, Alessandro Cappelli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. 2024. The refinedweb dataset for falcon llm: Outperforming curated corpora with web data only. Advances in Neural Information Processing Systems 36 (2024).
  • Podlipnig and Böszörmenyi (2003) Stefan Podlipnig and Laszlo Böszörmenyi. 2003. A survey of web cache replacement strategies. ACM Computing Surveys (CSUR) 35, 4 (2003), 374–398.
  • Reimers and Gurevych (2019) Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (Eds.). Association for Computational Linguistics, Hong Kong, China, 3982–3992. https://doi.org/10.18653/v1/D19-1410
  • Sachdeva et al. (2024) Noveen Sachdeva, Benjamin Coleman, Wang-Cheng Kang, Jianmo Ni, Lichan Hong, Ed H Chi, James Caverlee, Julian McAuley, and Derek Zhiyuan Cheng. 2024. How to Train Data-Efficient LLMs. arXiv preprint arXiv:2402.09668 (2024).
  • Saraiva et al. (2001) Patricia Correia Saraiva, Edleno Silva de Moura, Nivio Ziviani, Wagner Meira, Rodrigo Fonseca, and Berthier Ribeiro-Neto. 2001. Rank-preserving two-level caching for scalable search engines. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval. 51–58.
  • Song et al. (2020) Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. 2020. Mpnet: Masked and permuted pre-training for language understanding. Advances in Neural Information Processing Systems 33 (2020), 16857–16867.
  • Strubell et al. (2019) Emma Strubell, Ananya Ganesh, and Andrew McCallum. 2019. Energy and Policy Considerations for Deep Learning in NLP. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 3645–3650.
  • Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
  • Tseng et al. (2024) Albert Tseng, Jerry Chee, Qingyao Sun, Volodymyr Kuleshov, and Christopher De Sa. 2024. QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks. arXiv preprint arXiv:2402.04396 (2024).
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).
  • Wang et al. (2023) Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Huaijie Wang, Lingxiao Ma, Fan Yang, Ruiping Wang, Yi Wu, and Furu Wei. 2023. Bitnet: Scaling 1-bit transformers for large language models. arXiv preprint arXiv:2310.11453 (2023).
  • Wang et al. (2020b) Hongyi Wang, Mikhail Yurochkin, Yuekai Sun, Dimitris Papailiopoulos, and Yasaman Khazaeni. 2020b. Federated Learning with Matched Averaging. In International Conference on Learning Representations. https://openreview.net/forum?id=BkluqlSFDS
  • Wang et al. (2020a) Jianyu Wang, Qinghua Liu, Hao Liang, Gauri Joshi, and H Vincent Poor. 2020a. Tackling the objective inconsistency problem in heterogeneous federated optimization. Advances in neural information processing systems 33 (2020), 7611–7623.
  • Xie and O’Hallaron (2002) Yinglian Xie and David O’Hallaron. 2002. Locality in search engine queries and its implications for caching. In Proceedings. Twenty-First Annual Joint Conference of the IEEE Computer and Communications Societies, Vol. 3. IEEE, 1238–1247.
  • Zhang et al. (2008) Jiangong Zhang, Xiaohui Long, and Torsten Suel. 2008. Performance of compressed inverted list caching in search engines. In Proceedings of the 17th international conference on World Wide Web. 387–396.
  • Zhu et al. (2023) Banghua Zhu, Ying Sheng, Lianmin Zheng, Clark Barrett, Michael Jordan, and Jiantao Jiao. 2023. Towards Optimal Caching and Model Selection for Large Model Inference. In Thirty-seventh Conference on Neural Information Processing Systems. https://openreview.net/forum?id=gd20oaZqqF