Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Cache] Add cachedPrefixes for caching repeated system prompts #664

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

YiyanZhai
Copy link

This PR adds the cachedPrefixes field in MLCEngineConfig, allowing users to cache system prompts when creating MLCEngine. It reduces redundant processing of repeated instructions.

Example usage in CreateMLCEngine:

await webllm.CreateMLCEngine(
  selectedModel,
  {
    initProgressCallback: initProgressCallback,
    logLevel: "INFO",
    cachedPrefixes: [
      [ { role: "system", content: "You are a helpful assistant running in the user's browser. You need to answer questions ... " }, ]
    ],
  },
  {
    context_window_size: 2048,
  }
);

Copy link
Contributor

@CharlieFRuan CharlieFRuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the hard work! Added some comments. Please add an E2E example under examples/. I will take another pass afterwards. Thanks again!

@@ -114,6 +115,7 @@ export interface MLCEngineConfig {
initProgressCallback?: InitProgressCallback;
logitProcessorRegistry?: Map<string, LogitProcessor>;
logLevel?: LogLevel;
cachedPrefixes?: ChatCompletionMessageParam[][];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add docs to MLCEngineConfig, specifying the behavior of cachedPrefixes (e.g. will prefill when loading the engine to create the prefixes' KV, will only dispose these KV when reloading the engine). Perhaps we can also mark this as experimental to signify potential future API/behavior change

@@ -114,6 +115,7 @@ export interface MLCEngineConfig {
initProgressCallback?: InitProgressCallback;
logitProcessorRegistry?: Map<string, LogitProcessor>;
logLevel?: LogLevel;
cachedPrefixes?: ChatCompletionMessageParam[][];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also add an examples/cached_prefixes? Where we can demonstrate the prefill time difference between using cachedPrefixes and not using it. We should also test whether the behavior is expected in multi-turn conversation.

if (this.seqIdToPrefix.size === 0) {
this.fclearKVCaches(this.kvCache);
} else {
this.fKVCacheRemoveSequence!(this.kvCache, new tvmjs.Scalar(0, "int64"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we have multiple sequence IDs, let's make a constant, say CHAT_SEQUENCE_ID=0 (or maybe a better naming), instead of using a magic number 0 that may be hard to keep track of


// If a match is found, fork the sequence
if (matchedSeqId !== -1 && maxMatchedLen > 0) {
console.log(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use log.info() instead of console.log()

this.tvm.endScope();
} else if (seqID !== 0) {
// If no match is found, add the new sequence to the KV cache
console.log("Adding prefix to KV cache: ", seqID);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use log.info() instead of console.log()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants