-
Notifications
You must be signed in to change notification settings - Fork 946
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Cache] Add cachedPrefixes for caching repeated system prompts #664
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the hard work! Added some comments. Please add an E2E example under examples/
. I will take another pass afterwards. Thanks again!
@@ -114,6 +115,7 @@ export interface MLCEngineConfig { | |||
initProgressCallback?: InitProgressCallback; | |||
logitProcessorRegistry?: Map<string, LogitProcessor>; | |||
logLevel?: LogLevel; | |||
cachedPrefixes?: ChatCompletionMessageParam[][]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add docs to MLCEngineConfig
, specifying the behavior of cachedPrefixes
(e.g. will prefill when loading the engine to create the prefixes' KV, will only dispose these KV when reloading the engine). Perhaps we can also mark this as experimental
to signify potential future API/behavior change
@@ -114,6 +115,7 @@ export interface MLCEngineConfig { | |||
initProgressCallback?: InitProgressCallback; | |||
logitProcessorRegistry?: Map<string, LogitProcessor>; | |||
logLevel?: LogLevel; | |||
cachedPrefixes?: ChatCompletionMessageParam[][]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also add an examples/cached_prefixes
? Where we can demonstrate the prefill time difference between using cachedPrefixes
and not using it. We should also test whether the behavior is expected in multi-turn conversation.
if (this.seqIdToPrefix.size === 0) { | ||
this.fclearKVCaches(this.kvCache); | ||
} else { | ||
this.fKVCacheRemoveSequence!(this.kvCache, new tvmjs.Scalar(0, "int64")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we have multiple sequence IDs, let's make a constant, say CHAT_SEQUENCE_ID=0
(or maybe a better naming), instead of using a magic number 0
that may be hard to keep track of
|
||
// If a match is found, fork the sequence | ||
if (matchedSeqId !== -1 && maxMatchedLen > 0) { | ||
console.log( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use log.info()
instead of console.log()
this.tvm.endScope(); | ||
} else if (seqID !== 0) { | ||
// If no match is found, add the new sequence to the KV cache | ||
console.log("Adding prefix to KV cache: ", seqID); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use log.info()
instead of console.log()
This PR adds the
cachedPrefixes
field inMLCEngineConfig
, allowing users to cache system prompts when creating MLCEngine. It reduces redundant processing of repeated instructions.Example usage in
CreateMLCEngine
: