Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I propose an implementation for speculative fill in the middle. We now take advantage of the time the server is sitting idle while a suggestion is displayed to fetch the next completion assuming the user accepts the current one. This gives the illusion of (near) zero-latency suggestions to provide a better user experience.
It works by calculating the prefix, suffix, and prompt assuming the user accepts the current suggestion. Apart from a new prefix, suffix, and prompt, it is nearly identical to
llama#fim()
. After a new fim completion is fetched, we cache it instead of displaying it to the user. So when they accept the current suggestion, the next completion will be a cache hit.I'm aware that a lot of the code in
llama#fim()
is duplicated inllama#speculate()
. In most cases I would say it's better to modularize code, but here I think having two separate implementations makes it more readable, maintainable, and less messy.I'm open to suggestions on modularizing and implmention. TAB, TAB, TAB!
Screen.Recording.2025-01-24.at.7.22.55.PM.mov