Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama.vim: speculative fim #31

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

VJHack
Copy link
Collaborator

@VJHack VJHack commented Jan 25, 2025

I propose an implementation for speculative fill in the middle. We now take advantage of the time the server is sitting idle while a suggestion is displayed to fetch the next completion assuming the user accepts the current one. This gives the illusion of (near) zero-latency suggestions to provide a better user experience.

It works by calculating the prefix, suffix, and prompt assuming the user accepts the current suggestion. Apart from a new prefix, suffix, and prompt, it is nearly identical to llama#fim(). After a new fim completion is fetched, we cache it instead of displaying it to the user. So when they accept the current suggestion, the next completion will be a cache hit.

I'm aware that a lot of the code in llama#fim() is duplicated in llama#speculate(). In most cases I would say it's better to modularize code, but here I think having two separate implementations makes it more readable, maintainable, and less messy.

I'm open to suggestions on modularizing and implmention. TAB, TAB, TAB!

Screen.Recording.2025-01-24.at.7.22.55.PM.mov

@VJHack VJHack requested a review from ggerganov January 25, 2025 02:06
@ggerganov
Copy link
Member

I tested this for a bit but it does not seem to work correctly. For example:

I just got a suggestion:

image

The full completion here from the logs is:

{
    "content":"esult = 0;\n    int n = atoi(argv[1]);\n    for (int i = 2; i < n; i++) {\n       "
...
}

The speculative request again from the logs is:

{
    "input_suffix": "\n\n",
    "input_prefix": "#include <cstdio>\n\n// find the first n primes, read n from command line\nint main(int argc, char *argv[]) {\n    if (argc != 2) {\n        printf(\"Usage: %s <n>\n\", argv[0]);\n        return 1;\n    }\n    int r\n    int result = 0;\n    int n = atoi(argv[1]);\n    for (int i = 2; i < n; i++) {\n",
    "top_p": 0.99,
    "input_extra": [
        {
            "filename": "test.cpp",
            "time": [
                446730,
                1939646170
            ],
            "text": "#include <cstdio>\n\n// print hello\nint main(int argc, char *argv[]) {\n    printf(\"\");\n    return 0;\n}\n"
        }
    ],
    ...
    "prompt": "       ",
    "cache_prompt": true
}

This is correct. However accepting the suggestion above does not lead to a cache hit:

image

It sent this request instead:

{
    "input_suffix": "\n\n",
    "input_prefix": "#include <cstdio>\n\n// find the first n primes, read n from command line\nint main(int argc, char *argv[]) {\n    if (argc != 2) {\n        printf(\"Usage: %s <n>\n\", argv[0]);\n        return 1;\n    }\n    int result = 0;\n    int n = atoi(argv[1]);\n    for (int i = 2; i < n; i++) {\n       \n        if (is_prime(i)) {\n            result++;\n        }\n",
    ...
    "prompt": "    ",
    "cache_prompt": true
}

So something is not OK.

@ggerganov
Copy link
Member

ggerganov commented Jan 25, 2025

I'm open to suggestions on modularizing and implemention.

In general, I think the implementation should be improved like this:

  • Server-communication logic

    • The client sends async requests
    • Each received response is processed and stored in the cache
    • Triggers display events when changes in the cache occur
  • Display logic

    • Upon display events, get the local cursor state and search for it in the cache
    • If it finds a match: display it as a suggestion
    • If it does not: do nothing

Such implementation will allow us to implement more complex caching and prediction mechanisms because the rendering will be decoupled from the server communication. Currently, this is not the case, and there are a lot of places where we can make a mistake.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants