llama.vim: speculative fim #31

VJHack · 2025-01-25T01:43:57Z

I propose an implementation for speculative fill in the middle. We now take advantage of the time the server is sitting idle while a suggestion is displayed to fetch the next completion assuming the user accepts the current one. This gives the illusion of (near) zero-latency suggestions to provide a better user experience.

It works by calculating the prefix, suffix, and prompt assuming the user accepts the current suggestion. Apart from a new prefix, suffix, and prompt, it is nearly identical to llama#fim(). After a new fim completion is fetched, we cache it instead of displaying it to the user. So when they accept the current suggestion, the next completion will be a cache hit.

I'm aware that a lot of the code in llama#fim() is duplicated in llama#speculate(). In most cases I would say it's better to modularize code, but here I think having two separate implementations makes it more readable, maintainable, and less messy.

I'm open to suggestions on modularizing and implmention. TAB, TAB, TAB!

Screen.Recording.2025-01-24.at.7.22.55.PM.mov

ggerganov · 2025-01-25T12:00:33Z

I tested this for a bit but it does not seem to work correctly. For example:

I just got a suggestion:

The full completion here from the logs is:

{
    "content":"esult = 0;\n    int n = atoi(argv[1]);\n    for (int i = 2; i < n; i++) {\n       "
...
}

The speculative request again from the logs is:

{
    "input_suffix": "\n\n",
    "input_prefix": "#include <cstdio>\n\n// find the first n primes, read n from command line\nint main(int argc, char *argv[]) {\n    if (argc != 2) {\n        printf(\"Usage: %s <n>\n\", argv[0]);\n        return 1;\n    }\n    int r\n    int result = 0;\n    int n = atoi(argv[1]);\n    for (int i = 2; i < n; i++) {\n",
    "top_p": 0.99,
    "input_extra": [
        {
            "filename": "test.cpp",
            "time": [
                446730,
                1939646170
            ],
            "text": "#include <cstdio>\n\n// print hello\nint main(int argc, char *argv[]) {\n    printf(\"\");\n    return 0;\n}\n"
        }
    ],
    ...
    "prompt": "       ",
    "cache_prompt": true
}

This is correct. However accepting the suggestion above does not lead to a cache hit:

It sent this request instead:

{
    "input_suffix": "\n\n",
    "input_prefix": "#include <cstdio>\n\n// find the first n primes, read n from command line\nint main(int argc, char *argv[]) {\n    if (argc != 2) {\n        printf(\"Usage: %s <n>\n\", argv[0]);\n        return 1;\n    }\n    int result = 0;\n    int n = atoi(argv[1]);\n    for (int i = 2; i < n; i++) {\n       \n        if (is_prime(i)) {\n            result++;\n        }\n",
    ...
    "prompt": "    ",
    "cache_prompt": true
}

So something is not OK.

ggerganov · 2025-01-25T12:06:09Z

I'm open to suggestions on modularizing and implemention.

In general, I think the implementation should be improved like this:

Server-communication logic
- The client sends async requests
- Each received response is processed and stored in the cache
- Triggers display events when changes in the cache occur
Display logic
- Upon display events, get the local cursor state and search for it in the cache
- If it finds a match: display it as a suggestion
- If it does not: do nothing

Such implementation will allow us to implement more complex caching and prediction mechanisms because the rendering will be decoupled from the server communication. Currently, this is not the case, and there are a lot of places where we can make a mistake.

VJHack added 3 commits January 24, 2025 03:02

basic speculative completion (buggy)

f2a0123

clean up code

1d66809

fixed whitespace

f397a0c

VJHack requested a review from ggerganov January 25, 2025 02:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama.vim: speculative fim #31

llama.vim: speculative fim #31

VJHack commented Jan 25, 2025 •

edited

Loading

ggerganov commented Jan 25, 2025

ggerganov commented Jan 25, 2025 •

edited

Loading

llama.vim: speculative fim #31

Are you sure you want to change the base?

llama.vim: speculative fim #31

Conversation

VJHack commented Jan 25, 2025 • edited Loading

ggerganov commented Jan 25, 2025

ggerganov commented Jan 25, 2025 • edited Loading

VJHack commented Jan 25, 2025 •

edited

Loading

ggerganov commented Jan 25, 2025 •

edited

Loading