-
-
Notifications
You must be signed in to change notification settings - Fork 309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Option for passing in longer context fragments, stored in a deduped table #617
Comments
Pros of this approach:
Cons:
|
Alternative approaches:
That last option is actually reasonably elegant. It feels a bit nasty to me to have a whole bunch of very short values that are stored in a separate table, but it may be that most prompts and system prompts are long enough that this might be worthwhile anyway. |
Ideally whatever mechanism I use here should work for both If using existing attachments we could invent artificial content types of |
Question: in my current database how many responses have duplicates in prompts or system prompts? |
Simple dupe detection query: sqlite-utils "$(llm logs path)" "
select substr(prompt, 0, 20), length(prompt), count(*), length(prompt) * count(*) as size
from responses
group by prompt
having count(*) > 1 order by size desc" -t
|
I think I need to prototype the approach where every prompt and system string is de-duped into a separate foreign key table, then run it against a copy of my database and compare the sizes. |
Prototype: cp "$(llm logs path)" /tmp/logs.db
sqlite-utils extract logs.db responses system --table prompt --rename system prompt --fk-column system_id
sqlite-utils extract logs.db responses prompt |
The resulting DB was even bigger for some reason. A lot of the prompts/system prompts are duplicated in the |
Oh here's why it's bigger, it put an index on the new table: CREATE UNIQUE INDEX [idx_prompt_prompt]
ON [prompt] ([prompt]); I really need to do the thing where there's a hash column which has an index on it and is used for lookups, but the foreign key remains an integer so we don't bloat the |
Dropping that index and running vacuum dropped the DB size from 170MB to 144MB, but the original is 136MB to we still gained 8MB somehow. Probably just because we needed a bunch of extra pages for the new table. |
Given these problems with the |
Or I could have optional |
If I insert the prompts as utf-8 binary data into that Inserting them as text DOES work, but it feels really gnarly to have a BLOB column that sometimes has text in it, plus in the code Lines 20 to 26 in febbc04
|
Cleanest option here would be to have a Google Gemini models actually DO accept plain text attachments already: https://ai.google.dev/gemini-api/docs/document-processing?lang=rest
|
So maybe this ticket is actually about supporting plain text attachments and having them work across ALL models, in addition to whatever is going on here with Gemini. I wonder if any other models out there support attachments like |
I started playing with a I currently run that like this: curl -s "https://raw.githubusercontent.com/simonw/llm-docs/refs/heads/main/version-docs/$(llm --version | cut -d' ' -f3).txt" | \
llm -m gpt-4o-mini 'how do I embed a binary file?' |
Here's a much simpler idea. I introduce two new
And maybe two more that do the same for system prompts:
In addition, I treat In all cases the text is anticipated to be longer than a regular prompt, and hence should be de-duplicated and stored in a separate table. That table might be called This avoids the confusion that would come from reusing the existing attachments mechanism, and the UI for it feels pretty natural to me - especially since |
Current database tables are:
I'm going to call the new table |
How should these affect the existing One option: templates work exactly as they do right now, so when you do But... this actually brings up the point that templates right now are pretty inefficient. Using a template 100 times will drop 100 copies of that template into the Fixing this is probably a separate issue. I could start having templates automatically use this mechanism - being treated effectively as piped in / I could make it so replacement params work against these I can do that as a separate piece of work. |
If I'm doing this - supporting longer prompts using But as I started prototyping that out I realized that an alternative rule could simple be that any prompts over the length of X characters are automatically stored in that table instead. This also isn't a question I can get wrong. If I decide to make all prompts longer than 50 characters stored in |
... in which case the |
Here's as far as I got prototyping the diff --git a/llm/cli.py b/llm/cli.py
index 6a6fb2c..0f5f112 100644
--- a/llm/cli.py
+++ b/llm/cli.py
@@ -147,6 +147,16 @@ def cli():
@click.argument("prompt", required=False)
@click.option("-s", "--system", help="System prompt to use")
@click.option("model_id", "-m", "--model", help="Model to use")
+@click.option("file", "-f", "--file", type=click.File(), help="Read prompt from file")
+@click.option("url", "-u", "--url", help="Read prompt from URL")
+@click.option(
+ "system_file",
+ "--sf",
+ "--system-file",
+ type=click.File(),
+ help="Read system prompt from file",
+)
+@click.option("system_url", "--su", "--system-url", help="Read system prompt from URL")
@click.option(
"attachments",
"-a",
@@ -203,6 +213,10 @@ def prompt(
prompt,
system,
model_id,
+ file,
+ url,
+ system_file,
+ system_url,
attachments,
attachment_types,
options,
@@ -245,6 +259,16 @@ def prompt(
def read_prompt():
nonlocal prompt
+ prompt_bits = []
+
+ # Is there a file to read from?
+ if file:
+ prompt_bits.append(file.read())
+ if url:
+ response = httpx.get(url)
+ response.raise_for_status()
+ prompt_bits.append(response.text)
+
# Is there extra prompt available on stdin?
stdin_prompt = None
if not sys.stdin.isatty():
@@ -258,10 +282,12 @@ def prompt(
if (
prompt is None
- and not save
+ and file is None
+ and url is None
and sys.stdin.isatty()
and not attachments
and not attachment_types
+ and not save
):
# Hang waiting for input to stdin (unless --save)
prompt = sys.stdin.read() |
I think I need to change this code that populates the Lines 221 to 238 in 5d1d723
And code that reads from that table in two places: Lines 685 to 706 in 5d1d723
Lines 284 to 319 in 5d1d723
|
I could even define a new |
But what do I do about the I worked up this little system with the help of ChatGPT Code Interpreter: https://chatgpt.com/share/672d84c0-7360-8006-8333-f7e743237942 def apply_replacements(obj, replacements):
if isinstance(obj, dict):
return {k: apply_replacements(v, replacements) for k, v in obj.items()}
elif isinstance(obj, list):
return [apply_replacements(item, replacements) for item in obj]
elif isinstance(obj, str):
replaced_parts = []
last_index = 0
found = False
for key, value in replacements.items():
index = obj.find(key)
while index != -1:
found = True
if index > last_index:
replaced_parts.append(obj[last_index:index])
replaced_parts.append(value)
last_index = index + len(key)
index = obj.find(key, last_index)
if found:
if last_index < len(obj):
replaced_parts.append(obj[last_index:])
return {"$r": replaced_parts}
else:
return obj
else:
return obj
def reverse_replacements(obj, replacements):
return _reverse_replacements(obj, {v: k for k, v in replacements.items()})
def _reverse_replacements(obj, replacements):
if isinstance(obj, dict):
if "$r" in obj:
# Reconstruct the original string from the list
return "".join(
(replacements[part] if isinstance(part, int) else part)
for part in obj["$r"]
)
else:
return {k: _reverse_replacements(v, replacements) for k, v in obj.items()}
elif isinstance(obj, list):
return [_reverse_replacements(item, replacements) for item in obj]
else:
return obj So now you can do this: from llm.utils import apply_replacements, reverse_replacements
replacements = {"This is a pretty long string at this point": 1, "And so is this one": 2}
json_object = {
"foo": {
"bar": {
"baz": "this includes This is a pretty long string at this point",
"qux": [44, "And so is this one"],
"quux": "This has This is a pretty long string at this point and And so is this one as well"
}
}
}
replaced = apply_replacements(json_object, replacements)
orig = reverse_replacements(replaced, replacements) And the {'foo': {'bar': {'baz': {'$r': ['this includes ', 1]},
'quux': {'$r': ['This has ', 1, ' and ', 2, ' as well']},
'qux': [44, {'$r': [2]}]}}} |
The last remaining challenge is search. The system currently has a Lines 164 to 168 in 5d1d723
That design doesn't work if some of the prompts are in I could probably denormalize this, but that seems like it could be wasteful since the whole point of this exercise is to avoid storing the same long prompts hundreds of times... but if we store hundreds of copies in the FTS index table have we really improved things? So I likely need a separate I think I'll have to accept a reduction in relevance scoring because of this. That's OK - the search feature isn't really a signature big deal, it's more of a convenience. I don't think it matters too much if the relevance scoring is a bit out of whack. |
Note that |
I think the migrations I added broke |
Pushed my WIP to a branch: b2fce50 |
I'm second-guessing the design for this feature again now. As implemented this kicks in by magic if the prompt is longer than a certain threshold: Lines 230 to 236 in b2fce50
This assumes that prompts and system prompts will be entirely duplicated occasionally. But maybe that's not the smartest way to do this. What if instead these reusable contexts could be specified by the user directly and could even be concatenated together? Imagine being able to do something like this: llm -f readme.md -f docs/usage.md 'How do I install this?' Here we are concatenating two files together. Those files could be stored in two records in |
Relevant: https://twitter.com/alexalbert__/status/1857457290917589509
|
... if I DO go the llm -m claude-3.5-sonnet -f http://docs.anthropic.com/llms-full.txt 'how do I request a stream with curl?' Note that http://docs.anthropic.com/llms-full.txt is a redirect to the https version, so following redirects is important - I just tried this: curl 'http://docs.anthropic.com/llms-full.txt' | llm -m claude-3-5-haiku-latest 'a haiku about this document' And got:
|
I am going to call these fragments. The CLI option will be I am going to expose these in the Python API as well - as a Open question: how about letting fragments be specified by ID as well? That way users could store them deliberately in the database. |
There is definitely something interesting about being able to run a prompt that references multiple fragments. Imagine something like this: llm -f datasette -f llm "ideas for Datasette plugins powered by LLM" Where How to tell the difference between |
The only reason to expose fragments in the Python |
This design implies the need for a family of CLI commands for managing fragments and their aliases. Maybe: # list (truncated) stored fragments
llm fragments list
# show one (takes alias, ID or hash)
llm fragments show ALIAS
# search them
llm fragments list -q LLM
# add one with an alias
llm fragments set llm ./llm.txt Can you remove fragments? Doing so could break existing DB records that rely on them - but maybe there's private information in them that you didn't mean to save? I thank you can remove them, and doing so will cause those affected records to show that the fragments have been removed while still retuning the logged response. # Remove stored fragment
llm fragments remove llm |
So the fragments table has:
I might have a created date time on there too. And maybe even a source URL or filename. |
One thing this new design does not take into account is system prompts. Maybe Would also need a |
Made a start on this here - I've implemented the schema and the code such that prompts are aware of it, but I'm not yet persisting the fragments to the new database tables. Lines 90 to 133 in 6c355c1
|
A slight oddity with this design for aliases: if you set the alias of a piece of content to one thing, then set a different alias for the same byte-for-byte piece of content from elsewhere, the original alias will be forgotten. That feels confusing, and doesn't match with how aliases work for models where one model can have multiple aliases. Potential solutions:
That 3rd option is the least surprising in my opinion. It also removes the slightly weird unique-but-nullable |
db["fragment_aliases"].create(
{
"alias": str,
"fragment_id": int,
},
foreign_keys=("fragment_id", "fragments", "id"),
pk=("alias",),
) The primary key IS the alias, since you can't have multiple aliases with the same name. |
I think removing an alias should use garbage collection rules: if the aliased fragment is referenced by at least one other alias OR is referenced in the |
I'm not going to bother implementing garbage collection for the moment. |
i see you just touched this but for what its worth a few days ago i tried to add a text file as an attachment assuming it would work and was surprised it didnt. was the mental model i had. so this is good. |
The attachments feature makes it easy to attach video and audio from a URL or file while ensuring that even if that prompt component is used many times over it only ends up stored once in the database - in the
attachments
table using asha256
hash as the ID:llm/llm/models.py
Lines 28 to 39 in febbc04
I find myself wanting the same thing for blocks of text: I'd like to construct e.g. a context that contains the full docs for LLM and run multiple prompts against that without storing many copies of the same exact text in the database.
One possible solution would be to re-use the attachments mechanism, with a new default plain text attachment type that all models support:
llm -m gpt-4o-mini \ -a https://gist.githubusercontent.com/simonw/f7b251a05b834a9b60edff7e06d31572/raw/3ad2fb03d57aad7b616ec40455c355ec51d2e3db/llm-docs.md \ 'how do I embed a file?'
The text was updated successfully, but these errors were encountered: