-
Notifications
You must be signed in to change notification settings - Fork 371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG]: Llama Sharp LLamaEmbedder Chunking #1011
Comments
The embedder does not currently do any chunking of large documents - it simply takes all of the content you feed it and processes it in one go. It's up to you to ensure that's small enough to fit within the configured
|
Regarding this statement:
I assume that I can set the UBatchSize to the context size of the selected model, and the BatchSize to a bigger size than the UBatchSize. So if I have a custom chunking mechanism with a fixed token count split smaller than BatchSize, it should work. But based on this github issue (#921) BatchSize cannot be different than UBatchSize
|
UBatchSize doesn't really have anything to do with the context size of the model, instead it is simply setting how much work the GPU will do at once. So if you have e.g. However, this isn't relevant to embedding models because of: if (@params.UBatchSize != @params.BatchSize)
throw new ArgumentException("For non-causal models, batch size must be equal to ubatch size", nameof(@params)); The UBatch mechanism doesn't support non-causal models, all the work must be processed at once. That restriction is simply copied from llama.cpp here.
This sounds right to me. I think you should be able to:
|
So I am curious what batch is exacly doing? |
Batching is a fairly low-level implementation detail which makes processing large amounts of data more efficient. For example If you want the model to process a large prompt before generating some text, it's more efficient to process that prompt in a single large batch than to process it one. token. at. a. time. Alternatively if you're generating multiple different sequences all at once (e.g. 100 parallel conversations) rather than processing each conversation at once, you can process all 100 at simultaneously in a batch.
However, the GPU can't always handle a whole batch. None of this is really relevant to embedding though! Since there's that |
Description
I am using the following code to generate embeddings from very large documents, of-course each document's tokens can exceed the maximum context of selected model and therefore document's tokens should be splitted in chunks and for each chunk get an embedding, then in the end get the average embedding.
I have tested the above with multiple settings for BatchSize and UBatchSize (like 512 or 2048) but I always get the following error:
Then I even tried to create my own token chunking method and average calculator which works perfectly with Azure Open AI ADA model and I incoporated this also with usage of lama sharp but still the same error occured even chunk was les than 512 tokens.
I tested the following language models:
I tested using the following LlamaSharp versions:
-0.19.0
-0.18.0
Notice in another github issue (#921) there is a guy (martindevans) who mentions that "embedder can't split input into multiple batches at the moment"
So LLamaEmbedder and LLama.Native.LLamaPoolingType.Mean is not supposed to do chunking and average?
Reproduction Steps
Try to simply embed a small text like "This is a test"
Then try to embed a lage test like a document using the batch and mean parameters
Environment & Configuration
Known Workarounds
No response
The text was updated successfully, but these errors were encountered: