Implement chunking #2136

shinjanc · 2024-12-17T08:54:46Z

Question

I want to ingest 150-200 files of 15-20 pages and want to query them and want answers to be generated from multiple files. Presently it is just quoting 2 sources. Is chunking the way out? How to implement it and code for the same please

jaluma · 2025-01-07T08:29:33Z

What do you want to do with chunking? Right now, chunking is by sentence, so each document should generate N chunks that it should be interoperable. If no enough chunks are being retrieving, you should increase top similarity_top_k in ChatService.

shinjanc · 2025-01-07T16:31:34Z

I have already updated all parameters so that privateGPT quotes at least 10 sources. Currently, I have implemeted
Chunking: sentence chunking.
Llm: llama 3.1 8b
Embedding:nomic embed text
Vector storage : qdrant
Context window : 32000

I am still getting 4 sources. I want to maximise my sources. How do I achieve this?
I have tried multiple things, asked in discord community, but no-one has helped

shinjanc · 2025-01-07T16:33:03Z

I have already updated all parameters so that privateGPT quotes at least 10 sources. Currently, I have implemeted Chunking: sentence chunking. Llm: llama 3.1 8b Embedding:nomic embed text Vector storage : qdrant Context window : 32000 I am still getting 4 sources. I want to maximise my sources. How do I achieve this? I have tried multiple things, asked in discord community, but no-one has helped Yahoo Mail: Search, Organize, Conquer On Tue, Jan 7, 2025 at 13:59, Javier ***@***.***> wrote: What do you want to do with chunking? Right now, chunking is by sentence, so each document should generate N chunks that it should be interoperable. If no enough chunks are being retrieving, you should increase top similarity_top_k in ChatService. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: ***@***.***>

jaluma · 2025-01-08T09:15:44Z

LLM or context window won't change anything in search part. Can you change or comment similarity value to check it? You can change in settings.yaml. If it doesn't work, you should change you embedding model to another one, with more spatial capacities

shinjanc · 2025-01-08T09:24:00Z

The similarity value is already disabled.

vinodsuresh95 · 2025-01-14T07:05:43Z

this could be a possible way:

Fine-Tune Chunking:
Currently, you are using sentence-level chunking. This might not provide enough distinct chunks for retrieval.
Switch to paragraph-based chunking or a hybrid approach where sentences are grouped into small paragraphs (e.g., 3-5 sentences per chunk).
Adjust Similarity Retrieval:
Increase similarity_top_k in your ChatService settings to ensure more chunks are retrieved.
Example: If it’s currently set to 10, try increasing it to 15 or 20.

3.Embed Smaller Granular Chunks:
Shorter chunks may lead to more precise embeddings, improving the diversity of retrieved sources.
However, avoid chunks that are too small, as they might lose context.

4.Enhance Embedding Model:
If the embedding model (nomic embed text) is not capturing sufficient semantic relationships, switch to a model with higher spatial representation capabilities, like OpenAI’s text-embedding-ada-002 or Cohere embeddings.

5.Combine Embeddings Across Documents:
Implement logic to ensure embeddings from all documents are queried together by augmenting query pipelines to pull from multiple documents explicitly.

shinjanc added the question Further information is requested label Dec 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement chunking #2136

Implement chunking #2136

shinjanc commented Dec 17, 2024

jaluma commented Jan 7, 2025

shinjanc commented Jan 7, 2025 •

edited

Loading

shinjanc commented Jan 7, 2025 via email

jaluma commented Jan 8, 2025

shinjanc commented Jan 8, 2025

vinodsuresh95 commented Jan 14, 2025

Implement chunking #2136

Implement chunking #2136

Comments

shinjanc commented Dec 17, 2024

Question

jaluma commented Jan 7, 2025

shinjanc commented Jan 7, 2025 • edited Loading

shinjanc commented Jan 7, 2025 via email

jaluma commented Jan 8, 2025

shinjanc commented Jan 8, 2025

vinodsuresh95 commented Jan 14, 2025

shinjanc commented Jan 7, 2025 •

edited

Loading