Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement chunking #2136

Open
shinjanc opened this issue Dec 17, 2024 · 6 comments
Open

Implement chunking #2136

shinjanc opened this issue Dec 17, 2024 · 6 comments
Labels
question Further information is requested

Comments

@shinjanc
Copy link

Question

I want to ingest 150-200 files of 15-20 pages and want to query them and want answers to be generated from multiple files. Presently it is just quoting 2 sources. Is chunking the way out? How to implement it and code for the same please

@shinjanc shinjanc added the question Further information is requested label Dec 17, 2024
@jaluma
Copy link
Collaborator

jaluma commented Jan 7, 2025

What do you want to do with chunking? Right now, chunking is by sentence, so each document should generate N chunks that it should be interoperable. If no enough chunks are being retrieving, you should increase top similarity_top_k in ChatService.

@shinjanc
Copy link
Author

shinjanc commented Jan 7, 2025

I have already updated all parameters so that privateGPT quotes at least 10 sources. Currently, I have implemeted
Chunking: sentence chunking.
Llm: llama 3.1 8b
Embedding:nomic embed text
Vector storage : qdrant
Context window : 32000

I am still getting 4 sources. I want to maximise my sources. How do I achieve this?
I have tried multiple things, asked in discord community, but no-one has helped

@shinjanc
Copy link
Author

shinjanc commented Jan 7, 2025 via email

@jaluma
Copy link
Collaborator

jaluma commented Jan 8, 2025

LLM or context window won't change anything in search part. Can you change or comment similarity value to check it? You can change in settings.yaml. If it doesn't work, you should change you embedding model to another one, with more spatial capacities

@shinjanc
Copy link
Author

shinjanc commented Jan 8, 2025

The similarity value is already disabled.

@vinodsuresh95
Copy link

this could be a possible way:

  1. Fine-Tune Chunking:
    Currently, you are using sentence-level chunking. This might not provide enough distinct chunks for retrieval.
    Switch to paragraph-based chunking or a hybrid approach where sentences are grouped into small paragraphs (e.g., 3-5 sentences per chunk).

  2. Adjust Similarity Retrieval:
    Increase similarity_top_k in your ChatService settings to ensure more chunks are retrieved.
    Example: If it’s currently set to 10, try increasing it to 15 or 20.

3.Embed Smaller Granular Chunks:
Shorter chunks may lead to more precise embeddings, improving the diversity of retrieved sources.
However, avoid chunks that are too small, as they might lose context.

4.Enhance Embedding Model:
If the embedding model (nomic embed text) is not capturing sufficient semantic relationships, switch to a model with higher spatial representation capabilities, like OpenAI’s text-embedding-ada-002 or Cohere embeddings.

5.Combine Embeddings Across Documents:
Implement logic to ensure embeddings from all documents are queried together by augmenting query pipelines to pull from multiple documents explicitly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants