Tutorial: pipeline with promptnode - maximum QA database size? #5971

wolfgangihloff · 2023-10-04T18:25:20Z

wolfgangihloff
Oct 4, 2023

For the tutorial https://haystack.deepset.ai/tutorials/22_pipeline_with_promptnode#defining-the-pipeline
I take it the dataset that I import with join has a limit of about 4k token? How would I enable a larger knowledge base or working on a hundreds or thousand page long document?
Also a pointer to how to create a best such document myself would be helpful, using my own internal data.

Answered by julian-risch

Oct 9, 2023

@wolfgangihloff You're right. The length of the input passed to the PromptNode is limited by the maximum token length of the model. That could be 32768 tokens with gpt-4-32k but in the example it's 4097 with text-davinci-003.

The way to make use of larger knowledge base is to use a retriever that selects the most relevant documents (or pages of your pdf if it is very long) and passes only those most relevant ones to the PromptNode. It is called retrieval augmented generative pipelines: RAG and the retriever is part of the tutorial that you linked. What's maybe unclear after just reading the tutorial is that long pdfs with potentially thousands of pages are preprocessed earlier and split i…

View full answer

julian-risch · 2023-10-09T17:10:39Z

julian-risch
Oct 9, 2023
Maintainer

@wolfgangihloff You're right. The length of the input passed to the PromptNode is limited by the maximum token length of the model. That could be 32768 tokens with gpt-4-32k but in the example it's 4097 with text-davinci-003.

The way to make use of larger knowledge base is to use a retriever that selects the most relevant documents (or pages of your pdf if it is very long) and passes only those most relevant ones to the PromptNode. It is called retrieval augmented generative pipelines: RAG and the retriever is part of the tutorial that you linked. What's maybe unclear after just reading the tutorial is that long pdfs with potentially thousands of pages are preprocessed earlier and split into many smaller documents. The documents we store in Haystack's document stores are automatically sized down to ~512 tokens. Here is a tutorial about preprocessing: https://haystack.deepset.ai/tutorials/08_preprocessing

1 reply

wolfgangihloff Oct 11, 2023
Author

I sense the tutorials could get a bigger hint, that they layer on top of each other, maybe a call out in the initial beginner tutorials would be helpful to see the journey for learning you provide. Let me know if I should open some pull requests.
I also found that some of the tutorials will only work well for me with Python 3.10.x, that was also a learning that did not come clearly from the tutorials.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tutorial: pipeline with promptnode - maximum QA database size? #5971

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Tutorial: pipeline with promptnode - maximum QA database size? #5971

wolfgangihloff Oct 4, 2023

Replies: 1 comment · 1 reply

julian-risch Oct 9, 2023 Maintainer

wolfgangihloff Oct 11, 2023 Author

wolfgangihloff
Oct 4, 2023

Replies: 1 comment 1 reply

julian-risch
Oct 9, 2023
Maintainer

wolfgangihloff Oct 11, 2023
Author