Spliting docs into paragraphs #1596
-
Hi guys, I've just started playing with haystack and it's not splitting my docx files into paragraphs. Based on these lines I'm referencing looks like it's not possible. Am I missing something? haystack/haystack/file_converter/docx.py Lines 49 to 53 in bd823c9 haystack/haystack/preprocessor/utils.py Lines 294 to 300 in bd823c9 |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Hi @kamilpz I can recommend two resources to read as starting points regarding splitting of documents.
While the file converter converts your docx file into text in string format, the preprocessor handles splitting long text inputs into smaller chunks that can then be processed by the neural network models. The splitting can take into account word, sentence or paragraph (passage) boundaries. Within haystack, we call these smaller chunks "documents" and store them in document stores. So your original docx file will most likely be stored as multiple documents within haystack. I hope this answers your questions and if not feel free to describe in more detail what you would like to achieve/ what's your use case and I'd be happy to help. |
Beta Was this translation helpful? Give feedback.
Hi @kamilpz I can recommend two resources to read as starting points regarding splitting of documents.
split_by="word"
parameter is set but you can easily change it tosplit_by="passage"
: https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial8_Preprocessing.ipynbWhile the file converter converts your docx file into text in string format, the preprocessor handles s…