Document Splitter always returns 1 document for split_type="passage" in pdfs #8491

rmrbytes · 2024-10-24T10:07:13Z

Describe the bug
When using Document Splitter with pdf and split_type="passage", the result is always one document. This is using pypdf.

Expected behavior
The understanding I have is that it splits based on at least two line breaks \n\n

Additional context
When I tested using plain text it seems to be splitting correctly

To Reproduce

dir = '...'
files = [
{"filename": "rules.pdf", "meta": {"split_by" : "passage", "split_length":1, "split_overlap":0, "split_threshold":0}},
{"filename": "rules.txt", "meta": {"split_by" : "passage", "split_length":1, "split_overlap":0, "split_threshold":0}}
]
for file in files:
# set the filepath
file_path = Path(dir) / file["filename"]
router_res = file_type_router.run(sources=[file_path])
txt_docs = []
if 'text/plain' in router_res:
txt_docs = text_file_converter.run(sources=router_res['text/plain'])
elif 'application/pdf' in router_res:
txt_docs = pdf_converter.run(sources=router_res['application/pdf'])
elif 'text/markdown' in router_res:
txt_docs = markdown_converter.run(sources=router_res['text/markdown'])
document_splitter = DocumentSplitter(
split_by=file['meta']['split_by'],
split_length=file['meta']['split_length'],
split_overlap=file['meta']['split_overlap'],
split_threshold=file['meta']['split_threshold']
)
splitter_res = document_splitter.run([txt_docs['documents'][0]])
print(len(splitter_res['documents']))

System:

OS: Mac OS 14.6.1
GPU/CPU: CPU
Haystack version (commit or version number): 2.6.0
DocumentStore: Chromadb
Splitter: DocumentSplitter

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document Splitter always returns 1 document for split_type="passage" in pdfs #8491

Document Splitter always returns 1 document for split_type="passage" in pdfs #8491

rmrbytes commented Oct 24, 2024

Document Splitter always returns 1 document for split_type="passage" in pdfs #8491

Document Splitter always returns 1 document for split_type="passage" in pdfs #8491

Comments

rmrbytes commented Oct 24, 2024