Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document Splitter always returns 1 document for split_type="passage" in pdfs #8491

Open
rmrbytes opened this issue Oct 24, 2024 · 0 comments

Comments

@rmrbytes
Copy link

Describe the bug
When using Document Splitter with pdf and split_type="passage", the result is always one document. This is using pypdf.

Expected behavior
The understanding I have is that it splits based on at least two line breaks \n\n

Additional context
When I tested using plain text it seems to be splitting correctly

To Reproduce

dir = '...'
files = [
{"filename": "rules.pdf", "meta": {"split_by" : "passage", "split_length":1, "split_overlap":0, "split_threshold":0}},
{"filename": "rules.txt", "meta": {"split_by" : "passage", "split_length":1, "split_overlap":0, "split_threshold":0}}
]
for file in files:
# set the filepath
file_path = Path(dir) / file["filename"]
router_res = file_type_router.run(sources=[file_path])
txt_docs = []
if 'text/plain' in router_res:
txt_docs = text_file_converter.run(sources=router_res['text/plain'])
elif 'application/pdf' in router_res:
txt_docs = pdf_converter.run(sources=router_res['application/pdf'])
elif 'text/markdown' in router_res:
txt_docs = markdown_converter.run(sources=router_res['text/markdown'])
document_splitter = DocumentSplitter(
split_by=file['meta']['split_by'],
split_length=file['meta']['split_length'],
split_overlap=file['meta']['split_overlap'],
split_threshold=file['meta']['split_threshold']
)
splitter_res = document_splitter.run([txt_docs['documents'][0]])
print(len(splitter_res['documents']))

System:

  • OS: Mac OS 14.6.1
  • GPU/CPU: CPU
  • Haystack version (commit or version number): 2.6.0
  • DocumentStore: Chromadb
  • Splitter: DocumentSplitter
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant