We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Describe the bug When using Document Splitter with pdf and split_type="passage", the result is always one document. This is using pypdf.
split_type="passage"
Expected behavior The understanding I have is that it splits based on at least two line breaks \n\n
\n\n
Additional context When I tested using plain text it seems to be splitting correctly
To Reproduce
dir = '...' files = [ {"filename": "rules.pdf", "meta": {"split_by" : "passage", "split_length":1, "split_overlap":0, "split_threshold":0}}, {"filename": "rules.txt", "meta": {"split_by" : "passage", "split_length":1, "split_overlap":0, "split_threshold":0}} ] for file in files: # set the filepath file_path = Path(dir) / file["filename"] router_res = file_type_router.run(sources=[file_path]) txt_docs = [] if 'text/plain' in router_res: txt_docs = text_file_converter.run(sources=router_res['text/plain']) elif 'application/pdf' in router_res: txt_docs = pdf_converter.run(sources=router_res['application/pdf']) elif 'text/markdown' in router_res: txt_docs = markdown_converter.run(sources=router_res['text/markdown']) document_splitter = DocumentSplitter( split_by=file['meta']['split_by'], split_length=file['meta']['split_length'], split_overlap=file['meta']['split_overlap'], split_threshold=file['meta']['split_threshold'] ) splitter_res = document_splitter.run([txt_docs['documents'][0]]) print(len(splitter_res['documents']))
System:
The text was updated successfully, but these errors were encountered:
No branches or pull requests
Describe the bug
When using Document Splitter with pdf and
split_type="passage"
, the result is always one document. This is using pypdf.Expected behavior
The understanding I have is that it splits based on at least two line breaks
\n\n
Additional context
When I tested using plain text it seems to be splitting correctly
To Reproduce
dir = '...'
files = [
{"filename": "rules.pdf", "meta": {"split_by" : "passage", "split_length":1, "split_overlap":0, "split_threshold":0}},
{"filename": "rules.txt", "meta": {"split_by" : "passage", "split_length":1, "split_overlap":0, "split_threshold":0}}
]
for file in files:
# set the filepath
file_path = Path(dir) / file["filename"]
router_res = file_type_router.run(sources=[file_path])
txt_docs = []
if 'text/plain' in router_res:
txt_docs = text_file_converter.run(sources=router_res['text/plain'])
elif 'application/pdf' in router_res:
txt_docs = pdf_converter.run(sources=router_res['application/pdf'])
elif 'text/markdown' in router_res:
txt_docs = markdown_converter.run(sources=router_res['text/markdown'])
document_splitter = DocumentSplitter(
split_by=file['meta']['split_by'],
split_length=file['meta']['split_length'],
split_overlap=file['meta']['split_overlap'],
split_threshold=file['meta']['split_threshold']
)
splitter_res = document_splitter.run([txt_docs['documents'][0]])
print(len(splitter_res['documents']))
System:
The text was updated successfully, but these errors were encountered: