-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: DOCXToDocument: add table extraction #8457
base: main
Are you sure you want to change the base?
Conversation
Pull Request Test Coverage Report for Build 11518638027Warning: This coverage report may be inaccurate.This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
Details
💛 - Coveralls |
@medsriha any updates on this? Have you tried it out? |
Not yet :-( a bit busy with other stuff. Likely to start working on this early next week. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a couple of test units; otherwise, this is neat 🔥
@vblagoje I think we should make it configurable so let the user choose between md and csv. We have found that LLMs can work well with both with maybe a bit more consistency on csv since there are many different md format versions and not all md versions appear to work well. |
Ok, deal @sjrl I'll add option to create table as csv, add unit tests and ping you for the final review 🙏 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
fb962b9
to
1286bed
Compare
@vblagoje Please don't force-push once reviews have been published - it breaks the reviewer's ability to diff b'ween commits since their last review. |
The deserialized component. | ||
""" | ||
# Convert the table_format string back to enum before passing to the constructor | ||
if "init_parameters" in data and "table_format" in data["init_parameters"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those two keys are always going to be present - we can remove this check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops, this table_format
key won't be present in existing serialized pipelines - we should still check for that. Sorry about the confusion.
"init_parameters": {"table_format": "csv"}, | ||
} | ||
|
||
def test_from_dict(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test that serializes a pipeline to YAML and reloads it.
Co-authored-by: Madeesh Kannan <[email protected]>
pipeline = Pipeline() | ||
converter = DOCXToDocument(table_format=DOCXTableFormat.MARKDOWN) | ||
pipeline.add_component("converter", converter) | ||
assert pipeline.to_dict() == { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test needs to serialize to YAML and reload it.
if "init_parameters" in data and "table_format" in data["init_parameters"]: | ||
data["init_parameters"]["table_format"] = TableFormat.from_str(data["init_parameters"]["table_format"]) | ||
|
||
data["init_parameters"]["table_format"] = DOCXTableFormat.from_str(data["init_parameters"]["table_format"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still need the check for the table format key. See above.
Why:
Enhances functionality for converting DOCX documents by improving the extraction of document elements, including tables, while maintaining page breaks. This addresses limitations in accurately capturing the structured content of DOCX files for further processing.
DOCXToDocument
#8416What:
_extract_elements
which consolidates the extraction of paragraphs and tables from a DOCX file.How can it be used:
How did you test it:
Notes for the reviewer:
Fixes DC-2720