Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: DOCXToDocument: add table extraction #8457

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

vblagoje
Copy link
Member

@vblagoje vblagoje commented Oct 15, 2024

Why:

Enhances functionality for converting DOCX documents by improving the extraction of document elements, including tables, while maintaining page breaks. This addresses limitations in accurately capturing the structured content of DOCX files for further processing.

What:

  • Introduced _extract_elements which consolidates the extraction of paragraphs and tables from a DOCX file.
  • Refactored existing methods to support the new extraction logic, allowing for better handling of page breaks and table markdown representation.
  • Updated test cases to validate the correct functionality of document conversion involving tables and ensure meta information is retained accurately.
  • Existing unit tests not modified to ensure everything is kosher as before

How can it be used:

  • The new implementation provides a way to extract both text and tables from DOCX documents efficiently:
docx_converter.run(sources=paths)
  • The extracted content presents both paragraphs and tables formatted in markdown, preserving the original flow and structure of the document:
| This | Is     | Just a |
| ---- | ------ | ------ |
| 2020 | Random | Table  |
  • Markdown text table format is selected because it is the most suitable for LLM table representation (open to other options)

How did you test it:

  • Conducted unit tests to verify the core functionality of the DOCX-to-document conversion mechanism. This included:
    • Validating document content extraction with tables.
    • Checking that all necessary metadata attributes are preserved.
  • Additional tests ensure that extracted content maintains the original order, especially around tables, confirming that text before and after remains intact.

Notes for the reviewer:

  • Focus on the modifications in the extraction logic
  • Check the updated test cases involving mixed content (text and tables) within DOCX files.
  • Review the markdown conversion accuracy, especially for tables

Fixes DC-2720

@github-actions github-actions bot added topic:tests type:documentation Improvements on the docs labels Oct 15, 2024
@coveralls
Copy link
Collaborator

coveralls commented Oct 15, 2024

Pull Request Test Coverage Report for Build 11518638027

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage increased (+0.1%) to 90.59%

Files with Coverage Reduction New Missed Lines %
components/routers/file_type_router.py 1 98.36%
Totals Coverage Status
Change from base Build 11463116725: 0.1%
Covered Lines: 7615
Relevant Lines: 8406

💛 - Coveralls

@vblagoje vblagoje marked this pull request as ready for review October 15, 2024 15:13
@vblagoje vblagoje requested review from a team as code owners October 15, 2024 15:13
@vblagoje vblagoje requested review from dfokina, Amnah199, medsriha, a team, silvanocerza and sjrl and removed request for a team, Amnah199 and silvanocerza October 15, 2024 15:13
@vblagoje
Copy link
Member Author

Perhaps not 100% there yet but let's start iterating @sjrl and @medsriha

@vblagoje
Copy link
Member Author

@medsriha any updates on this? Have you tried it out?

@medsriha
Copy link
Member

@medsriha any updates on this? Have you tried it out?

Not yet :-( a bit busy with other stuff. Likely to start working on this early next week.

Copy link
Member

@medsriha medsriha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a couple of test units; otherwise, this is neat 🔥

@vblagoje
Copy link
Member Author

Ok, thanks a lot @medsriha - let's hear from @sjrl - I read somewhere md table format is a preferred format for table input so I didn't bother with csv, wdyt?

@sjrl
Copy link
Contributor

sjrl commented Oct 21, 2024

Ok, thanks a lot @medsriha - let's hear from @sjrl - I read somewhere md table format is a preferred format for table input so I didn't bother with csv, wdyt?

@vblagoje I think we should make it configurable so let the user choose between md and csv. We have found that LLMs can work well with both with maybe a bit more consistency on csv since there are many different md format versions and not all md versions appear to work well.

@vblagoje
Copy link
Member Author

Ok, deal @sjrl I'll add option to create table as csv, add unit tests and ping you for the final review 🙏

@vblagoje
Copy link
Member Author

@sjrl @medsriha this one should be ready to go now with both csv and markdown table output support configurable via init parameter. LMK your thoughts.

haystack/components/converters/docx.py Outdated Show resolved Hide resolved
haystack/components/converters/docx.py Outdated Show resolved Hide resolved
Copy link

@abrahamy abrahamy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@julian-risch julian-risch added this to the 2.7.0 milestone Oct 23, 2024
@vblagoje
Copy link
Member Author

@sjrl @shadeMe I rolled back to previous commit and then added the last two commits.

@shadeMe
Copy link
Collaborator

shadeMe commented Oct 24, 2024

@vblagoje Please don't force-push once reviews have been published - it breaks the reviewer's ability to diff b'ween commits since their last review.

haystack/components/converters/docx.py Outdated Show resolved Hide resolved
haystack/components/converters/docx.py Outdated Show resolved Hide resolved
The deserialized component.
"""
# Convert the table_format string back to enum before passing to the constructor
if "init_parameters" in data and "table_format" in data["init_parameters"]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those two keys are always going to be present - we can remove this check.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, this table_format key won't be present in existing serialized pipelines - we should still check for that. Sorry about the confusion.

haystack/components/converters/docx.py Outdated Show resolved Hide resolved
"init_parameters": {"table_format": "csv"},
}

def test_from_dict(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test that serializes a pipeline to YAML and reloads it.

vblagoje and others added 2 commits October 24, 2024 13:07
Co-authored-by: Madeesh Kannan <[email protected]>
@julian-risch julian-risch removed this from the 2.7.0 milestone Oct 24, 2024
@vblagoje vblagoje requested a review from shadeMe October 25, 2024 07:29
pipeline = Pipeline()
converter = DOCXToDocument(table_format=DOCXTableFormat.MARKDOWN)
pipeline.add_component("converter", converter)
assert pipeline.to_dict() == {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test needs to serialize to YAML and reload it.

if "init_parameters" in data and "table_format" in data["init_parameters"]:
data["init_parameters"]["table_format"] = TableFormat.from_str(data["init_parameters"]["table_format"])

data["init_parameters"]["table_format"] = DOCXTableFormat.from_str(data["init_parameters"]["table_format"])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still need the check for the table format key. See above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic:tests type:documentation Improvements on the docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat: Add table extraction inDOCXToDocument
7 participants