feat: DOCXToDocument: add table extraction #8457

vblagoje · 2024-10-15T12:58:34Z

Why:

Enhances functionality for converting DOCX documents by improving the extraction of document elements, including tables, while maintaining page breaks. This addresses limitations in accurately capturing the structured content of DOCX files for further processing.

fixes feat: Add table extraction inDOCXToDocument #8416

What:

Introduced _extract_elements which consolidates the extraction of paragraphs and tables from a DOCX file.
Refactored existing methods to support the new extraction logic, allowing for better handling of page breaks and table markdown representation.
Updated test cases to validate the correct functionality of document conversion involving tables and ensure meta information is retained accurately.
Existing unit tests not modified to ensure everything is kosher as before

How can it be used:

The new implementation provides a way to extract both text and tables from DOCX documents efficiently:

docx_converter.run(sources=paths)

The extracted content presents both paragraphs and tables formatted in markdown, preserving the original flow and structure of the document:

| This | Is     | Just a |
| ---- | ------ | ------ |
| 2020 | Random | Table  |

Markdown text table format is selected because it is the most suitable for LLM table representation (open to other options)

How did you test it:

Conducted unit tests to verify the core functionality of the DOCX-to-document conversion mechanism. This included:
- Validating document content extraction with tables.
- Checking that all necessary metadata attributes are preserved.
Additional tests ensure that extracted content maintains the original order, especially around tables, confirming that text before and after remains intact.

Notes for the reviewer:

Focus on the modifications in the extraction logic
Check the updated test cases involving mixed content (text and tables) within DOCX files.
Review the markdown conversion accuracy, especially for tables

Fixes DC-2720

coveralls · 2024-10-15T13:07:32Z

Pull Request Test Coverage Report for Build 11518638027

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

0 of 0 changed or added relevant lines in 0 files are covered.
1 unchanged line in 1 file lost coverage.
Overall coverage increased (+0.1%) to 90.59%

Files with Coverage Reduction	New Missed Lines	%
components/routers/file_type_router.py	1	98.36%

Totals
Change from base Build 11463116725:	0.1%
Covered Lines:	7615
Relevant Lines:	8406

💛 - Coveralls

vblagoje · 2024-10-15T15:14:31Z

Perhaps not 100% there yet but let's start iterating @sjrl and @medsriha

vblagoje · 2024-10-17T12:50:51Z

@medsriha any updates on this? Have you tried it out?

medsriha · 2024-10-17T13:59:51Z

@medsriha any updates on this? Have you tried it out?

Not yet :-( a bit busy with other stuff. Likely to start working on this early next week.

medsriha

I added a couple of test units; otherwise, this is neat 🔥

vblagoje · 2024-10-21T07:48:18Z

Ok, thanks a lot @medsriha - let's hear from @sjrl - I read somewhere md table format is a preferred format for table input so I didn't bother with csv, wdyt?

sjrl · 2024-10-21T07:49:58Z

Ok, thanks a lot @medsriha - let's hear from @sjrl - I read somewhere md table format is a preferred format for table input so I didn't bother with csv, wdyt?

@vblagoje I think we should make it configurable so let the user choose between md and csv. We have found that LLMs can work well with both with maybe a bit more consistency on csv since there are many different md format versions and not all md versions appear to work well.

vblagoje · 2024-10-21T08:42:30Z

Ok, deal @sjrl I'll add option to create table as csv, add unit tests and ping you for the final review 🙏

vblagoje · 2024-10-21T09:21:25Z

@sjrl @medsriha this one should be ready to go now with both csv and markdown table output support configurable via init parameter. LMK your thoughts.

haystack/components/converters/docx.py

test/components/converters/test_docx_file_to_document.py

abrahamy

🚀

vblagoje · 2024-10-24T07:55:53Z

@sjrl @shadeMe I rolled back to previous commit and then added the last two commits.

shadeMe · 2024-10-24T09:28:03Z

@vblagoje Please don't force-push once reviews have been published - it breaks the reviewer's ability to diff b'ween commits since their last review.

haystack/components/converters/docx.py

shadeMe · 2024-10-24T09:32:54Z

haystack/components/converters/docx.py

+            The deserialized component.
+        """
+        # Convert the table_format string back to enum before passing to the constructor
+        if "init_parameters" in data and "table_format" in data["init_parameters"]:


Those two keys are always going to be present - we can remove this check.

Whoops, this table_format key won't be present in existing serialized pipelines - we should still check for that. Sorry about the confusion.

haystack/components/converters/docx.py

shadeMe · 2024-10-24T09:35:21Z

test/components/converters/test_docx_file_to_document.py

+            "init_parameters": {"table_format": "csv"},
+        }
+
+    def test_from_dict(self):


Test that serializes a pipeline to YAML and reloads it.

Co-authored-by: Madeesh Kannan <[email protected]>

shadeMe · 2024-10-25T11:34:15Z

test/components/converters/test_docx_file_to_document.py

+        pipeline = Pipeline()
+        converter = DOCXToDocument(table_format=DOCXTableFormat.MARKDOWN)
+        pipeline.add_component("converter", converter)
+        assert pipeline.to_dict() == {


This test needs to serialize to YAML and reload it.

shadeMe · 2024-10-25T11:34:35Z

haystack/components/converters/docx.py

-        if "init_parameters" in data and "table_format" in data["init_parameters"]:
-            data["init_parameters"]["table_format"] = TableFormat.from_str(data["init_parameters"]["table_format"])
-
+        data["init_parameters"]["table_format"] = DOCXTableFormat.from_str(data["init_parameters"]["table_format"])


We still need the check for the table format key. See above.

vblagoje added 2 commits October 15, 2024 14:40

DOCXToDocument: add table extraction

013f47a

Add reno note

4bbaa9f

github-actions bot added topic:tests type:documentation Improvements on the docs labels Oct 15, 2024

vblagoje added 2 commits October 15, 2024 15:16

mypy fixes

2d74b3d

Merge branch 'main' into docx_table_converter

8901192

vblagoje marked this pull request as ready for review October 15, 2024 15:13

vblagoje requested review from a team as code owners October 15, 2024 15:13

vblagoje requested review from dfokina, Amnah199, medsriha, a team, silvanocerza and sjrl and removed request for a team, Amnah199 and silvanocerza October 15, 2024 15:13

medsriha and others added 2 commits October 18, 2024 21:11

add unit tests

83a4898

Merge branch 'main' into docx_table_converter

3552f56

medsriha approved these changes Oct 19, 2024

View reviewed changes

vblagoje added 3 commits October 21, 2024 11:04

Add csv table support

5b442f4

Merge branch 'main' into docx_table_converter

c16b4b4

Update release note

20d294d

shadeMe requested changes Oct 21, 2024

View reviewed changes

haystack/components/converters/docx.py Outdated Show resolved Hide resolved

haystack/components/converters/docx.py Outdated Show resolved Hide resolved

sjrl reviewed Oct 22, 2024

View reviewed changes

test/components/converters/test_docx_file_to_document.py Show resolved Hide resolved

abrahamy approved these changes Oct 23, 2024

View reviewed changes

julian-risch added this to the 2.7.0 milestone Oct 23, 2024

Add TableFormat enum

1286bed

vblagoje force-pushed the docx_table_converter branch from fb962b9 to 1286bed Compare October 24, 2024 07:47

Add table_format as str init param

2dd3777

shadeMe requested changes Oct 24, 2024

View reviewed changes

vblagoje and others added 2 commits October 24, 2024 13:07

Update docx.py

2f3304c

Co-authored-by: Madeesh Kannan <[email protected]>

PR feedback

c028438

julian-risch removed this from the 2.7.0 milestone Oct 24, 2024

vblagoje requested a review from shadeMe October 25, 2024 07:29

shadeMe requested changes Oct 25, 2024

View reviewed changes

PR feedback

3e85192

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: DOCXToDocument: add table extraction #8457

feat: DOCXToDocument: add table extraction #8457

vblagoje commented Oct 15, 2024 •

edited by jira bot

Loading

coveralls commented Oct 15, 2024 •

edited

Loading

vblagoje commented Oct 15, 2024

vblagoje commented Oct 17, 2024

medsriha commented Oct 17, 2024

medsriha left a comment •

edited

Loading

vblagoje commented Oct 21, 2024

sjrl commented Oct 21, 2024

vblagoje commented Oct 21, 2024

vblagoje commented Oct 21, 2024

abrahamy left a comment

vblagoje commented Oct 24, 2024

shadeMe commented Oct 24, 2024

shadeMe Oct 24, 2024

shadeMe Oct 24, 2024

shadeMe Oct 24, 2024

shadeMe Oct 25, 2024

shadeMe Oct 25, 2024

feat: DOCXToDocument: add table extraction #8457

Are you sure you want to change the base?

feat: DOCXToDocument: add table extraction #8457

Conversation

vblagoje commented Oct 15, 2024 • edited by jira bot Loading

Why:

What:

How can it be used:

How did you test it:

Notes for the reviewer:

coveralls commented Oct 15, 2024 • edited Loading

Pull Request Test Coverage Report for Build 11518638027

Warning: This coverage report may be inaccurate.

Details

💛 - Coveralls

vblagoje commented Oct 15, 2024

vblagoje commented Oct 17, 2024

medsriha commented Oct 17, 2024

medsriha left a comment • edited Loading

Choose a reason for hiding this comment

vblagoje commented Oct 21, 2024

sjrl commented Oct 21, 2024

vblagoje commented Oct 21, 2024

vblagoje commented Oct 21, 2024

abrahamy left a comment

Choose a reason for hiding this comment

vblagoje commented Oct 24, 2024

shadeMe commented Oct 24, 2024

shadeMe Oct 24, 2024

Choose a reason for hiding this comment

shadeMe Oct 24, 2024

Choose a reason for hiding this comment

shadeMe Oct 24, 2024

Choose a reason for hiding this comment

shadeMe Oct 25, 2024

Choose a reason for hiding this comment

shadeMe Oct 25, 2024

Choose a reason for hiding this comment

vblagoje commented Oct 15, 2024 •

edited by jira bot

Loading

coveralls commented Oct 15, 2024 •

edited

Loading

medsriha left a comment •

edited

Loading