Feat/refactor layoutelement textregion to vectorized data structure #3881

badGarnet · 2025-01-21T02:27:43Z

This PR refactors the data structure for list[LayoutElement] and list[TextRegion] used in partition pdf/image files.

new data structure replaces a list of objects with one object with numpy array to store data
this only affects partition internal steps and it doesn't change input or output signature of partition function itself, i.e., partition still returns list[Element]
internally list[LayoutElement] -> LayoutElements; list[TextRegion] -> TextRegions
current refactor stops before clean up pdfminer elements inside inferred layout elements -> the algorithm of clean up needs to be refactored before the data structure refactor can move forward. So current refactor converts the array data structure into list data structure with element_array.as_list() call. This is the last step before turning list[LayoutElement] into list[Element] as return
a future PR will update this last step so that we build list[Element] from LayoutElements data structure instead.

The goal of this PR is to replace the data structure as much as possible without changing underlying logic. There are a few places where the slicing or filtering logic was simple enough to be converted into vector data structure operations. Those are refactored to be vector based. As a result there is some small improvements observed in ingest test. This is likely because the vector operations cleaned up some previous inconsistency in data types and operations.

- initial refactor on tesseract ocr agent return as array instead of as a list

- more refactoring is required after the rest of data structure change is complete - specifically the algorithm in inference lib will need to be refactored to use vector math with vector data structures

- test original form should not have passed but because the function modified input it was passing - refactor the function to use new data structure removes the implicit modification of input

- inference library requires text regsions from pdfminer to be of either EmbeddedTextRegion or ImageTextRegion for class identification purposes - best way foward is to refactor the library to use layoutelements for both inferred and extracted (i.e., pdfminer) layouts for consistency then we can remove this temporary patch function in this commit

…lement-textregion-to-vectorized-data-structure

- for now convert output to list so we can reuse existing assertions - add todo note to refactor the checks to use array directly

…miner elements)

…- Ingest test fixtures update (#3882) This pull request includes updated ingest test fixtures. Please review and merge if appropriate. Co-authored-by: badGarnet <[email protected]>

badGarnet · 2025-01-22T17:41:08Z

...put/local-single-file-with-pdf-infer-table-structure/layout-parser-paper-with-table.jpg.json

+    "element_id": "a0c3c6b7e1e8c95016b989ef43c5ea2e",
+    "text": "2 For each dataset, we train several models of different sizes for different needs (the trade-off between accuracy vs. computational cost). For \u201cbase model\u201d and \u201clarge model\u201d, we refer to using the ResNet 50 or ResNet 101 backbones [13], respectively. One can train models of different architectures, like Faster R-CNN [28] (P) and Mask R-CNN [12] (M). For example, an F in the Large Model column indicates it has m Faster R-CNN model trained using the ResNet 101 backbone. The platform is maintained and a number of additions will be made to the model zoo in coming months.",


this is an improvement as a result of this PR (unintended but improvements are welcome)

…ta-structure' of github.com:Unstructured-IO/unstructured into feat/refactor-layoutelement-textregion-to-vectorized-data-structure

christinestraub

LGTM

pawel-kmiecik

Generally looks good!
I remember we had a discussion about using Data Frames (pandas or better - polars) but then the usage was not justified just for vectorized IoU.
Now I see that it could be easier to track the indeces of different numpy vector (series) and manipulate them if we used DFs. But - it's just a thing to think of, shouldn't block this refactor.

pawel-kmiecik · 2025-01-23T07:29:03Z

test_unstructured/partition/pdf_image/test_pdfminer_processing.py

-    assert result[1].bbox == Rectangle(20, 20, 30, 30)
+def test_process_file_with_pdfminer():
+    layout, links = process_file_with_pdfminer(
+        Path(__file__).parents[3] / "example-docs" / "pdf" / "layout-parser-paper-fast.pdf"


nit: Don't we have some fixture somewhere that delivers the example-docs path (could be scope=session)? If not I guess it's worth adding. Looks like being used in many tests.

pawel-kmiecik · 2025-01-23T07:50:32Z

unstructured/partition/pdf_image/pdfminer_processing.py

@@ -45,18 +46,79 @@ def process_file_with_pdfminer(
        return extracted_layout, layouts_links


+def _validate_bbox(bbox: list[int | float]) -> bool:
+    return all(x is not None for x in bbox) and ((bbox[2] - bbox[0]) * (bbox[3] - bbox[1]) > 0)


Is this check correct? What about:

bbox = (x1 = 3, y1 = 3, x2 = 2, y2 = 2) => (2-3) * (2-3) = -1 * -1 = 1 > 0

(but I haven't drunk my coffee yet so maybe I don't see something :D )

Yeah, maybe:

bbox[0] < bbox[2] and # x1 < x2 bbox[1] < bbox[3] # y1 < y2

?

MaksOpp

LGTM

…lement-textregion-to-vectorized-data-structure

badGarnet and others added 25 commits January 9, 2025 11:49

feat: refactor list into array

badbf85

- initial refactor on tesseract ocr agent return as array instead of as a list

refactor paddle ocr return as arrays

6a62dfc

refactor build layout elements to build LayoutElements

6a123b4

return layoutelements actually and update tests

8138b9f

refactor sorting

31b7488

fix process file with pdfminer and add test

8070bdd

fix test reference for links

e81d201

light refactor of merge extracted and inferred layout

f07f960

- more refactoring is required after the rest of data structure change is complete - specifically the algorithm in inference lib will need to be refactored to use vector math with vector data structures

fix: fix a test expectation

de0e8ad

- test original form should not have passed but because the function modified input it was passing - refactor the function to use new data structure removes the implicit modification of input

fix kwarg name

55e0e21

update test with refactored data structure

f53fe20

fix: save new elements array to merged layout

ba1d933

refactor pdfminer process page and bump dep

76116c1

bump deps again

37fa5df

pass in the correct threshold

31edd43

Merge remote-tracking branch 'origin/main' into feat/refactor-layoute…

25e8969

…lement-textregion-to-vectorized-data-structure

bump version and changelog

4ea8b7a

refactor tests in test_ocr

c71a58d

- for now convert output to list so we can reuse existing assertions - add todo note to refactor the checks to use array directly

refactor tests

0b1f17d

fix sorting test (to add sources)

c96d431

fix: dump elements list before non-vectorized step (remove nested pdf…

04ac46f

…miner elements)

fix: fix condition to detect invalid coord values

083c04e

fix: fix logic

a179328

Feat/refactor layoutelement textregion to vectorized data structure <…

5fadd4d

…- Ingest test fixtures update (#3882) This pull request includes updated ingest test fixtures. Please review and merge if appropriate. Co-authored-by: badGarnet <[email protected]>

badGarnet commented Jan 22, 2025

View reviewed changes

badGarnet added 4 commits January 22, 2025 12:03

use env python to drive pytest

354895d

Merge branch 'feat/refactor-layoutelement-textregion-to-vectorized-da…

fcb752a

…ta-structure' of github.com:Unstructured-IO/unstructured into feat/refactor-layoutelement-textregion-to-vectorized-data-structure

fix docker test make command

09695fc

unpin protobuf and update dockerfile

934614c

badGarnet added 3 commits January 22, 2025 15:01

fix: fix flakey test

343161a

fix: fix updated weaviate client init

6a91673

pin weaviate so we can still use v3 client

894e7e6

badGarnet marked this pull request as ready for review January 22, 2025 22:21

badGarnet requested review from christinestraub, pawel-kmiecik, ryannikolaidis, MaksOpp and cragwolfe January 22, 2025 22:21

christinestraub approved these changes Jan 23, 2025

View reviewed changes

pawel-kmiecik reviewed Jan 23, 2025

View reviewed changes

MaksOpp approved these changes Jan 23, 2025

View reviewed changes

badGarnet added 2 commits January 23, 2025 10:22

fix: fix bbox validation logic and add test

334ae6a

Merge remote-tracking branch 'origin/main' into feat/refactor-layoute…

5b8a6a5

…lement-textregion-to-vectorized-data-structure

badGarnet enabled auto-merge January 23, 2025 16:35

badGarnet added this pull request to the merge queue Jan 23, 2025

Merged via the queue into main with commit 8f2a719 Jan 23, 2025
41 checks passed

badGarnet deleted the feat/refactor-layoutelement-textregion-to-vectorized-data-structure branch January 23, 2025 17:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/refactor layoutelement textregion to vectorized data structure #3881

Feat/refactor layoutelement textregion to vectorized data structure #3881

badGarnet commented Jan 21, 2025 •

edited

Loading

badGarnet Jan 22, 2025

christinestraub left a comment

pawel-kmiecik left a comment

pawel-kmiecik Jan 23, 2025

pawel-kmiecik Jan 23, 2025

MaksOpp Jan 23, 2025

MaksOpp Jan 23, 2025

badGarnet Jan 23, 2025

MaksOpp left a comment

		"element_id": "a0c3c6b7e1e8c95016b989ef43c5ea2e",
		"text": "2 For each dataset, we train several models of different sizes for different needs (the trade-off between accuracy vs. computational cost). For \u201cbase model\u201d and \u201clarge model\u201d, we refer to using the ResNet 50 or ResNet 101 backbones [13], respectively. One can train models of different architectures, like Faster R-CNN [28] (P) and Mask R-CNN [12] (M). For example, an F in the Large Model column indicates it has m Faster R-CNN model trained using the ResNet 101 backbone. The platform is maintained and a number of additions will be made to the model zoo in coming months.",

Feat/refactor layoutelement textregion to vectorized data structure #3881

Feat/refactor layoutelement textregion to vectorized data structure #3881

Conversation

badGarnet commented Jan 21, 2025 • edited Loading

badGarnet Jan 22, 2025

Choose a reason for hiding this comment

christinestraub left a comment

Choose a reason for hiding this comment

pawel-kmiecik left a comment

Choose a reason for hiding this comment

pawel-kmiecik Jan 23, 2025

Choose a reason for hiding this comment

pawel-kmiecik Jan 23, 2025

Choose a reason for hiding this comment

MaksOpp Jan 23, 2025

Choose a reason for hiding this comment

MaksOpp Jan 23, 2025

Choose a reason for hiding this comment

badGarnet Jan 23, 2025

Choose a reason for hiding this comment

MaksOpp left a comment

Choose a reason for hiding this comment

badGarnet commented Jan 21, 2025 •

edited

Loading