Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/refactor layoutelement textregion to vectorized data structure #3881

Conversation

badGarnet
Copy link
Collaborator

@badGarnet badGarnet commented Jan 21, 2025

This PR refactors the data structure for list[LayoutElement] and list[TextRegion] used in partition pdf/image files.

  • new data structure replaces a list of objects with one object with numpy array to store data
  • this only affects partition internal steps and it doesn't change input or output signature of partition function itself, i.e., partition still returns list[Element]
  • internally list[LayoutElement] -> LayoutElements; list[TextRegion] -> TextRegions
  • current refactor stops before clean up pdfminer elements inside inferred layout elements -> the algorithm of clean up needs to be refactored before the data structure refactor can move forward. So current refactor converts the array data structure into list data structure with element_array.as_list() call. This is the last step before turning list[LayoutElement] into list[Element] as return
  • a future PR will update this last step so that we build list[Element] from LayoutElements data structure instead.

The goal of this PR is to replace the data structure as much as possible without changing underlying logic. There are a few places where the slicing or filtering logic was simple enough to be converted into vector data structure operations. Those are refactored to be vector based. As a result there is some small improvements observed in ingest test. This is likely because the vector operations cleaned up some previous inconsistency in data types and operations.

badGarnet and others added 25 commits January 9, 2025 11:49
- initial refactor on tesseract ocr agent return as array instead of as
  a list
- more refactoring is required after the rest of data structure change
  is complete
- specifically the algorithm in inference lib will need to be refactored
  to use vector math with vector data structures
- test original form should not have passed but because the function
  modified input it was passing
- refactor the function to use new data structure removes the implicit
  modification of input
- inference library requires text regsions from pdfminer to be of either
  EmbeddedTextRegion or ImageTextRegion for class identification
  purposes
- best way foward is to refactor the library to use layoutelements for
  both inferred and extracted (i.e., pdfminer) layouts for consistency
  then we can remove this temporary patch function in this commit
…lement-textregion-to-vectorized-data-structure
- for now convert output to list so we can reuse existing assertions
- add todo note to refactor the checks to use array directly
…- Ingest test fixtures update (#3882)

This pull request includes updated ingest test fixtures.
Please review and merge if appropriate.

Co-authored-by: badGarnet <[email protected]>
Comment on lines +71 to +72
"element_id": "a0c3c6b7e1e8c95016b989ef43c5ea2e",
"text": "2 For each dataset, we train several models of different sizes for different needs (the trade-off between accuracy vs. computational cost). For \u201cbase model\u201d and \u201clarge model\u201d, we refer to using the ResNet 50 or ResNet 101 backbones [13], respectively. One can train models of different architectures, like Faster R-CNN [28] (P) and Mask R-CNN [12] (M). For example, an F in the Large Model column indicates it has m Faster R-CNN model trained using the ResNet 101 backbone. The platform is maintained and a number of additions will be made to the model zoo in coming months.",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is an improvement as a result of this PR (unintended but improvements are welcome)

…ta-structure' of github.com:Unstructured-IO/unstructured into feat/refactor-layoutelement-textregion-to-vectorized-data-structure
@badGarnet badGarnet marked this pull request as ready for review January 22, 2025 22:21
Copy link
Collaborator

@christinestraub christinestraub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@pawel-kmiecik pawel-kmiecik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good!
I remember we had a discussion about using Data Frames (pandas or better - polars) but then the usage was not justified just for vectorized IoU.
Now I see that it could be easier to track the indeces of different numpy vector (series) and manipulate them if we used DFs. But - it's just a thing to think of, shouldn't block this refactor.

assert result[1].bbox == Rectangle(20, 20, 30, 30)
def test_process_file_with_pdfminer():
layout, links = process_file_with_pdfminer(
Path(__file__).parents[3] / "example-docs" / "pdf" / "layout-parser-paper-fast.pdf"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Don't we have some fixture somewhere that delivers the example-docs path (could be scope=session)? If not I guess it's worth adding. Looks like being used in many tests.

@@ -45,18 +46,79 @@ def process_file_with_pdfminer(
return extracted_layout, layouts_links


def _validate_bbox(bbox: list[int | float]) -> bool:
return all(x is not None for x in bbox) and ((bbox[2] - bbox[0]) * (bbox[3] - bbox[1]) > 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this check correct? What about:

bbox = (x1 = 3, y1 = 3, x2 = 2, y2 = 2) => (2-3) * (2-3) = -1 * -1 = 1 > 0

(but I haven't drunk my coffee yet so maybe I don't see something :D )

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, maybe:

bbox[0] < bbox[2] and  # x1 < x2 
bbox[1] < bbox[3]      # y1 < y2

?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or abs()

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch

Copy link
Contributor

@MaksOpp MaksOpp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@badGarnet badGarnet enabled auto-merge January 23, 2025 16:35
@badGarnet badGarnet added this pull request to the merge queue Jan 23, 2025
Merged via the queue into main with commit 8f2a719 Jan 23, 2025
41 checks passed
@badGarnet badGarnet deleted the feat/refactor-layoutelement-textregion-to-vectorized-data-structure branch January 23, 2025 17:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants