-
Notifications
You must be signed in to change notification settings - Fork 836
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat/refactor layoutelement textregion to vectorized data structure #3881
Feat/refactor layoutelement textregion to vectorized data structure #3881
Conversation
- initial refactor on tesseract ocr agent return as array instead of as a list
- more refactoring is required after the rest of data structure change is complete - specifically the algorithm in inference lib will need to be refactored to use vector math with vector data structures
- test original form should not have passed but because the function modified input it was passing - refactor the function to use new data structure removes the implicit modification of input
- inference library requires text regsions from pdfminer to be of either EmbeddedTextRegion or ImageTextRegion for class identification purposes - best way foward is to refactor the library to use layoutelements for both inferred and extracted (i.e., pdfminer) layouts for consistency then we can remove this temporary patch function in this commit
…lement-textregion-to-vectorized-data-structure
- for now convert output to list so we can reuse existing assertions - add todo note to refactor the checks to use array directly
…- Ingest test fixtures update (#3882) This pull request includes updated ingest test fixtures. Please review and merge if appropriate. Co-authored-by: badGarnet <[email protected]>
"element_id": "a0c3c6b7e1e8c95016b989ef43c5ea2e", | ||
"text": "2 For each dataset, we train several models of different sizes for different needs (the trade-off between accuracy vs. computational cost). For \u201cbase model\u201d and \u201clarge model\u201d, we refer to using the ResNet 50 or ResNet 101 backbones [13], respectively. One can train models of different architectures, like Faster R-CNN [28] (P) and Mask R-CNN [12] (M). For example, an F in the Large Model column indicates it has m Faster R-CNN model trained using the ResNet 101 backbone. The platform is maintained and a number of additions will be made to the model zoo in coming months.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is an improvement as a result of this PR (unintended but improvements are welcome)
…ta-structure' of github.com:Unstructured-IO/unstructured into feat/refactor-layoutelement-textregion-to-vectorized-data-structure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks good!
I remember we had a discussion about using Data Frames (pandas or better - polars) but then the usage was not justified just for vectorized IoU.
Now I see that it could be easier to track the indeces of different numpy vector (series) and manipulate them if we used DFs. But - it's just a thing to think of, shouldn't block this refactor.
assert result[1].bbox == Rectangle(20, 20, 30, 30) | ||
def test_process_file_with_pdfminer(): | ||
layout, links = process_file_with_pdfminer( | ||
Path(__file__).parents[3] / "example-docs" / "pdf" / "layout-parser-paper-fast.pdf" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Don't we have some fixture somewhere that delivers the example-docs
path (could be scope=session
)? If not I guess it's worth adding. Looks like being used in many tests.
@@ -45,18 +46,79 @@ def process_file_with_pdfminer( | |||
return extracted_layout, layouts_links | |||
|
|||
|
|||
def _validate_bbox(bbox: list[int | float]) -> bool: | |||
return all(x is not None for x in bbox) and ((bbox[2] - bbox[0]) * (bbox[3] - bbox[1]) > 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this check correct? What about:
bbox = (x1 = 3, y1 = 3, x2 = 2, y2 = 2) => (2-3) * (2-3) = -1 * -1 = 1 > 0
(but I haven't drunk my coffee yet so maybe I don't see something :D )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, maybe:
bbox[0] < bbox[2] and # x1 < x2
bbox[1] < bbox[3] # y1 < y2
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or abs()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…lement-textregion-to-vectorized-data-structure
This PR refactors the data structure for
list[LayoutElement]
andlist[TextRegion]
used in partition pdf/image files.numpy
array to store datapartition
function itself, i.e.,partition
still returnslist[Element]
list[LayoutElement]
->LayoutElements
;list[TextRegion]
->TextRegions
element_array.as_list()
call. This is the last step before turninglist[LayoutElement]
intolist[Element]
as returnlist[Element]
fromLayoutElements
data structure instead.The goal of this PR is to replace the data structure as much as possible without changing underlying logic. There are a few places where the slicing or filtering logic was simple enough to be converted into vector data structure operations. Those are refactored to be vector based. As a result there is some small improvements observed in ingest test. This is likely because the vector operations cleaned up some previous inconsistency in data types and operations.