Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/refactor layoutelement textregion to vectorized data structure #3881

Merged
Show file tree
Hide file tree
Changes from 25 commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
badbf85
feat: refactor list into array
badGarnet Jan 9, 2025
6a62dfc
refactor paddle ocr return as arrays
badGarnet Jan 9, 2025
6a123b4
refactor build layout elements to build LayoutElements
badGarnet Jan 9, 2025
8138b9f
return layoutelements actually and update tests
badGarnet Jan 9, 2025
31b7488
refactor sorting
badGarnet Jan 10, 2025
8070bdd
fix process file with pdfminer and add test
badGarnet Jan 13, 2025
e81d201
fix test reference for links
badGarnet Jan 13, 2025
f07f960
light refactor of merge extracted and inferred layout
badGarnet Jan 13, 2025
de0e8ad
fix: fix a test expectation
badGarnet Jan 14, 2025
55e0e21
fix kwarg name
badGarnet Jan 14, 2025
f53fe20
update test with refactored data structure
badGarnet Jan 14, 2025
ba1d933
fix: save new elements array to merged layout
badGarnet Jan 15, 2025
efb040d
fix: fix conversion of pdfminer text regions
badGarnet Jan 16, 2025
76116c1
refactor pdfminer process page and bump dep
badGarnet Jan 16, 2025
37fa5df
bump deps again
badGarnet Jan 21, 2025
31edd43
pass in the correct threshold
badGarnet Jan 21, 2025
25e8969
Merge remote-tracking branch 'origin/main' into feat/refactor-layoute…
badGarnet Jan 21, 2025
4ea8b7a
bump version and changelog
badGarnet Jan 21, 2025
c71a58d
refactor tests in test_ocr
badGarnet Jan 21, 2025
0b1f17d
refactor tests
badGarnet Jan 22, 2025
c96d431
fix sorting test (to add sources)
badGarnet Jan 22, 2025
04ac46f
fix: dump elements list before non-vectorized step (remove nested pdf…
badGarnet Jan 22, 2025
083c04e
fix: fix condition to detect invalid coord values
badGarnet Jan 22, 2025
a179328
fix: fix logic
badGarnet Jan 22, 2025
5fadd4d
Feat/refactor layoutelement textregion to vectorized data structure <…
ryannikolaidis Jan 22, 2025
354895d
use env python to drive pytest
badGarnet Jan 22, 2025
fcb752a
Merge branch 'feat/refactor-layoutelement-textregion-to-vectorized-da…
badGarnet Jan 22, 2025
09695fc
fix docker test make command
badGarnet Jan 22, 2025
934614c
unpin protobuf and update dockerfile
badGarnet Jan 22, 2025
343161a
fix: fix flakey test
badGarnet Jan 22, 2025
6a91673
fix: fix updated weaviate client init
badGarnet Jan 22, 2025
894e7e6
pin weaviate so we can still use v3 client
badGarnet Jan 22, 2025
334ae6a
fix: fix bbox validation logic and add test
badGarnet Jan 23, 2025
5b8a6a5
Merge remote-tracking branch 'origin/main' into feat/refactor-layoute…
badGarnet Jan 23, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,12 @@
## 0.16.15-dev0

### Enhancements

### Features
- **Vectorize layout (inferred, extracted, and OCR) data structure** Using `np.ndarray` to store a group of layout elements or text regions instead of using a list of objects. This improves the memory efficiency and compute speed around layout merging and deduplication.

### Fixes

## 0.16.14

### Enhancements
Expand Down
8 changes: 4 additions & 4 deletions requirements/base.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
#
# pip-compile ./base.in
#
anyio==4.7.0
anyio==4.8.0
# via httpx
backoff==2.2.1
# via -r ./base.in
Expand Down Expand Up @@ -36,7 +36,7 @@ dataclasses-json==0.6.7
# unstructured-client
deepdiff==8.1.1
# via unstructured-client
emoji==2.14.0
emoji==2.14.1
# via -r ./base.in
exceptiongroup==1.2.2
# via anyio
Expand Down Expand Up @@ -64,7 +64,7 @@ langdetect==1.0.9
# via -r ./base.in
lxml==5.3.0
# via -r ./base.in
marshmallow==3.23.2
marshmallow==3.25.1
# via
# dataclasses-json
# unstructured-client
Expand Down Expand Up @@ -150,5 +150,5 @@ urllib3==1.26.20
# unstructured-client
webencodings==0.5.1
# via html5lib
wrapt==1.17.0
wrapt==1.17.2
# via -r ./base.in
8 changes: 4 additions & 4 deletions requirements/dev.txt
Original file line number Diff line number Diff line change
Expand Up @@ -17,9 +17,9 @@ distlib==0.3.9
# via virtualenv
filelock==3.16.1
# via virtualenv
identify==2.6.4
identify==2.6.6
# via pre-commit
importlib-metadata==8.5.0
importlib-metadata==8.6.1
# via
# -c ././deps/constraints.txt
# build
Expand All @@ -36,7 +36,7 @@ platformdirs==4.3.6
# via
# -c ./test.txt
# virtualenv
pre-commit==4.0.1
pre-commit==4.1.0
# via -r ./dev.in
pyproject-hooks==1.2.0
# via
Expand All @@ -51,7 +51,7 @@ tomli==2.2.1
# -c ./test.txt
# build
# pip-tools
virtualenv==20.28.1
virtualenv==20.29.1
# via pre-commit
wheel==0.45.1
# via pip-tools
Expand Down
2 changes: 1 addition & 1 deletion requirements/extra-epub.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,5 +4,5 @@
#
# pip-compile ./extra-epub.in
#
pypandoc==1.14
pypandoc==1.15
# via -r ./extra-epub.in
2 changes: 1 addition & 1 deletion requirements/extra-markdown.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
#
# pip-compile ./extra-markdown.in
#
importlib-metadata==8.5.0
importlib-metadata==8.6.1
# via
# -c ././deps/constraints.txt
# markdown
Expand Down
2 changes: 1 addition & 1 deletion requirements/extra-odt.txt
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ lxml==5.3.0
# via
# -c ./base.txt
# python-docx
pypandoc==1.14
pypandoc==1.15
# via -r ./extra-odt.in
python-docx==1.1.2
# via -r ./extra-odt.in
Expand Down
10 changes: 5 additions & 5 deletions requirements/extra-paddleocr.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
#
# pip-compile ./extra-paddleocr.in
#
anyio==4.7.0
anyio==4.8.0
# via
# -c ./base.txt
# httpx
Expand Down Expand Up @@ -52,13 +52,13 @@ idna==3.10
# anyio
# httpx
# requests
imageio==2.36.1
imageio==2.37.0
# via
# imgaug
# scikit-image
imgaug==0.4.0
# via unstructured-paddleocr
importlib-resources==6.5.1
importlib-resources==6.5.2
# via matplotlib
kiwisolver==1.4.7
# via matplotlib
Expand Down Expand Up @@ -86,9 +86,9 @@ numpy==1.26.4
# shapely
# tifffile
# unstructured-paddleocr
opencv-contrib-python==4.10.0.84
opencv-contrib-python==4.11.0.86
# via unstructured-paddleocr
opencv-python==4.10.0.84
opencv-python==4.11.0.86
# via
# imgaug
# unstructured-paddleocr
Expand Down
2 changes: 1 addition & 1 deletion requirements/extra-pandoc.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,5 +4,5 @@
#
# pip-compile ./extra-pandoc.in
#
pypandoc==1.14
pypandoc==1.15
# via -r ./extra-pandoc.in
2 changes: 1 addition & 1 deletion requirements/extra-pdf-image.in
Original file line number Diff line number Diff line change
Expand Up @@ -11,5 +11,5 @@ google-cloud-vision
effdet
# Do not move to constraints.in, otherwise unstructured-inference will not be upgraded
# when unstructured library is.
unstructured-inference==0.8.1
unstructured-inference==0.8.4
unstructured.pytesseract>=0.3.12
43 changes: 14 additions & 29 deletions requirements/extra-pdf-image.txt
Original file line number Diff line number Diff line change
Expand Up @@ -60,14 +60,14 @@ googleapis-common-protos==1.66.0
# via
# google-api-core
# grpcio-status
grpcio==1.68.1
grpcio==1.69.0
# via
# -c ././deps/constraints.txt
# google-api-core
# grpcio-status
grpcio-status==1.62.3
# via google-api-core
huggingface-hub==0.27.0
huggingface-hub==0.27.1
# via
# timm
# tokenizers
Expand All @@ -79,16 +79,12 @@ idna==3.10
# via
# -c ./base.txt
# requests
importlib-resources==6.5.1
importlib-resources==6.5.2
# via matplotlib
iopath==0.1.10
# via layoutparser
jinja2==3.1.5
# via torch
kiwisolver==1.4.7
# via matplotlib
layoutparser==0.3.4
# via unstructured-inference
lxml==5.3.0
# via
# -c ./base.txt
Expand All @@ -107,7 +103,6 @@ numpy==1.26.4
# via
# -c ./base.txt
# contourpy
# layoutparser
# matplotlib
# onnx
# onnxruntime
Expand All @@ -126,10 +121,8 @@ onnx==1.17.0
# unstructured-inference
onnxruntime==1.19.2
# via unstructured-inference
opencv-python==4.10.0.84
# via
# layoutparser
# unstructured-inference
opencv-python==4.11.0.86
# via unstructured-inference
packaging==24.2
# via
# -c ./base.txt
Expand All @@ -140,33 +133,28 @@ packaging==24.2
# transformers
# unstructured-pytesseract
pandas==2.2.3
# via layoutparser
# via unstructured-inference
pdf2image==1.17.0
# via
# -r ./extra-pdf-image.in
# layoutparser
# via -r ./extra-pdf-image.in
pdfminer-six==20231228
# via
# -r ./extra-pdf-image.in
# pdfplumber
pdfplumber==0.11.5
# via layoutparser
# via unstructured-inference
pi-heif==0.21.0
# via -r ./extra-pdf-image.in
pikepdf==9.5.0
pikepdf==9.5.1
# via -r ./extra-pdf-image.in
pillow==11.1.0
# via
# layoutparser
# matplotlib
# pdf2image
# pdfplumber
# pi-heif
# pikepdf
# torchvision
# unstructured-pytesseract
portalocker==3.1.1
# via iopath
proto-plus==1.25.0
# via
# google-api-core
Expand Down Expand Up @@ -213,7 +201,6 @@ pytz==2024.2
pyyaml==6.0.2
# via
# huggingface-hub
# layoutparser
# omegaconf
# timm
# transformers
Expand All @@ -233,12 +220,12 @@ requests==2.32.3
# transformers
rsa==4.9
# via google-auth
safetensors==0.5.0
safetensors==0.5.2
# via
# timm
# transformers
scipy==1.13.1
# via layoutparser
# via unstructured-inference
six==1.17.0
# via
# -c ./base.txt
Expand All @@ -247,7 +234,7 @@ sympy==1.13.1
# via
# onnxruntime
# torch
timm==1.0.12
timm==1.0.14
# via
# effdet
# unstructured-inference
Expand All @@ -269,20 +256,18 @@ tqdm==4.67.1
# via
# -c ./base.txt
# huggingface-hub
# iopath
# transformers
transformers==4.44.2
# via unstructured-inference
typing-extensions==4.12.2
# via
# -c ./base.txt
# huggingface-hub
# iopath
# pypdf
# torch
tzdata==2024.2
# via pandas
unstructured-inference==0.8.1
unstructured-inference==0.8.4
# via -r ./extra-pdf-image.in
unstructured-pytesseract==0.3.13
# via -r ./extra-pdf-image.in
Expand All @@ -291,7 +276,7 @@ urllib3==1.26.20
# -c ././deps/constraints.txt
# -c ./base.txt
# requests
wrapt==1.17.0
wrapt==1.17.2
# via
# -c ./base.txt
# deprecated
Expand Down
4 changes: 2 additions & 2 deletions requirements/huggingface.txt
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ fsspec==2024.12.0
# via
# huggingface-hub
# torch
huggingface-hub==0.27.0
huggingface-hub==0.27.1
# via
# tokenizers
# transformers
Expand Down Expand Up @@ -74,7 +74,7 @@ requests==2.32.3
# transformers
sacremoses==0.1.1
# via -r ./huggingface.in
safetensors==0.5.0
safetensors==0.5.2
# via transformers
sentencepiece==0.2.0
# via -r ./huggingface.in
Expand Down
15 changes: 8 additions & 7 deletions requirements/test.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
#
annotated-types==0.7.0
# via pydantic
anyio==4.7.0
anyio==4.8.0
# via
# -c ./base.txt
# httpx
Expand Down Expand Up @@ -54,7 +54,7 @@ exceptiongroup==1.2.2
# -c ./base.txt
# anyio
# pytest
faker==33.1.0
faker==33.3.1
# via jsf
flake8==7.1.1
# via
Expand All @@ -66,7 +66,7 @@ freezegun==1.5.1
# via -r ./test.in
genson==1.3.0
# via datamodel-code-generator
grpcio==1.68.1
grpcio==1.69.0
# via
# -c ././deps/constraints.txt
# -r ./test.in
Expand Down Expand Up @@ -164,7 +164,7 @@ pycodestyle==2.12.1
# via
# flake8
# flake8-print
pydantic[email]==2.10.4
pydantic[email]==2.10.5
# via
# -r ./test.in
# datamodel-code-generator
Expand Down Expand Up @@ -196,7 +196,7 @@ pyyaml==6.0.2
# via
# datamodel-code-generator
# vcrpy
referencing==0.35.1
referencing==0.36.1
# via
# jsonschema
# jsonschema-specifications
Expand All @@ -218,7 +218,7 @@ rpds-py==0.22.3
# referencing
rstr==3.2.2
# via jsf
ruff==0.8.5
ruff==0.9.2
# via -r ./test.in
semantic-version==2.10.0
# via liccheck
Expand Down Expand Up @@ -269,6 +269,7 @@ typing-extensions==4.12.2
# mypy
# pydantic
# pydantic-core
# referencing
tzdata==2024.2
# via pandas
ujson==5.10.0
Expand All @@ -281,7 +282,7 @@ urllib3==1.26.20
# vcrpy
vcrpy==7.0.0
# via -r ./test.in
wrapt==1.17.0
wrapt==1.17.2
# via
# -c ./base.txt
# smart-open
Expand Down
Loading
Loading