Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/refactor layoutelement textregion to vectorized data structure #3881

Merged
Show file tree
Hide file tree
Changes from 32 commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
badbf85
feat: refactor list into array
badGarnet Jan 9, 2025
6a62dfc
refactor paddle ocr return as arrays
badGarnet Jan 9, 2025
6a123b4
refactor build layout elements to build LayoutElements
badGarnet Jan 9, 2025
8138b9f
return layoutelements actually and update tests
badGarnet Jan 9, 2025
31b7488
refactor sorting
badGarnet Jan 10, 2025
8070bdd
fix process file with pdfminer and add test
badGarnet Jan 13, 2025
e81d201
fix test reference for links
badGarnet Jan 13, 2025
f07f960
light refactor of merge extracted and inferred layout
badGarnet Jan 13, 2025
de0e8ad
fix: fix a test expectation
badGarnet Jan 14, 2025
55e0e21
fix kwarg name
badGarnet Jan 14, 2025
f53fe20
update test with refactored data structure
badGarnet Jan 14, 2025
ba1d933
fix: save new elements array to merged layout
badGarnet Jan 15, 2025
efb040d
fix: fix conversion of pdfminer text regions
badGarnet Jan 16, 2025
76116c1
refactor pdfminer process page and bump dep
badGarnet Jan 16, 2025
37fa5df
bump deps again
badGarnet Jan 21, 2025
31edd43
pass in the correct threshold
badGarnet Jan 21, 2025
25e8969
Merge remote-tracking branch 'origin/main' into feat/refactor-layoute…
badGarnet Jan 21, 2025
4ea8b7a
bump version and changelog
badGarnet Jan 21, 2025
c71a58d
refactor tests in test_ocr
badGarnet Jan 21, 2025
0b1f17d
refactor tests
badGarnet Jan 22, 2025
c96d431
fix sorting test (to add sources)
badGarnet Jan 22, 2025
04ac46f
fix: dump elements list before non-vectorized step (remove nested pdf…
badGarnet Jan 22, 2025
083c04e
fix: fix condition to detect invalid coord values
badGarnet Jan 22, 2025
a179328
fix: fix logic
badGarnet Jan 22, 2025
5fadd4d
Feat/refactor layoutelement textregion to vectorized data structure <…
ryannikolaidis Jan 22, 2025
354895d
use env python to drive pytest
badGarnet Jan 22, 2025
fcb752a
Merge branch 'feat/refactor-layoutelement-textregion-to-vectorized-da…
badGarnet Jan 22, 2025
09695fc
fix docker test make command
badGarnet Jan 22, 2025
934614c
unpin protobuf and update dockerfile
badGarnet Jan 22, 2025
343161a
fix: fix flakey test
badGarnet Jan 22, 2025
6a91673
fix: fix updated weaviate client init
badGarnet Jan 22, 2025
894e7e6
pin weaviate so we can still use v3 client
badGarnet Jan 22, 2025
334ae6a
fix: fix bbox validation logic and add test
badGarnet Jan 23, 2025
5b8a6a5
Merge remote-tracking branch 'origin/main' into feat/refactor-layoute…
badGarnet Jan 23, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,12 @@
## 0.16.15-dev0

### Enhancements

### Features
- **Vectorize layout (inferred, extracted, and OCR) data structure** Using `np.ndarray` to store a group of layout elements or text regions instead of using a list of objects. This improves the memory efficiency and compute speed around layout merging and deduplication.

### Fixes

## 0.16.14

### Enhancements
Expand Down
8 changes: 4 additions & 4 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
FROM quay.io/unstructured-io/base-images:wolfi-base-latest AS base

ARG PYTHON=python3.11
ARG PIP=pip3.11
ARG PIP="${PYTHON} -m pip"

USER root

Expand All @@ -19,6 +19,9 @@ RUN chown -R notebook-user:notebook-user /app && \

USER notebook-user

# append PATH before pip install to avoid warning logs; it also avoids issues with packages that needs compilation during installation
ENV PATH="${PATH}:/home/notebook-user/.local/bin"
ENV TESSDATA_PREFIX=/usr/local/share/tessdata
ENV NLTK_DATA=/home/notebook-user/nltk_data

# Install Python dependencies and download required NLTK packages
Expand All @@ -28,7 +31,4 @@ RUN find requirements/ -type f -name "*.txt" -exec $PIP install --no-cache-dir -
$PYTHON -c "from unstructured.partition.model_init import initialize; initialize()" && \
$PYTHON -c "from unstructured_inference.models.tables import UnstructuredTableTransformerModel; model = UnstructuredTableTransformerModel(); model.initialize('microsoft/table-transformer-structure-recognition')"

ENV PATH="${PATH}:/home/notebook-user/.local/bin"
ENV TESSDATA_PREFIX=/usr/local/share/tessdata

CMD ["/bin/bash"]
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -308,7 +308,7 @@ docker-test:
$(DOCKER_IMAGE) \
bash -c "CI=$(CI) \
UNSTRUCTURED_INCLUDE_DEBUG_METADATA=$(UNSTRUCTURED_INCLUDE_DEBUG_METADATA) \
pytest $(if $(TEST_FILE),$(TEST_FILE),test_unstructured)"
python3 -m pytest $(if $(TEST_FILE),$(TEST_FILE),test_unstructured)"

.PHONY: docker-smoke-test
docker-smoke-test:
Expand Down
8 changes: 4 additions & 4 deletions requirements/base.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
#
# pip-compile ./base.in
#
anyio==4.7.0
anyio==4.8.0
# via httpx
backoff==2.2.1
# via -r ./base.in
Expand Down Expand Up @@ -36,7 +36,7 @@ dataclasses-json==0.6.7
# unstructured-client
deepdiff==8.1.1
# via unstructured-client
emoji==2.14.0
emoji==2.14.1
# via -r ./base.in
exceptiongroup==1.2.2
# via anyio
Expand Down Expand Up @@ -64,7 +64,7 @@ langdetect==1.0.9
# via -r ./base.in
lxml==5.3.0
# via -r ./base.in
marshmallow==3.23.2
marshmallow==3.25.1
# via
# dataclasses-json
# unstructured-client
Expand Down Expand Up @@ -150,5 +150,5 @@ urllib3==1.26.20
# unstructured-client
webencodings==0.5.1
# via html5lib
wrapt==1.17.0
wrapt==1.17.2
# via -r ./base.in
4 changes: 2 additions & 2 deletions requirements/deps/constraints.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,8 @@
# extras. Putting a dependency here will only affect dependency sets that contain them -- in other
# words, if something does not require a constraint, it will not be installed.
####################################################################################################
# (jennings): Versions greater than 5.0 create dependency conflicts with other packages
protobuf<5.0
# we are using v3 client https://weaviate.io/developers/weaviate/client-libraries/python/python_v3
weaviate-client>=3.26.7,<4.0.0
# TODO: Constriant due to multiple versions being installed during pip-compile
grpcio>=1.65.5
# TODO: Pinned in transformers package, remove when that gets updated (https://github.com/huggingface/transformers/blob/main/setup.py)
Expand Down
10 changes: 5 additions & 5 deletions requirements/dev.txt
Original file line number Diff line number Diff line change
Expand Up @@ -15,11 +15,11 @@ click==8.1.8
# pip-tools
distlib==0.3.9
# via virtualenv
filelock==3.16.1
filelock==3.17.0
# via virtualenv
identify==2.6.4
identify==2.6.6
# via pre-commit
importlib-metadata==8.5.0
importlib-metadata==8.6.1
# via
# -c ././deps/constraints.txt
# build
Expand All @@ -36,7 +36,7 @@ platformdirs==4.3.6
# via
# -c ./test.txt
# virtualenv
pre-commit==4.0.1
pre-commit==4.1.0
# via -r ./dev.in
pyproject-hooks==1.2.0
# via
Expand All @@ -51,7 +51,7 @@ tomli==2.2.1
# -c ./test.txt
# build
# pip-tools
virtualenv==20.28.1
virtualenv==20.29.1
# via pre-commit
wheel==0.45.1
# via pip-tools
Expand Down
2 changes: 1 addition & 1 deletion requirements/extra-csv.txt
Original file line number Diff line number Diff line change
Expand Up @@ -20,5 +20,5 @@ six==1.17.0
# via
# -c ./base.txt
# python-dateutil
tzdata==2024.2
tzdata==2025.1
# via pandas
2 changes: 1 addition & 1 deletion requirements/extra-epub.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,5 +4,5 @@
#
# pip-compile ./extra-epub.in
#
pypandoc==1.14
pypandoc==1.15
# via -r ./extra-epub.in
2 changes: 1 addition & 1 deletion requirements/extra-markdown.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
#
# pip-compile ./extra-markdown.in
#
importlib-metadata==8.5.0
importlib-metadata==8.6.1
# via
# -c ././deps/constraints.txt
# markdown
Expand Down
2 changes: 1 addition & 1 deletion requirements/extra-odt.txt
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ lxml==5.3.0
# via
# -c ./base.txt
# python-docx
pypandoc==1.14
pypandoc==1.15
# via -r ./extra-odt.in
python-docx==1.1.2
# via -r ./extra-odt.in
Expand Down
18 changes: 8 additions & 10 deletions requirements/extra-paddleocr.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
#
# pip-compile ./extra-paddleocr.in
#
anyio==4.7.0
anyio==4.8.0
# via
# -c ./base.txt
# httpx
Expand Down Expand Up @@ -32,7 +32,7 @@ exceptiongroup==1.2.2
# via
# -c ./base.txt
# anyio
fonttools==4.55.3
fonttools==4.55.4
# via matplotlib
h11==0.14.0
# via
Expand All @@ -52,13 +52,13 @@ idna==3.10
# anyio
# httpx
# requests
imageio==2.36.1
imageio==2.37.0
# via
# imgaug
# scikit-image
imgaug==0.4.0
# via unstructured-paddleocr
importlib-resources==6.5.1
importlib-resources==6.5.2
# via matplotlib
kiwisolver==1.4.7
# via matplotlib
Expand Down Expand Up @@ -86,9 +86,9 @@ numpy==1.26.4
# shapely
# tifffile
# unstructured-paddleocr
opencv-contrib-python==4.10.0.84
opencv-contrib-python==4.11.0.86
# via unstructured-paddleocr
opencv-python==4.10.0.84
opencv-python==4.11.0.86
# via
# imgaug
# unstructured-paddleocr
Expand All @@ -113,10 +113,8 @@ pillow==11.1.0
# pdf2image
# scikit-image
# unstructured-paddleocr
protobuf==4.25.5
# via
# -c ././deps/constraints.txt
# paddlepaddle
protobuf==5.29.3
# via paddlepaddle
pyclipper==1.3.0.post6
# via unstructured-paddleocr
pyparsing==3.2.1
Expand Down
2 changes: 1 addition & 1 deletion requirements/extra-pandoc.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,5 +4,5 @@
#
# pip-compile ./extra-pandoc.in
#
pypandoc==1.14
pypandoc==1.15
# via -r ./extra-pandoc.in
2 changes: 1 addition & 1 deletion requirements/extra-pdf-image.in
Original file line number Diff line number Diff line change
Expand Up @@ -11,5 +11,5 @@ google-cloud-vision
effdet
# Do not move to constraints.in, otherwise unstructured-inference will not be upgraded
# when unstructured library is.
unstructured-inference==0.8.1
unstructured-inference==0.8.4
unstructured.pytesseract>=0.3.12
Loading
Loading