feat: Add readingorder model #44

PeterStaar-IBM · 2024-10-26T05:27:32Z

Checklist:

Commit Message Formatting: Commit titles and messages follow guidelines in the
conventional commits.
Documentation has been updated, if necessary.
Examples have been added, if necessary.
Tests have been added, if necessary.

Signed-off-by: Peter Staar <[email protected]>

mllife · 2024-11-26T11:42:42Z

Are you rewriting the C++ code from ds_glm_model for ordering as references here; DS4SD/docling#361 (reply in thread)

mergify · 2024-12-13T13:25:05Z

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

🟢 Require two reviewer for test updates

Wonderful, this rule succeeded.

When test data is updated, we require two reviewers

#approved-reviews-by >= 2

Signed-off-by: Peter Staar <[email protected]>

Signed-off-by: Nikos Livathinos <[email protected]>

Signed-off-by: Peter Staar <[email protected]>

… python 3.13 Signed-off-by: Peter Staar <[email protected]>

Signed-off-by: Peter Staar <[email protected]>

pyproject.toml

Signed-off-by: Michele Dolfi <[email protected]>

fix pyproject pining

Signed-off-by: Peter Staar <[email protected]>

cau-git · 2025-02-06T17:57:51Z

tests/test_reading_order.py

+            print("true: ", str(true_elem), ", rand: ", str(rand_elem))
+        """
+
+        pred_elements = romodel.predict_reading_order(page_elements=rand_elements)


As far as I can see, the output of the predict_reading_order is a permutation of the page_elements it accepts as an input. This argument is a list of objects following a custom PageElement data model, combined from BoundingBox and DocItem attributes.
To make sure there is a consistent interface for DoclingDocument, the test unit should demonstrate how one needs to apply this resorted page_elements list to the DoclingDocument for all the supported operations:

Change the order of DocItem instances in the body of a DoclingDocument

Change multiple separate DocItem instances to a single DocItem instance when they must be merged (i.e. two or more provenances in one item)

Change a DocItem to have a caption or footnote etc. referencing to it.

(All of this obviously also needs API extensions on DoclingDocument)

Ultimately, the higher-level reading-order model must have an API that accepts a DoclingDocument and returns a new DoclingDocument with correct order. That does not necessarily need to happen in docling-ibm-models but it should be as easy as possible outside in case.

Ultimately, the higher-level reading-order model must have an API that accepts a DoclingDocument and returns a new DoclingDocument with correct order

High-level I agree with the statement, but how do we see happening the merging of items? E.g. the paragraphs split among columns or pages? They are very likely identified only after the pieces are in the correct reading-order, which means the DoclingDocument items are not yet completely defined.

@cau-git I dont think you fully grasp how we design the reading-order model. It seperates out in different tasks, in order to not get confused.

It orders a set of PageElements. This ensures that we always return exactly the same number of elements as we get.

It generates links between PageElements (to_caption, to_footnote, merge) in order to collapse PageElements into DocItems

We could have a higher level method that takes a DoclingDocument and returns an ordered one, but we also need the low-level, so we can measure the accuracy of the methods.

I clearly understand why we need the low level methods. What I want to understand is if we want to start from a DoclingDocument representation (like this test assumes) or from an internal representation like what the layout model produces. The current state would allow for both.

Signed-off-by: Peter Staar <[email protected]>

Signed-off-by: Christoph Auer <[email protected]>

dolfim-ibm

lgtm 🚀

Signed-off-by: Christoph Auer <[email protected]>

PeterStaar-IBM added 6 commits October 26, 2024 07:26

added ReadingOrder model

156a98a

Signed-off-by: Peter Staar <[email protected]>

updated the ReadinOrder

93aaa57

Signed-off-by: Peter Staar <[email protected]>

finished the first porting of the reading-order

20fa950

Signed-off-by: Peter Staar <[email protected]>

added a test and refactored the reading-order-model

ee95a52

Signed-off-by: Peter Staar <[email protected]>

tests scripts are WIP

860ab95

Signed-off-by: Peter Staar <[email protected]>

first running reading order model

7e3a202

Signed-off-by: Peter Staar <[email protected]>

PeterStaar-IBM self-assigned this Nov 18, 2024

PeterStaar-IBM requested review from cau-git and nikos-livathinos November 18, 2024 09:01

PeterStaar-IBM added 3 commits January 25, 2025 05:39

merged with main

f1ddd85

Signed-off-by: Peter Staar <[email protected]>

work in progress

a07f5e2

Signed-off-by: Peter Staar <[email protected]>

got RO, to-captions and to-footnotes working

3856504

Signed-off-by: Peter Staar <[email protected]>

PeterStaar-IBM mentioned this pull request Jan 27, 2025

feat: Add the reading-order model from docling-ibm-models [WIP] DS4SD/docling#811

Closed

3 tasks

nikos-livathinos and others added 6 commits January 27, 2025 14:19

chore: Code styling for ReadingOrderPredictor

b18f2da

Signed-off-by: Nikos Livathinos <[email protected]>

merged with mypy cleaning

b83a748

Signed-off-by: Peter Staar <[email protected]>

working on the reading-order

715b9e8

Signed-off-by: Peter Staar <[email protected]>

fixed the sorting of heads

5c4e4a8

Signed-off-by: Peter Staar <[email protected]>

implemented new to_captions method

804d8c9

Signed-off-by: Peter Staar <[email protected]>

added datasets for reading-order

9d8f5d5

Signed-off-by: Peter Staar <[email protected]>

PeterStaar-IBM changed the title ~~added ReadingOrder model~~ feat: Add readingorder model Feb 5, 2025

PeterStaar-IBM added 3 commits February 5, 2025 09:56

updated the checks to python 3.13

57fc9ef

Signed-off-by: Peter Staar <[email protected]>

updated the pyproject to have the latest torch-vision compatible with…

5edb94c

… python 3.13 Signed-off-by: Peter Staar <[email protected]>

updated tests for layout

26aa6ad

Signed-off-by: Peter Staar <[email protected]>

PeterStaar-IBM requested a review from dolfim-ibm February 5, 2025 15:19

PeterStaar-IBM assigned cau-git Feb 5, 2025

PeterStaar-IBM added 3 commits February 5, 2025 16:29

updated the tests with reading-order on docling-dpbench

e05a4b1

Signed-off-by: Peter Staar <[email protected]>

cleaned up the pyproject

f19c954

Signed-off-by: Peter Staar <[email protected]>

cleaned up the test

39066dd

Signed-off-by: Peter Staar <[email protected]>

dolfim-ibm reviewed Feb 5, 2025

View reviewed changes

pyproject.toml Outdated Show resolved Hide resolved

dolfim-ibm and others added 4 commits February 5, 2025 16:45

cleanup pyproject and lock for py3.13

ea14c07

Signed-off-by: Michele Dolfi <[email protected]>

Merge pull request #77 from DS4SD/pin-pyproject

086706e

fix pyproject pining

finalised the reading order

8f913a7

Signed-off-by: Peter Staar <[email protected]>

cleaned code

1d2dd93

Signed-off-by: Peter Staar <[email protected]>

PeterStaar-IBM marked this pull request as ready for review February 6, 2025 13:09

cau-git reviewed Feb 6, 2025

View reviewed changes

PeterStaar-IBM and others added 2 commits February 7, 2025 17:29

fix for multipage reading-order

2f88418

Signed-off-by: Peter Staar <[email protected]>

Fixes for to_caption

6892adf

Signed-off-by: Christoph Auer <[email protected]>

cau-git mentioned this pull request Feb 19, 2025

feat: Add ReadingOrderEvaluator for new reading-order model DS4SD/docling-eval#29

Merged

Fix styling

6f16878

Signed-off-by: Christoph Auer <[email protected]>

cau-git requested a review from dolfim-ibm February 19, 2025 14:40

dolfim-ibm previously approved these changes Feb 19, 2025

View reviewed changes

Merge from main

a02e75d

Signed-off-by: Christoph Auer <[email protected]>

cau-git dismissed dolfim-ibm’s stale review via a02e75d February 19, 2025 15:50

Fix mypy

27cfa15

Signed-off-by: Christoph Auer <[email protected]>

cau-git previously approved these changes Feb 19, 2025

View reviewed changes

Saidgurbuz previously approved these changes Feb 19, 2025

View reviewed changes

dolfim-ibm previously approved these changes Feb 19, 2025

View reviewed changes

nikos-livathinos previously approved these changes Feb 19, 2025

View reviewed changes

Update test units

ffa4adf

Signed-off-by: Christoph Auer <[email protected]>

cau-git dismissed stale reviews from nikos-livathinos, dolfim-ibm, Saidgurbuz, and themself via ffa4adf February 19, 2025 16:26

Fix usage of iterate_items

a49b993

Signed-off-by: Christoph Auer <[email protected]>

cau-git approved these changes Feb 20, 2025

View reviewed changes

dolfim-ibm approved these changes Feb 20, 2025

View reviewed changes

cau-git merged commit 23c1696 into main Feb 20, 2025
8 checks passed

cau-git deleted the dev/add-reading-order branch February 20, 2025 07:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add readingorder model #44

feat: Add readingorder model #44

PeterStaar-IBM commented Oct 26, 2024

mllife commented Nov 26, 2024 •

edited

Loading

mergify bot commented Dec 13, 2024 •

edited

Loading

cau-git Feb 6, 2025 •

edited

Loading

dolfim-ibm Feb 7, 2025

PeterStaar-IBM Feb 7, 2025

cau-git Feb 7, 2025

dolfim-ibm left a comment

feat: Add readingorder model #44

feat: Add readingorder model #44

Conversation

PeterStaar-IBM commented Oct 26, 2024

mllife commented Nov 26, 2024 • edited Loading

mergify bot commented Dec 13, 2024 • edited Loading

Merge Protections

🟢 Enforce conventional commit

🟢 Require two reviewer for test updates

cau-git Feb 6, 2025 • edited Loading

Choose a reason for hiding this comment

dolfim-ibm Feb 7, 2025

Choose a reason for hiding this comment

PeterStaar-IBM Feb 7, 2025

Choose a reason for hiding this comment

cau-git Feb 7, 2025

Choose a reason for hiding this comment

dolfim-ibm left a comment

Choose a reason for hiding this comment

mllife commented Nov 26, 2024 •

edited

Loading

mergify bot commented Dec 13, 2024 •

edited

Loading

cau-git Feb 6, 2025 •

edited

Loading