Skip to content

Commit

Permalink
feat: add figures in markdown export (#27)
Browse files Browse the repository at this point in the history
Signed-off-by: Michele Dolfi <[email protected]>
  • Loading branch information
dolfim-ibm authored Sep 23, 2024
1 parent ded530a commit b843ae6
Show file tree
Hide file tree
Showing 2 changed files with 35 additions and 1 deletion.
18 changes: 17 additions & 1 deletion docling_core/types/doc/document.py
Original file line number Diff line number Diff line change
Expand Up @@ -434,7 +434,7 @@ def get_map_to_page_dimensions(self):

return pagedims

def export_to_markdown(
def export_to_markdown( # noqa: C901
self,
delim: str = "\n\n",
main_text_start: int = 0,
Expand All @@ -445,8 +445,10 @@ def export_to_markdown(
"paragraph",
"caption",
"table",
"figure",
],
strict_text: bool = False,
image_placeholder: str = "<!-- image -->",
) -> str:
r"""Serialize to Markdown.
Expand All @@ -460,6 +462,12 @@ def export_to_markdown(
Defaults to 0.
main_text_end (Optional[int], optional): Main-text slicing stop index
(exclusive). Defaults to None.
main_text_labels (list[str], optional): The labels to include in the
markdown.
strict_text (bool, optional): if true, the output will be only plain text
without any markdown styling. Defaults to False.
image_placeholder (str, optional): the placeholder to include to position
images in the markdown. Defaults to a markdown comment "<!-- image -->".
Returns:
str: The exported Markdown representation.
Expand Down Expand Up @@ -539,6 +547,14 @@ def export_to_markdown(

markdown_text = md_table

elif isinstance(item, Figure) and item_type in main_text_labels:

markdown_text = ""
if not strict_text:
markdown_text = f"{image_placeholder}"
if item.text:
markdown_text += "\n" + item.text

if markdown_text:
md_texts.append(markdown_text)

Expand Down
18 changes: 18 additions & 0 deletions test/data/doc/doc-export.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,9 @@ In modern document understanding systems [1,15], table extraction is typically a

Fig. 1. Comparison between HTML and OTSL table structure representation: (A) table-example with complex row and column headers, including a 2D empty span, (B) minimal graphical representation of table structure using rectangular layout, (C) HTML representation, (D) OTSL representation. This example demonstrates many of the key-features of OTSL, namely its reduced vocabulary size (12 versus 5 in this case), its reduced sequence length (55 versus 30) and a enhanced internal structure (variable token sequence length per row in HTML versus a fixed length of rows in OTSL).

<!-- image -->
Fig. 1. Comparison between HTML and OTSL table structure representation: (A) table-example with complex row and column headers, including a 2D empty span, (B) minimal graphical representation of table structure using rectangular layout, (C) HTML representation, (D) OTSL representation. This example demonstrates many of the key-features of OTSL, namely its reduced vocabulary size (12 versus 5 in this case), its reduced sequence length (55 versus 30) and a enhanced internal structure (variable token sequence length per row in HTML versus a fixed length of rows in OTSL).

today, table detection in documents is a well understood problem, and the latest state-of-the-art (SOTA) object detection methods provide an accuracy comparable to human observers [7,8,10,14,23]. On the other hand, the problem of table structure recognition (TSR) is a lot more challenging and remains a very active area of research, in which many novel machine learning algorithms are being explored [3,4,5,9,11,12,13,14,17,18,21,22].

Recently emerging SOTA methods for table structure recognition employ transformer-based models, in which an image of the table is provided to the network in order to predict the structure of the table as a sequence of tokens. These image-to-sequence (Im2Seq) models are extremely powerful, since they allow for a purely data-driven solution. The tokens of the sequence typically belong to a markup language such as HTML, Latex or Markdown, which allow to describe table structure as rows, columns and spanning cells in various configurations. In Figure 1, we illustrate how HTML is used to represent the table-structure of a particular example table. Public table-structure data sets such as PubTab-Net [22], and FinTabNet [21], which were created in a semi-automated way from paired PDF and HTML sources (e.g. PubMed Central), popularized primarily the use of HTML as ground-truth representation format for TSR.
Expand Down Expand Up @@ -44,6 +47,9 @@ ulary and can be interpreted as a table structure. For example, with the HTML to

Fig. 2. Frequency of tokens in HTML and OTSL as they appear in PubTabNet.

<!-- image -->
Fig. 2. Frequency of tokens in HTML and OTSL as they appear in PubTabNet.

Obviously, HTML and other general-purpose markup languages were not designed for Im2Seq models. As such, they have some serious drawbacks. First, the token vocabulary needs to be artificially large in order to describe all plausible tabular structures. Since most Im2Seq models use an autoregressive approach, they generate the sequence token by token. Therefore, to reduce inference time, a shorter sequence length is critical. Every table-cell is represented by at least two tokens (<td> and </td>). Furthermore, when tokenizing the HTML structure, one needs to explicitly enumerate possible column-spans and row-spans as words. In practice, this ends up requiring 28 different HTML tokens (when including column-and row-spans up to 10 cells) just to describe every table in the PubTabNet dataset. Clearly, not every token is equally represented, as is depicted in Figure 2. This skewed distribution of tokens in combination with variable token row-length makes it challenging for models to learn the HTML structure.

Additionally, it would be desirable if the representation would easily allow an early detection of invalid sequences on-the-go, before the prediction of the entire table structure is completed. HTML is not well-suited for this purpose as the verification of incomplete sequences is non-trivial or even impossible.
Expand Down Expand Up @@ -78,6 +84,9 @@ A notable attribute of OTSL is that it has the capability of achieving lossless

Fig. 3. OTSL description of table structure: A-table example; B-graphical representation of table structure; C-mapping structure on a grid; D-OTSL structure encoding; E-explanation on cell encoding

<!-- image -->
Fig. 3. OTSL description of table structure: A-table example; B-graphical representation of table structure; C-mapping structure on a grid; D-OTSL structure encoding; E-explanation on cell encoding

## 4.2 Language Syntax

The OTSL representation follows these syntax rules:
Expand Down Expand Up @@ -110,6 +119,9 @@ To evaluate the impact of OTSL on prediction accuracy and inference times, we co

Fig. 4. Architecture sketch of the TableFormer model, which is a representative for the Im2Seq approach.

<!-- image -->
Fig. 4. Architecture sketch of the TableFormer model, which is a representative for the Im2Seq approach.

We rely on standard metrics such as Tree Edit Distance score (TEDs) for table structure prediction, and Mean Average Precision (mAP) with 0.75 Intersection Over Union (IOU) threshold for the bounding-box predictions of table cells. The predicted OTSL structures were converted back to HTML format in

order to compute the TED score. Inference timing results for all experiments were obtained from the same machine on a single core with AMD EPYC 7763 CPU @2.45 GHz.
Expand Down Expand Up @@ -152,12 +164,18 @@ To illustrate the qualitative differences between OTSL and HTML, Figure 5 demons

Fig. 5. The OTSL model produces more accurate bounding boxes with less overlap (E) than the HTML model (D), when predicting the structure of a sparse table (A), at twice the inference speed because of shorter sequence length (B),(C). 'PMC2807444_006_00.png ' PubTabNet. μ

<!-- image -->
Fig. 5. The OTSL model produces more accurate bounding boxes with less overlap (E) than the HTML model (D), when predicting the structure of a sparse table (A), at twice the inference speed because of shorter sequence length (B),(C). 'PMC2807444_006_00.png ' PubTabNet. μ

μ


Fig. 6. Visualization of predicted structure and detected bounding boxes on a complex table with many rows. The OTSL model (B) captured repeating pattern of horizontally merged cells from the GT (A), unlike the HTML model (C). The HTML model also didn't complete the HTML sequence correctly and displayed a lot more of drift and overlap of bounding boxes. 'PMC5406406_003_01.png ' PubTabNet.

<!-- image -->
Fig. 6. Visualization of predicted structure and detected bounding boxes on a complex table with many rows. The OTSL model (B) captured repeating pattern of horizontally merged cells from the GT (A), unlike the HTML model (C). The HTML model also didn't complete the HTML sequence correctly and displayed a lot more of drift and overlap of bounding boxes. 'PMC5406406_003_01.png ' PubTabNet.

## 6 Conclusion

We demonstrated that representing tables in HTML for the task of table structure recognition with Im2Seq models is ill-suited and has serious limitations. Furthermore, we presented in this paper an Optimized Table Structure Language (OTSL) which, when compared to commonly used general purpose languages, has several key benefits.
Expand Down

0 comments on commit b843ae6

Please sign in to comment.