Text metadata #30

joey234 · 2025-02-26T01:35:45Z

🚀 The feature, motivation and pitch

Thank you for releasing an amazing work.

I think in traditional OCR tools, metadata of text like coordinate, font, size, color, style, etc. are also extracted.
Having those information would further strengthen the tool much more.

Just out of curiosity if you just ask it to extact those information in the prompt, how well would it perform.

Alternatives

No response

Additional context

No response

jakep-allenai · 2025-02-26T17:30:35Z

Yeah, previous work such as PaperMage was extracting the metadata and coordinates of each block and layout region, but we stepped away from that in this version. The thinking was that this pipeline is more focused on generating LM training data, or LM context (ex. "ask your PDF style" applications), and this would increase the number of output tokens (which are quite expensive).

Can you share more how you would plan to use that information in your end application?

felixdittrich92 · 2025-02-27T07:20:00Z

Hi @jakep-allenai 👋

First at all really great work it's from my view the first really useful fine tuned open source VLLM solution / toolkit for OCR compared to common open source OCR solutions like paddleOCR or docTR / OnnxTR (PS: I'm the maintainer of the last two 😅)

Coordinates of each word:
- Often used for applications where the extracted information are displayed on a higher level (frontend mask for example) to provide users the option for post correction
- Required for multi-stage solutions like key information extraction (for example OCR engine + LiLT)
- Additional it makes the results "explainable / controlable"
Layout information
- For example if you want to exclude specific areas or the opposite

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text metadata #30

Text metadata #30

joey234 commented Feb 26, 2025

jakep-allenai commented Feb 26, 2025

felixdittrich92 commented Feb 27, 2025

Text metadata #30

Text metadata #30

Comments

joey234 commented Feb 26, 2025

🚀 The feature, motivation and pitch

Alternatives

Additional context

jakep-allenai commented Feb 26, 2025

felixdittrich92 commented Feb 27, 2025