Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text metadata #30

Open
joey234 opened this issue Feb 26, 2025 · 2 comments
Open

Text metadata #30

joey234 opened this issue Feb 26, 2025 · 2 comments

Comments

@joey234
Copy link

joey234 commented Feb 26, 2025

🚀 The feature, motivation and pitch

Thank you for releasing an amazing work.

I think in traditional OCR tools, metadata of text like coordinate, font, size, color, style, etc. are also extracted.
Having those information would further strengthen the tool much more.

Just out of curiosity if you just ask it to extact those information in the prompt, how well would it perform.

Alternatives

No response

Additional context

No response

@jakep-allenai
Copy link
Collaborator

Yeah, previous work such as PaperMage was extracting the metadata and coordinates of each block and layout region, but we stepped away from that in this version. The thinking was that this pipeline is more focused on generating LM training data, or LM context (ex. "ask your PDF style" applications), and this would increase the number of output tokens (which are quite expensive).

Can you share more how you would plan to use that information in your end application?

@felixdittrich92
Copy link

Hi @jakep-allenai 👋

First at all really great work it's from my view the first really useful fine tuned open source VLLM solution / toolkit for OCR compared to common open source OCR solutions like paddleOCR or docTR / OnnxTR (PS: I'm the maintainer of the last two 😅)

  • Coordinates of each word:
    • Often used for applications where the extracted information are displayed on a higher level (frontend mask for example) to provide users the option for post correction
    • Required for multi-stage solutions like key information extraction (for example OCR engine + LiLT)
    • Additional it makes the results "explainable / controlable"
  • Layout information
    • For example if you want to exclude specific areas or the opposite

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants