You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I think in traditional OCR tools, metadata of text like coordinate, font, size, color, style, etc. are also extracted.
Having those information would further strengthen the tool much more.
Just out of curiosity if you just ask it to extact those information in the prompt, how well would it perform.
Alternatives
No response
Additional context
No response
The text was updated successfully, but these errors were encountered:
Yeah, previous work such as PaperMage was extracting the metadata and coordinates of each block and layout region, but we stepped away from that in this version. The thinking was that this pipeline is more focused on generating LM training data, or LM context (ex. "ask your PDF style" applications), and this would increase the number of output tokens (which are quite expensive).
Can you share more how you would plan to use that information in your end application?
First at all really great work it's from my view the first really useful fine tuned open source VLLM solution / toolkit for OCR compared to common open source OCR solutions like paddleOCR or docTR / OnnxTR (PS: I'm the maintainer of the last two 😅)
Coordinates of each word:
Often used for applications where the extracted information are displayed on a higher level (frontend mask for example) to provide users the option for post correction
Required for multi-stage solutions like key information extraction (for example OCR engine + LiLT)
Additional it makes the results "explainable / controlable"
Layout information
For example if you want to exclude specific areas or the opposite
🚀 The feature, motivation and pitch
Thank you for releasing an amazing work.
I think in traditional OCR tools, metadata of text like coordinate, font, size, color, style, etc. are also extracted.
Having those information would further strengthen the tool much more.
Just out of curiosity if you just ask it to extact those information in the prompt, how well would it perform.
Alternatives
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: