Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update to tesseract 5 #21

Open
keighrim opened this issue Feb 6, 2025 · 0 comments
Open

update to tesseract 5 #21

keighrim opened this issue Feb 6, 2025 · 0 comments
Assignees
Labels
✨N New feature or request

Comments

@keighrim
Copy link
Member

keighrim commented Feb 6, 2025

A recent small-scale in-house experiments show that Tesseract 5 (T5)'s outperforms Tesseract 4 in terms of accuracy measure (CER). Hence we would like to update this app to T5 instead of retiring it in favor of apps based on larger models (such as doctr or llava). This will keep an entry in the app directory for a "good-enough" text recognition under restricted HW environment.

I'd also like to take this update work as an opportunity to partially address clamsproject/app-role-filler-binder#4 issue, in that the updated T5 wrapper app should use the current docTR app as a reference implementation to organize input/output MMIF structure. This means, we would like to wrap tesseract's internal structure understanding scheme in our vocabulary terms. Concretely, in docTR-wrapper, we translated

  1. Page: no translation as we only deal with single-page scenario (one image at a time)
  2. Block --> Paragraph from LAPPS vocab
  3. Line --> Sentence from LAPPS vocab
  4. Word --> Token from LAPPS vocab

Similarly with T5, we translate

  1. Page: no translation as we only deal with single-page scenario (one image at a time)
  2. Block --> Paragraph from LAPPS vocab
  3. Par: no translation as we already used LAPPS Paragraph, and there's no middle level between Paragraph and Sentence in LAPPS vocab.
  4. Line --> Sentence from LAPPS vocab
  5. Word --> Token from LAPPS vocab

This will naturally address #18 once implemented.

@keighrim keighrim added the ✨N New feature or request label Feb 6, 2025
@clams-bot clams-bot added this to apps Feb 6, 2025
@github-project-automation github-project-automation bot moved this to Todo in apps Feb 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
✨N New feature or request
Projects
Status: Todo
Development

No branches or pull requests

2 participants