Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix DLLs to be Tesseract 5.2 #667

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

Methuselah96
Copy link
Contributor

@Methuselah96 Methuselah96 commented Apr 16, 2024

It looks like this commit attempted to update to Tesseract 5.2, but the Tesseract DLL was not actually updated to 5.2 (and a Tesseract 5.2 EXE was added instead).

This PR:

  • Updates the Tesseract DLL to 5.2 and updates the tests as a result.
  • Adds a GitHub Actions CI to automatically build the Leptonica and Tesseract DLLs with an output that can be downloaded and included in a PR (as it is in this one).
  • Updates the existing GitHub Actions CI to be run on all PRs and pushes to master. It has been cleaned up and only runs on Windows at the moment, but I hope to expand it to run on Linux and macOS in the future.

@Methuselah96
Copy link
Contributor Author

The DLLs that were added were downloaded from this build. The MD5 hashes are printed as part of the build and can be verified by hashing the DLLs locally.

image
image

@Methuselah96 Methuselah96 marked this pull request as ready for review April 16, 2024 04:46
@@ -71,7 +71,7 @@ public void CanParseMultipageTifOneByOne()
[TestCase(PageSegMode.SingleColumn, "This is a lot of 12 point text to test the")]
[TestCase(PageSegMode.SingleLine, "This is a lot of 12 point text to test the")]
[TestCase(PageSegMode.SingleWord, "This")]
[TestCase(PageSegMode.SingleChar, "T")]
[TestCase(PageSegMode.SingleChar, "hl")]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file actually contains an image of the letter "T". However, the margin on the right-hand side of "T" is smaller than the left, and I think that's causing the auto-thresholding algorithm to invert the thresholding and recognize the text as "hl" instead. I updated the PNG and got the expected result:
OLD:
https://github.com/charlesw/tesseract/blob/master/src/Tesseract.Tests/Data/Ocr/PSM_SingleChar.png
NEW:
PSM_SingleChar

Copy link
Contributor Author

@Methuselah96 Methuselah96 Aug 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was planning to look into that and make sure that was a regression with the Tesseract library itself, and not an issue with the C# wrapper, thanks for looking into it. I'm surprised that the Tesseract library would return more than one character when it's explicitly instructed to only return a single character.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You make an interesting point about Tesseract returning 2 characters despite the PageSegMode; might be worth digging into deeper as a potential Tesseract library defect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants