-
Notifications
You must be signed in to change notification settings - Fork 746
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix DLLs to be Tesseract 5.2 #667
base: master
Are you sure you want to change the base?
Conversation
The DLLs that were added were downloaded from this build. The MD5 hashes are printed as part of the build and can be verified by hashing the DLLs locally. |
@@ -71,7 +71,7 @@ public void CanParseMultipageTifOneByOne() | |||
[TestCase(PageSegMode.SingleColumn, "This is a lot of 12 point text to test the")] | |||
[TestCase(PageSegMode.SingleLine, "This is a lot of 12 point text to test the")] | |||
[TestCase(PageSegMode.SingleWord, "This")] | |||
[TestCase(PageSegMode.SingleChar, "T")] | |||
[TestCase(PageSegMode.SingleChar, "hl")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file actually contains an image of the letter "T". However, the margin on the right-hand side of "T" is smaller than the left, and I think that's causing the auto-thresholding algorithm to invert the thresholding and recognize the text as "hl" instead. I updated the PNG and got the expected result:
OLD:
https://github.com/charlesw/tesseract/blob/master/src/Tesseract.Tests/Data/Ocr/PSM_SingleChar.png
NEW:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was planning to look into that and make sure that was a regression with the Tesseract library itself, and not an issue with the C# wrapper, thanks for looking into it. I'm surprised that the Tesseract library would return more than one character when it's explicitly instructed to only return a single character.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You make an interesting point about Tesseract returning 2 characters despite the PageSegMode; might be worth digging into deeper as a potential Tesseract library defect.
It looks like this commit attempted to update to Tesseract 5.2, but the Tesseract DLL was not actually updated to 5.2 (and a Tesseract 5.2 EXE was added instead).
This PR: