Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Character confidence threshold #3860

Open
wants to merge 18 commits into
base: main
Choose a base branch
from
Open

Conversation

plutasnyy
Copy link
Contributor

@plutasnyy plutasnyy commented Jan 6, 2025

This change adds the ability to filter out characters predicted by Tesseract with low confidence scores.

Some notes:

  • I intentionally disabled it by default; I think some low score(like 0.9-0.95 for Tesseract) could be a safe choice though
  • I wanted to use character bboxes and combine them into word bbox later. However, a bug in Tesseract in some specific scenarios returns incorrect character bboxes (unit tests caught it 🥳 ). More in comment in the code

@plutasnyy plutasnyy marked this pull request as ready for review January 8, 2025 10:39
@plutasnyy plutasnyy requested review from badGarnet and MaksOpp January 8, 2025 10:39
Copy link
Contributor

@MaksOpp MaksOpp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@property
def TESSERACT_CHARACTER_CONFIDENCE_THRESHOLD(self) -> int:
"""Tesseract predictions with confidence below this threshold are ignored"""
return self._get_float("TESSERACT_CHARACTER_CONFIDENCE_THRESHOLD", 0.0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder, maybe we'd like to have some really low default threshold, i.e. 0.1, just to filter out complete garbage chars?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am ok with 0; the default behavior is no filter at all so this PR should just keep that for now. We can use followups to change this value.

image: np.ndarray,
lang: str = "eng",
config: str = "",
character_confidence_threshold: float = 0.5,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we are adding some default, so maybe let's also keep it in config?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see below we again have 0.5 as a default in hocr_to_dataframe, so either way, I would unify those

@plutasnyy plutasnyy added this pull request to the merge queue Jan 9, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jan 9, 2025
@plutasnyy plutasnyy added this pull request to the merge queue Jan 9, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jan 9, 2025
@plutasnyy plutasnyy added this pull request to the merge queue Jan 10, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jan 10, 2025
ocr_df = self.hocr_to_dataframe(hocr, character_confidence_threshold)
return ocr_df

def hocr_to_dataframe(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the compute performance with this code? We essentially were relying on tesseract internal cpp code to parse results but here we do it in python.

Comment on lines +130 to +131
"width": right - left,
"height": bottom - top,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small nit on performance we can create df using bbox first then use vector ops to compute width and height (and overwrite the data for right and bottom).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants