Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

follow-up to dev mailing list discuss + issue #3998: RFC: Consistent tprintf() use, ERROR: & WARNING: prefixes. #4020

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

GerHobbelt
Copy link
Contributor

follow-up to dev mailing list discuss + issue #3998: RFC: Info messages to stdout from training utils (Issue #3998)

Here's the extract/backport from my master branch:

  • use tprintf() for all error/warning/info/debug-diagnostics output --> fprintf(stderr, msg) --> tprintf(msg). This also anticipates migrating tesseract to using the fmt library (C++20) as done by @stweil in a dev branch.

  • all error messages are prefixed with 'ERROR: '. This enables users, f.e., to grep/filter the stderr/log using a regex like
    /^ERROR: /
    to filter out the error messages.

    Application code which overrides the tprintf() API is another example: there error messages can be easily picked up and sent to different monitor channels / GUI windows / etc. in environments where there's no stdout/stderr (MSWindows GUI) or where end users are not accustomed to watching stdout/stderr.

  • ditto for 'WARNING: ' level messages.

CAVEAT: this is a manual transfer from my master branch + code review while I was at it. I hope I got them all... :-)

…stdout from training utils (Issue tesseract-ocr#3998)

Here's the extract/backport from my master branch:

- use tprintf() for all error/warning/info/debug-diagnostics output --> fprintf(stderr, msg) --> tprintf(msg). This also anticipates migrating tesseract to using the fmt library (C++20) as done by @stweil in a dev branch.
- all error messages are prefixed with 'ERROR: '. This enables users, f.e., to grep/filter the stderr/log using a regex like '/^EERROR: /' to filter out the error messages. Application code which overrides the tprintf() API is another example: there error messages can be easily picked up and sent to different monitor channels / GUI windows / etc. in environments where there's no stdout/stderr (MSWindows GUI) or where end users are not accustomed to watching stdout/stderr.
- ditto for 'WARNING: ' level messages.

CAVEAT: this is a manual transfer from my master branch + code review whilee I was at it. I hope I got them all... :-)
@GerHobbelt
Copy link
Contributor Author

Notes:

  • adds an error message when UZN file loading fails -- rather than being utterly silent about that.
  • Edits a few error messages minimally to make them easier to understand -- at least for me. 😉
  • As a rule of thumb: version info goes to stdout, but anything else that's not direct OCR results goes to stderr, via tprintf. Thus one can easily redirect stdout to file and not be bothered with a having to pull apart a mix of data and diagnostics messages.

… course, one only sees it when scrolling through the pullreq after the fact.
@egorpugin
Copy link
Contributor

This is not right.
Proper way is to add fun/macro error() or warning() or tprintf_error() ... that will wrap text with prefix ERROR: or MESSAGE: .

@GerHobbelt
Copy link
Contributor Author

GerHobbelt commented Feb 25, 2023

Ok. Do note however, that a few of those prefixes aren't really prefixes as they are preceded by a "\n" that's part of the same message string, so a function will not cover 100% of cases. 😉

Given that, I suggest I do a macro as then the result remains the same at compile time, e.g. tprintf("\n" TPRNT_ERROR("blah blah"), ....) --> expands to a single constant string. Also nice because fmt depends on the format string to be a const_expr, in case tesseract mainline decides to migrate to using fmt as was done by @stweil in a dev branch of his (and which I blatantly ripped for my own).

BTW: definitely not an ERROR() or error() macro as that has a huge chance of clashing with foreign code. ERROR, for example, is a 100% collision with a system-level ERROR define in Windows (been there, seen that, t-shirt, elsewhere)

Therefore I propose TPRNT_ERROR(msg) and TPRNT_WARNING(msg) macros. Deal? 😉

@GerHobbelt
Copy link
Contributor Author

BTW: there are two important levels for the tprintf() messages:

  • Error
  • Warning

and the rest is not prefixed (ok, a rare 'INFO:' may be observed; I'm only consistent 95% of the time 😄 ) so all info/debug messages come without a prefix, like before.
This keeps the general diagnostics output (which is loud and needed if you want to find out what tesseract is doing at all, I find) clean from useless prefixes that only clutter console/logfile space.

Meanwhile the important stuff (errors and warnings) can be easily filtered/routed in applications, where those are relevant to report/log without being obscured by the diag output noise.


If you are of a different opinion re the above levels & prefixes 'rule', I'm all ears and I'll see how this patch can be tweaked.

@egorpugin
Copy link
Contributor

egorpugin commented Feb 25, 2023

If current tprint means logger, we can name new entities log_error(), log_warning(), log_info() etc.
If you want TPRNT_ERROR I think it's better to name it as TPRINT_*. No need to shorten 1 letter.

@GerHobbelt
Copy link
Contributor Author

@egorpugin : before I tackle this, there's also tlog() and ERROR::error() which invokes fprintf(stderr, ...) directly, so the question is: what would you like?

I'm myself of two minds about this: I could do it as a tprint_error/tprint_warning/tprint_info set1 (do we want the tlog() to be merged/mixed in with that, as we're talking code refactoring anyway), which has my minor preference, or we could do it as tprintf-the-whole-darn-lot-with-ERROR+WARNING-macros, or, syslog-like levels, e.g. tprintf(ERROR, msg...) 🤔

If we can get some kind of a management decision on this refactor action, that would be great. 😄 Then I can do it in my own fork and backport it, like the stuff done so far, but I'd rather do the refactor once as it's... some work. And finding out I did it in a direction that's not liked, well... that'll cost me later for a feature I don't mind otherwise while I keep tracking your mainline repo: increased expected cost of git merge collisions. 😭 Anyway, so's you know why I ask this.

Thanks! 👍


Side note: ERROR::error() also has an 'abort'/fatal mode a la glog (google log library, see github). I tackle that sort of thing in my backend by pumping everything through the supposed channels and in case of a fatal, doing a hard flush. That means we'll have something like log_flush() and log_abort() too: I like to keep the flush and actual abort logic separate -- own backend has debugger detection built-in so the fail-on-null-pointer-dereference is then exchanged for a jump-into-debugger call. By providing log_flush() + log_abort() anyone can override those actions easily, like I did for my own backend.

OK or do you want it done differently?


Footnotes

  1. which meshes nicely with my back-end where I have different error/warning/info/debug calls, so log_error(), log_warning(), log_info() is you suggested earlier makes that a cheaper tactic at run-time. 👍 Hence the light preference for this approach.

src/ccstruct/blread.cpp Fixed Show fixed Hide fixed
@egorpugin
Copy link
Contributor

Just leave it as is with "ERROR: .
Further refactorings will improve it.

In the second part you are talking about more like full featured logger rather than some one-function-printer.
If tesseract really needs this with sources & sinks & MT support etc, it can be handled with switching to boost log or spd log or small but consistent own implementation. It is not this ticket topic I guess.

@GerHobbelt
Copy link
Contributor Author

Ok, if I understood that correctly, you want to keep this mostly as-is.

I'll do the tprintf() et al -> log_error() et al migration as a separate pullreq later. BTW: the key argument there is to NOT have spdlog/glog/whatever in tesseract itself, but to keep tesseract mostly as-is, BUT provide a somewhat easy way for OTHERS to modify or hook tesseract up to such systems if THEIR use of the tesseract library requires such measures.

In other words:

  • tesseract-the-application should simply keep writing to stderr as before -- only through a small set of functions, such as tprintf.
  • tesseract-the-library can do the same, or...
  • tesseract-the-library can have its tprintf(),etc. output hooked into and rerouted to whatever the application programmer needs, e.g. glog or completely different means.

Hope I made myself clear on the goal there -- which is ultimately what I need / use.
Being consistent in the format (prefix, rather) of error and warning messages is a first step towards all that.
Any refactoring in that direction can be done in a separate pullreq as I understand you suggest; it would nicely separate the changes intended for a different goal.

Apologies for any confusion I might've caused.

…s expected. Cause: omission in backporting this from tprintf+fmt code. Caught by CI.
@egorpugin
Copy link
Contributor

How do you imagine hooking?

@GerHobbelt
Copy link
Contributor Author

Something simple. After all, tesseract (the application) doesn't need the fancy footwork, nor do many uses as a library.

Currently what I've got is a blunt, hardcoded override in tprintf.cpp, but that's just for now as I'm still in the testing phase and mostly on the command line myself. In a while tesseract will be incorporated in a server, where you can fire requests at it. Which is where it becomes important to have tprintf/tlog diagnostics available per tesseract session1. Haven't coded it yet, but I reckon I might need access to the tesseract instance as an aide to identify which session we're accessing the logging/diagnostics of, plus some application-specifics (to be determined).

Classically I'd do it as one or more C-style function pointers (callbacks), to be 'registered' via a new Tesseract API. But that's a little convoluted/messy re cleanup, and having had a look at current tprintf.cpp it looks like proper file closing (debug_file != stderr) isn't featured today, so it might be nice to drop that in a class destructor perhaps.

Right now I'm considering doing it with a C++ class instance:

  • create an application-specific derived class' instantiation of the tesseract-defined (virtual) base interface class (say class TesseractLoggingInterface).
  • tesseract always provides its own instantiation, with is hooked up/invoked by default. (Current tprintf/tlog in a class wrapper - keeping those calls but replacing vTessPrintf() so little visible change to most of the codebase)
  • application can register/replace its implementation-specific instance via a Tesseract BaseAPI call.

That way, little changes for tesseract itself; maybe I could clean up the current tprintf/tlog a little, but not too much. (tprintf_error + tprintf_warning + tprintf_info migration would help me, so that's a prerequisite for me to do, I think?)

Thus I get myself a cleaner interface that's easier to use and maintain once I create the server: that one can do it the way I like then. (The idea there is to use SQLite for a lot of stuff (fast & cheap storage cross-platform and also suitable for non-server / desktop machines), so current thought is to dump the log/diagnostics in SQLite as well, so callers can inspect after the fact by firing queries at the server for whatever they want. Get the image. Get the OCR output in format X. Get the thresholded image or other intermediate. Get the diagnostics/log for it. That sort of thing. Anyway, all that's is out of scope for tesseract and no running code yet.)

I could do a 'it might look like this' RFC/pullreq in the next bunch of weeks: I need to field-test this anyway and then you & others can have a look and comment/steer it towards possible integration (for the virtual base class + default class def that equals current behaviour).

Any which way, I agree it shouldn't be made part of this pullreq. Better for me to file a new / follow-up PR for this. (Which was why I did the "ERROR: ..." stuff like I did; some of the interfacing is complicated enough to not do it all in one go. 😉 )

Footnotes

  1. "session", "job", whatever you name it. The grand picture there is: you queue page images (you may want to fiddle with some priorities to move some of those 'pages' up or down, because... 🤷 ), tesseract chuggs away, you get the results and any collateral you want for review. Tesseract tprintf() output is one of the outputs that's related to a page, 's all. Hence "logging available per session")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants