Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

provide predefined element_length_fn for pipeline.bucket_boundaries #364

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

bertsky
Copy link
Collaborator

@bertsky bertsky commented Oct 2, 2024

tfaip provides an option delegating to Tensorflow's tf.data.Dataset.bucket_by_sequence_length (also known as batch width or length bucketing). This is really helpful for optimal use of memory resources on the GPU: as one tries to increase batch size for GPU utilization one must also be wary of OOM, and this is aggrevated by zero padding – so samples with largest length can spoil an entire batch's memory utilization. So batch bucketing groups batches into similar lengths.

Our CLIs even expose that (e.g. --pipeline.bucket_boundaries 20 50 100 200 400 800 ), but:

  1. the formula for bucket batch sizes was broken prior to tfaip 1.2.7, so one needed to also calculate that (e.g. --pipeline.bucket_batch_sizes 256 128 64 32 16 8) or basically observed the opposite effect
  2. another ingredient was the element_length_fn parameter required by tfaip/TF – this is dependent on the data representation, and can only be formulated as code, so it must be done in Calamari itself

This PR addresses the final problem 2. (Note that on the API, it was already possible to achieve that, as OCR-D/ocrd_calamari#118 showcases, but it would still be simpler if it was predefined.)

Following is a plot of measured GPU utilization for a typical calamari-predict on some 2500 lines without batch bucketing:
calamari-predict-and-eval-tf2 13

...and with batch bucketing (enabled by this PR):
calamari-predict-and-eval-tf2 13-bb-N4

So obviously GPU utilization gets a little less peaky.

But it's still peaky. (I have tried varying num_processes, prefetch, use_shared_memory, dist_strategy – but the effect is mostly negligible. So if anyone has ideas, please comment.)

@bertsky bertsky requested a review from andbue October 2, 2024 15:53
@codecov-commenter
Copy link

codecov-commenter commented Oct 2, 2024

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 0% with 6 lines in your changes missing coverage. Please review.

Please upload report for BASE (master@f0139d6). Learn more about missing BASE report.

Files with missing lines Patch % Lines
calamari_ocr/ocr/dataset/data.py 0.00% 6 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@           Coverage Diff            @@
##             master    #364   +/-   ##
========================================
  Coverage          ?   0.00%           
========================================
  Files             ?     129           
  Lines             ?    6867           
  Branches          ?       0           
========================================
  Hits              ?       0           
  Misses            ?    6867           
  Partials          ?       0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@andbue andbue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a concept, this seems like a good idea and I don't see any downside. Before setting defaults for the command line we might want to run some more performance tests on larger datasets.

@bertsky
Copy link
Collaborator Author

bertsky commented Oct 11, 2024

Before setting defaults for the command line we might want to run some more performance tests on larger datasets.

Agreed. I'd also like to revive the discussion around performance in #290, now that we have v2.3. It would seem that for a fair comparison we need to look at all potential factors (batch bucketing being one of them, but not necessarily so on random data which is not as likely to have widely diverging sequence lengths as real data). What I would like to see is GPU and memory utilization (like above) in tandem with walltime and CPU-time benchmarks.

But this PR is not about new defaults, just enabling use of an existing CLI (and API) option.

What I forgot was adding a unit test for this, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants