-
Notifications
You must be signed in to change notification settings - Fork 209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
provide predefined element_length_fn for pipeline.bucket_boundaries #364
base: master
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## master #364 +/- ##
========================================
Coverage ? 0.00%
========================================
Files ? 129
Lines ? 6867
Branches ? 0
========================================
Hits ? 0
Misses ? 6867
Partials ? 0 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a concept, this seems like a good idea and I don't see any downside. Before setting defaults for the command line we might want to run some more performance tests on larger datasets.
Agreed. I'd also like to revive the discussion around performance in #290, now that we have v2.3. It would seem that for a fair comparison we need to look at all potential factors (batch bucketing being one of them, but not necessarily so on random data which is not as likely to have widely diverging sequence lengths as real data). What I would like to see is GPU and memory utilization (like above) in tandem with walltime and CPU-time benchmarks. But this PR is not about new defaults, just enabling use of an existing CLI (and API) option. What I forgot was adding a unit test for this, though. |
tfaip provides an option delegating to Tensorflow's
tf.data.Dataset.bucket_by_sequence_length
(also known as batch width or length bucketing). This is really helpful for optimal use of memory resources on the GPU: as one tries to increase batch size for GPU utilization one must also be wary of OOM, and this is aggrevated by zero padding – so samples with largest length can spoil an entire batch's memory utilization. So batch bucketing groups batches into similar lengths.Our CLIs even expose that (e.g.
--pipeline.bucket_boundaries 20 50 100 200 400 800
), but:--pipeline.bucket_batch_sizes 256 128 64 32 16 8
) or basically observed the opposite effectelement_length_fn
parameter required by tfaip/TF – this is dependent on the data representation, and can only be formulated as code, so it must be done in Calamari itselfThis PR addresses the final problem 2. (Note that on the API, it was already possible to achieve that, as OCR-D/ocrd_calamari#118 showcases, but it would still be simpler if it was predefined.)
Following is a plot of measured GPU utilization for a typical
calamari-predict
on some 2500 lines without batch bucketing:...and with batch bucketing (enabled by this PR):
So obviously GPU utilization gets a little less peaky.
But it's still peaky. (I have tried varying
num_processes
,prefetch
,use_shared_memory
,dist_strategy
– but the effect is mostly negligible. So if anyone has ideas, please comment.)