Possible performance bug with predict nodes and shared memory #131

trivoldus28 · 2020-10-10T20:18:54Z

Turns out that the default value for max_shared_memory
https://github.com/funkey/gunpowder/blob/e523b49ca846a9fd46ab6fc0dd1040cc4a4d53b4/gunpowder/tensorflow/nodes/predict.py#L71
does not allocate 1GB of rather 4GB because the value type ctypes.c_float is used when for creating the RawArray.
https://github.com/funkey/gunpowder/blob/e523b49ca846a9fd46ab6fc0dd1040cc4a4d53b4/gunpowder/tensorflow/nodes/predict.py#L90

With 4GB per each input and output array, this means each predict worker is allocating 8GB of shared memory. With 4 workers, job memory should then be at least 32GB. Now, here is the performance bug. I did not know about the shared memory requirement and have always run my inference pipeline with 4 workers with only 8GB of memory (to minimize my resource usage counting :)). You'd think that gunpower would run out of memory and be killed, but actually it will not! Turns out, the reason for this is because Python multiprocessing package creates a temp file and mmap that whenever RawArray is used: src. So it does not matter how many workers are being run and how much shared memory is being allocating Python will happily chug along although with possible slow downs of memory being swapped in and out to disk.

The most immediate slow down is during initialization when the array is set to zero. I have seen my inference jobs taking more than 15m to initialize all of the RawArrays to disk tempfiles (vs less than 1m if there were enough memory).

The second-order bug would be during runtime and data is paged out from main memory to disk. In my inference jobs, once getting past the initialization, 8GB was actually enough for four workers, but I can imagine scenarios where not enough memory was requested for the jobs and data is paged out on every transfer. I don't know exactly what the mechanism is inside the OS for it to decide when to page something out from a memory mapped file, but we probably should avoid this scenario at all time because it can be an opaque performance bug.

My recommendations are:

At the very least, the max_shared_memory argument should be made more transparent to the user. Maybe something like shared_memory_per_worker_GB and then calculate the appropriate max_shared_memory from that.
The default for max_shared_memory should be decreased substantially. I'm guessing that most production jobs won't be transferring more than a few hundred MBs, so maybe the default should be capped to something like 64MB or 128MB and the more experimental users can increase it accordingly.

The text was updated successfully, but these errors were encountered:

trivoldus28 added bug question labels Oct 10, 2020

trivoldus28 assigned funkey and trivoldus28 Oct 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible performance bug with predict nodes and shared memory #131

Possible performance bug with predict nodes and shared memory #131

trivoldus28 commented Oct 10, 2020

Possible performance bug with predict nodes and shared memory #131

Possible performance bug with predict nodes and shared memory #131

Comments

trivoldus28 commented Oct 10, 2020