-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use PID for unique mutex name in /dev/shm #89
base: master
Are you sure you want to change the base?
Conversation
signed-off-by: Elena Sakhnovitch Change-Id: I265ba32bc3777db5f04f1924547fe432ba78c3d0 (cherry picked from commit 2f84906)
this fixes a race condition if the library is invoked simultaneously invoked from multiple processes
ping. Is anyone seeing this? Do you need more context? |
Thanks to look into this. Although multiple clients can access rocm_smi_lib at the same time, some function only allow one process can be accessed at a time. The shared memory file is used as a mutex to protect those function. In this proposed change, we may create multiple mutex, which then may allow multiple process access those function concurrently. When you got this error, did you have multiple process use the rocm_smi_lib concurrently? One possibility is that process1 acquire the mutex and then crash. After that, process2 cannot acquire the mutex any more. In that case, you may need to delete the shared memory file manually. |
Hi... RCCL uses rocm_smi under the hood pytorch uses RCCL for distributed training, and instantiates multiple processes per node when there are multiple GPUs in a node. This leads to the race condition. It is not possible to manually delete the shared memory files, because the processes are launched simultaneously. |
Perhaps the way the mutex is set up is not thread (multi-process) safe to begin with? |
I see. So it is a l
Thank you. I will try to reproduce it. |
Install LICENSE.txt to share/doc/smi-lib Change-Id: Idcbb70db8808111203e8e4a4c3ab4d1e070ac79d
Add rpm License header for cpack Change-Id: I2f4a89015b6389cfde801f41d4f6e0f59e7087aa
pop_back() was causing a seg fault when pp_dpm_pcie file is empty and returns whitespace. Signed-off-by: Divya Shikre <[email protected]> Change-Id: I888f1f79751cd456e43751a5b96d08560a039677 (cherry picked from commit ec71380)
98635ec
to
66e101a
Compare
Has there been any progress on this issue? The problem is still present in rocm 5.0.2 when launching pytorch with 8 GPUs/node on OLCF crusher.
|
The error returned by pthread_mutex_timedlock() is different from last time which was 110: Based on man page:
Do we observed some process crash? Thanks. |
66e101a
to
bd95425
Compare
Has there been any progress on this issue? This is still occurring with rocm 5.2.0. |
Fixes a race condition if the library is simultaneously invoked from multiple processes
fixes #88