Use PID for unique mutex name in /dev/shm #89

jglaser · 2022-01-05T23:15:35Z

Fixes a race condition if the library is simultaneously invoked from multiple processes

fixes #88

signed-off-by: Elena Sakhnovitch Change-Id: I265ba32bc3777db5f04f1924547fe432ba78c3d0 (cherry picked from commit 2f84906)

this fixes a race condition if the library is invoked simultaneously invoked from multiple processes

jglaser · 2022-01-21T17:00:56Z

ping. Is anyone seeing this? Do you need more context?

bill-shuzhou-liu · 2022-01-24T21:21:29Z

Thanks to look into this. Although multiple clients can access rocm_smi_lib at the same time, some function only allow one process can be accessed at a time. The shared memory file is used as a mutex to protect those function.

In this proposed change, we may create multiple mutex, which then may allow multiple process access those function concurrently.

When you got this error, did you have multiple process use the rocm_smi_lib concurrently? One possibility is that process1 acquire the mutex and then crash. After that, process2 cannot acquire the mutex any more. In that case, you may need to delete the shared memory file manually.

jglaser · 2022-01-24T21:30:10Z

Hi... RCCL uses rocm_smi under the hood
https://github.com/ROCmSoftwarePlatform/rccl/blob/4643a17f83900dd84676fc61ebf03be0d9584d68/src/misc/rocm_smi_wrap.cc#L37-L43

pytorch uses RCCL for distributed training, and instantiates multiple processes per node when there are multiple GPUs in a node. This leads to the race condition. It is not possible to manually delete the shared memory files, because the processes are launched simultaneously.

jglaser · 2022-01-24T21:32:06Z

Thanks to look into this. Although multiple clients can access rocm_smi_lib at the same time, some function only allow one process can be accessed at a time. The shared memory file is used as a mutex to protect those function.

In this proposed change, we may create multiple mutex, which then may allow multiple process access those function concurrently.

When you got this error, did you have multiple process use the rocm_smi_lib concurrently? One possibility is that process1 acquire the mutex and then crash. After that, process2 cannot acquire the mutex any more. In that case, you may need to delete the shared memory file manually.

Perhaps the way the mutex is set up is not thread (multi-process) safe to begin with?

bill-shuzhou-liu · 2022-01-24T21:59:57Z

I see. So it is a l

Hi... RCCL uses rocm_smi under the hood https://github.com/ROCmSoftwarePlatform/rccl/blob/4643a17f83900dd84676fc61ebf03be0d9584d68/src/misc/rocm_smi_wrap.cc#L37-L43

pytorch uses RCCL for distributed training, and instantiates multiple processes per node when there are multiple GPUs in a node. This leads to the race condition. It is not possible to manually delete the shared memory files, because the processes are launched simultaneously.

Thank you. I will try to reproduce it.

Install LICENSE.txt to share/doc/smi-lib Change-Id: Idcbb70db8808111203e8e4a4c3ab4d1e070ac79d

Add rpm License header for cpack Change-Id: I2f4a89015b6389cfde801f41d4f6e0f59e7087aa

pop_back() was causing a seg fault when pp_dpm_pcie file is empty and returns whitespace. Signed-off-by: Divya Shikre <[email protected]> Change-Id: I888f1f79751cd456e43751a5b96d08560a039677 (cherry picked from commit ec71380)

jglaser · 2022-03-29T02:14:58Z

Has there been any progress on this issue?

The problem is still present in rocm 5.0.2 when launching pytorch with 8 GPUs/node on OLCF crusher.

15: pthread_mutex_unlock: Success
15: pthread_mutex_unlock: No such file or directory
15: pthread_mutex_timedlock() returned 131
15: Failed to initialize RSMI device mutex after 5 seconds. Previous execution may not have shutdown cleanly. To fix problem, stop all rocm_smi programs, and then delete the rocm_smi* shared memory files in /dev/shm.: Success
15: pthread_mutex_timedlock() returned 131
15: Failed to initialize RSMI device mutex after 5 seconds. Previous execution may not have shutdown cleanly. To fix problem, stop all rocm_smi programs, and then delete the rocm_smi* shared memory files in /dev/shm.: Success
15: pthread_mutex_timedlock() returned 131
15: Failed to initialize RSMI device mutex after 5 seconds. Previous execution may not have shutdown cleanly. To fix problem, stop all rocm_smi programs, and then delete the rocm_smi* shared memory files in /dev/shm.: Success
15: rsmi_init() failed
15: rsmi_init() failed
15: Traceback (most recent call last):
15:   File "../contact_pred/finetune_structure.py", line 374, in <module>
15:     main()

bill-shuzhou-liu · 2022-03-29T15:42:47Z

The error returned by pthread_mutex_timedlock() is different from last time which was 110:
pthread_mutex_timedlock() returned 131
110: ETIMEDOUT, other process/thread hold the lock more than 5 seconds, and then timeout
131: ENOTRECOVERABLE, state not recoverable

Based on man page:

If a mutex is initialized with the PTHREAD_MUTEX_ROBUST attribute and its owner dies without unlocking it, ... ... If the next owner unlocks the mutex using pthread_mutex_unlock(3) before making it consistent, the mutex will be permanently unusable and any subsequent attempts to lock it using pthread_mutex_lock(3) will fail with the error ENOTRECOVERABLE.

Do we observed some process crash? Thanks.

gounley · 2022-07-11T15:41:28Z

Has there been any progress on this issue? This is still occurring with rocm 5.2.0.

Elena Sakhnovitch and others added 2 commits October 4, 2021 15:04

[rocm_smi.py]: fix fan 255% error

98635ec

signed-off-by: Elena Sakhnovitch Change-Id: I265ba32bc3777db5f04f1924547fe432ba78c3d0 (cherry picked from commit 2f84906)

Use PID for unique mutex name in /dev/shm

39f220f

this fixes a race condition if the library is invoked simultaneously invoked from multiple processes

bill-shuzhou-liu and others added 3 commits January 26, 2022 09:36

Add license file to smi-lib package

0da6e0e

Install LICENSE.txt to share/doc/smi-lib Change-Id: Idcbb70db8808111203e8e4a4c3ab4d1e070ac79d

Add rpm License header

bd3fda7

Add rpm License header for cpack Change-Id: I2f4a89015b6389cfde801f41d4f6e0f59e7087aa

bill-shuzhou-liu force-pushed the master branch from 98635ec to 66e101a Compare February 9, 2022 17:20

Merge branch 'master' into unique_mutex_name

42d1fd9

bill-shuzhou-liu force-pushed the master branch from 66e101a to bd95425 Compare March 30, 2022 16:18

saadrahim requested a review from bill-shuzhou-liu November 9, 2022 18:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use PID for unique mutex name in /dev/shm #89

Use PID for unique mutex name in /dev/shm #89

jglaser commented Jan 5, 2022

jglaser commented Jan 21, 2022

bill-shuzhou-liu commented Jan 24, 2022

jglaser commented Jan 24, 2022 •

edited

Loading

jglaser commented Jan 24, 2022

bill-shuzhou-liu commented Jan 24, 2022

jglaser commented Mar 29, 2022

bill-shuzhou-liu commented Mar 29, 2022

gounley commented Jul 11, 2022

Use PID for unique mutex name in /dev/shm #89

Are you sure you want to change the base?

Use PID for unique mutex name in /dev/shm #89

Conversation

jglaser commented Jan 5, 2022

jglaser commented Jan 21, 2022

bill-shuzhou-liu commented Jan 24, 2022

jglaser commented Jan 24, 2022 • edited Loading

jglaser commented Jan 24, 2022

bill-shuzhou-liu commented Jan 24, 2022

jglaser commented Mar 29, 2022

bill-shuzhou-liu commented Mar 29, 2022

gounley commented Jul 11, 2022

jglaser commented Jan 24, 2022 •

edited

Loading