-
Notifications
You must be signed in to change notification settings - Fork 281
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[4.2.0] Assert in mpl_gpu_ze.c:466 when ZE_AFFINITY_MASK set to second device #6958
Comments
The |
It seems we missed the idea that users may choose to skip using certain devices. In this case, the assertion is incorrect. I'm checking to see if it is sufficient to simply remove the assertion, or if other considerations are needed for this case |
My previous comment is incorrect. The assertion is still valid in case of skipping certain devices. While setting up the global-to-local device id mapping, As a workaround, please try setting
To resolve this issue, we either need to debug and fix the logic of the |
Thanks for the suggestion, however the assert still occurs on the 2x Flex 170 (single tile per card) system I am using. The immediate use case is a test environment for which I can patch the MPI source. Simply removing the assert line allows the program to complete, though from earlier comments this may not be a valid approach? Configuring using --with-device=ch4:ucx avoids this code path and is a further option to workaround this issue. |
It turns out the issue of handling whole devices in |
N.b. I am using 4.2.0 which predates PR6929. |
Thanks for pointing this out. I will try to find access to a Flex series GPU and continue investigating this issue. |
Issue
Running MPI programs on systems with Intel(R) Data Center GPU assert when ZE_AFFINITY_MASK is set to use a second device.
Environment
O/S: SLES 15.5
CPU: 2x Intel(R) Xeon(R) Gold 6336Y CPU @ 2.40GHz
GPU: 2x Intel(R) Data Center GPU Flex 170
MPI: MPICH 4.2.0 configured with --enable-debuginfo --enable-shared, no libdrm present
Reproducer
Save the following trivial MPI program e.g. as
mpitest.c
:Build it as follows:
mpicc -g -O0 -o mpitest mpitest.c
Run it with the affinity mask set to the second device:
ZE_AFFINITY_MASK=1 MPIR_CVAR_CH4_IPC_ZE_SHAREABLE_HANDLE=pidfd mpirun -n 1 ./mpitest
Observe assertion failure:
The text was updated successfully, but these errors were encountered: