Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[4.2.0] Assert in mpl_gpu_ze.c:466 when ZE_AFFINITY_MASK set to second device #6958

Open
david-edwards-linaro opened this issue Apr 3, 2024 · 7 comments

Comments

@david-edwards-linaro
Copy link

Issue

Running MPI programs on systems with Intel(R) Data Center GPU assert when ZE_AFFINITY_MASK is set to use a second device.

Environment

O/S: SLES 15.5
CPU: 2x Intel(R) Xeon(R) Gold 6336Y CPU @ 2.40GHz
GPU: 2x Intel(R) Data Center GPU Flex 170
MPI: MPICH 4.2.0 configured with --enable-debuginfo --enable-shared, no libdrm present

Reproducer

Save the following trivial MPI program e.g. as mpitest.c:

#include <mpi.h>

int main(int argc, char** argv)
{
    MPI_Init(&argc, &argv);
    MPI_Finalize();
    return 0;
}

Build it as follows:
mpicc -g -O0 -o mpitest mpitest.c

Run it with the affinity mask set to the second device:
ZE_AFFINITY_MASK=1 MPIR_CVAR_CH4_IPC_ZE_SHAREABLE_HANDLE=pidfd mpirun -n 1 ./mpitest

Observe assertion failure:

mpitest: mpich-4.2.0/src/mpl/src/gpu/mpl_gpu_ze.c:466: int MPL_gpu_init_device_mappings(int, int): Assertion `local_dev_id == local_ze_device_count' failed.

@hzhou
Copy link
Contributor

hzhou commented Apr 3, 2024

The local_ze_device_count here should be 1. However, the local_dev_id includes both root device and subdevices. So I believe it is 3 here (1 root + 2 sub). Thus I don't understand the assertion. @abrooks98 maybe you can take a look and clarify the semantics of those two counts.

@abrooks98
Copy link
Collaborator

It seems we missed the idea that users may choose to skip using certain devices. In this case, the assertion is incorrect. I'm checking to see if it is sufficient to simply remove the assertion, or if other considerations are needed for this case

@abrooks98
Copy link
Collaborator

abrooks98 commented Apr 3, 2024

My previous comment is incorrect. The assertion is still valid in case of skipping certain devices. local_ze_device_count includes root and subdevices. So in case of ZE_AFFINITY_MASK=1, the correct value is 3.

While setting up the global-to-local device id mapping, local_dev_id is incrementing the root and sub devices. So if the scheme is correct, it should also be 3 in this case. It should support setting ZE_AFFINITY_MASK as the root device, and has worked in the past, but it seems there is a bug or missing logic.

As a workaround, please try setting ZE_AFFINITY_MASK=1.0,1.1 to ensure the root device and its sub devices are captured and pass this check. See below:

> MPIR_CVAR_ENABLE_GPU=1 ZE_AFFINITY_MASK=1.0,1.1 MPIR_CVAR_CH4_IPC_ZE_SHAREABLE_HANDLE=pidfd mpirun -n 1 -launcher ssh ./mpitest
local_dev_id: 3 | local_ze_device_count: 3

> MPIR_CVAR_ENABLE_GPU=1 ZE_AFFINITY_MASK=1 MPIR_CVAR_CH4_IPC_ZE_SHAREABLE_HANDLE=pidfd mpirun -n 1 -launcher ssh ./mpitest
local_dev_id: 1 | local_ze_device_count: 3
mpitest: src/gpu/mpl_gpu_ze.c:468: MPL_gpu_init_device_mappings: Assertion `local_dev_id == local_ze_device_count' failed.

To resolve this issue, we either need to debug and fix the logic of the ZE_AFFINITY_MASK parsing (bandaid fix) or remove it in favor of using BDF/UUID discovery (portable long term solution).

@david-edwards-linaro
Copy link
Author

Thanks for the suggestion, however the assert still occurs on the 2x Flex 170 (single tile per card) system I am using.

The immediate use case is a test environment for which I can patch the MPI source. Simply removing the assert line allows the program to complete, though from earlier comments this may not be a valid approach? Configuring using --with-device=ch4:ucx avoids this code path and is a further option to workaround this issue.

@abrooks98
Copy link
Collaborator

It turns out the issue of handling whole devices in ZE_AFFINITY_MASK is relatively new and stems from #6929. The change causes comparing an unsigned int against an int with value -1 in this particular case, which results in the subdevices not getting properly counted. I should have a PR to fix this today.

@david-edwards-linaro
Copy link
Author

N.b. I am using 4.2.0 which predates PR6929.

@abrooks98
Copy link
Collaborator

Thanks for pointing this out. I will try to find access to a Flex series GPU and continue investigating this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants