Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OFI: memory registration of cuda memory #7148

Open
thomasgillis opened this issue Sep 24, 2024 · 0 comments · May be fixed by #7156
Open

OFI: memory registration of cuda memory #7148

thomasgillis opened this issue Sep 24, 2024 · 0 comments · May be fixed by #7156

Comments

@thomasgillis
Copy link
Collaborator

thomasgillis commented Sep 24, 2024

Hi all,

when taking a closer look at #7140 I realized that MPICH seems to not use the correct handle for fi_mr_regattr.

The documentation specifies that:

device
Reserved 64 bits for device identifier if using non-standard HMEM interface. This field is ignore unless the iface field is valid. Otherwise, the device field is determined by the value specified through iface.
cuda
For FI_HMEM_CUDA, this is equivalent to CUdevice (int).

However, MPICH uses attr->device that is obtained from cudaPointerGetAttributes.
I am not familiar with the difference between the handle and the device id, but the doc of cuDeviceGet seems to suggest there is a difference:

Returns a handle to a compute device.
Parameters
device
- Returned device handle
ordinal
- Device number to get handle for
raffenet added a commit to raffenet/mpich that referenced this issue Oct 2, 2024
Libfabric docs say that the value of the cuda field in the regattr
struct is the device handle gotten from cuDeviceGet, not the
ordinal. Fixes pmodels#7148.
@raffenet raffenet linked a pull request Oct 2, 2024 that will close this issue
4 tasks
raffenet added a commit to raffenet/mpich that referenced this issue Oct 2, 2024
Libfabric docs say that the value of the cuda field in the regattr
struct is the device handle gotten from cuDeviceGet, not the
ordinal. Fixes pmodels#7148.
raffenet added a commit to raffenet/mpich that referenced this issue Oct 9, 2024
Libfabric docs say that the value of the cuda field in the regattr
struct is the device handle gotten from cuDeviceGet, not the
ordinal. Fixes pmodels#7148.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant