Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix segfault with -O0-compiled kernels #13

Open
wants to merge 1 commit into
base: rocm-5.6.x
Choose a base branch
from

Conversation

ldrumm
Copy link

@ldrumm ldrumm commented Oct 3, 2023

kernels_[blitType] yields a null function pointer in KernelBlitManager::initHeap due to KernelBlitManager::createProgram not initializing all the kernels because of broken layout invariants.

KernBlitManager::NumBlitKernels has two possible return values: BlitTotal, and BlitLinearTotal which are sentinels. These sentinels are used in KernelBlitManager::createProgram to initialize the kernels_ array. It correctly initializes [0, NumBlitKernels()), but InitHeap is > NumBlitKernels, so the InitHeap kernel is not loaded.

Thus, when images are disabled, and a kernel has an hidden_heap_v1 entry, the InitHeap blitkernel is not loaded, and kernelBlitManager::initHeap subsequently gets a null pointer. The kernel always has such hidden_heap_v1 descriptor entry at -O0, but I believe it's possible (unconfirmed) for this situation to occur in other circumstances.

The fix is simply to ensure the InitHeap enumeration has a numeric value less than BlitLinearTotal.

n.b. This bug does not exist on the 5.7 release, but it looks like that's by chance.

`kernels_[blitType]` yields a null function pointer in
`KernelBlitManager::initHeap` due to `KernelBlitManager::createProgram`
not initializing all the kernels because of broken layout invariants.

`KernBlitManager::NumBlitKernels` has two possible return values:
`BlitTotal, and BlitLinearTotal` which are sentinels. These sentinels
are used in `KernelBlitManager::createProgram` to initialize the
`kernels_` array. It correctly initializes `[0, NumBlitKernels())`, but
`InitHeap` is > `NumBlitKernels`, so the `InitHeap` kernel is not loaded.

Thus, when images are disabled, and a kernel has an `hidden_heap_v1`
entry, the `InitHeap` blitkernel is not loaded, and
`kernelBlitManager::initHeap` subsequently gets a null pointer. The
kernel always has such `hidden_heap_v1` descriptor entry at -O0, but I
believe it's possible (unconfirmed) for this situation to occur in other
circumstances.

The fix is simply to ensure the `InitHeap` enumeration has a numeric
value *less than* `BlitLinearTotal`.

n.b. This bug does not exist on the 5.7 release, but it looks like
that's by chance.
@cjatin
Copy link
Contributor

cjatin commented Oct 3, 2023

cc'ing: @gandryey @chrispaquot @saleelk

npmiller added a commit to npmiller/llvm that referenced this pull request May 30, 2024
This was tracked down to a bug in ROCm that seems to be fixed with newer
versions, and the CI is now on ROCm 6+ so these should be fine.

ROCm ticket: ROCm/clr#13

The reduce over group test works on W6800 and MI210, but it seems for
gfx1031 it reports not supporting shared USM, note that HIP for gfx1031
isn't officially supported by AMD.
steffenlarsen added a commit to intel/llvm that referenced this pull request Jun 7, 2024
This was tracked down to a bug in ROCm that seems to be fixed with newer
versions, and the CI is now on ROCm 6+ so these should be fine.

ROCm ticket: ROCm/clr#13

The reduce over group test works on W6800 and MI210, but it seems for
gfx1031 it reports not supporting shared USM, note that HIP for gfx1031
isn't officially supported by AMD.

---------

Co-authored-by: Steffen Larsen <[email protected]>
ianayl pushed a commit to ianayl/sycl that referenced this pull request Jun 13, 2024
This was tracked down to a bug in ROCm that seems to be fixed with newer
versions, and the CI is now on ROCm 6+ so these should be fine.

ROCm ticket: ROCm/clr#13

The reduce over group test works on W6800 and MI210, but it seems for
gfx1031 it reports not supporting shared USM, note that HIP for gfx1031
isn't officially supported by AMD.

---------

Co-authored-by: Steffen Larsen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants