Skip to content
This repository has been archived by the owner on Jan 26, 2024. It is now read-only.

[rocm] Fix segfault with -O0-compiled kernels #45

Closed
wants to merge 1 commit into from

Conversation

ldrumm
Copy link

@ldrumm ldrumm commented Oct 2, 2023

kernels_[blitType] yields a null function pointer in KernelBlitManager::initHeap due to KernelBlitManager::createProgram not initializing all the kernels because of broken layout invariants.

KernBlitManager::NumBlitKernels has two possible return values: BlitTotal, and BlitLinearTotal which are sentinels. These sentinels are used in KernelBlitManager::createProgram to initialize the kernels_ array. It correctly initializes [0, NumBlitKernels()), but InitHeap is > NumBlitKernels, so the InitHeap kernel is not loaded.

Thus, when images are disabled, and a kernel has an hidden_heap_v1 entry, the InitHeap blitkernel is not loaded, and kernelBlitManager::initHeap subsequently gets a null pointer. The kernel always has such hidden_heap_v1 descriptor entry at -O0, but I believe it's possible (unconfirmed) for this situation to occur in other circumstances.

The fix is simply to ensure the InitHeap enumeration has a numeric value less than BlitLinearTotal.

n.b. This bug does not exist on the 5.7 release, but it looks like that's by chance.

`kernels_[blitType]` yields a null function pointer in
`KernelBlitManager::initHeap` due to `KernelBlitManager::createProgram`
not initializing all the kernels because of broken layout invariants.

`KernBlitManager::NumBlitKernels` has two possible return values:
`BlitTotal, and BlitLinearTotal` which are sentinels. These sentinels
are used in `KernelBlitManager::createProgram` to initialize the
`kernels_` array. It correctly initializes `[0, NumBlitKernels())`, but
`InitHeap` is > `NumBlitKernels`, so the `InitHeap` kernel is not loaded.

Thus, when images are disabled, and a kernel has an `hidden_heap_v1`
entry, the `InitHeap` blitkernel is not loaded, and
`kernelBlitManager::initHeap` subsequently gets a null pointer. The
kernel always has such `hidden_heap_v1` descriptor entry at -O0, but I
believe it's possible (unconfirmed) for this situation to occur in other
circumstances.

The fix is simply to ensure the `InitHeap` enumeration has a numeric
value *less than* `BlitLinearTotal`.

n.b. This bug does not exist on the 5.7 release, but it looks like
that's by chance.
@ldrumm
Copy link
Author

ldrumm commented Oct 3, 2023

Wrong repo.
New patch: ROCm/clr#13

@ldrumm ldrumm closed this Oct 3, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant