-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KFDMemoryTests fail with 5.11rc7 and gfx1030 (hsa intermittently fails) #108
Comments
Can you provide a kernel log (dmesg) and the .config used to build your kernel? Can you also try a supported kernel with the DKMS driver and firmware to help narrow down the problem? |
It is the default config file from ubuntu nightly builds. The image is from: https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.11-rc7/amd64/ I need something newer than 5.10 to address this bug (https://bugzilla.kernel.org/show_bug.cgi?id=210593) with Ryzen CPU support I can build from git if required with any debug. The relevant pieces from dmesg are here: Full dmesg also attached. [ 541.246085] [drm] kiq ring mec 2 pipe 1 q 0 |
Please check this: This should report If that's not the case, I recommend you report a bug against the Ubuntu kernel. This is a required feature for KFD to work. I submitted a patch upstream recently to make KFD select this automatically during the kernel build process. |
I dont think that is the issue since it is enabled.
|
@powderluv Can you please check if the issue still exist with the latest ROCm 6.2? If not, please close the ticket. Thanks! |
On Linux 5950x 5.11.0-051100rc7. (5.11 RC7) I am seeing the following KFDMemory failures with GFX1030
[----------] 25 tests from KFDMemoryTest
[ RUN ] KFDMemoryTest.MMapLarge
[ ] Successfully registered and mapped 117GB system memory to gpu
[ OK ] KFDMemoryTest.MMapLarge (1243 ms)
[ RUN ] KFDMemoryTest.MapUnmapToNodes
[ ] Skipping test: At least two GPUs are required.
[ OK ] KFDMemoryTest.MapUnmapToNodes (30 ms)
[ RUN ] KFDMemoryTest.MapMemoryToGPU
[ OK ] KFDMemoryTest.MapMemoryToGPU (6 ms)
[ RUN ] KFDMemoryTest.InvalidMemoryPointerAlloc
[ OK ] KFDMemoryTest.InvalidMemoryPointerAlloc (5 ms)
[ RUN ] KFDMemoryTest.ZeroMemorySizeAlloc
[ OK ] KFDMemoryTest.ZeroMemorySizeAlloc (5 ms)
[ RUN ] KFDMemoryTest.MemoryAlloc
[ OK ] KFDMemoryTest.MemoryAlloc (5 ms)
[ RUN ] KFDMemoryTest.AccessPPRMem
[ ] Skipping test: Test requires APU.
[ OK ] KFDMemoryTest.AccessPPRMem (5 ms)
[ RUN ] KFDMemoryTest.MemoryRegister
/home/foo/github/roct-thunk-interface/tests/kfdtest/src/Dispatch.cpp:95: Failure
Value of: (hsaKmtWaitOnEvent(m_pEop, timeout))
Actual: 31
Expected: HSAKMT_STATUS_SUCCESS
Which is: 0
/home/foo/github/roct-thunk-interface/tests/kfdtest/src/BaseQueue.cpp:122: Failure
Value of: WaitOnValue(m_Resources.Queue_read_ptr, RptrWhenConsumed(), timeOut)
Actual: false
Expected: true
/home/foo/github/roct-thunk-interface/tests/kfdtest/src/KFDMemoryTest.cpp:482: Failure
Value of: WaitOnValue(&stackData[sdmaOffset], 0x12345678)
Actual: false
Expected: true
/home/foo/github/roct-thunk-interface/tests/kfdtest/src/Dispatch.cpp:95: Failure
Value of: (hsaKmtWaitOnEvent(m_pEop, timeout))
Actual: 31
Expected: HSAKMT_STATUS_SUCCESS
Which is: 0
/home/foo/github/roct-thunk-interface/tests/kfdtest/src/BaseQueue.cpp:122: Failure
Value of: WaitOnValue(m_Resources.Queue_read_ptr, RptrWhenConsumed(), timeOut)
Actual: false
Expected: true
/home/foo/github/roct-thunk-interface/tests/kfdtest/src/KFDMemoryTest.cpp:530: Failure
Value of: stackData[dstOffset]
Actual: 3735928559
Expected: 0xD00BED00
Which is: 3490442496
/home/foo/github/roct-thunk-interface/tests/kfdtest/src/KFDMemoryTest.cpp:531: Failure
Value of: stackData[sdmaOffset]
Actual: 3735928559
Expected: 0xD0BED0BE
Which is: 3502166206
[ FAILED ] KFDMemoryTest.MemoryRegister (10379 ms)
[ RUN ] KFDMemoryTest.MemoryRegisterSamePtr
/home/foo/github/roct-thunk-interface/tests/kfdtest/src/BaseQueue.cpp:122: Failure
Value of: WaitOnValue(m_Resources.Queue_read_ptr, RptrWhenConsumed(), timeOut)
Actual: false
Expected: true
/home/foo/github/roct-thunk-interface/tests/kfdtest/src/KFDMemoryTest.cpp:593: Failure
Value of: WaitOnValue((unsigned int *)(&mem[2]), 0xdeadbeef)
Actual: false
Expected: true
[ FAILED ] KFDMemoryTest.MemoryRegisterSamePtr (4256 ms)
[ RUN ] KFDMemoryTest.FlatScratchAccess
hsakmt is built on rocm-4.0.x branch and compiled with latest AOMP 13.x.
I stumbled on this before my test program here:
https://github.com/powderluv/LLVM-AMDGPU-Assembler-Extra/blob/master/examples/asm-kernel/asm-kernel.s
would fail 2/3 times but run sometimes.
foo@5950x:/github/LLVM-AMDGPU-Assembler-Extra/b/examples/asm-kernel$ ./asm-kernel
Using agent: gfx1030
Success
foo@5950x:/github/LLVM-AMDGPU-Assembler-Extra/b/examples/asm-kernel$ ./asm-kernel
Queue at 0x7f3899753000 inactivated due to async error:
HSA_STATUS_ERROR_INVALID_ALLOCATION: The requested allocation is not valid.
^C
130 foo@5950x:/github/LLVM-AMDGPU-Assembler-Extra/b/examples/asm-kernel$ ./asm-kernel
Queue at 0x7f9dc3f5f000 inactivated due to async error:
HSA_STATUS_ERROR_INVALID_ALLOCATION: The requested allocation is not valid.
^[[A^C
130 foo@5950x:/github/LLVM-AMDGPU-Assembler-Extra/b/examples/asm-kernel$ ./asm-kernel
Queue at 0x7fa0543a9000 inactivated due to async error:
HSA_STATUS_ERROR_INVALID_ALLOCATION: The requested allocation is not valid.
^C
130 foo@5950x:/github/LLVM-AMDGPU-Assembler-Extra/b/examples/asm-kernel$ ./asm-kernel
Using agent: gfx1030
Success
Here are some related issues:
ROCm/HIP#2238
ROCm/aomp#187. (Rocminfo is attached to this bug)
The text was updated successfully, but these errors were encountered: