Replies: 1 comment 1 reply
-
Usually this happens when your TMA descriptor is invalid. Need more details to help you debug:
Please give as many details as you can |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I could use some advice on how to debug a TMA-related issue.
I'm debugging a handful of tests that appear to be crashing on
SM90_TMA_LOAD_MULTICAST_3D::copy()
in cutlass/include/cute/arch/copy_sm90_tma.hpp:cutlass/include/cute/arch/copy_sm90_tma.hpp
Lines 631 to 643 in b78588d
The problem is that I'm utterly failing at figuring out what makes the kernel crash with an "Illegal instruction".
Curiously enough, when the crash happens, the driver also reports a page fault:
However, I'm having trouble connecting reported fault address 0x7f79_78fff000 to the instruction inputs.
As far as I can tell, the data in the registers passed to the instruction is valid. I can't tell if the tensormap data is sensible, as it's not documented. The pointer reported in the driver log also does not seem to match anything passed to the instruction.
Any suggestions on what may be the root cause for
UTMALDG.3D.MULTICAST
triggering such a failure?Beta Was this translation helpful? Give feedback.
All reactions