-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failed tests for parallel_nsa_with_compression #3
Comments
Due to challenges with online top‑k, the |
Thank you for your reply. However, it seems that in parallel_nsa, only the block calculation is carried out for q (queries) and k (keys), and there is no compression of k. I have carefully checked the code and found this to be the case. Also, the core function of parallel_nsa, parallel_nsa_fwd_kernel, does not involve the top-k selection and the sliding window attention mechanism. I'm not sure if what I said is correct. If there are any mistakes, please feel free to point them out and correct me. |
And I sincerely hope that the relevant code for parallel_nsa_with_compression can be completed as soon as possible. Thank you very much for your efforts. |
In
|
Thank you for your prompt reply. I understand, but I think the core sparsity in the NSA paper lies in the compression at the block level, rather than the sparse selection through block_indices. This is how I understand it. So, perhaps the core function development still lies in the development of the parallel_nsa_with_compression function. Actually, I'm wondering if the "online top-k" you mentioned refers to adaptively selecting the corresponding top-k for each token. Additionally, what do you think the main challenges here are? When I tried to process a relatively long sequence in nsa_with_compression, it ran out of GPU memory. |
In this context, "online-topk" refers to processing key states block by block while retaining only the top‑k indices online, rather than computing query states using the entire set of compressed key states. This approach avoids materializing the full dot product between the query and key in the kernel, making it particularly beneficial for long-context settings. |
Additionally, we have finished the major part of the parallel_nsa_with_compression function with both topk selection and selected attention, where the only missing part should be the output of compressed attention. But you can try the topk selection and following selected and sliding attention now. |
I would like to ask how to debug a Triton JIT kernel function. It seems that regular breakpoints don’t work. |
When I tested the output of parallel_nsa_with_compression, I found that the results differ significantly from the output of nsa_with_compression, and the pytest test failed. What could be the reason for this?
The text was updated successfully, but these errors were encountered: