Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed tests for parallel_nsa_with_compression #3

Open
Lau-Jonathan opened this issue Feb 24, 2025 · 10 comments
Open

Failed tests for parallel_nsa_with_compression #3

Lau-Jonathan opened this issue Feb 24, 2025 · 10 comments

Comments

@Lau-Jonathan
Copy link

When I tested the output of parallel_nsa_with_compression, I found that the results differ significantly from the output of nsa_with_compression, and the pytest test failed. What could be the reason for this?

@Hanyuezhuohua
Copy link
Collaborator

Due to challenges with online top‑k, the nsa_with_compression function is still under development. For now, please try using parallel_nsa, which features both selected and sliding attention. We plan to release the one with compressed attention in the coming days.

@Lau-Jonathan
Copy link
Author

Thank you for your reply. However, it seems that in parallel_nsa, only the block calculation is carried out for q (queries) and k (keys), and there is no compression of k. I have carefully checked the code and found this to be the case. Also, the core function of parallel_nsa, parallel_nsa_fwd_kernel, does not involve the top-k selection and the sliding window attention mechanism. I'm not sure if what I said is correct. If there are any mistakes, please feel free to point them out and correct me.

@Lau-Jonathan
Copy link
Author

Due to challenges with online top‑k, the nsa_with_compression function is still under development. For now, please try using parallel_nsa, which features both selected and sliding attention. We plan to release the one with compressed attention in the coming days.

And I sincerely hope that the relevant code for parallel_nsa_with_compression can be completed as soon as possible. Thank you very much for your efforts.

@Hanyuezhuohua
Copy link
Collaborator

In parallel_nsa, we assume that we already have the block indices for sparse attention. The compression and top-k selection mechanisms will be done in the parallel_nsa_with_compression (We want to implement an online top-k for better scalability, which encounters some challenges and will be finished today or tomorrow). For the sliding window attention, our new version has provided window size as a new config for the parallel_nsa function and supports sliding window attention now (set window size > 0 in line

).

@Lau-Jonathan
Copy link
Author

In parallel_nsa, we assume that we already have the block indices for sparse attention. The compression and top-k selection mechanisms will be done in the parallel_nsa_with_compression (We want to implement an online top-k for better scalability, which encounters some challenges and will be finished today or tomorrow). For the sliding window attention, our new version has provided window size as a new config for the parallel_nsa function and supports sliding window attention now (set window size > 0 in line

native-sparse-attention/native_sparse_attention/ops/parallel.py

Line 913 in e207a41

window_size: int = 0,
).

Thank you for your prompt reply. I understand, but I think the core sparsity in the NSA paper lies in the compression at the block level, rather than the sparse selection through block_indices. This is how I understand it. So, perhaps the core function development still lies in the development of the parallel_nsa_with_compression function. Actually, I'm wondering if the "online top-k" you mentioned refers to adaptively selecting the corresponding top-k for each token. Additionally, what do you think the main challenges here are? When I tried to process a relatively long sequence in nsa_with_compression, it ran out of GPU memory.

@Hanyuezhuohua
Copy link
Collaborator

In this context, "online-topk" refers to processing key states block by block while retaining only the top‑k indices online, rather than computing query states using the entire set of compressed key states. This approach avoids materializing the full dot product between the query and key in the kernel, making it particularly beneficial for long-context settings.

@Hanyuezhuohua
Copy link
Collaborator

Additionally, we have finished the major part of the parallel_nsa_with_compression function with both topk selection and selected attention, where the only missing part should be the output of compressed attention. But you can try the topk selection and following selected and sliding attention now.

@Lau-Jonathan
Copy link
Author

Image
I haven’t made any changes to the new code, but when running test_nsa_with_compression.py, I encountered the error message as shown in the image.

@Lau-Jonathan
Copy link
Author

Image I haven’t made any changes to the new code, but when running test_nsa_with_compression.py, I encountered the error message as shown in the image.

It seems that tl.log2 returns a tensor in it's source code. I simply replace n_dims here with 5 which corresponds to math.log2(S). And the tests passed. But it is not elegant here. Because when I tried to cast core.constexpr n_dims to a int, it seems not ok and AssertionError. If u have any better idea here, I really hope the bug can be solved that way. BTW, Why can not i git clone this repository?

@yzhangcs yzhangcs changed the title Failed tests for When I tested the output of parallel_nsa_with_compression, I found that the results differ significantly from the output of nsa_with_compression, and the pytest test failed. What could be the reason for this? Failed tests for parallel_nsa_with_compression Feb 26, 2025
yzhangcs added a commit that referenced this issue Feb 26, 2025
@Lau-Jonathan
Copy link
Author

I would like to ask how to debug a Triton JIT kernel function. It seems that regular breakpoints don’t work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants