Failed tests for parallel_nsa_with_compression #3

Lau-Jonathan · 2025-02-24T09:20:54Z

When I tested the output of parallel_nsa_with_compression, I found that the results differ significantly from the output of nsa_with_compression, and the pytest test failed. What could be the reason for this?

Hanyuezhuohua · 2025-02-24T15:02:22Z

Due to challenges with online top‑k, the nsa_with_compression function is still under development. For now, please try using parallel_nsa, which features both selected and sliding attention. We plan to release the one with compressed attention in the coming days.

Lau-Jonathan · 2025-02-25T02:58:40Z

Thank you for your reply. However, it seems that in parallel_nsa, only the block calculation is carried out for q (queries) and k (keys), and there is no compression of k. I have carefully checked the code and found this to be the case. Also, the core function of parallel_nsa, parallel_nsa_fwd_kernel, does not involve the top-k selection and the sliding window attention mechanism. I'm not sure if what I said is correct. If there are any mistakes, please feel free to point them out and correct me.

Lau-Jonathan · 2025-02-25T02:59:11Z

Due to challenges with online top‑k, the nsa_with_compression function is still under development. For now, please try using parallel_nsa, which features both selected and sliding attention. We plan to release the one with compressed attention in the coming days.

And I sincerely hope that the relevant code for parallel_nsa_with_compression can be completed as soon as possible. Thank you very much for your efforts.

Hanyuezhuohua · 2025-02-25T03:06:15Z

In parallel_nsa, we assume that we already have the block indices for sparse attention. The compression and top-k selection mechanisms will be done in the parallel_nsa_with_compression (We want to implement an online top-k for better scalability, which encounters some challenges and will be finished today or tomorrow). For the sliding window attention, our new version has provided window size as a new config for the parallel_nsa function and supports sliding window attention now (set window size > 0 in line

native-sparse-attention/native_sparse_attention/ops/parallel.py

Line 913 in e207a41

window_size: int = 0,

).

Lau-Jonathan · 2025-02-25T03:13:22Z

In parallel_nsa, we assume that we already have the block indices for sparse attention. The compression and top-k selection mechanisms will be done in the parallel_nsa_with_compression (We want to implement an online top-k for better scalability, which encounters some challenges and will be finished today or tomorrow). For the sliding window attention, our new version has provided window size as a new config for the parallel_nsa function and supports sliding window attention now (set window size > 0 in line

native-sparse-attention/native_sparse_attention/ops/parallel.py

Line 913 in e207a41

window_size: int = 0,
).

Thank you for your prompt reply. I understand, but I think the core sparsity in the NSA paper lies in the compression at the block level, rather than the sparse selection through block_indices. This is how I understand it. So, perhaps the core function development still lies in the development of the parallel_nsa_with_compression function. Actually, I'm wondering if the "online top-k" you mentioned refers to adaptively selecting the corresponding top-k for each token. Additionally, what do you think the main challenges here are? When I tried to process a relatively long sequence in nsa_with_compression, it ran out of GPU memory.

Hanyuezhuohua · 2025-02-25T23:52:07Z

In this context, "online-topk" refers to processing key states block by block while retaining only the top‑k indices online, rather than computing query states using the entire set of compressed key states. This approach avoids materializing the full dot product between the query and key in the kernel, making it particularly beneficial for long-context settings.

Hanyuezhuohua · 2025-02-25T23:53:42Z

Additionally, we have finished the major part of the parallel_nsa_with_compression function with both topk selection and selected attention, where the only missing part should be the output of compressed attention. But you can try the topk selection and following selected and sliding attention now.

Lau-Jonathan · 2025-02-26T10:55:51Z

I haven’t made any changes to the new code, but when running test_nsa_with_compression.py, I encountered the error message as shown in the image.

Lau-Jonathan · 2025-02-26T14:07:58Z

I haven’t made any changes to the new code, but when running test_nsa_with_compression.py, I encountered the error message as shown in the image.

It seems that tl.log2 returns a tensor in it's source code. I simply replace n_dims here with 5 which corresponds to math.log2(S). And the tests passed. But it is not elegant here. Because when I tried to cast core.constexpr n_dims to a int, it seems not ok and AssertionError. If u have any better idea here, I really hope the bug can be solved that way. BTW, Why can not i git clone this repository?

Lau-Jonathan · 2025-02-26T15:53:17Z

I would like to ask how to debug a Triton JIT kernel function. It seems that regular breakpoints don’t work.

yzhangcs added a commit that referenced this issue Feb 26, 2025

Fix unexpected tensor dim (#3)

be53b00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed tests for parallel_nsa_with_compression #3

Failed tests for parallel_nsa_with_compression #3

Lau-Jonathan commented Feb 24, 2025

Hanyuezhuohua commented Feb 24, 2025

Lau-Jonathan commented Feb 25, 2025

Lau-Jonathan commented Feb 25, 2025

Hanyuezhuohua commented Feb 25, 2025

Lau-Jonathan commented Feb 25, 2025

Hanyuezhuohua commented Feb 25, 2025

Hanyuezhuohua commented Feb 25, 2025

Lau-Jonathan commented Feb 26, 2025

Lau-Jonathan commented Feb 26, 2025

Lau-Jonathan commented Feb 26, 2025

Failed tests for parallel_nsa_with_compression #3

Failed tests for parallel_nsa_with_compression #3

Comments

Lau-Jonathan commented Feb 24, 2025

Hanyuezhuohua commented Feb 24, 2025

Lau-Jonathan commented Feb 25, 2025

Lau-Jonathan commented Feb 25, 2025

Hanyuezhuohua commented Feb 25, 2025

Lau-Jonathan commented Feb 25, 2025

Hanyuezhuohua commented Feb 25, 2025

Hanyuezhuohua commented Feb 25, 2025

Lau-Jonathan commented Feb 26, 2025

Lau-Jonathan commented Feb 26, 2025

Lau-Jonathan commented Feb 26, 2025