-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix bug for asort kernel & faster sampler with GPU sorting #2730
base: main
Are you sure you want to change the base?
Conversation
I'm not sure to understand the change properly and testing them they seem to produce weird results. If you run the tests with the cuda feature enabled, this will actually fail, did you try it?
Below is also another example I tried where the results seem wrong. Fwiw the original implementation comes from llama.cpp argsort.cu so if you get the kernel to work on arbitrary sizes I would suggest to also make a PR there as it would be of interest to them. Finally, could you make only the kernel changes in this PR, and defer the sampling bits to a PR for later once this has stabilized a bit? use anyhow::Result;
use candle_core::{Device, Tensor};
fn main() -> Result<()> {
let dev = Device::new_cuda(0)?;
let values: Vec<u32> = (0..10u32).rev().collect();
let a = Tensor::new(values, &dev)?;
let a = a.arg_sort_last_dim(true)?;
let a = a.to_vec1::<u32>()?;
println!("{a:?}");
Ok(())
} |
Thanks for the comments. I'm currently working on a fix. |
Hi @LaurentMazare,
These issues have been addressed by creating temporary row and indice buffers (sized to a power of 2, padded) and copying the valid elements back. |
82120c6
to
e4c3a71
Compare
This PR addresses an issue with the
asort
kernel, which can lead to CUDA invalid arguments error when the last dimension to be sorted exceeded the number of threads allowed per CUDA block. With the revisedasort
kernel, we can now partially perform logit sampling on the GPU (specifically, the sorting process). This optimization improves text generation performance by up to 25%, depending on the model and the number of tokens generated.Testing Cases
I conducted tests using the following models and commands:
GLM4-9B
cargo run --release --example glm4 --features cuda -- \ --weight-path /home/glm-4-9b-chat/ \ --prompt "Please talk about deep learning." \ --sample-len 2000
Note: By default, the GLM4 model uses sampling parameters of
temperature=0.8
andtop_p=0.8
.LLaMa3.1 8B
cargo run --release --example llama --features cuda -- \ --weight-path /home/Meta-Llama-3.1-8B-Instruct/ \ --prompt "Please talk about deep learning." \ --temperature 0.7 \ --top-p 0.8 \ --sample-len 2000
Performance Results
Original Implementation (Sampling Entirely on CPU)
GLM4-9B:
LLaMa3.1 8B:
After
asort
Kernel Fix & Faster Sampler with GPU SortGLM4-9B:
LLaMa3.1 8B:
Note: You can adjust the
temperature
parameter to influence the model's behavior and potentially generate more than 2000 tokens for the above comparisons. Whilesample-len
sets the maximum number of tokens that can be generated, the actual output length may vary depending on the model (and sampling process, parameters, etc.).