Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cuBLAS error 15 #853

Closed
WorksButNotTested opened this issue Nov 21, 2023 · 4 comments
Closed

cuBLAS error 15 #853

WorksButNotTested opened this issue Nov 21, 2023 · 4 comments

Comments

@WorksButNotTested
Copy link

Since upgrading from v0.4.0 to v0.5.0, I seem to be getting an error running tabby when enabling cuda. Here is the output I am seeing. It seems the error occurs when I first send a request to the endpoint using the swagger default template from the web UI.

Describe the bug
... INFO ... crates tabby/src/serve/mods.rs:146: Starting server, this might takes a few minutes...
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: GGML_USE_TENSOR_CORES: yes
ggml_init_cublas: found 2 CUDA devices:
Device 0: Tesla M60, compute capability 5.2
Device 1: Tesla M60, compute capability 5.2
ggml_cuda_set_main_device: using device 0 (Telsa M60) as main device

... INFO ... crates tabby/src/serve/mods.rs:165: Listening at 0.0.0.0:8080

cuBLAS error 15 at /root/workspace/crates/llama-cpp-bindings/llama.cpp/ggml-cuda.cu:7282

Information about your version
v0.5.5

Information about your GPU
NVIDIA-SMI 470.223.03 Driver Version: 470.223.02, CUDA Version 11.4
Ubuntu 20.04.06 TLS

@wsxiaoys
Copy link
Member

Could you try setting environment variable LLAMA_CPP_PARALLELISM=1 (which should reduce vram usage).

Related discussion: https://snowshoe.dev/tabbyml/ag5vLJl1ln9

@WorksButNotTested
Copy link
Author

I added -e LLAMA_CPP_PARALLELISM=1 to my docker command, but I still get the same error?

@wsxiaoys
Copy link
Member

wsxiaoys commented Nov 23, 2023

https://github.com/TabbyML/llama.cpp/blob/75fb6f2ba0930be1515757196a81d32a1c2ab8ff/ggml-cuda.cu#L7289

Maybe it's related to compute capacity 5.2 doesn't supports fp16 operations.

(In 0.4 -> 0.5 transition, we switched the default implementation of cuda runtime to llama.cpp, which has slightly narrower support matrix)

Related: https://stackoverflow.com/questions/74995164/atomicadd-half-precision-floating-point-fp16-on-cuda-compute-capability-5-2

@WorksButNotTested
Copy link
Author

Is it possible to configure tabby to revert back to the previous cuda runtime?
What did the previous runtime use in place of fp16 operations? Is it possible to change the parameter to cublasGemmBatchedEx? Or is there any possibility of the workaround mentioned in the article working?

@wsxiaoys wsxiaoys closed this as not planned Won't fix, can't repro, duplicate, stale Jun 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants