-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Port most vLLM kernels to ROCm #1313
Conversation
@@ -64,66 +60,6 @@ def get_torch_arch_list() -> Set[str]: | |||
return set(arch_list) | |||
|
|||
|
|||
# First, check the TORCH_CUDA_ARCH_LIST environment variable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can put all of this code into a function? Then we can just skip it if torch.version.hip
is set :)
If you have a different preference, let me know!
It does build, but I don't think I have enough VRAM to run any unquantized model. Also, note that it require ROCm 5.7, amd_hip_bf16.h didn't exist before. Anyway, doesn't it have a hard dependency on xFormers? Just trying to call vllm.entrypoints.openai.api_server throw |
The first step here will be to get one of the smaller models like OPT-125M working, which doesn't need AWQ. For dealing with You are right this needs ROCm 5.7, which the pytorch nightlies already support now. |
@pcmoritz Curious if you would mind looking into porting the AWQ quantization kernels to ROCm too? Would be a benefit to everyone running quantized models. |
@casper-hansen It depends on how the rest of the port goes and how the performance numbers for the rest looks like. I hope we can get some help from AMD to port them since it is likely more involved :) |
@WoosukKwon Can you have a look at the PR? All the vLLM layers are working now on AMD hardware except the ones that depend on xformers (with the following patch that I didn't want to merge). Make sure to have a closer look at and test the refactoring in setup.py since I haven't tested it on nvidia hardware :)
(the failures are the ones from xformers modules not being available) |
Hi @pcmoritz, huge thanks for the great work! I will definitely take a look. As for the FlashAttention and xformers, we will probably be able to collaborate with the AI teams at AMD. |
@pcmoritz @casper-hansen Speaking of the AWQ kernels, I believe they are for temporary use. We plan to implement much faster kernels (probably using Triton) in the near future. |
@iAmir97 converted the script to AMD assembler instructions in https://github.com/pcmoritz/vllm-public/pull/1/files, let me try if I can cherrypick the commits into this PR and if the tests still pass, then maybe that's the better solution since the performance might be better :) |
The tests are passing, so I changed the code. I'll be at skycamp tomorrow @WoosukKwon, if you want and have time, we can chat / hack some more on this :) |
When I tried flash_attn branch, I got NameError: name 'BlockDiagonalCausalMask' is not defined. |
Does asm work as expect? |
@fsx950223 It is passing the unit tests for the layers, so seems to be working for me. On the |
@sabreshao Thanks, I'm planning to work more on flash attention integration, currently I'm working with the latest master in https://github.com/ROCmSoftwarePlatform/flash-attention, let me know if you think using a different code base will make it more likely to succeed :) Currently the focus is on correctness and not speed yet :) |
Could you share the development environment with me (maybe a docker file)? I met some issues on my local environment. And which type of GPU you are using, MI250 or MI300? |
@pcmoritz @WoosukKwon @sabreshao Below is a workable solution to enable Flash Attention for vLLM per my testing. Pls just treat it as a reference, and I am sure you can make further optimizations. To be specific, it is to follow the instruction at https://github.com/ROCmSoftwarePlatform/flash-attention except that, instead of using ROCm 5.4, we can use ROCm5.7 (already supported by pytorch nightly), the same version @pcmoritz used above. To be specific, one can replace the following code in vLLM
with the following code (similar to the code here)
|
Hi @pcmoritz, I got the following error when building vLLM:
Do you happen to know how to resolve this? I'm using PyTorch 2.1.0 with ROCm 5.6.0. My local ROCm version is also 5.6.0. |
This header is only available with ROCm 5.7 unfortunately (somebody else
ran into the same problem)
…On Sat, Oct 21, 2023, 7:47 PM Woosuk Kwon ***@***.***> wrote:
Hi @pcmoritz <https://github.com/pcmoritz>, I got the following error
when building vLLM:
In file included from csrc/attention/attention_kernels.hip:23:
In file included from csrc/attention/attention_dtypes_hip.h:7:
In file included from csrc/attention/dtype_bfloat16_hip.cuh:29:
/opt/rocm-5.6.0/include/hip/hip_bf16.h:30:10: fatal error: 'hip/amd_detail/amd_hip_bf16.h' file not found
#include <hip/amd_detail/amd_hip_bf16.h>
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 error generated when compiling for gfx1030.
error: command '/opt/rocm-5.6.0/bin/hipcc' failed with exit code 1
Do you happen to know how to resolve this?
I'm using PyTorch 2.1.0 with ROCm 5.6.0. My local ROCm version is also
5.6.0.
—
Reply to this email directly, view it on GitHub
<#1313 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAA3VJEMJEDN6CQ6VBYSPT3YASCLBAVCNFSM6AAAAAA52XIR6SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZTHE3TMOJTHA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@pcmoritz I see. Then, which version of PyTorch did you use? |
The nightly one that comes with ROCm 5.7 support
…On Sat, Oct 21, 2023, 8:03 PM Woosuk Kwon ***@***.***> wrote:
@pcmoritz <https://github.com/pcmoritz> I see. Then, which version of
PyTorch did you use?
—
Reply to this email directly, view it on GitHub
<#1313 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAA3VJC2EKW7FDYTM3IA3ELYASEH3AVCNFSM6AAAAAA52XIR6SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZTHE3TQOJZGI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@pcmoritz Thank you for the port. Are you aware of a way to avoid using |
need to support 5.6 because thats what pytorch supports now |
@ehartford Yes it would be nice. So far I have been using nightly and it works quite nicely! |
Ok I figured out how to install both versions of rocm so that I can use 5.7 with vllm and 5.6 with pytorch |
what's preventing this from being merged? |
@pcmoritz do you know what I did wrong? Of course there's no CUDA_HOME because it's ROCm. |
@WoosukKwon Do you have any idea when the changing of AWQ kernels will happen? planning to port them to rocm. |
@ehartford Are you on the right branch? In this branch, this should be handled (https://github.com/vllm-project/vllm/pull/1313/files#diff-60f61ab7a8d1910d86d9fda2261620314edcae5894d5aaa236b821c7256badd7R27) -- your error seems to indicate that the |
Hi, I have a 7900 XTX and a burning desire to run AWQ models. I see there hasn't been much activity on this issue since October, are you still waiting for more mature AWQ kernels? Is there anything I can do to help? |
This ports most vLLM kernels to ROCm (with the exception of the
quantization_ops
which is not critical to run some of the models).If you have a working ROCm installation, you can compile this with
python setup.py develop
orpython setup.py install
(on my local machine,pip install -e .
is NOT working, not sure if that is generic or specific to my setup).