-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Continuation] Merge EmbeddedLLM/vllm-rocm into vLLM main #1836
Conversation
Features * Auto-code path selection * support llama2 * support squeezellm rocm * add documentation amd-installation.rst. Describing how to setup vllm ROCm version * format.sh all the code * add base amd.Dockerfile --------- Co-authored-by: tjtanaa <[email protected]> Co-authored-by: kuanfu <[email protected]>
Hi @hongxiayang Could you provide what your benchmark setting is? I'm wondering because it seems quite lower than what we got from A100-80GB GPUs on the ShareGPT benchmark. |
The setting is as below: 1 gpu only on MI210 (dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, seed=0).
|
Update the vLLM installation procedures on AMD platform. Update vLLM documentations.
@WoosukKwon It is ready for another review. Thank you very much. |
@hongxiayang This is my benchmark result on llama-7b and ShareGPT (
|
Wow, Is this one gpu, or 8 GPUs? your number is quite different from mine, and I am wondering whether we had the same parameters when running the test, like edit: My MI210 was wonky when I ran the test. Your number is valid. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tjtanaa Thanks again for the great work! I found the code super clean and well organized. I also like the detailed documentation and the provided docker image. I could run vLLM on MI210 very smoothly and the performance was great! Thanks a lot for the contribution.
Left some minor comments on the code style. Please take a look!
-v <path/to/model>:/app/model \ | ||
vllm-rocm \ | ||
bash | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can keep this as is. I tried it out and it worked pretty smoothly!
|
||
ROOT_DIR = os.path.dirname(__file__) | ||
|
||
MAIN_CUDA_VERSION = "12.1" | ||
|
||
# Supported NVIDIA GPU architectures. | ||
SUPPORTED_ARCHS = {"7.0", "7.5", "8.0", "8.6", "8.9", "9.0"} | ||
NVIDIA_SUPPORTED_ARCHS = {"7.0", "7.5", "8.0", "8.6", "8.9", "9.0"} | ||
ROCM_SUPPORTED_ARCHS = {"gfx90a", "gfx908", "gfx906", "gfx1030", "gfx1100"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious: Which part of the code makes this requirement? That is, why is gfx8 not supported? While I don't we have to support it, I'd like to know why we don't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way we compiled this list of rocm supported archs is based on what AMD is supporting for ROCm and HIP, furthermore each arch has its own set of assembly instructions we have to make sure the currently used assembly instructions is supported by those archs as well.
To the best of our knowledge, the following are the ARCH requirements needed by different libraries:
- Pytorch
gfx900 gfx906 gfx908 gfx90a gfx1030 gfx1101
- vLLM Custom Ops:
gfx90a gfx908 gfx906 gfx1030 gfx1100
- Flash-Attention-ROCm:
gfx90a gfx940 gfx941 gfx942
Should we use the intersection of all three ARCH requirements instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tjtanaa Thanks for the detailed explanation. Sorry, I have little background on this stuff. Maybe I should learn more about ROCm and AMD GPUs 😂
As far as I understand, the vLLM custom ops support every "recent" AMD GPUs, and currently the supported GPU list is limited by the ROCm Flash Attention. Is this correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WoosukKwon We believe in near future, the supported GPU ARCH is going to be restricted by Flash Attention ROCm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fyi: The supported gfx arch for ROCm is documented here (as "LLVM target" column): https://rocm.docs.amd.com/en/latest/release/gpu_os_support.html#linux-supported-gpus.
@hongxiayang It's LLaMA2-7B on a single MI210x. Basically, it should be the same setup as yours. |
@WoosukKwon We have done updating the code style. and replied to your questions regarding to the supported ARCHs. |
Co-authored-by: Philipp Moritz <[email protected]> Co-authored-by: Amir Balwel <[email protected]> Co-authored-by: root <[email protected]> Co-authored-by: tjtanaa <[email protected]> Co-authored-by: kuanfu <[email protected]> Co-authored-by: miloice <[email protected]>
Add ROCm- Support
As there are too many changes has been made after #1749 ,
the previous PR #1749 is closed as and continued here.
PR Authors:
@kliuae
@iAmir97
@tjtanaa
@tanpinsiang
Contributer:
@pcmortiz
This pull request also incorporates the work from Port most vLLM kernels to ROCm #1313 by @pcmoritz, which was not merged. We appreciate @pcmoritz's contribution.