Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] vLLM ARM Enablement for AARCH64 CPUs #9228

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

sanketkaleoss
Copy link

@sanketkaleoss sanketkaleoss commented Oct 10, 2024

Description
This PR enables support of vLLM for AARCH64 architecture. Motivated by the requirements from (#176, #5741, etc), I implemented this PR which enables the ARM path for CPU inference.

ARM Compatibility: Modified the build scripts and configuration files to ensure compatibility with ARM processors. It currently supports float32 and bfloat16 datatypes.

Motivation
Enabling vLLM on ARM architecture broadens its usability, allowing it to run on a wider range of devices, including those with ARM processors. This enhancement is crucial for expanding the reach and applicability of vLLM in various use cases.

Checklist

  • Code changes have been tested on ARM devices (Graviton3).

Modifications

  1. Added NEON intrinsics for enabling vLLM on ARM
  2. Updated requirements-cpu.txt
  3. Added ARM path in cpu_extension.cmake

Performance Results

Model : facebook/opt-125m
Datatype : float32

image

Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

@sanketkaleoss sanketkaleoss changed the title vLLM ARM Enablement for AARCH64 CPUs [Feature] vLLM ARM Enablement for AARCH64 CPUs Oct 22, 2024
@sanketkaleoss
Copy link
Author

sanketkaleoss commented Oct 22, 2024

@youkaichao @ChipKerchner @mgoin @pathorn @manojnkumar I have added a new feature that allows vLLM to run on ARM CPU backend. I have tested on AWS Graviton 3E. Please have a look at this PR.

Copy link
Collaborator

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @sanketkaleoss! This seems reasonable to me as a base implementation. It seems only compute in fp32 is supported, is that right?
It would be good to update the cpu installation documentation with how to build and to also add a new Dockerfile.arm

message(STATUS "ARMv8 architecture detected")
list(APPEND CXX_COMPILE_FLAGS
"-mcpu=native"
"-march=armv8.6-a"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this specific arch chosen?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

armv8.6-a is the first architecture that supports advance SIMD instructions, bf16 support and SVE support. That's why I chose this specific arch.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine for now, but for instance I think Apple's M1 CPU uses ARMv8.4-A so we should consider supporting older versions

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, noted.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to wikichip, Graviton3 is also ARMv8.4A -- is the Graviton3E different? I haven't found any documentation on it

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tlrmchlsmth You're right, graviton 3 has armv8.4. The code needs 8.6 only for bf16 dependencies. But if I just use "-mcpu=native" or "-march=armv8.4-a+bf16" the code works fine on Graviton3 instances. What do you suggest here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If -march=armv8.4-a+bf16 works, then I suggest using that for this PR.

In general, I think it's best to go with a fairly minimal ISA version, and then explicitly specify what ISA extensions we want to use, especially because ARM has many extensions that are optional for several ISA versions before they become mandatory

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, the M1 Mac (and I think M2) does not support BF16 NEON so unfortunately this wouldn't help there.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mgoin , @tlrmchlsmth I've added the dockerfile and documentation. Changed the arm arch to native. Please have a look.

@sanketkaleoss
Copy link
Author

Nice work @sanketkaleoss! This seems reasonable to me as a base implementation. It seems only compute in fp32 is supported, is that right? It would be good to update the cpu installation documentation with how to build and to also add a new Dockerfile.arm

Thanks @mgoin for the review. It supports fp32 and bf16 as of now. Sure, I'll work on updating the documentation and adding Dockerfile too.

@youkaichao
Copy link
Member

this is great, I will hand it over to @mgoin for detailed review.

it would be better if we can support macos m chips as well, they are also arm chips.

@sanketkaleoss
Copy link
Author

this is great, I will hand it over to @mgoin for detailed review.

it would be better if we can support macos m chips as well, they are also arm chips.

Thanks @youkaichao . I'll add that to future work.

message(STATUS "ARMv8 architecture detected")
list(APPEND CXX_COMPILE_FLAGS
"-mcpu=native"
"-march=armv8.6-a"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If -march=armv8.4-a+bf16 works, then I suggest using that for this PR.

In general, I think it's best to go with a fairly minimal ISA version, and then explicitly specify what ISA extensions we want to use, especially because ARM has many extensions that are optional for several ISA versions before they become mandatory

message(STATUS "ARMv8 architecture detected")
list(APPEND CXX_COMPILE_FLAGS
"-mcpu=native"
"-march=armv8.6-a"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, the M1 Mac (and I think M2) does not support BF16 NEON so unfortunately this wouldn't help there.


namespace vec_op {

// FIXME: FP16 is not fully supported in Torch-CPU
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sanketkaleoss do you have plans to support FP16 in a future PR? I see that it's partially implemented.

Do you know what the problem with FP16 in Torch-CPU is?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what's the problem there. I can add the support in future PR if torch-CPU supports FP16 in future.

@caijijuhe
Copy link

Hello, I followed your Dockerfile.arm encountered during the build process。
ERROR: ld.so: object '/usr/lib/aarch64-linux-gnu/libtcmalloc_minimal.so.4' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
Can you share the built image

@mergify mergify bot added ci/build documentation Improvements or additions to documentation labels Nov 4, 2024
elseif (POWER9_FOUND OR POWER10_FOUND)
message(STATUS "PowerPC detected")
# Check for PowerPC VSX support
list(APPEND CXX_COMPILE_FLAGS
"-mvsx"
"-mcpu=native"
"-mtune=native")

elseif (ASIMD_FOUND)
message(STATUS "ARMv8 architecture detected")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it could be ARMv9:

Suggested change
message(STATUS "ARMv8 architecture detected")
message(STATUS "ARMv8 or later architecture detected")

or maybe explicitly call out NEON instead

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, sure

Comment on lines 88 to 89
"-mcpu=native"
"-mtune=native"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Native is OK for local development, however the binaries won't be portable at all.

I actually thought you were on the right track before this change. We really ought to explicitly specify a fairly minimal base ARM architecture (maybe ARMv8.4) and then explicitly the set of extensions that we need to build (BF16, and maybe FP16 and DotProd?)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I'll try to set it as armv8.4 as a base implementation. Then, depending on the flags it would run bf16 or not.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you try using "-march=armv8.4-a+bf16+dotprod+fp16"?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for your guys reference, on arm64 -mcpu= acts as both specifying the appropriate architecture and tuning and it's generally better to use that vs -march if you're building for a specific CPU, you can find more details from here if you are running any test on Graviton: https://github.com/aws/aws-graviton-getting-started/blob/main/c-c++.md

Copy link
Author

@sanketkaleoss sanketkaleoss Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you try using "-march=armv8.4-a+bf16+dotprod+fp16"?

Yes, I tried it and it works. It even works with "-march=armv8.2-a+bf16+dotprod+fp16" making it compatible with Graviton2 as suggested by @ddynwzh1992 .

Copy link

mergify bot commented Nov 6, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. @sanketkaleoss please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Nov 6, 2024
@sanketkaleoss
Copy link
Author

@mgoin @tlrmchlsmth I have created separate paths for FP32 and BF16 datatypes. This code now works on any arm machine from armv8.2-a onwards. Even if BF16 support is not present, it will choose FP32 path. Please have a look.

Sanket Kale added 2 commits November 14, 2024 19:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci/build documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants