Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for HF Accelerate on Gaudi /AMD cards #239

Open
1 task
Tracked by #2218
JamesKunstle opened this issue Oct 1, 2024 · 0 comments
Open
1 task
Tracked by #2218

Support for HF Accelerate on Gaudi /AMD cards #239

JamesKunstle opened this issue Oct 1, 2024 · 0 comments
Assignees

Comments

@JamesKunstle
Copy link
Contributor

JamesKunstle commented Oct 1, 2024

We chose to use Hugging Face Accelerate to implement support for both DeepSpeed and FSDP. On Nvidia cards, this was a drop-in replacement for a fully custom training loop. However, minor training loop changes may be needed to run on Intel Gaudi and AMD Instinct cards.

One example of this difference is that on Gaudi cards, one switches:

- model.to(device_id)
+ model.to(device_hpu)

And the Intel Gaudi Pytorch bridge handles moving the model to the correct GPU according to the assigned process rank.

This is different enough from the vanilla Nvidia + torch training loop that Accelerate might not support it directly.

We need to scope:

  1. What is the current state of potential support for running distributed training on AMD / Intel
  2. Is it possible to non-invasively patch Accelerate to make it work on Intel / AMD
  3. How can we propagate those changes to the upstream

Contributing tasks:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant