Support for HF Accelerate on Gaudi /AMD cards #239

JamesKunstle · 2024-10-01T19:52:56Z

We chose to use Hugging Face Accelerate to implement support for both DeepSpeed and FSDP. On Nvidia cards, this was a drop-in replacement for a fully custom training loop. However, minor training loop changes may be needed to run on Intel Gaudi and AMD Instinct cards.

One example of this difference is that on Gaudi cards, one switches:

- model.to(device_id)
+ model.to(device_hpu)

And the Intel Gaudi Pytorch bridge handles moving the model to the correct GPU according to the assigned process rank.

This is different enough from the vanilla Nvidia + torch training loop that Accelerate might not support it directly.

We need to scope:

What is the current state of potential support for running distributed training on AMD / Intel
Is it possible to non-invasively patch Accelerate to make it work on Intel / AMD
How can we propagate those changes to the upstream

Contributing tasks:

Attempt PoC Accelerate+FSDP run on a Gaudi machine #240

The text was updated successfully, but these errors were encountered:

JamesKunstle self-assigned this Oct 1, 2024

JamesKunstle mentioned this issue Oct 1, 2024

[Epic] Add Gaudi support to InstructLab CLI, eval, and training instructlab/instructlab#2218

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for HF Accelerate on Gaudi /AMD cards #239

Support for HF Accelerate on Gaudi /AMD cards #239

JamesKunstle commented Oct 1, 2024 •

edited

Loading

Support for HF Accelerate on Gaudi /AMD cards #239

Support for HF Accelerate on Gaudi /AMD cards #239

Comments

JamesKunstle commented Oct 1, 2024 • edited Loading

JamesKunstle commented Oct 1, 2024 •

edited

Loading