[Hardware] Add support for Huawei Ascend NPU #198

Chendong98 · 2025-02-04T17:59:32Z

Single Controller:
- Change placement group resources from GPU to NPU
- Made modifications to integrate Huawei’s HCCL
Megatron:
- Adapte Megatron to Huawei Ascend NPU using MindSpeed，and upgrade Megatron to version 0.6.0 to comply with MindSpeed’s requirements.
- Adapte Megatron-core 0.6.0’s ParamAndGradBuffer when synchronizing the weights between Megatron-LM and vLLM
- Replace operators in ParallelLlamaModel, including RMSNORM, flash attention, ROPE, and pad/unpad.
vLLM:
- Use this PR for vLLM Ascend support.
- Add the SPMD version of vLLM 0.6.4post1

…HCCL

Signed-off-by: Chendong98 <[email protected]>

…d) to NPU Signed-off-by: Chendong98 <[email protected]>

Signed-off-by: Chendong98 <[email protected]>

- Support training several iters in SFT trainer - Add CI for SFT trainer to train one iter.

…ngine#129)

This PR adds support for LoRA (Low-Rank Adaptation) for efficient model fine-tuning. ### Changes 1. Added LoRA configuration support in trainer config 2. Modified FSDP wrapping policy to handle LoRA modules 3. Integrated with existing FSDP training infrastructure 4. Added peft dependency 5. Removed unused ring_attn_utils.py ### Features - Configurable LoRA rank and alpha parameters - Target module specification for selective adaptation - Compatible with FSDP sharding strategy ### Testing Tested with Qwen2.5-0.5B-Instruct model on GSM8K dataset using the provided example script. ### Dependencies - Added `peft` package to requirements.txt This PR is based on commit 902ddbe and has been merged with the latest upstream main branch. --------- Co-authored-by: Jiayi Pan <[email protected]> Co-authored-by: openhands <[email protected]>

minor fix

… old_log_prob (volcengine#137) - As titled

…r_gpu (volcengine#136) ## Summary This PR changes all the micro_batch_size to micro_batch_size_per_gpu. **The Core logic of setting batch size:** - **All algorithmic metrics** (train batch size, ppo mini batch size): are global (from the perspective of single-controller), which will be normalized in each Worker. - **All performance-related parameters** (micro batch size, max token length in dynamic batch size) are local parameters, which represent the data sizes per GPU (i.e., each Worker). ## Main Changes 1. Change the scripts and config and delete the normalization for micro_bsz 2. Fix CI for SFT

We set `max_num_batched_tokens` in config `.rollout`, but they weren't actually being passed to VLLM -- causing potential insufficient use of GPUs. This PR: - properly pass `max_num_batched_tokens` from config to vLLM - set `disable_log_stats` to False, so vLLM performance information can be properly displayed (to spot issues)

# Add Sequence Parallelism and Padding Removal to SFT Trainer This PR adds sequence parallelism (SP) and padding removal optimizations to the SFT trainer, which can help improve training efficiency for large language models. ## Key Changes ### Core Features 1. **Sequence Parallelism**: Added support for sequence parallelism through the Ulysses framework - Configurable via `ulysses_sequence_parallel_size` parameter - Properly handles data distribution across SP ranks - Maintains consistent loss computation across distributed setup 2. **Padding Removal**: Added support for efficient handling of variable-length sequences - Enabled via `use_remove_padding` flag (requires SP to be enabled) - Uses flash-attention's padding removal utilities - Handles proper re-padding and loss computation 3. **Training Improvements**: - Added label smoothing support to loss computation - Added progress bar with epoch information - Added RoPE scaling configuration support - Improved error messages for batch size validation ### Testing - Added comprehensive test suite (`test_trainer.py`) to verify: - Forward pass consistency between original and SP+rmpad implementations - Loss computation correctness across distributed setup - Proper handling of micro-batches ### Example Usage Added example script `examples/sft/gsm8k/run_qwen_05_sp2.sh` demonstrating how to use the new features with Qwen-2.5B model. ## Implementation Details - Uses device mesh for proper distributed training setup - Handles data distribution ensuring same sequences within SP groups but different across DP groups - Carefully manages backward pass timing with gradient checkpointing - Maintains compatibility with existing FSDP features ## Testing Instructions 1. Run the example script with sequence parallelism: ```bash bash examples/sft/gsm8k/run_qwen_05_sp2.sh <nproc_per_node> <save_path> ``` 2. Run the test suite: ```bash tests/sft/run_sft_sp_loss_match.sh``` ^^ These are PR description generated by [OpenHands](https://github.com/All-Hands-AI/OpenHands) --------- Co-authored-by: Jiayi Pan <[email protected]> Co-authored-by: openhands <[email protected]>

…#142) - As titled

…llm log level (volcengine#141) - Previous gradient accumulation value is computed by micro_batch_size, which is wrong when using dynamic_bsz - Fix ci script to avoid overlooking this issue - Change vLLM state log default value to True to disable log. - We will check the `self.config.actor.ppo_mini_batch_size % self.config.actor.ppo_micro_batch_size_per_gpu == 0` after normalization in fsdp_workers instead of in dp_actor and dp_critic.

- Add link to performance tuning

…default (volcengine#147)

@chujiezheng

…olcengine#150) - As titled - Solved: volcengine#149 Waiting for testing from @chujiezheng --------- Co-authored-by: Chi Zhang <[email protected]>

…olcengine#156)

`token_level_rewards == (token_level_rewards * non_zero_mask)`

Add contribution guide

…o_batch` (volcengine#164) The logits is of shape `(bsz, response_length, vocab_size)`. This PR doesn't change any code execution, but explicitly show the logits shape and easier for readers to understand the code. Signed-off-by: Hongpeng Guo <[email protected]>

…)` to load model (volcengine#133) ## Summary This PR enables to use Liger Kernel's `_apply_liger_kernel_to_instance` to init a fsdp worker model. ## Main Changes 1. Adding an option of using `liger_kernel.transformers.AutoLigerKernelForCausalLM` to load a model from pretained, instead of the default `transformers.AutoModelForCausalLM` 2. Added a test case using configuration file `tests/e2e/run_qwen_gsm8k_model_rm_liger_kernel.sh` ## Related Issue volcengine#96 ## TODO volcengine#97 optimize the memory usage when computing entropy & log_probs https://github.com/volcengine/verl/blob/6d96fda3d47f057caaa8f494ca7804181903e911/verl/workers/actor/dp_actor.py#L94-L106 --------- Signed-off-by: Hongpeng Guo <[email protected]>

This is a follow-up to volcengine#151 ## Motivation Currently, in order to add a custom score function you need to fork verl and update the `_select_rm_score_fn` to define your logic. This makes it harder to use verl as part of a larger application while staying up to date with upstream improvements in verl. It would be convenient to allow end users to directly pass in a reward function they wish to use, without requiring them to clone/fork verl to do so. ## Design In this PR I slightly modify `main_ppo.py` to allow users to import a new function `run_ppo`. `run_ppo` behaves very similarly to the existing `main`, with the important addition of a new `compute_score` argument. This argument, if passed in, is used to compute the score of every generation. This is the change that allows The `compute_score` function is similar in shape to the existing `compute_score` on gsm8k and math. However, I have added a new `data_source` parameter so that the user can compute the score differently if desired depending on the task shape. ## Example Usage This is a sample script showing how you can use the new functionality. I have tested that this works. ```python from verl.trainer.main_ppo import run_ppo from omegaconf import OmegaConf def custom_compute_score(data_source, solution_str, ground_truth): """Dummy compute_score function that reward the model for generations of exactly 20 characters :) """ return abs(len(solution_str) - 20) config = OmegaConf.load("vendor/verl/verl/trainer/config/ppo_trainer.yaml") # Update config as needed config.data.train_files = "path/to/train.parquet" config.data.val_files = "path/to/test.parquet" # ... run_ppo(config, custom_compute_score) ``` ## Breaking changes There are no breaking changes in this PR. It is still possible to call `python -m verl.trainer.main_ppo ...` as before (although if you want to pass in a custom compute_score you will need to use the new method described above). ## Possible future work It would be great to move to [structured configs](https://omegaconf.readthedocs.io/en/2.1_branch/structured_config.html) as well since they'd allow us to have typesafe, autocompletable configurations from Python. I thought about adding those changes here as well but they would be much more extensive and I'm not sure whether there's interest from the project.

![image](https://github.com/user-attachments/assets/f0bae990-a4ef-49da-aa1e-58894b41db5f) --------- Co-authored-by: HL <[email protected]>

…cengine#153)

…ne#179) - As titled

since 'lighteval/MATH' is no longer available on huggingface.

…gine#186)

runnning -> running

This PR adds documentation for the LigerKernel option in a new performance tuning section, addressing the comment from volcengine#173. Changes: - Created new performance tuning section in docs - Documented LigerKernel option for SFT - Added performance tuning section to documentation index Related to volcengine#173 --------- Co-authored-by: openhands <[email protected]> Co-authored-by: HL <[email protected]>

Co-authored-by: Jayson Francis <[email protected]>

…olcengine#191) runs always show "crashed" on my wandb, despite finishing successfully. "Crashed" indicates that wandb did not finish sending the "success" signal to the server so the server believes the client was terminated unexpectedly. Furthermore, wandb log is incomplete (last lines missing). This PR adds a call to `wandb.finish` when the Tracker was destructed (oftentimes when `trainer.fit` finished) so that signals are sent to the server and a data sync is performed. Without this change: <img width="526" alt="image" src="https://github.com/user-attachments/assets/869da24e-c5b8-415c-b15a-bb79c49f96ce" /> With this change: <img width="548" alt="image" src="https://github.com/user-attachments/assets/16f0a40d-ea3b-48ed-93a4-f40ee01cb7c6" />

Chendong98 and others added 30 commits January 24, 2025 17:35

change resources from GPU to NPU in the placement group and hack for …

5d5ec78

…HCCL

add a vllm version that support npu

dd92fd1

Signed-off-by: Chendong98 <[email protected]>

Adapt GPU-coupeled operators (RMSNorm, RoPE, FlashAttention, Pad/Unpa…

808dbe2

…d) to NPU Signed-off-by: Chendong98 <[email protected]>

add megatron patches

5f0ce38

Adapt veRL from megatron 0.4 -> megatron 0.6 + mindspeed 0.6

1c4214d

Signed-off-by: Chendong98 <[email protected]>

add veRL on Ascend environment turtorial

9b60f97

Signed-off-by: Chendong98 <[email protected]>

Add more detailed information to npu-turtorial doc

159fcb5

Signed-off-by: Chendong98 <[email protected]>

[ci] feat: add ci for sft trainer (volcengine#128)

c8a3592

- Support training several iters in SFT trainer - Add CI for SFT trainer to train one iter.

[ppo] refactor: refactor old_log_prob into a separate function (volce…

1b68a4b

…ngine#129)

Update README.md (volcengine#130)

14cee5f

[readme] docs: add links for GRPO

95ea90b

[misc] fix nan in non_tensor_batch union (volcengine#135)

cf6580d

docs: update README.md typo (volcengine#139)

bb7b4b6

minor fix

[misc] fix: only return old_log_prob and temp to fix union problem in…

daaa365

… old_log_prob (volcengine#137) - As titled

docs: add ray and slurm link

b7058cc

docs: add reference for tiny-zero

6e4a1e2

[doc] perf: add performance tuning guide for FSDP backend (volcengine…

b9e6c37

…#142) - As titled

Update README.md (volcengine#146)

b4a7086

- Add link to performance tuning

docs: add news for doubao-1.5-pro

e383906

[perf] docs: fix typo

553745d

[misc] feat: enable grad ckpt as default and enable chunk prefill as …

64ab991

…default (volcengine#147)

[misc]fix: pad dataproto when pad size is larger than len(dataproto) (v…

5c725ed

…olcengine#150) - As titled - Solved: volcengine#149 Waiting for testing from @chujiezheng --------- Co-authored-by: Chi Zhang <[email protected]>

docs: add ragen as a related work

ee67f3d

fix: fix missing version file dependency when installing without -e (v…

0dcf43b

…olcengine#156)

fix: redundant non_zero_mask (volcengine#152)

11954fe

`token_level_rewards == (token_level_rewards * non_zero_mask)`

vermouth1992 and others added 22 commits February 5, 2025 01:52

[misc] fix: fix ray requirement (volcengine#163)

4c1a088

Update README.md (volcengine#165)

bd871a0

Add contribution guide

fix: typo (volcengine#166)

0ea6e30

fix: typo (explanation) (volcengine#167)

a625367

docs: add split placement to readme

f83dff0

fix bug：fix checkpoint save with existing dirs (volcengine#174)

947fabb

![image](https://github.com/user-attachments/assets/f0bae990-a4ef-49da-aa1e-58894b41db5f) --------- Co-authored-by: HL <[email protected]>

validate use_remove_padding when applying sequence parallelism (vol…

384dfb9

…cengine#153)

[feat, SFT] Support LigerKernel for SFT (volcengine#173)

a133599

docs: add twitter link to readme

b1e3b26

[misc] fix: grpo kl loss should be add when do minimization (volcengi…

48e3ddd

…ne#179) - As titled

data: fix the math dataset source (volcengine#175)

7dd2518

since 'lighteval/MATH' is no longer available on huggingface.

exchange the mini_batch_size calculation logic (volcengine#183)

0f0a1ec

docs: update readme with links to examples

fec107b

megatron: fix config error and add compute log prob interface (volcen…

6a11fec

…gine#186)

docs: update ray_trainer.rst (volcengine#187)

56dfa51

runnning -> running

fix link in quickstart.rst (volcengine#192)

6b0f3ba

Co-authored-by: Jayson Francis <[email protected]>

Remove unused variable in gsm8k preprocessing code (volcengine#193)

d8bb32a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Hardware] Add support for Huawei Ascend NPU #198

[Hardware] Add support for Huawei Ascend NPU #198

Chendong98 commented Feb 4, 2025

[Hardware] Add support for Huawei Ascend NPU #198

Are you sure you want to change the base?

[Hardware] Add support for Huawei Ascend NPU #198

Conversation

Chendong98 commented Feb 4, 2025