-
Notifications
You must be signed in to change notification settings - Fork 226
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/add remax support #234
base: main
Are you sure you want to change the base?
Conversation
Hi @liziniu Thank you for your contribution! According to our implementation, I guess ReMax can be implemented by adding a few lines to the original PPO/GRPO/Reinforce implementation instead of writing a new trainer to make maintenance easier. Correct me if this is invalid |
+1. From my understanding, Remax can be implemented similarly to reinforce++ with a different adv estimator. See reinforce++ implementation: #228 |
@@ -41,7 +41,7 @@ verl is fast with: | |||
- **vLLM** and **TGI** for rollout generation, **SGLang** support coming soon. | |||
- huggingface models support | |||
- Supervised fine-tuning | |||
- Reinforcement learning from human feedback with [PPO](https://github.com/volcengine/verl/tree/main/examples/ppo_trainer) and [GRPO](https://github.com/volcengine/verl/tree/main/examples/grpo_trainer) | |||
- Reinforcement learning from human feedback with [PPO](https://github.com/volcengine/verl/tree/main/examples/ppo_trainer), [GRPO](https://github.com/volcengine/verl/tree/main/examples/grpo_trainer), and [ReMax](https://github.com/volcengine/verl/tree/main/examples/remax_trainer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you have the training log and wandb already, would you mind adding one more record to docs/experiment/ppo.rst
to include remax? it would help the community to track if experiment can be reproduced in future version.
We can do that in the next PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. A preliminary result on Qwen2.5-3B is added and more results will come later.
@vermouth1992 @PeterSH6 I see. Let me reformat the code with minimal changes of the PPO's trainer. |
I have completed the implementation of ReMaX support. The changes include:
The code follows the project's style guidelines. Please review when you have a chance. Let me know if any changes or clarifications are needed. Thank you for your time! |
Could you add a CI to run remax with Qwen 0.5b to protect this functionality? You can follow the example here: https://github.com/volcengine/verl/blob/main/.github/workflows/e2e_gsm8k.yml#L69 |
Description
Added ReMax support to verl. ReMax is a simple, efficient, and stable RL algorithm customized for LLM training, with theoretical guarantees for variance reduction.
The HybridFlow paper experimented with ReMax, but verl did not provide an implementation. Therefore, ReMax has been added.
Changes
Testing
validation reward of optimizing Qwen2.5-3B-Instruct on the GSM8K dataset:
The curve demonstrates the effectiveness of ReMax, though its performance can be further enhanced through hyperparameter fine-tuning.
Documentation
Checklist