Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use reinforce #35

Open
supermancmk opened this issue Feb 15, 2025 · 2 comments
Open

Use reinforce #35

supermancmk opened this issue Feb 15, 2025 · 2 comments

Comments

@supermancmk
Copy link

Thanks for your work, may I ask if you use reinforce (--advantage_estimator reinforce)

@iseesaw
Copy link

iseesaw commented Feb 16, 2025

I have the same issue, has the author compared different RL algorithms, like REINFORCE++, RLOO, GRPO?

I successfully replicated results using PPO, which takes more GPUs. Recently I tried using REINFORCE++, but the training crashed and the reward firstly increased to 0.6 and then quickly dropped to -1. It's unclear whether it's due to hyperparameter settings or other reasons.

@supermancmk
Copy link
Author

Yes, I encountered the same problem after using Reinforce++. After 54 steps of training, the reward dropped and the length increased sharply, causing the training to crash. Has the author encountered such the problem? Here is my training curve. I left the hyper parameters unchanged except for setting advantage_estimator to reinforce @Zeng-WH @SivilTaram @HYZ17 @jxhe

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants