Use reinforce #35

supermancmk · 2025-02-15T05:26:05Z

Thanks for your work, may I ask if you use reinforce (--advantage_estimator reinforce)

iseesaw · 2025-02-16T09:20:18Z

I have the same issue, has the author compared different RL algorithms, like REINFORCE++, RLOO, GRPO?

I successfully replicated results using PPO, which takes more GPUs. Recently I tried using REINFORCE++, but the training crashed and the reward firstly increased to 0.6 and then quickly dropped to -1. It's unclear whether it's due to hyperparameter settings or other reasons.

supermancmk · 2025-02-16T18:44:00Z

Yes, I encountered the same problem after using Reinforce++. After 54 steps of training, the reward dropped and the length increased sharply, causing the training to crash. Has the author encountered such the problem? Here is my training curve. I left the hyper parameters unchanged except for setting advantage_estimator to reinforce @Zeng-WH @SivilTaram @HYZ17 @jxhe

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use reinforce #35

Use reinforce #35

supermancmk commented Feb 15, 2025

iseesaw commented Feb 16, 2025 •

edited

Loading

supermancmk commented Feb 16, 2025

Use reinforce #35

Use reinforce #35

Comments

supermancmk commented Feb 15, 2025

iseesaw commented Feb 16, 2025 • edited Loading

supermancmk commented Feb 16, 2025

iseesaw commented Feb 16, 2025 •

edited

Loading