You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have the same issue, has the author compared different RL algorithms, like REINFORCE++, RLOO, GRPO?
I successfully replicated results using PPO, which takes more GPUs. Recently I tried using REINFORCE++, but the training crashed and the reward firstly increased to 0.6 and then quickly dropped to -1. It's unclear whether it's due to hyperparameter settings or other reasons.
Yes, I encountered the same problem after using Reinforce++. After 54 steps of training, the reward dropped and the length increased sharply, causing the training to crash. Has the author encountered such the problem? Here is my training curve. I left the hyper parameters unchanged except for setting advantage_estimator to reinforce @Zeng-WH@SivilTaram@HYZ17@jxhe
Thanks for your work, may I ask if you use reinforce (--advantage_estimator reinforce)
The text was updated successfully, but these errors were encountered: