Resume from checkpoints #76

fzyzcjy · 2025-01-04T10:35:58Z

Hi thanks for the lib! It would be great if it can support resuming from checkpoints. I checked the doc but it seems this is not mentioned...

PeterSH6 · 2025-01-04T15:46:54Z

Nice suggestion. It's a good feature to be added! We'll include it in our roadmap.

edchengg · 2025-01-23T18:16:23Z

this feature would be awesome. thanks!

**Features:** - Save actor and critic checkpoint: - Model - Optimizer - lr_scheduler - rng_state - dataloader - A complete checkpoint represents that dataloader, actor and critic (if any) state are properly saved - By default, we will not save the dataset but only store the dataloader (with sampler) state **Usage:** - Support resume mode: auto, disable and resume_from_path - auto: veRL will automatically check the latest checkpoint from `trainer.default_local_dir` - disable: veRL will always train from scratch - resume_from_path: When setting `resume_from_path`=True, then user only need to set the resume_mode to the checkpoint path that you want to load. **TODO:** - Support SFT resume in the next PR - Support uploader **Relevant issue:** - #76 - #143

eric-haibin-lin · 2025-02-09T10:27:10Z

mr mergede

PeterSH6 mentioned this issue Feb 7, 2025

[ckpt] feat: integrate checkpoint resume in RL ray trainer #222

Merged

eric-haibin-lin closed this as completed Feb 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resume from checkpoints #76

Resume from checkpoints #76

fzyzcjy commented Jan 4, 2025

PeterSH6 commented Jan 4, 2025

edchengg commented Jan 23, 2025

eric-haibin-lin commented Feb 9, 2025

Resume from checkpoints #76

Resume from checkpoints #76

Comments

fzyzcjy commented Jan 4, 2025

PeterSH6 commented Jan 4, 2025

edchengg commented Jan 23, 2025

eric-haibin-lin commented Feb 9, 2025