Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resume from checkpoints #76

Closed
fzyzcjy opened this issue Jan 4, 2025 · 3 comments
Closed

Resume from checkpoints #76

fzyzcjy opened this issue Jan 4, 2025 · 3 comments

Comments

@fzyzcjy
Copy link
Contributor

fzyzcjy commented Jan 4, 2025

Hi thanks for the lib! It would be great if it can support resuming from checkpoints. I checked the doc but it seems this is not mentioned...

@PeterSH6
Copy link
Collaborator

PeterSH6 commented Jan 4, 2025

Nice suggestion. It's a good feature to be added! We'll include it in our roadmap.

@edchengg
Copy link

this feature would be awesome. thanks!

PeterSH6 added a commit that referenced this issue Feb 8, 2025
**Features:**
- Save actor and critic checkpoint:
  - Model
  - Optimizer
  - lr_scheduler
  - rng_state
  - dataloader
- A complete checkpoint represents that dataloader, actor and critic (if
any) state are properly saved
- By default, we will not save the dataset but only store the dataloader
(with sampler) state

**Usage:**
- Support resume mode: auto, disable and resume_from_path
- auto: veRL will automatically check the latest checkpoint from
`trainer.default_local_dir`
   - disable: veRL will always train from scratch
- resume_from_path: When setting `resume_from_path`=True, then user only
need to set the resume_mode to the checkpoint path that you want to
load.

**TODO:**
- Support SFT resume in the next PR
- Support uploader

**Relevant issue:**
- #76
- #143
@eric-haibin-lin
Copy link
Collaborator

mr mergede

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants