[ckpt] feat: integrate checkpoint resume in RL ray trainer #222

PeterSH6 · 2025-02-07T15:13:50Z

Features:

Save actor and critic checkpoint:
- Model
- Optimizer
- lr_scheduler
- rng_state
- dataloader
A complete checkpoint represents that dataloader, actor and critic (if any) state are properly saved
By default, we will not save the dataset but only store the dataloader (with sampler) state

Usage:

Support resume mode: auto, disable and resume_from_path
- auto: veRL will automatically check the latest checkpoint from trainer.default_local_dir
- disable: veRL will always train from scratch
- resume_from_path: When setting resume_from_path=True, then user only need to set the resume_mode to the checkpoint path that you want to load.

TODO:

Relevant issue:

PeterSH6 added 14 commits February 7, 2025 11:34

integrate ckpt manager in RL

d74a279

add load in workers

f4f869c

update config

ff935cb

impl load ckpt in ray trainer

329c1d5

update load

e3f4801

update test

d014a53

fix save path and several config

f0b82fa

resume done

61026a1

support dataloader resume

71fbe12

fix

60f81d1

lint

58b0bc3

lint

189f9f8

update ci to check load

58ff76e

fix script

fc10868

PeterSH6 marked this pull request as ready for review February 8, 2025 04:36

PeterSH6 requested a review from vermouth1992 February 8, 2025 04:37

add ckpt test in e2e

91e3dc6

PeterSH6 force-pushed the gm/ckpt_integrate branch from 48ef8f4 to 91e3dc6 Compare February 8, 2025 12:25

add world size in ckpt name

ad90b90

vermouth1992 approved these changes Feb 8, 2025

View reviewed changes

PeterSH6 merged commit 5a400bf into main Feb 8, 2025
14 checks passed

PeterSH6 deleted the gm/ckpt_integrate branch February 8, 2025 13:35

PeterSH6 mentioned this pull request Feb 9, 2025

Does verl support Breakpoint retraining ？ #143

Open

Provide feedback