Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: add hf ckpt to faq, and include verl apis in the website #427

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion .readthedocs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,13 @@ build:
os: ubuntu-22.04
tools:
python: "3.8"
rust: "1.70"

sphinx:
configuration: docs/conf.py

python:
install:
- requirements: docs/requirements-docs.txt
- requirements: docs/requirements-docs.txt
- method: pip
path: .
20 changes: 11 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -118,18 +118,20 @@ If you find the project helpful, please cite:
verl is inspired by the design of Nemo-Aligner, Deepspeed-chat and OpenRLHF. The project is adopted and supported by Anyscale, Bytedance, LMSys.org, Shanghai AI Lab, Tsinghua University, UC Berkeley, UCLA, UIUC, and University of Hong Kong.

## Awesome work using verl
- [Enhancing Multi-Step Reasoning Abilities of Language Models through Direct Q-Function Optimization](https://arxiv.org/abs/2410.09302)
- [Flaming-hot Initiation with Regular Execution Sampling for Large Language Models](https://arxiv.org/abs/2410.21236)
- [Process Reinforcement Through Implicit Rewards](https://github.com/PRIME-RL/PRIME/)
- [TinyZero](https://github.com/Jiayi-Pan/TinyZero): a reproduction of DeepSeek R1 Zero recipe for reasoning tasks
- [RAGEN](https://github.com/ZihanWang314/ragen): a general-purpose reasoning agent training framework
- [Logic R1](https://github.com/Unakar/Logic-RL): a reproduced DeepSeek R1 Zero on 2K Tiny Logic Puzzle Dataset.
- [TinyZero](https://github.com/Jiayi-Pan/TinyZero): a reproduction of **DeepSeek R1 Zero** recipe for reasoning tasks
- [PRIME](https://github.com/PRIME-RL/PRIME): Process reinforcement through implicit rewards
- [RAGEN](https://github.com/ZihanWang314/ragen): a general-purpose reasoning **agent** training framework
- [Logic-RL](https://github.com/Unakar/Logic-RL): a reproduction of DeepSeek R1 Zero on 2K Tiny Logic Puzzle Dataset.
- [deepscaler](https://github.com/agentica-project/deepscaler): iterative context scaling with GRPO
- [critic-rl](https://github.com/HKUNLP/critic-rl): Teaching Language Models to Critique via Reinforcement Learning
- [Easy-R1](https://github.com/hiyouga/EasyR1): Multi-Modality RL
- [critic-rl](https://github.com/HKUNLP/critic-rl): LLM critics for code generation
- [Easy-R1](https://github.com/hiyouga/EasyR1): **Multi-modal** RL training framework
- [self-rewarding-reasoning-LLM](https://arxiv.org/pdf/2502.19613): self-rewarding and correction with **generative reward models**
- [Search-R1](https://github.com/PeterGriffinJin/Search-R1): RL with reasoning and **searching (tool-call)** interleaved LLMs
- [DQO](https://arxiv.org/abs/2410.09302): Enhancing multi-Step reasoning abilities of language models through direct Q-function optimization
- [FIRE](https://arxiv.org/abs/2410.21236): Flaming-hot initiation with regular execution sampling for large language models

## Contribution Guide
Contributions from the community are welcome!
Contributions from the community are welcome! Please checkout our [roadmap](https://github.com/volcengine/verl/issues/22) and [release plan](https://github.com/volcengine/verl/issues/354).

### Code formatting
We use yapf (Google style) to enforce strict code formatting when reviewing PRs. To reformat you code locally, make sure you installed **latest** `yapf`
Expand Down
5 changes: 5 additions & 0 deletions docs/faq/faq.rst
Original file line number Diff line number Diff line change
Expand Up @@ -55,3 +55,8 @@ Please set the following environment variable. The env var must be set before th
export VLLM_ATTENTION_BACKEND=XFORMERS

If in doubt, print this env var in each rank to make sure it is properly set.

Checkpoints
------------------------

If you want to convert the model checkpoint into huggingface safetensor format, please refer to ``scripts/model_merger.py``.
5 changes: 4 additions & 1 deletion docs/requirements-docs.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,4 +6,7 @@ sphinx-markdown-tables
# theme default rtd

# crate-docs-theme
sphinx-rtd-theme
sphinx-rtd-theme

# pin tokenizers version to avoid env_logger version req
tokenizers==0.19.1
18 changes: 8 additions & 10 deletions verl/protocol.py
Original file line number Diff line number Diff line change
Expand Up @@ -84,7 +84,7 @@ def union_tensor_dict(tensor_dict1: TensorDict, tensor_dict2: TensorDict) -> Ten
return tensor_dict1


def union_numpy_dict(tensor_dict1: dict[np.ndarray], tensor_dict2: dict[np.ndarray]) -> dict[np.ndarray]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current standard of type hint actually prefers dict over typing.Dict, list over typing.List

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This starts from python 3.10. https://docs.python.org/3/library/typing.html

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. The doc server is not configured with py3.10 and threw errors. I'll update to 310 and see if we can build the doc

def union_numpy_dict(tensor_dict1: Dict[str, np.ndarray], tensor_dict2: Dict[str, np.ndarray]) -> Dict[str, np.ndarray]:
for key, val in tensor_dict2.items():
if key in tensor_dict1:
assert isinstance(tensor_dict2[key], np.ndarray)
Expand All @@ -97,7 +97,7 @@ def union_numpy_dict(tensor_dict1: dict[np.ndarray], tensor_dict2: dict[np.ndarr
return tensor_dict1


def list_of_dict_to_dict_of_list(list_of_dict: list[dict]):
def list_of_dict_to_dict_of_list(list_of_dict: List[Dict]):
if len(list_of_dict) == 0:
return {}
keys = list_of_dict[0].keys()
Expand Down Expand Up @@ -148,7 +148,7 @@ def unfold_batch_dim(data: 'DataProto', batch_dims=2):
return DataProto(batch=tensor, non_tensor_batch=non_tensor_new, meta_info=data.meta_info)


def collate_fn(x: list['DataProtoItem']):
def collate_fn(x: List['DataProtoItem']):
batch = []
non_tensor_batch = []
for data in x:
Expand Down Expand Up @@ -448,19 +448,17 @@ def union(self, other: 'DataProto') -> 'DataProto':
return self

def make_iterator(self, mini_batch_size, epochs, seed=None, dataloader_kwargs=None):
"""Make an iterator from the DataProto. This is built upon that TensorDict can be used as a normal Pytorch
r"""Make an iterator from the DataProto. This is built upon that TensorDict can be used as a normal Pytorch
dataset. See https://pytorch.org/tensordict/tutorials/data_fashion for more details.


Args:
mini_batch_size (int): mini-batch size when iterating the dataset. We require that
``batch.batch_size[0] % mini_batch_size == 0``
mini_batch_size (int): mini-batch size when iterating the dataset. We require that ``batch.batch_size[0] % mini_batch_size == 0``.
epochs (int): number of epochs when iterating the dataset.
dataloader_kwargs: internally, it returns a DataLoader over the batch.
The dataloader_kwargs is the kwargs passed to the DataLoader
dataloader_kwargs (Any): internally, it returns a DataLoader over the batch. The dataloader_kwargs is the kwargs passed to the DataLoader.

Returns:
Iterator: an iterator that yields a mini-batch data at a time. The total number of iteration steps is
``self.batch.batch_size * epochs // mini_batch_size``
Iterator: an iterator that yields a mini-batch data at a time. The total number of iteration steps is ``self.batch.batch_size * epochs // mini_batch_size``
"""
assert self.batch.batch_size[0] % mini_batch_size == 0, f"{self.batch.batch_size[0]} % {mini_batch_size} != 0"
# we can directly create a dataloader from TensorDict
Expand Down