Skip to content

Commit

Permalink
example: switch the default model ckpt for Megatron, add wandb logs (#…
Browse files Browse the repository at this point in the history
…210)

use the general purpose LLM for the math task instead of code LLM.

---------

Co-authored-by: Your Name <[email protected]>
  • Loading branch information
eric-haibin-lin and Your Name authored Feb 6, 2025
1 parent 22d56a8 commit ced8ecb
Show file tree
Hide file tree
Showing 15 changed files with 68 additions and 49 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ Checkout this [Jupyter Notebook](https://github.com/volcengine/verl/tree/main/ex
- [Run GSM8K Example](https://verl.readthedocs.io/en/latest/examples/gsm8k_example.html)

**Reproducible algorithm baselines:**
- [PPO](https://verl.readthedocs.io/en/latest/experiment/ppo.html)
- [PPO and GRPO](https://verl.readthedocs.io/en/latest/experiment/ppo.html)

**For code explanation and advance usage (extension):**
- PPO Trainer and Workers
Expand Down
2 changes: 1 addition & 1 deletion docs/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# veRL documents
# verl documents

## Build the docs

Expand Down
6 changes: 3 additions & 3 deletions docs/advance/dpo_extension.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ Extend to other RL(HF) algorithms

We already implemented the complete training pipeline of the PPO
algorithms. To extend to other algorithms, we analyze the high-level
principle to use veRL and provide a tutorial to implement the DPO
principle to use verl and provide a tutorial to implement the DPO
algorithm. Users can follow the similar paradigm to extend to other RL algorithms.

.. note:: **Key ideas**: Single process drives multi-process computation and data communication.
Expand All @@ -26,7 +26,7 @@ Step 3: Utilize the encapsulated APIs to implement the control flow
Example: Online DPO
-------------------

We use veRL to implement a simple online DPO algorithm. The algorithm
We use verl to implement a simple online DPO algorithm. The algorithm
flow of Online DPO is as follows:

1. There is a prompt (rollout) generator which has the same weight as
Expand Down Expand Up @@ -178,7 +178,7 @@ steps:
and merge them.

Frequently calling these 3 steps on the controller process greatly hurts
code readability. **In veRL, we have abstracted and encapsulated these 3
code readability. **In verl, we have abstracted and encapsulated these 3
steps, so that the worker's method + dispatch + collect can be
registered into the worker_group**

Expand Down
2 changes: 1 addition & 1 deletion docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@

# -- Project information -----------------------------------------------------

project = u'veRL'
project = u'verl'
# pylint: disable=W0622
copyright = u'2024 ByteDance Seed Foundation MLSys Team'
author = u'Guangming Sheng, Chi Zhang, Yanghua Peng, Haibin Lin'
Expand Down
2 changes: 1 addition & 1 deletion docs/examples/ppo_code_architecture.rst
Original file line number Diff line number Diff line change
Expand Up @@ -200,7 +200,7 @@ Define, init and run the PPO Trainer
on the allocated GPUs (in the resource pool)
- The actual PPO training will be executed in ``trainer.fit()``

veRL can be easily extended to other RL algorithms by reusing the Ray
verl can be easily extended to other RL algorithms by reusing the Ray
model workers, resource pool and reward functions. See :doc:`extension<../advance/dpo_extension>` for
more information.

Expand Down
44 changes: 27 additions & 17 deletions docs/experiment/ppo.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,22 +11,32 @@ Assuming GSM8k dataset is preprocess via ``python3 examples/data_preprocess/gsm8
Refer to the table below to reproduce PPO training from different pre-trained models.

.. _Huggingface: https://huggingface.co/google/gemma-2-2b-it#benchmark-results
.. _SFT Command and logs: https://github.com/eric-haibin-lin/verl-data/blob/experiments/gsm8k/gemma-2-2b-it-sft-0.411.log
.. _SFT+PPO Command and logs: https://github.com/eric-haibin-lin/verl-data/blob/experiments/gsm8k/gemma-2-2b-it-ppo-bsz512_4-prompt1024-resp-512-0.640.log
.. _SFT Command and Logs: https://github.com/eric-haibin-lin/verl-data/blob/experiments/gsm8k/gemma-2-2b-it-sft-0.411.log
.. _SFT+PPO Command and Logs: https://github.com/eric-haibin-lin/verl-data/blob/experiments/gsm8k/gemma-2-2b-it-ppo-bsz512_4-prompt1024-resp-512-0.640.log
.. _wandb: https://api.wandb.ai/links/verl-team/h7ux8602
.. _Qwen Blog: https://qwenlm.github.io/blog/qwen2.5-llm/
.. _PPO Command and logs: https://github.com/eric-haibin-lin/verl-data/blob/experiments/gsm8k/Qwen2.5-0.5B-bsz256_2-prompt1024-resp512-0.567.log

+----------------------------+------------------------+------------+-----------------------------------------------------------------------------------------------+
| Model | Method | Test score | Details |
+============================+========================+============+=====================+=========================================================================+
| google/gemma-2-2b-it | pretrained checkpoint | 23.9 | `Huggingface`_ |
+----------------------------+------------------------+------------+-----------------------------------------------------------------------------------------------+
| google/gemma-2-2b-it | SFT | 52.06 | `SFT Command and logs`_ |
+----------------------------+------------------------+------------+-----------------------------------------------------------------------------------------------+
| google/gemma-2-2b-it | SFT + PPO | 64.02 | `SFT+PPO Command and logs`_, `wandb`_ |
+----------------------------+------------------------+------------+-----------------------------------------------------------------------------------------------+
| Qwen/Qwen2.5-0.5B-Instruct | pretrained checkpoint | 36.4 | `Qwen Blog`_ |
+----------------------------+------------------------+------------+-----------------------------------------------------------------------------------------------+
| Qwen/Qwen2.5-0.5B-Instruct | PPO | 56.7 | `PPO Command and logs`_ |
+----------------------------+------------------------+------------+-----------------------------------------------------------------------------------------------+
.. _PPO Command and Logs: https://github.com/eric-haibin-lin/verl-data/blob/experiments/gsm8k/Qwen2.5-0.5B-bsz256_2-prompt1024-resp512-0.567.log
.. _Megatron PPO Command and Logs: https://github.com/eric-haibin-lin/verl-data/blob/experiments/gsm8k/deepseek-llm-7b-chat-megatron-bsz256_4-prompt512-resp512-0.695.log
.. _Qwen7b GRPO Script: https://github.com/volcengine/verl/blob/a65c9157bc0b85b64cd753de19f94e80a11bd871/examples/grpo_trainer/run_qwen2-7b_seq_balance.sh
.. _Megatron wandb: https://wandb.ai/verl-team/verl_megatron_gsm8k_examples/runs/10fetyr3

+----------------------------------+------------------------+------------+-----------------------------------------------------------------------------------------------+
| Model | Method | Test score | Details |
+==================================+========================+============+=====================+=========================================================================+
| google/gemma-2-2b-it | pretrained checkpoint | 23.9 | `Huggingface`_ |
+----------------------------------+------------------------+------------+-----------------------------------------------------------------------------------------------+
| google/gemma-2-2b-it | SFT | 52.06 | `SFT Command and Logs`_ |
+----------------------------------+------------------------+------------+-----------------------------------------------------------------------------------------------+
| google/gemma-2-2b-it | SFT + PPO | 64.02 | `SFT+PPO Command and Logs`_, `wandb`_ |
+----------------------------------+------------------------+------------+-----------------------------------------------------------------------------------------------+
| Qwen/Qwen2.5-0.5B-Instruct | pretrained checkpoint | 36.4 | `Qwen Blog`_ |
+----------------------------------+------------------------+------------+-----------------------------------------------------------------------------------------------+
| Qwen/Qwen2.5-0.5B-Instruct | PPO | 56.7 | `PPO Command and Logs`_ |
+----------------------------------+------------------------+------------+-----------------------------------------------------------------------------------------------+
| deepseek-ai/deepseek-llm-7b-chat | PPO | 69.5 [1]_ | `Megatron PPO Command and Logs`_, `Megatron wandb`_ |
+----------------------------------+------------------------+------------+-----------------------------------------------------------------------------------------------+
| Qwen/Qwen2-7B-Instruct | GRPO | 89 | `Qwen7b GRPO Script`_ |
+----------------------------------+------------------------+------------+-----------------------------------------------------------------------------------------------+


.. [1] During the evaluation, we have only extracted answers following the format "####". A more flexible answer exaction, longer response length and better prompt engineering may lead to higher score.
12 changes: 6 additions & 6 deletions docs/index.rst
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
Welcome to veRL's documentation!
Welcome to verl's documentation!
================================================

.. _hf_arxiv: https://arxiv.org/pdf/2409.19256

veRL is a flexible, efficient and production-ready RL training framework designed for large language models (LLMs) post-training. It is an open source implementation of the `HybridFlow <hf_arxiv>`_ paper.
verl is a flexible, efficient and production-ready RL training framework designed for large language models (LLMs) post-training. It is an open source implementation of the `HybridFlow <hf_arxiv>`_ paper.

veRL is flexible and easy to use with:
verl is flexible and easy to use with:

- **Easy extension of diverse RL algorithms**: The Hybrid programming model combines the strengths of single-controller and multi-controller paradigms to enable flexible representation and efficient execution of complex Post-Training dataflows. Allowing users to build RL dataflows in a few lines of code.

Expand All @@ -16,9 +16,9 @@ veRL is flexible and easy to use with:
- Readily integration with popular HuggingFace models


veRL is fast with:
verl is fast with:

- **State-of-the-art throughput**: By seamlessly integrating existing SOTA LLM training and inference frameworks, veRL achieves high generation and training throughput.
- **State-of-the-art throughput**: By seamlessly integrating existing SOTA LLM training and inference frameworks, verl achieves high generation and training throughput.

- **Efficient actor model resharding with 3D-HybridEngine**: Eliminates memory redundancy and significantly reduces communication overhead during transitions between training and generation phases.

Expand Down Expand Up @@ -92,7 +92,7 @@ veRL is fast with:
Contribution
-------------

veRL is free software; you can redistribute it and/or modify it under the terms
verl is free software; you can redistribute it and/or modify it under the terms
of the Apache License 2.0. We welcome contributions.
Join us on `GitHub <https://github.com/volcengine/verl>`_, `Slack <https://join.slack.com/t/verlgroup/shared_invite/zt-2w5p9o4c3-yy0x2Q56s_VlGLsJ93A6vA>`_ and `Wechat <https://raw.githubusercontent.com/eric-haibin-lin/verl-community/refs/heads/main/WeChat.JPG>`_ for discussions.

Expand Down
6 changes: 3 additions & 3 deletions docs/perf/perf_tuning.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Performance Tuning Guide
=========================

In this section, we will discuss how to tune the performance of all the stages in veRL, including:
In this section, we will discuss how to tune the performance of all the stages in verl, including:

1. Rollout generation throughput.

Expand All @@ -16,7 +16,7 @@ In this section, we will discuss how to tune the performance of all the stages i
Rollout Generation Tuning
--------------------------

veRL currently supports two rollout backends: vLLM and TGI (with SGLang support coming soon).
verl currently supports two rollout backends: vLLM and TGI (with SGLang support coming soon).

Below are key factors for tuning vLLM-based rollout. Before tuning, we recommend setting ``actor_rollout_ref.rollout.disable_log_stats=False`` so that rollout statistics are logged.

Expand Down Expand Up @@ -45,7 +45,7 @@ Batch Size Tuning
To achieve higher throughput in experience preparation (i.e., model fwd) and model update (i.e., actor/critic fwd/bwd),
users may need to tune the ``*micro_batch_size_per_gpu`` for different computation.

In veRL, the core principle for setting batch sizes is:
In verl, the core principle for setting batch sizes is:

- **Algorithmic metrics** (train batch size, PPO mini-batch size) are *global* (from a single-controller perspective),
normalized in each worker. See the `normalization code <https://github.com/volcengine/verl/blob/main/verl/workers/fsdp_workers.py#L120-L122>`_.
Expand Down
12 changes: 6 additions & 6 deletions docs/start/install.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ Requirements
- **Python**: Version >= 3.9
- **CUDA**: Version >= 12.1

veRL supports various backends. Currently, the following configurations are available:
verl supports various backends. Currently, the following configurations are available:

- **FSDP** and **Megatron-LM** (optional) for training.
- **vLLM** adn **TGI** for rollout generation, **SGLang** support coming soon.
Expand All @@ -34,7 +34,7 @@ Image and tag: ``verlai/verl:vemlp-th2.4.0-cu124-vllm0.6.3-ray2.10-te1.7-v0.0.3`
docker run --runtime=nvidia -it --rm --shm-size="10g" --cap-add=SYS_ADMIN -v <image:tag>
2. Inside the container, install veRL:
2. Inside the container, install verl:

.. code:: bash
Expand Down Expand Up @@ -74,7 +74,7 @@ To manage environment, we recommend using conda:
conda create -n verl python==3.9
conda activate verl
For installing the latest version of veRL, the best way is to clone and
For installing the latest version of verl, the best way is to clone and
install it from source. Then you can modify our code to customize your
own post-training jobs.

Expand All @@ -85,7 +85,7 @@ own post-training jobs.
cd verl
pip3 install -e .
You can also install veRL using ``pip3 install``
You can also install verl using ``pip3 install``

.. code:: bash
Expand All @@ -95,9 +95,9 @@ You can also install veRL using ``pip3 install``
Dependencies
------------

veRL requires Python >= 3.9 and CUDA >= 12.1.
verl requires Python >= 3.9 and CUDA >= 12.1.

veRL support various backend, we currently release FSDP and Megatron-LM
verl support various backend, we currently release FSDP and Megatron-LM
for actor training and vLLM for rollout generation.

The following dependencies are required for all backends, PyTorch FSDP and Megatron-LM.
Expand Down
17 changes: 13 additions & 4 deletions examples/ppo_trainer/run_deepseek_megatron.sh
Original file line number Diff line number Diff line change
@@ -1,5 +1,11 @@
set -x

# prepare pre-trained model ckpt
huggingface-cli download deepseek-ai/deepseek-llm-7b-chat --local-dir $HOME/models/deepseek-llm-7b-chat

# ``actor_rollout_ref.rollout.tensor_model_parallel_size`` in theory could be different from
# ``**.megatron.tensor_model_parallel_size``

# the config file used: verl/trainer/main_ppo/config/ppo_megatron_trainer.yaml

python3 -m verl.trainer.main_ppo --config-path=config \
Expand All @@ -10,19 +16,22 @@ python3 -m verl.trainer.main_ppo --config-path=config \
data.val_batch_size=1312 \
data.max_prompt_length=512 \
data.max_response_length=512 \
actor_rollout_ref.model.path=deepseek-ai/deepseek-coder-6.7b-instruct \
actor_rollout_ref.model.path=$HOME/models/deepseek-llm-7b-chat \
actor_rollout_ref.actor.optim.lr=2e-6 \
actor_rollout_ref.actor.ppo_mini_batch_size=256 \
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=8 \
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4 \
actor_rollout_ref.actor.megatron.tensor_model_parallel_size=4 \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8 \
actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
actor_rollout_ref.rollout.name=vllm \
actor_rollout_ref.rollout.gpu_memory_utilization=0.5 \
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=16 \
actor_rollout_ref.ref.megatron.tensor_model_parallel_size=4 \
critic.optim.lr=2e-5 \
critic.model.path=deepseek-ai/deepseek-coder-6.7b-instruct \
critic.model.path=$HOME/models/deepseek-llm-7b-chat \
critic.model.enable_gradient_checkpointing=False \
critic.ppo_micro_batch_size_per_gpu=8 \
critic.ppo_micro_batch_size_per_gpu=4 \
critic.megatron.tensor_model_parallel_size=4 \
algorithm.kl_ctrl.kl_coef=0.001 \
trainer.critic_warmup=0 \
trainer.logger=['console','wandb'] \
Expand Down
4 changes: 2 additions & 2 deletions examples/ppo_trainer/verl_getting_started.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -8,13 +8,13 @@
"source": [
"# Run Qwen PPO with [verl](https://github.com/volcengine/verl)\n",
"\n",
"This tutorial provides a step-by-step guide to using veRL for executing your RLHF pipeline. You can find our [github repo](https://github.com/volcengine/verl/) and [documentation](https://verl.readthedocs.io/en/latest/index.html) for mode details.\n",
"This tutorial provides a step-by-step guide to using verl for executing your RLHF pipeline. You can find our [github repo](https://github.com/volcengine/verl/) and [documentation](https://verl.readthedocs.io/en/latest/index.html) for mode details.\n",
"\n",
"This notebook is also published on the [Lightning Studio](https://lightning.ai/hlin-verl/studios/verl-getting-started) platform, which provides free GPU quota every month. Checkout the published notebook with pre-installed dependencies using a free L4 GPU [here](https://lightning.ai/hlin-verl/studios/verl-getting-started) (no credit card required).\n",
"\n",
"### You will learn:\n",
"\n",
"- How to install veRL from scratch.\n",
"- How to install verl from scratch.\n",
"- How to use existing scripts to run an RLHF pipeline with your own models and data."
]
},
Expand Down
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ name = "verl"
# The actual version is specified in the [tool.setuptools.dynamic] section below.
dynamic = ["version"]

description = "veRL: Volcano Engine Reinforcement Learning for LLM"
description = "verl: Volcano Engine Reinforcement Learning for LLM"
license = {file = "LICENSE"} # or "Apache-2.0", if you prefer an SPDX identifier
readme = {file = "README.md", content-type = "text/markdown"}
requires-python = ">=3.8"
Expand Down
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@
license='Apache 2.0',
author='Bytedance - Seed - MLSys',
author_email='[email protected], [email protected]',
description='veRL: Volcano Engine Reinforcement Learning for LLM',
description='verl: Volcano Engine Reinforcement Learning for LLM',
install_requires=install_requires,
extras_require=extras_require,
package_data={'': ['version/*'],
Expand Down
2 changes: 1 addition & 1 deletion verl/third_party/vllm/vllm_v_0_4_2/parallel_state.py
Original file line number Diff line number Diff line change
Expand Up @@ -206,7 +206,7 @@ def initialize_model_parallel(
backend = backend or torch.distributed.get_backend()

# NOTE(sgm) we don't assert world_size == tp * pp
# DP is not managed by vllm but by the veRL WorkerGroup
# DP is not managed by vllm but by the verl WorkerGroup

num_tensor_model_parallel_groups: int = (world_size // tensor_model_parallel_size)
num_pipeline_model_parallel_groups: int = (world_size // pipeline_model_parallel_size)
Expand Down
2 changes: 1 addition & 1 deletion verl/third_party/vllm/vllm_v_0_5_4/parallel_state.py
Original file line number Diff line number Diff line change
Expand Up @@ -224,7 +224,7 @@ def initialize_model_parallel(
backend = backend or torch.distributed.get_backend(ps.get_world_group().device_group)

# NOTE(sgm) we don't assert world_size == tp * pp
# DP is not managed by vllm but by the veRL WorkerGroup
# DP is not managed by vllm but by the verl WorkerGroup
# if (world_size !=
# tensor_model_parallel_size * pipeline_model_parallel_size):
# raise RuntimeError(
Expand Down

0 comments on commit ced8ecb

Please sign in to comment.