example: switch the default model ckpt for Megatron, add wandb logs (#…

…210) use the general purpose LLM for the math task instead of code LLM. --------- Co-authored-by: Your Name <[email protected]>
volcengine · Feb 6, 2025 · ced8ecb · ced8ecb
1 parent 22d56a8
commit ced8ecb
Show file tree

Hide file tree

Showing 15 changed files with 68 additions and 49 deletions.
diff --git a/README.md b/README.md
@@ -69,7 +69,7 @@ Checkout this [Jupyter Notebook](https://github.com/volcengine/verl/tree/main/ex
   - [Run GSM8K Example](https://verl.readthedocs.io/en/latest/examples/gsm8k_example.html)
 
 **Reproducible algorithm baselines:**
-- [PPO](https://verl.readthedocs.io/en/latest/experiment/ppo.html)
+- [PPO and GRPO](https://verl.readthedocs.io/en/latest/experiment/ppo.html)
 
 **For code explanation and advance usage (extension):**
 - PPO Trainer and Workers

diff --git a/docs/README.md b/docs/README.md
@@ -1,4 +1,4 @@
-# veRL documents
+# verl documents
 
 ## Build the docs
 

diff --git a/docs/advance/dpo_extension.rst b/docs/advance/dpo_extension.rst
@@ -3,7 +3,7 @@ Extend to other RL(HF) algorithms
 
 We already implemented the complete training pipeline of the PPO
 algorithms. To extend to other algorithms, we analyze the high-level
-principle to use veRL and provide a tutorial to implement the DPO
+principle to use verl and provide a tutorial to implement the DPO
 algorithm. Users can follow the similar paradigm to extend to other RL algorithms.
 
 .. note:: **Key ideas**: Single process drives multi-process computation and data communication.
@@ -26,7 +26,7 @@ Step 3: Utilize the encapsulated APIs to implement the control flow
 Example: Online DPO
 -------------------
 
-We use veRL to implement a simple online DPO algorithm. The algorithm
+We use verl to implement a simple online DPO algorithm. The algorithm
 flow of Online DPO is as follows:
 
 1. There is a prompt (rollout) generator which has the same weight as
@@ -178,7 +178,7 @@ steps:
    and merge them.
 
 Frequently calling these 3 steps on the controller process greatly hurts
-code readability. **In veRL, we have abstracted and encapsulated these 3
+code readability. **In verl, we have abstracted and encapsulated these 3
 steps, so that the worker's method + dispatch + collect can be
 registered into the worker_group**
 

diff --git a/docs/conf.py b/docs/conf.py
@@ -31,7 +31,7 @@
 
 # -- Project information -----------------------------------------------------
 
-project = u'veRL'
+project = u'verl'
 # pylint: disable=W0622
 copyright = u'2024 ByteDance Seed Foundation MLSys Team'
 author = u'Guangming Sheng, Chi Zhang, Yanghua Peng, Haibin Lin'

diff --git a/docs/examples/ppo_code_architecture.rst b/docs/examples/ppo_code_architecture.rst
@@ -200,7 +200,7 @@ Define, init and run the PPO Trainer
   on the allocated GPUs (in the resource pool)
 - The actual PPO training will be executed in ``trainer.fit()``
 
-veRL can be easily extended to other RL algorithms by reusing the Ray
+verl can be easily extended to other RL algorithms by reusing the Ray
 model workers, resource pool and reward functions. See :doc:`extension<../advance/dpo_extension>` for
 more information.
 

diff --git a/docs/experiment/ppo.rst b/docs/experiment/ppo.rst
@@ -11,22 +11,32 @@ Assuming GSM8k dataset is preprocess via ``python3 examples/data_preprocess/gsm8
 Refer to the table below to reproduce PPO training from different pre-trained models.
 
 .. _Huggingface: https://huggingface.co/google/gemma-2-2b-it#benchmark-results
-.. _SFT Command and logs: https://github.com/eric-haibin-lin/verl-data/blob/experiments/gsm8k/gemma-2-2b-it-sft-0.411.log
-.. _SFT+PPO Command and logs: https://github.com/eric-haibin-lin/verl-data/blob/experiments/gsm8k/gemma-2-2b-it-ppo-bsz512_4-prompt1024-resp-512-0.640.log
+.. _SFT Command and Logs: https://github.com/eric-haibin-lin/verl-data/blob/experiments/gsm8k/gemma-2-2b-it-sft-0.411.log
+.. _SFT+PPO Command and Logs: https://github.com/eric-haibin-lin/verl-data/blob/experiments/gsm8k/gemma-2-2b-it-ppo-bsz512_4-prompt1024-resp-512-0.640.log
 .. _wandb: https://api.wandb.ai/links/verl-team/h7ux8602
 .. _Qwen Blog: https://qwenlm.github.io/blog/qwen2.5-llm/
-.. _PPO Command and logs: https://github.com/eric-haibin-lin/verl-data/blob/experiments/gsm8k/Qwen2.5-0.5B-bsz256_2-prompt1024-resp512-0.567.log
-
-+----------------------------+------------------------+------------+-----------------------------------------------------------------------------------------------+
-| Model                      | Method                 | Test score |  Details                                                                                      |
-+============================+========================+============+=====================+=========================================================================+
-| google/gemma-2-2b-it       | pretrained checkpoint  | 23.9       |   `Huggingface`_                                                                              |
-+----------------------------+------------------------+------------+-----------------------------------------------------------------------------------------------+
-| google/gemma-2-2b-it       | SFT                    | 52.06      |   `SFT Command and logs`_                                                                     |
-+----------------------------+------------------------+------------+-----------------------------------------------------------------------------------------------+
-| google/gemma-2-2b-it       | SFT + PPO              | 64.02      |   `SFT+PPO Command and logs`_, `wandb`_                                                       |
-+----------------------------+------------------------+------------+-----------------------------------------------------------------------------------------------+
-| Qwen/Qwen2.5-0.5B-Instruct | pretrained checkpoint  | 36.4       |   `Qwen Blog`_                                                                                |
-+----------------------------+------------------------+------------+-----------------------------------------------------------------------------------------------+
-| Qwen/Qwen2.5-0.5B-Instruct | PPO                    | 56.7       |   `PPO Command and logs`_                                                                     |
-+----------------------------+------------------------+------------+-----------------------------------------------------------------------------------------------+
+.. _PPO Command and Logs: https://github.com/eric-haibin-lin/verl-data/blob/experiments/gsm8k/Qwen2.5-0.5B-bsz256_2-prompt1024-resp512-0.567.log
+.. _Megatron PPO Command and Logs: https://github.com/eric-haibin-lin/verl-data/blob/experiments/gsm8k/deepseek-llm-7b-chat-megatron-bsz256_4-prompt512-resp512-0.695.log
+.. _Qwen7b GRPO Script: https://github.com/volcengine/verl/blob/a65c9157bc0b85b64cd753de19f94e80a11bd871/examples/grpo_trainer/run_qwen2-7b_seq_balance.sh
+.. _Megatron wandb: https://wandb.ai/verl-team/verl_megatron_gsm8k_examples/runs/10fetyr3
+
++----------------------------------+------------------------+------------+-----------------------------------------------------------------------------------------------+
+| Model                            | Method                 | Test score |  Details                                                                                      |
++==================================+========================+============+=====================+=========================================================================+
+| google/gemma-2-2b-it             | pretrained checkpoint  | 23.9       |   `Huggingface`_                                                                              |
++----------------------------------+------------------------+------------+-----------------------------------------------------------------------------------------------+
+| google/gemma-2-2b-it             | SFT                    | 52.06      |   `SFT Command and Logs`_                                                                     |
++----------------------------------+------------------------+------------+-----------------------------------------------------------------------------------------------+
+| google/gemma-2-2b-it             | SFT + PPO              | 64.02      |   `SFT+PPO Command and Logs`_, `wandb`_                                                       |
++----------------------------------+------------------------+------------+-----------------------------------------------------------------------------------------------+
+| Qwen/Qwen2.5-0.5B-Instruct       | pretrained checkpoint  | 36.4       |   `Qwen Blog`_                                                                                |
++----------------------------------+------------------------+------------+-----------------------------------------------------------------------------------------------+
+| Qwen/Qwen2.5-0.5B-Instruct       | PPO                    | 56.7       |   `PPO Command and Logs`_                                                                     |
++----------------------------------+------------------------+------------+-----------------------------------------------------------------------------------------------+
+| deepseek-ai/deepseek-llm-7b-chat | PPO                    | 69.5 [1]_  |   `Megatron PPO Command and Logs`_, `Megatron wandb`_                                         |
++----------------------------------+------------------------+------------+-----------------------------------------------------------------------------------------------+
+| Qwen/Qwen2-7B-Instruct           | GRPO                   | 89         |   `Qwen7b GRPO Script`_                                                                       |
++----------------------------------+------------------------+------------+-----------------------------------------------------------------------------------------------+
+
+
+.. [1] During the evaluation, we have only extracted answers following the format "####". A more flexible answer exaction, longer response length and better prompt engineering may lead to higher score.
diff --git a/docs/index.rst b/docs/index.rst
@@ -1,11 +1,11 @@
-Welcome to veRL's documentation!
+Welcome to verl's documentation!
 ================================================
 
 .. _hf_arxiv: https://arxiv.org/pdf/2409.19256
 
-veRL is a flexible, efficient and production-ready RL training framework designed for large language models (LLMs) post-training. It is an open source implementation of the `HybridFlow <hf_arxiv>`_ paper.
+verl is a flexible, efficient and production-ready RL training framework designed for large language models (LLMs) post-training. It is an open source implementation of the `HybridFlow <hf_arxiv>`_ paper.
 
-veRL is flexible and easy to use with:
+verl is flexible and easy to use with:
 
 - **Easy extension of diverse RL algorithms**: The Hybrid programming model combines the strengths of single-controller and multi-controller paradigms to enable flexible representation and efficient execution of complex Post-Training dataflows. Allowing users to build RL dataflows in a few lines of code.
 
@@ -16,9 +16,9 @@ veRL is flexible and easy to use with:
 - Readily integration with popular HuggingFace models
 
 
-veRL is fast with:
+verl is fast with:
 
-- **State-of-the-art throughput**: By seamlessly integrating existing SOTA LLM training and inference frameworks, veRL achieves high generation and training throughput.
+- **State-of-the-art throughput**: By seamlessly integrating existing SOTA LLM training and inference frameworks, verl achieves high generation and training throughput.
 
 - **Efficient actor model resharding with 3D-HybridEngine**: Eliminates memory redundancy and significantly reduces communication overhead during transitions between training and generation phases.
 
@@ -92,7 +92,7 @@ veRL is fast with:
 Contribution
 -------------
 
-veRL is free software; you can redistribute it and/or modify it under the terms
+verl is free software; you can redistribute it and/or modify it under the terms
 of the Apache License 2.0. We welcome contributions.
 Join us on `GitHub <https://github.com/volcengine/verl>`_, `Slack <https://join.slack.com/t/verlgroup/shared_invite/zt-2w5p9o4c3-yy0x2Q56s_VlGLsJ93A6vA>`_ and `Wechat <https://raw.githubusercontent.com/eric-haibin-lin/verl-community/refs/heads/main/WeChat.JPG>`_ for discussions.
 

diff --git a/docs/perf/perf_tuning.rst b/docs/perf/perf_tuning.rst
@@ -1,7 +1,7 @@
 Performance Tuning Guide
 =========================
 
-In this section, we will discuss how to tune the performance of all the stages in veRL, including:
+In this section, we will discuss how to tune the performance of all the stages in verl, including:
 
 1. Rollout generation throughput.
 
@@ -16,7 +16,7 @@ In this section, we will discuss how to tune the performance of all the stages i
 Rollout Generation Tuning
 --------------------------
 
-veRL currently supports two rollout backends: vLLM and TGI (with SGLang support coming soon). 
+verl currently supports two rollout backends: vLLM and TGI (with SGLang support coming soon). 
 
 Below are key factors for tuning vLLM-based rollout. Before tuning, we recommend setting ``actor_rollout_ref.rollout.disable_log_stats=False`` so that rollout statistics are logged.
 
@@ -45,7 +45,7 @@ Batch Size Tuning
 To achieve higher throughput in experience preparation (i.e., model fwd) and model update (i.e., actor/critic fwd/bwd), 
 users may need to tune the ``*micro_batch_size_per_gpu`` for different computation.
 
-In veRL, the core principle for setting batch sizes is:
+In verl, the core principle for setting batch sizes is:
 
 - **Algorithmic metrics** (train batch size, PPO mini-batch size) are *global* (from a single-controller perspective), 
   normalized in each worker. See the `normalization code <https://github.com/volcengine/verl/blob/main/verl/workers/fsdp_workers.py#L120-L122>`_.

diff --git a/docs/start/install.rst b/docs/start/install.rst
@@ -7,7 +7,7 @@ Requirements
 - **Python**: Version >= 3.9
 - **CUDA**: Version >= 12.1
 
-veRL supports various backends. Currently, the following configurations are available:
+verl supports various backends. Currently, the following configurations are available:
 
 - **FSDP** and **Megatron-LM** (optional) for training.
 - **vLLM** adn **TGI** for rollout generation, **SGLang** support coming soon.
@@ -34,7 +34,7 @@ Image and tag: ``verlai/verl:vemlp-th2.4.0-cu124-vllm0.6.3-ray2.10-te1.7-v0.0.3`
     docker run --runtime=nvidia -it --rm --shm-size="10g" --cap-add=SYS_ADMIN -v <image:tag>
 
 
-2.	Inside the container, install veRL:
+2.	Inside the container, install verl:
 
 .. code:: bash
 
@@ -74,7 +74,7 @@ To manage environment, we recommend using conda:
    conda create -n verl python==3.9
    conda activate verl
 
-For installing the latest version of veRL, the best way is to clone and
+For installing the latest version of verl, the best way is to clone and
 install it from source. Then you can modify our code to customize your
 own post-training jobs.
 
@@ -85,7 +85,7 @@ own post-training jobs.
    cd verl
    pip3 install -e .
 
-You can also install veRL using ``pip3 install``
+You can also install verl using ``pip3 install``
 
 .. code:: bash
 
@@ -95,9 +95,9 @@ You can also install veRL using ``pip3 install``
 Dependencies
 ------------
 
-veRL requires Python >= 3.9 and CUDA >= 12.1.
+verl requires Python >= 3.9 and CUDA >= 12.1.
 
-veRL support various backend, we currently release FSDP and Megatron-LM
+verl support various backend, we currently release FSDP and Megatron-LM
 for actor training and vLLM for rollout generation.
 
 The following dependencies are required for all backends, PyTorch FSDP and Megatron-LM.

diff --git a/examples/ppo_trainer/run_deepseek_megatron.sh b/examples/ppo_trainer/run_deepseek_megatron.sh
@@ -1,5 +1,11 @@
 set -x
 
+# prepare pre-trained model ckpt
+huggingface-cli download deepseek-ai/deepseek-llm-7b-chat --local-dir $HOME/models/deepseek-llm-7b-chat
+
+# ``actor_rollout_ref.rollout.tensor_model_parallel_size`` in theory could be different from
+# ``**.megatron.tensor_model_parallel_size``
+
 # the config file used: verl/trainer/main_ppo/config/ppo_megatron_trainer.yaml
 
 python3 -m verl.trainer.main_ppo --config-path=config \
@@ -10,19 +16,22 @@ python3 -m verl.trainer.main_ppo --config-path=config \
     data.val_batch_size=1312 \
     data.max_prompt_length=512 \
     data.max_response_length=512 \
-    actor_rollout_ref.model.path=deepseek-ai/deepseek-coder-6.7b-instruct \
+    actor_rollout_ref.model.path=$HOME/models/deepseek-llm-7b-chat \
     actor_rollout_ref.actor.optim.lr=2e-6 \
     actor_rollout_ref.actor.ppo_mini_batch_size=256 \
-    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=8 \
+    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4 \
+    actor_rollout_ref.actor.megatron.tensor_model_parallel_size=4 \
     actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8 \
     actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
     actor_rollout_ref.rollout.name=vllm \
     actor_rollout_ref.rollout.gpu_memory_utilization=0.5 \
     actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=16 \
+    actor_rollout_ref.ref.megatron.tensor_model_parallel_size=4 \
     critic.optim.lr=2e-5 \
-    critic.model.path=deepseek-ai/deepseek-coder-6.7b-instruct \
+    critic.model.path=$HOME/models/deepseek-llm-7b-chat \
     critic.model.enable_gradient_checkpointing=False \
-    critic.ppo_micro_batch_size_per_gpu=8 \
+    critic.ppo_micro_batch_size_per_gpu=4 \
+    critic.megatron.tensor_model_parallel_size=4 \
     algorithm.kl_ctrl.kl_coef=0.001 \
     trainer.critic_warmup=0 \
     trainer.logger=['console','wandb'] \

diff --git a/examples/ppo_trainer/verl_getting_started.ipynb b/examples/ppo_trainer/verl_getting_started.ipynb
@@ -8,13 +8,13 @@
    "source": [
     "# Run Qwen PPO with [verl](https://github.com/volcengine/verl)\n",
     "\n",
-    "This tutorial provides a step-by-step guide to using veRL for executing your RLHF pipeline. You can find our [github repo](https://github.com/volcengine/verl/) and [documentation](https://verl.readthedocs.io/en/latest/index.html) for mode details.\n",
+    "This tutorial provides a step-by-step guide to using verl for executing your RLHF pipeline. You can find our [github repo](https://github.com/volcengine/verl/) and [documentation](https://verl.readthedocs.io/en/latest/index.html) for mode details.\n",
     "\n",
     "This notebook is also published on the [Lightning Studio](https://lightning.ai/hlin-verl/studios/verl-getting-started) platform, which provides free GPU quota every month. Checkout the published notebook with pre-installed dependencies using a free L4 GPU [here](https://lightning.ai/hlin-verl/studios/verl-getting-started) (no credit card required).\n",
     "\n",
     "### You will learn:\n",
     "\n",
-    "- How to install veRL from scratch.\n",
+    "- How to install verl from scratch.\n",
     "- How to use existing scripts to run an RLHF pipeline with your own models and data."
    ]
   },

diff --git a/pyproject.toml b/pyproject.toml
@@ -18,7 +18,7 @@ name = "verl"
 # The actual version is specified in the [tool.setuptools.dynamic] section below.
 dynamic = ["version"]
 
-description = "veRL: Volcano Engine Reinforcement Learning for LLM"
+description = "verl: Volcano Engine Reinforcement Learning for LLM"
 license = {file = "LICENSE"}  # or "Apache-2.0", if you prefer an SPDX identifier
 readme = {file = "README.md", content-type = "text/markdown"}
 requires-python = ">=3.8"

diff --git a/setup.py b/setup.py
@@ -43,7 +43,7 @@
     license='Apache 2.0',
     author='Bytedance - Seed - MLSys',
     author_email='[email protected], [email protected]',
-    description='veRL: Volcano Engine Reinforcement Learning for LLM',
+    description='verl: Volcano Engine Reinforcement Learning for LLM',
     install_requires=install_requires,
     extras_require=extras_require,
     package_data={'': ['version/*'],

diff --git a/verl/third_party/vllm/vllm_v_0_4_2/parallel_state.py b/verl/third_party/vllm/vllm_v_0_4_2/parallel_state.py
@@ -206,7 +206,7 @@ def initialize_model_parallel(
     backend = backend or torch.distributed.get_backend()
 
     # NOTE(sgm) we don't assert world_size == tp * pp
-    # DP is not managed by vllm but by the veRL WorkerGroup
+    # DP is not managed by vllm but by the verl WorkerGroup
 
     num_tensor_model_parallel_groups: int = (world_size // tensor_model_parallel_size)
     num_pipeline_model_parallel_groups: int = (world_size // pipeline_model_parallel_size)

diff --git a/verl/third_party/vllm/vllm_v_0_5_4/parallel_state.py b/verl/third_party/vllm/vllm_v_0_5_4/parallel_state.py
@@ -224,7 +224,7 @@ def initialize_model_parallel(
     backend = backend or torch.distributed.get_backend(ps.get_world_group().device_group)
 
     # NOTE(sgm) we don't assert world_size == tp * pp
-    # DP is not managed by vllm but by the veRL WorkerGroup
+    # DP is not managed by vllm but by the verl WorkerGroup
     # if (world_size !=
     #         tensor_model_parallel_size * pipeline_model_parallel_size):
     #     raise RuntimeError(