diff --git a/docs/source/features/parallelisms.rst b/docs/source/features/parallelisms.rst index 12ed55d338f7..be3ff4d7b8dc 100644 --- a/docs/source/features/parallelisms.rst +++ b/docs/source/features/parallelisms.rst @@ -262,7 +262,9 @@ Unlike other model-parallel techniques, EP is applied to only the expert layers Enable Expert Parallelism ~~~~~~~~~~~~~~~~~~~~~~~~~ -To enable EP, set ``model.expert_model_parallel_size`` to the expert parallel size you want. For example, if the model has eight experts (``num_moe_experts=8``), then setting ``expert_model_parallel_size=4`` results in each GPU processing two experts. The number of experts should be divisible by the expert parallel size. +To enable EP, set ``expert_model_parallel_size`` to the expert parallel size you want in the ``MegatronStrategy``'s arguments. +For example, if the model has eight experts (``num_moe_experts=8``), then setting ``expert_model_parallel_size=4`` results in each GPU processing two experts. +The number of experts should be divisible by the expert parallel size. .. code-block:: python @@ -284,12 +286,35 @@ Set expert parallelism directly from CLI: For further information on configuration, refer to the following documentation: `NeMo Megatron GPT Config `__. -Implement Expert Parallelism -~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Enable Expert Tensor Parallelism +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +To enable ETP, set ``expert_tensor_parallel_size`` to the size you want in the ``MegatronStrategy``'s arguments. For example: + + .. code-block:: python + + from nemo.collections import llm + from functools import partial + + # Load train recipe + recipe = partial(llm.mixtral_8x7b.pretrain_recipe)() + + # Set expert tensor parallel size + recipe.trainer.strategy.expert_tensor_parallel_size = 4 + +Set expert tensor parallelism directly from CLI: + + .. code-block:: bash + + nemo llm pretrain --factory mixtral_8x7b trainer.strategy.expert_tensor_parallel_size=4 + + +Expert Parallelism Implementation +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The NeMo Framework implementation of EP uses functionality from Megatron Core. Please consult the `Megatron Core MoE layer `_ for more MoE implementation details. + Activation Partitioning -----------------------