From 1573a8f2e822405c53acab3ff4521eba9ced1e83 Mon Sep 17 00:00:00 2001 From: Keita Watanabe Date: Thu, 23 Jan 2025 01:25:59 +0000 Subject: [PATCH 1/3] add tips to force NCCL comm to go through EFA Co-authored-by: mhuguesaws <71357145+mhuguesaws@users.noreply.github.com> --- micro-benchmarks/nccl-tests/README.md | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) diff --git a/micro-benchmarks/nccl-tests/README.md b/micro-benchmarks/nccl-tests/README.md index dd217e2c..1a549627 100644 --- a/micro-benchmarks/nccl-tests/README.md +++ b/micro-benchmarks/nccl-tests/README.md @@ -316,3 +316,20 @@ The formula defines the maximum theoretical bandwidth that can be achieved on di * `t` : time to complete the operation. (similar to sec for Algbw and Busbw) * `S` : number of elements being communicated (similar to count for Algbw and Busbw) * `B` : theoretical peak bandwidth. + +## 4. Tips and Tricks + +This section demonstrates NCCL tests tips and tricks useful to diagnose cluster nodes. + +#### Test EFA + +You can force inter-GPU communications to go through EFA with the following environment variables: + + +```bash +# NCCL Environment force disable P2P through NVlink, PCI and SHM. +export NCCL_P2P_DISABLE=1 +export NCCL_SHM_DISABLE=1 +export NCCL_NVLS_ENABLE=0 +export NCCL_NET='AWS Libfabric' +``` From ebb0711bc4c5ed8dbf0b20c045e1924524ed17c8 Mon Sep 17 00:00:00 2001 From: mhuguesaws <71357145+mhuguesaws@users.noreply.github.com> Date: Thu, 23 Jan 2025 09:14:34 -0600 Subject: [PATCH 2/3] Update micro-benchmarks/nccl-tests/README.md --- micro-benchmarks/nccl-tests/README.md | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/micro-benchmarks/nccl-tests/README.md b/micro-benchmarks/nccl-tests/README.md index 1a549627..39add422 100644 --- a/micro-benchmarks/nccl-tests/README.md +++ b/micro-benchmarks/nccl-tests/README.md @@ -327,6 +327,11 @@ You can force inter-GPU communications to go through EFA with the following envi ```bash +#Single node EFA +# libfabric flags +export FI_PROVIDER=efa +export FI_EFA_USE_DEVICE_RDMA=1 + # NCCL Environment force disable P2P through NVlink, PCI and SHM. export NCCL_P2P_DISABLE=1 export NCCL_SHM_DISABLE=1 From 041677cdfe26971d190901275ea45f7e09a67557 Mon Sep 17 00:00:00 2001 From: Keita Watanabe Date: Fri, 24 Jan 2025 07:18:15 +0900 Subject: [PATCH 3/3] Update micro-benchmarks/nccl-tests/README.md Co-authored-by: mhuguesaws <71357145+mhuguesaws@users.noreply.github.com> --- micro-benchmarks/nccl-tests/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/micro-benchmarks/nccl-tests/README.md b/micro-benchmarks/nccl-tests/README.md index 39add422..a4cc85e4 100644 --- a/micro-benchmarks/nccl-tests/README.md +++ b/micro-benchmarks/nccl-tests/README.md @@ -323,7 +323,7 @@ This section demonstrates NCCL tests tips and tricks useful to diagnose cluster #### Test EFA -You can force inter-GPU communications to go through EFA with the following environment variables: +You can force inter-GPU communications to go through EFA network interfaces instead of shared memory of NVLink with the following environment variables: ```bash