Question about algorithms used for all-gathers and reduce-scatters #1594

siddharth9820 · 2025-02-05T17:16:53Z

Lines 76 to 83 in 80f6bda

    
           algos_of_coll = { 
        
             "AllGather":     ["RING","COLLNET_DIRECT","NVLS","PAT"], 
        
             "AllReduce":     ["TREE","RING","COLLNET_DIRECT","COLLNET_CHAIN","NVLS","NVLS_TREE"], 
        
             "Broadcast":     ["RING"], 
        
             "Reduce":        ["RING"], 
        
             "ReduceScatter": ["RING","COLLNET_DIRECT","NVLS","PAT"], 
        
             "SendRecv":      [None] 
        
           }

Based on these lines of code, I can see that all-gathers and reduce-scatters can employ RING, COLLNET_DIRECT, NVLS, and PAT. I am working on the Perlmutter supercomputer (https://docs.nersc.gov/systems/perlmutter/architecture/) which has HPE's slingshot interconnect. Would that mean that our of these four options only RING is applicable for all-gathers/reduce-scatters since the other three depend on having Infiniband and Nvswitches?

kiskra-nvidia · 2025-02-05T18:10:06Z

CollNet and NVLS are out for sure. I think PAT could be an option though?

You can verify which algorithm NCCL chooses for every collective by running with NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=COLL,TUNING -- rank 0 will print the algo/proto combination.

siddharth9820 · 2025-02-05T19:10:30Z

Thank you so much! Follow up question - what does PAT do? And how should I model it's latency and message transmission times. For example I know that the latency of ring algorithm is O(G) and transmission time is O( [G-1]/G * N), where G is the number of GPUs and N is the size of the output buffer (for all gather)

bhatele · 2025-02-05T20:35:06Z

Also, if there are two available options, how does NCCL choose which one to actually use?

kiskra-nvidia · 2025-02-05T23:09:13Z

PAT is a binomial tree algorithm. There's a bit of info here: https://developer.nvidia.com/blog/new-scaling-algorithm-and-initialization-with-nvidia-collective-communications-library-2-23/#pat_logarithmic_scaling_for_reducescatter_and_allgather%C2%A0

NCCL has an internal performance model for each algorithm and for each collective operation picks the algorithm/protocol combo that it expects to perform best under the circumstances.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about algorithms used for all-gathers and reduce-scatters #1594

Question about algorithms used for all-gathers and reduce-scatters #1594

siddharth9820 commented Feb 5, 2025 •

edited

Loading

kiskra-nvidia commented Feb 5, 2025

siddharth9820 commented Feb 5, 2025

bhatele commented Feb 5, 2025

kiskra-nvidia commented Feb 5, 2025

Question about algorithms used for all-gathers and reduce-scatters #1594

Question about algorithms used for all-gathers and reduce-scatters #1594

Comments

siddharth9820 commented Feb 5, 2025 • edited Loading

kiskra-nvidia commented Feb 5, 2025

siddharth9820 commented Feb 5, 2025

bhatele commented Feb 5, 2025

kiskra-nvidia commented Feb 5, 2025

siddharth9820 commented Feb 5, 2025 •

edited

Loading