You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
nvFuser comm/compute overlap experiment and comparison with nccl. In this experiment, we post a single allgather followed by a single matmul op. After warmup and averaging across multiple iterations, we get that nccl's latency is way better than ucc's
I am seeing bad perf for one-node TL/CUDA/allgather on GPU connected through nvLink.
On H100
Setup DGX 8*H100, one node
osu benchmark's
osu_iallgather
osu benchmark osu_allgather
nccl-test
ucc perftest
On V100
osu iallgather
reproducer
osu-benchmarks
nvFuser Overlap benchmark
nvFuser comm/compute overlap experiment and comparison with nccl. In this experiment, we post a single allgather followed by a single matmul op. After warmup and averaging across multiple iterations, we get that nccl's latency is way better than ucc's
reproducer:
The text was updated successfully, but these errors were encountered: