[DO NOT MERGE] Use different GPU for different tensor #39

zasdfgbnm · 2021-06-10T17:48:18Z

For discussion only. After this change:

NCCL

$ bash start_test.sh torch_alltoall_bench.py --use-cuda --backend nccl
World size 2
size       min, us    avg, us    max, us   
32         29.690     31.357     33.023    
64         29.924     31.516     33.109    
128        29.512     31.063     32.615    
256        29.479     31.114     32.749    
512        29.282     30.826     32.370    
1024       29.578     31.410     33.243    
2048       28.834     30.597     32.360    
4096       28.702     30.416     32.130    
8192       29.347     30.869     32.391    
16384      31.537     33.187     34.838    
32768      34.635     35.715     36.795

UCC

$ bash start_test.sh torch_alltoall_bench.py --use-cuda --backend ucc
World size 2
size       min, us    avg, us    max, us   
[1623346740.960905] [sunnyvale:56372:0]          parser.c:1885 UCX  WARN  unused env variable: UCX_HOME (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
[1623346740.984496] [sunnyvale:56373:0]          parser.c:1885 UCX  WARN  unused env variable: UCX_HOME (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
[1623346743.056999] [sunnyvale:56373:0] mc_cuda_wait_kernel.cu:44   cuda mc ERROR cuda failed with ret:400(invalid resource handle)
[1623346743.057017] [sunnyvale:56373:0]     tl_ucp_coll.c:129  TL_UCP ERROR error in ee task post
[E torch_ucc.cpp:417] failed to post triggered collective: Unhandled error
Traceback (most recent call last):
  File "/home/gaoxiang/torch-ucc/test/torch_alltoall_bench.py", line 80, in <module>
    req = dist.all_to_all_single(recv_tensor, send_tensor, async_op=True)
  File "/home/gaoxiang/.local/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 2469, in all_to_all_single
    work = default_pg.alltoall_base(
RuntimeError: Unhandled error
32         9.424      4.712      9.424

bureddy · 2021-06-10T17:57:58Z

can you add your start_test.sh ?

Sergei-Lebedev · 2021-06-10T18:35:59Z

I guess the error you see happening because we assume correct device is set before first call to ucc_init(). So when you create tensor on device 1 ucc will be initialized on device 0. Following should help in this case

 void ProcessGroupUCC::initComm(c10::Device dev) {
   if (!comm) {
+    c10::cuda::set_device(dev.index());
     comm = CommPG::get_comm(comm_id, dev, &oob);

zasdfgbnm · 2021-06-10T22:55:17Z

@Sergei-Lebedev Thanks for the suggestion! It's working. The final patch I used was #40

Also re: @bureddy
I didn't change anything in start_test.sh

Summary: See discussion in openucx/torch-ucc#39 Pull Request resolved: #6 Reviewed By: kingchc Differential Revision: D29083656 Pulled By: srinivas212 fbshipit-source-id: 7bc69704cd1e7b7a26982e03f243eeb9fa097b7b

Use different GPU for different tensor

684e132

zasdfgbnm closed this Jun 10, 2021

zasdfgbnm deleted the more-gpu branch June 10, 2021 22:57

zasdfgbnm mentioned this pull request Jun 11, 2021

Set to the correct device in ProcessGroupUCC::initComm facebookresearch/torch_ucc#6

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DO NOT MERGE] Use different GPU for different tensor #39

[DO NOT MERGE] Use different GPU for different tensor #39

zasdfgbnm commented Jun 10, 2021 •

edited

Loading

bureddy commented Jun 10, 2021

Sergei-Lebedev commented Jun 10, 2021

zasdfgbnm commented Jun 10, 2021

[DO NOT MERGE] Use different GPU for different tensor #39

[DO NOT MERGE] Use different GPU for different tensor #39

Conversation

zasdfgbnm commented Jun 10, 2021 • edited Loading

bureddy commented Jun 10, 2021

Sergei-Lebedev commented Jun 10, 2021

zasdfgbnm commented Jun 10, 2021

zasdfgbnm commented Jun 10, 2021 •

edited

Loading