Skip to content
This repository has been archived by the owner on May 17, 2022. It is now read-only.

[DO NOT MERGE] Use different GPU for different tensor #39

Closed
wants to merge 1 commit into from
Closed

[DO NOT MERGE] Use different GPU for different tensor #39

wants to merge 1 commit into from

Conversation

zasdfgbnm
Copy link

@zasdfgbnm zasdfgbnm commented Jun 10, 2021

For discussion only. After this change:

NCCL

$ bash start_test.sh torch_alltoall_bench.py --use-cuda --backend nccl
World size 2
size       min, us    avg, us    max, us   
32         29.690     31.357     33.023    
64         29.924     31.516     33.109    
128        29.512     31.063     32.615    
256        29.479     31.114     32.749    
512        29.282     30.826     32.370    
1024       29.578     31.410     33.243    
2048       28.834     30.597     32.360    
4096       28.702     30.416     32.130    
8192       29.347     30.869     32.391    
16384      31.537     33.187     34.838    
32768      34.635     35.715     36.795    

UCC

$ bash start_test.sh torch_alltoall_bench.py --use-cuda --backend ucc
World size 2
size       min, us    avg, us    max, us   
[1623346740.960905] [sunnyvale:56372:0]          parser.c:1885 UCX  WARN  unused env variable: UCX_HOME (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
[1623346740.984496] [sunnyvale:56373:0]          parser.c:1885 UCX  WARN  unused env variable: UCX_HOME (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
[1623346743.056999] [sunnyvale:56373:0] mc_cuda_wait_kernel.cu:44   cuda mc ERROR cuda failed with ret:400(invalid resource handle)
[1623346743.057017] [sunnyvale:56373:0]     tl_ucp_coll.c:129  TL_UCP ERROR error in ee task post
[E torch_ucc.cpp:417] failed to post triggered collective: Unhandled error
Traceback (most recent call last):
  File "/home/gaoxiang/torch-ucc/test/torch_alltoall_bench.py", line 80, in <module>
    req = dist.all_to_all_single(recv_tensor, send_tensor, async_op=True)
  File "/home/gaoxiang/.local/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 2469, in all_to_all_single
    work = default_pg.alltoall_base(
RuntimeError: Unhandled error
32         9.424      4.712      9.424     

@bureddy
Copy link

bureddy commented Jun 10, 2021

can you add your start_test.sh ?

@Sergei-Lebedev
Copy link
Collaborator

I guess the error you see happening because we assume correct device is set before first call to ucc_init(). So when you create tensor on device 1 ucc will be initialized on device 0. Following should help in this case

 void ProcessGroupUCC::initComm(c10::Device dev) {
   if (!comm) {
+    c10::cuda::set_device(dev.index());
     comm = CommPG::get_comm(comm_id, dev, &oob);

@zasdfgbnm
Copy link
Author

@Sergei-Lebedev Thanks for the suggestion! It's working. The final patch I used was #40

Also re: @bureddy
I didn't change anything in start_test.sh

@zasdfgbnm zasdfgbnm closed this Jun 10, 2021
@zasdfgbnm zasdfgbnm deleted the more-gpu branch June 10, 2021 22:57
facebook-github-bot pushed a commit to facebookresearch/torch_ucc that referenced this pull request Jun 16, 2021
Summary:
See discussion in openucx/torch-ucc#39

Pull Request resolved: #6

Reviewed By: kingchc

Differential Revision: D29083656

Pulled By: srinivas212

fbshipit-source-id: 7bc69704cd1e7b7a26982e03f243eeb9fa097b7b
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants