You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying out the fastsocket NCCL plugin on GCP (specifically a GCE SLURM cluster build out of 2x(8xA100) nodes with gVNIC's). I see those warnings in the logs, specifically NCCL WARN Cannot get incoming CPU. and NCCL WARN Maximum retry reached for accept 3.. Does that mean something specific or can it be safely ignored?
The code runs despite the warning, although performance with/without the plugin look very similar.
full-debug2-test-1:4024:4048 [0] net_fastsocket.cc:765 NCCL WARN Cannot get incoming CPU.
full-debug2-test-0:4300:4325 [0] net_fastsocket.cc:785 NCCL WARN Maximum retry reached for accept 3.
full-debug2-test-1:4024:4055 [0] net_fastsocket.cc:674 NCCL WARN Maximum retry reached for connect 3.
full-debug2-test-0:4300:4325 [0] NCCL INFO accept qid: 3, rqid: 3
full-debug2-test-0:4300:4325 [0] NCCL INFO accept incoming cpu: 0
full-debug2-test-0:4300:4325 [0] NCCL INFO NET/FastSocket : Connected after 1000 retries.
full-debug2-test-0:4300:4325 [0] NCCL INFO NET/FastSocket : Accepted data socket 3
full-debug2-test-0:4300:4348 [0] net_fastsocket.cc:652 NCCL WARN Cannot get incoming CPU.
full-debug2-test-1:4024:4055 [0] NCCL INFO connect incoming cpu: 0
full-debug2-test-1:4024:4055 [0] NCCL INFO connect qid: 3, rqid: 3
full-debug2-test-1:4024:4055 [0] NCCL INFO NET/FastSocket : Connected after 1000 retries.
full-debug2-test-1:4024:4055 [0] NCCL INFO NET/FastSocket : Connected data socket 3
full-debug2-test-1:4024:4048 [0] net_fastsocket.cc:765 NCCL WARN Cannot get incoming CPU.
full-debug2-test-1:4024:4055 [0] NCCL INFO NET/FastSocket : Async connect done
full-debug2-test-0:4300:4348 [0] net_fastsocket.cc:652 NCCL WARN Cannot get incoming CPU.
full-debug2-test-1:4024:4048 [0] net_fastsocket.cc:765 NCCL WARN Cannot get incoming CPU.
full-debug2-test-0:4300:4348 [0] net_fastsocket.cc:652 NCCL WARN Cannot get incoming CPU
The text was updated successfully, but these errors were encountered:
I am trying out the fastsocket NCCL plugin on GCP (specifically a GCE SLURM cluster build out of 2x(8xA100) nodes with gVNIC's). I see those warnings in the logs, specifically
NCCL WARN Cannot get incoming CPU.
andNCCL WARN Maximum retry reached for accept 3.
. Does that mean something specific or can it be safely ignored?The code runs despite the warning, although performance with/without the plugin look very similar.
The text was updated successfully, but these errors were encountered: