You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A node in a distributed task may need to know the name of its private network interface because sometimes distributed training frameworks fail to discover the correct interface automatically (currently the case when running PyTorch with NCCL in Vultr).
Currently, there isn't an easy way for a node to find out its private network interface name.
Solution
Provide a new system environment variable containing the node's private network interface name(s).
This can probably be implemented by comparing the node's private IP address with the list of its network interfaces or by checking the route to other nodes in the cluster.
Requires research: it is not yet clear how and whether dstack should provide multiple interface names when they are available, e.g. on AWS instances with multiple EFA attachments (see #1804).
Workaround
The node can try parsing the output of ifconfig or ip and compare it with existing system environment variables, such as DSTACK_MASTER_NODE_IP and DSTACK_NODES_IPS.
Problem
A node in a distributed task may need to know the name of its private network interface because sometimes distributed training frameworks fail to discover the correct interface automatically (currently the case when running PyTorch with NCCL in Vultr).
Currently, there isn't an easy way for a node to find out its private network interface name.
Solution
Provide a new system environment variable containing the node's private network interface name(s).
This can probably be implemented by comparing the node's private IP address with the list of its network interfaces or by checking the route to other nodes in the cluster.
Requires research: it is not yet clear how and whether
dstack
should provide multiple interface names when they are available, e.g. on AWS instances with multiple EFA attachments (see #1804).Workaround
The node can try parsing the output of
ifconfig
orip
and compare it with existing system environment variables, such asDSTACK_MASTER_NODE_IP
andDSTACK_NODES_IPS
.Additional information
Do after or along with #2219.
Would you like to help us implement this feature by sending a PR?
Yes
The text was updated successfully, but these errors were encountered: