XGBoost

1. NCCL errors

XGBoost supports distributed GPU training which depends on NCCL2 available at this link. NCCL auto-detects which network interfaces to use for inter-node communication. If some interfaces are in state up, however are not able to communicate between nodes, NCCL may try to use them anyway and therefore fail during the init functions or even hang.

To track NCCL error, User needs to enable NCCL_DEBUG when submitting spark application by

--conf spark.executorEnv.NCCL_DEBUG=INFO

Sometimes, Node tries to connect to another node which selects an inappropriate interface, which may cause xgboost task hang. To fix this kind of issue, User needs to specify an appropriate interface for the node by NCCL_SOCKET_IFNAME

--conf spark.executorEnv.NCCL_SOCKET_IFNAME=eth0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xgboost-examples-trouble-shooting.md

xgboost-examples-trouble-shooting.md

XGBoost

1. NCCL errors

Files

xgboost-examples-trouble-shooting.md

Latest commit

History

xgboost-examples-trouble-shooting.md

File metadata and controls

XGBoost

1. NCCL errors