-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intel Gaudi bootc container is missing InfiniBand #483
Comments
Care to open a PR to fix? |
@enriquebelarte solved the issue in ec05b07 Enrique, is there anything left to do here? |
Could not test upstream as build checks failed because of kernel version in runners. |
@tiran , @enriquebelarte , assuming the fix has been identified and proven to work repeatably (add and load habanalabs_ib module as part of the bootc image) , should we close the issue? |
@braultatgithub This Containerfile has changed significantly since the fix was submitted. I haven't tested the new Containerfile but it seems to be doing the correct thing. Extracts InfiniBand module from rpm and loads it. |
We need to test this on a real server. RDMA may also need a custom libfabric with SynapseAI, habanlabs-rdma-core, habanalabs-thunk, and Habana's hccl_ofi_wrapper library. |
@tiran @enriquebelarte I bumped onto this ticket by accident when I was diagnosing some other habanalabs_ib issue myself. "Manually loading the module doesn't make a difference." rings very close to home, so I decided to share my two cents. I've seen a scenario where I had habanalabs, habanalabs_en and habanalabs_cn were correctly loaded on our system while habanalabs_ib failed to load due to ib_uverbs module (which is a dependency) not being loaded automatically when loading habanalabs_ib (for any reason possible). You should be able to observe that in dmesg on the failing system by looking for "habanalabs_ib" lines, namely looking for "Unknown symbol" lines. |
The container file https://github.com/containers/ai-lab-recipes/blob/main/training/intel-bootc/Containerfile does not contain necessary bits and pieces to setup InfiniBand Intel Gaudi devices. Without IB, it is not possible to use Intel Gaudi 2 cards for training. I'm not entirely sure if this only affects servers with multiple Intel Gaudi 2 cards or also servers with a single card. My test systems have 8 Intel Gaudi 2 cards each.
The server with bootc image did not have the
habanalabs_ib
module loaded. Manually loading the module doesn't make a difference.rdma dev
does not show anyhlib
(Babana Labs InfiniBand) devices. Without the devices, PyTorch and Habana's PyTorch pluginhabana_frameworks
fail to initialize the devices:hcl_ibverbs_t::init failed to find matching Habana IB device (hlib_6)
On the other bare metal server with regular RHEL 9 and Habana's packages,
rdma dev
shows a node for each card andrdma link
shows over 160 active connections with LINK_UP.# rdma dev 0: mlx5_0: node_type ca fw 20.32.2004 node_guid b83f:d203:004a:5242 sys_image_guid b83f:d203:004a:5242 1: mlx5_1: node_type ca fw 20.32.2004 node_guid b83f:d203:004a:522e sys_image_guid b83f:d203:004a:522e 18: hlib_4: node_type unspecified fw 49.0.0 node_guid 0000:0000:0000:0000 sys_image_guid 0000:0000:0000:0000 19: hlib_6: node_type unspecified fw 49.0.0 node_guid 0000:0000:0000:0000 sys_image_guid 0000:0000:0000:0000 20: hlib_2: node_type unspecified fw 49.0.0 node_guid 0000:0000:0000:0000 sys_image_guid 0000:0000:0000:0000 21: hlib_7: node_type unspecified fw 49.0.0 node_guid 0000:0000:0000:0000 sys_image_guid 0000:0000:0000:0000 22: hlib_1: node_type unspecified fw 49.0.0 node_guid 0000:0000:0000:0000 sys_image_guid 0000:0000:0000:0000 23: hlib_3: node_type unspecified fw 49.0.0 node_guid 0000:0000:0000:0000 sys_image_guid 0000:0000:0000:0000 24: hlib_5: node_type unspecified fw 49.0.0 node_guid 0000:0000:0000:0000 sys_image_guid 0000:0000:0000:0000 25: hlib_0: node_type unspecified fw 49.0.0 node_guid 0000:0000:0000:0000 sys_image_guid 0000:0000:0000:0000
journal shows that
habanalabs
Kernel module registers a habana labs infiniband device for each card:# journalctl -o cat | grep hlib habanalabs 0000:db:00.0 hlib_4: IB device registered habanalabs 0000:bb:00.0 hlib_2: IB device registered habanalabs 0000:19:00.0 hlib_6: IB device registered habanalabs 0000:5d:00.0 hlib_7: IB device registered habanalabs 0000:9b:00.0 hlib_1: IB device registered habanalabs 0000:cb:00.0 hlib_3: IB device registered habanalabs 0000:4c:00.0 hlib_5: IB device registered habanalabs 0000:3b:00.0 hlib_0: IB device registered
According to https://docs.habana.ai/en/latest/Installation_Guide/Bare_Metal_Fresh_OS.html we also need the
habanalabs-thunk
andhabanalabs-rdma-core
packages in the bootc image.reproducer
The text was updated successfully, but these errors were encountered: