-
Notifications
You must be signed in to change notification settings - Fork 176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Call zesInit()
instead of setting ZES_ENABLE_SYSMAN
from a constructor function
#687
Comments
I note intel/compute-runtime#650, which might be the reason hwloc was not using (I'm not a huge fan of calling |
Hello. Yes this is unpleasant, and it caused many complains, veeeery long discussion with ZE developers, but we had no way to do otherwise :/ The compute-runtime issue is indeed the main reason while we didn't use zesInit() yet. Supporting both zesInit() and no-zesInit() in the same code looked like a big mess. Hence the plan was to just wait, assume everybody switched to a modern compute-runtime with a working zesInit(), and then just drop the constructor entirely. Then came the issue with ZES and ZE using difference device handles, and conversion routines were still not implemented recently (oneapi-src/level-zero#153). Also there's a new "unified runtime" in the works, which may make some/most of this work obsolete at some point. Honestly, I was planning to wait until we know where we're going. Do you need a fix? Or can you live with a workaround (disabling levelzero as build-time, setting ZES_ENABLE_SYSMAN before hwloc does, ...)? Regarding calling zesInit() inside the constructor, I don't remember if it was required in some cases. My PR was just calling it from the normal hwloc code but it wasn't much tested since zesInit() wasn't implemented at that time. |
:-) I think we're both in a similar bind; the Swift runtime does the nasty thing it does to work around a problem with Rosetta and Docker, but that relies on things not changing the environment from constructors (which is a pretty mad thing to be doing, so it seemed a reasonable compromise; we detect the case when the environment has moved and cope with it, so the worst that happens is you end up with no command line arguments). Trouble is, someone linked OpenCV and pulled in hwloc, which then did exactly that :-D I think setting |
Hi, Intel recently updated their documentation about SYSMAN initialization ( https://github.com/intel/compute-runtime/blob/master/programmers-guide/SYSMAN.md ) I'm as happy as you and we (Argonne / Aurora people) try to push Intel to fix some of their hum interesting behavior and implementation choices, |
as we documented about Sysman initialization with https://github.com/intel/compute-runtime/blob/master/programmers-guide/SYSMAN.md , we recommend users to switch using zesInit instead of zeInit + ZES_ENABLE_SYSMAN. please let us know what is the issue for HWLOC switching to zesInit |
@saik-intel The doc isn't very clear. The way to find the sysman device handle corresponding to a core device handle is to get the uuid from the core device, call zesDriverGetDeviceByUuidExp() to get the index, and then use that index in the device array returned by zesGetDevice? Is there an easy way to translate between core and sysman driver handles too? Or should I just iterate over each driver until I find the right UUID? |
A nice complexity also arrisse due to the fact that "zesInit + (Legacy mode) Or (Legacy mode) + zesInit" is not supported. A user code who is using the "Legacy Mode" (aka calling zeInit + setting ZES_ENABLE_SYSMAN=1) and use the C API of hwlock, will need hwlock to not call "zesInit" to avoid a crash... |
please use below sample code snippet to map core device handle to sysman device handle and adapt/optimize for your environment. For simplicity sample assumes driver count is 1.
}` |
I don't get a crash in this case, I get ZE_RESULT_ERROR_UNINITIALIZED from zesInit() (tested with latest compute-runtime on my laptop, and with 24.26.30049.10-950 on the endeavour test cluster). However, I am thinking of just disabling the hwloc level-zero backend whenever I find ZES_ENABLE_SYSMAN=1 in the environment just in case. |
Is driver count always going to be 1 in practice? Intel CPU + GPU seem to use the same driver. As long as other GPU vendors don't implement their own ZE driver, we're good? I don't see any way to match ZE and ZES drivers (I don't see any driver name, ZE reports a driver UUID, ZES doesn't). Does the device UUID somehow contain a driver-specific part that would prevent two drivers from giving the same UUID to two different devices? |
if you use ZES_ENABLE_Sysman in environment variable and call zesInit(), it will return error as we want to support only one |
As two combination is not supported , use either one mode and i recommend to use only one and if we use LEGACY method, i recommend dont call zesInit() |
|
You are right. My bad. ( Sorry for the noise! |
@saik-intel The question is how to expand your sample code with multiple drivers. Let's say I have 2 drivers reporting 2 devices each. Am I sure that UUID of these 2*2 devices will all be different so that I can rely on only the device UUID to match ZES device with ZE device? This really depends how device UUID are built internally. If all drivers build the device UUID as ++, no problem. If some dumb drivers just report UUID=0 for the first device and UUID=1 for the second, we'll get some UUID collisition, and UUID matching won't be unique unless we can also match the ZE driver with the ZES driver. This question only stands if multiple drivers are going to exist in practice. If some academics start playing with the compute runtime and add another driver with a simple UUID build, the problem may arise. If both your driver and a custom driver build the UUID from the same PCI BDF, they may report the same UUID, problem too. |
|
@saik-intel Thanks, I have the clarification I need then, no need for @AshwinKumarKulkarni to update the code, just add a comment saying that UUID is guaranteed unique across multiple devices and drivers. |
@bgoglin yes UUID is guaranteed unique across multiple devices and drivers. Can you please close this issue, if it is clarified. Thanks |
This issue will be closed once implemented in hwloc, which will take some days. The plan is to drop support for ZES_ENABLE_SYSMAN entirely and only support on zesInit(). Will publish that in hwloc 2.12. |
FYI there's a candidate implementation in #695 and https://ci.inria.fr/hwloc/job/basic/view/change-requests/job/PR-695/ but I need to find platforms with PVC and a recent enough L0 runtime to really test things. |
I can have a try! Do you have a test-suite for me to run? |
That'd be great. Take the tarball from the ci.inria.fr link above, run ./configure && make && make check, and also run |
I confirm I worked! Thanks a lot!
But the The It's nice that now the OpenCL are merged in the same PCI ( I think before they where in a separate box). In fact they are the same physical device (they should may / report the same UUID), I guess one possibility will be for the boxes to merged and named |
Thanks! I think everything looks good. OpenCL works better than earlier because there's a new OpenCL extension that works for Intel GPUs too. Regarding InfiniBand, can you look the configure output if verbs.h was found? It's supposed to disable that specific test otherwise. |
Didn't found it indeed:
Strangely enough, I have it on my path
|
I just merged #695 to close this. I'll release hwloc 2.12 soon with all these changes. I tried to reproduce the unrelated IB issue from @TApplencourt above, without any success. Feel free to open another issue about it (I guess we'd want the entire configure output and config.log). |
It looks like hwloc changes the environment from a constructor function; this seems unpleasant — it means that very likely the first thing your program will do is allocate memory and copy the environment.
Setting
ZES_ENABLE_SYSMAN=1
explicitly as a workaround will stop this behaviour, but it would be better if hwloc calledzesInit(0)
rather than changing the environment.The text was updated successfully, but these errors were encountered: