You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Over in #87 we implemented on-demand loading of GPU SMI libraries. But one issue was left hanging, namely, what to do in the event of .so version changes? Indeed, how even to detect that there has been a version change?
In brief, a .so is versioned as .so.x where the x is the version number. For example, currently we have x=1 for the NVIDIA library and x=7 for the AMD ROCM library. x is supposed to change when there's a breaking API change (https://tldp.org/HOWTO/Program-Library-HOWTO/shared-libraries.html). If the library moves to a new x value then we have two cases:
the API is incompatible but we don't detect it b/c we're loading via a symlink w/o the version number, this is currently the case on NVIDIA systems where /usr/lib64/libnvidia-ml.so is a symlink to libnvidia-ml.so.1 (which is in turn a link to libnvidia-ml.so.xxx.yyy.zz for the specific sw version). If we're lucky the link fails for a new so version but more likely it will succeed and some of the calls will fail in weird ways because signatures have changed
the API is incompatible and the old version is gone and we fail to see the new version because it has a different file name, this is the case on AMD systems where we load /opt/rocm/lib/librocm_smi64.so.7 but if that fails we just bail out. (Here too there is a symlink in the mix, the rocm in that path is usually a link to a version-specific library.)
This needs a principled solution and I don't yet know what it is, but probably something like this:
always load a versioned library
if the load fails, glob for all .so versions in the same directory
report the failure in any case, with as much info as possible
To report the failure we need some kind of error-reporting back channel, but we want this anyway. It could be a message like the heartbeat, or piggybacked on normal data like load= or on the heartbeat, but there are other possibilities.
The text was updated successfully, but these errors were encountered:
Over in #87 we implemented on-demand loading of GPU SMI libraries. But one issue was left hanging, namely, what to do in the event of .so version changes? Indeed, how even to detect that there has been a version change?
In brief, a .so is versioned as .so.x where the x is the version number. For example, currently we have x=1 for the NVIDIA library and x=7 for the AMD ROCM library. x is supposed to change when there's a breaking API change (https://tldp.org/HOWTO/Program-Library-HOWTO/shared-libraries.html). If the library moves to a new x value then we have two cases:
/usr/lib64/libnvidia-ml.so
is a symlink tolibnvidia-ml.so.1
(which is in turn a link tolibnvidia-ml.so.xxx.yyy.zz
for the specific sw version). If we're lucky the link fails for a new so version but more likely it will succeed and some of the calls will fail in weird ways because signatures have changed/opt/rocm/lib/librocm_smi64.so.7
but if that fails we just bail out. (Here too there is a symlink in the mix, therocm
in that path is usually a link to a version-specific library.)This needs a principled solution and I don't yet know what it is, but probably something like this:
To report the failure we need some kind of error-reporting back channel, but we want this anyway. It could be a message like the heartbeat, or piggybacked on normal data like
load=
or on the heartbeat, but there are other possibilities.The text was updated successfully, but these errors were encountered: