Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Figure out dynamic library versioning #232

Open
lars-t-hansen opened this issue Jan 14, 2025 · 0 comments
Open

Figure out dynamic library versioning #232

lars-t-hansen opened this issue Jan 14, 2025 · 0 comments
Labels
enhancement New feature or request

Comments

@lars-t-hansen
Copy link
Collaborator

Over in #87 we implemented on-demand loading of GPU SMI libraries. But one issue was left hanging, namely, what to do in the event of .so version changes? Indeed, how even to detect that there has been a version change?

In brief, a .so is versioned as .so.x where the x is the version number. For example, currently we have x=1 for the NVIDIA library and x=7 for the AMD ROCM library. x is supposed to change when there's a breaking API change (https://tldp.org/HOWTO/Program-Library-HOWTO/shared-libraries.html). If the library moves to a new x value then we have two cases:

  • the API is incompatible but we don't detect it b/c we're loading via a symlink w/o the version number, this is currently the case on NVIDIA systems where /usr/lib64/libnvidia-ml.so is a symlink to libnvidia-ml.so.1 (which is in turn a link to libnvidia-ml.so.xxx.yyy.zz for the specific sw version). If we're lucky the link fails for a new so version but more likely it will succeed and some of the calls will fail in weird ways because signatures have changed
  • the API is incompatible and the old version is gone and we fail to see the new version because it has a different file name, this is the case on AMD systems where we load /opt/rocm/lib/librocm_smi64.so.7 but if that fails we just bail out. (Here too there is a symlink in the mix, the rocm in that path is usually a link to a version-specific library.)

This needs a principled solution and I don't yet know what it is, but probably something like this:

  • always load a versioned library
  • if the load fails, glob for all .so versions in the same directory
  • report the failure in any case, with as much info as possible

To report the failure we need some kind of error-reporting back channel, but we want this anyway. It could be a message like the heartbeat, or piggybacked on normal data like load= or on the heartbeat, but there are other possibilities.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant