Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions regarding multi-GPU #8

Open
LokiWager opened this issue Sep 5, 2023 · 3 comments
Open

Questions regarding multi-GPU #8

LokiWager opened this issue Sep 5, 2023 · 3 comments

Comments

@LokiWager
Copy link

LokiWager commented Sep 5, 2023

  1. I couldn't locate the specification for GPU.0 within the code. Where is this detail defined? Although I've searched in scheduler.c, it seems to be absent. The only instances I've noticed are within the initial client setup in client.c and resource assignment in k8s-plugin. Where else might this information be specified?

  2. Given the existing architecture, what potential challenges might we face if we were to extend support for multi-GPU? I presume there might be a requirement for a multi-queue scheduler, an equitable scheduling algorithm for client assignments, and modifications to the k8s-plugin.

  3. Dose it only support glibc 2.2.5 & glibc 2.34

I look forward to your response. Thank you!

@grgalex
Copy link
Owner

grgalex commented Sep 7, 2023

I couldn't locate the specification for GPU.0 within the code. Where is this detail defined? Although I've searched in scheduler.c, it seems to be absent

I mistakenly state that nvshare-scheduler uses GPU with ID 0 in the README. The scheduler is actually GPU-agnostic. We could use the same program to schedule access to a phone booth and we wouldn't have to change a single line.

The only place where GPU ID 0 is hardcoded is the following:

However, my (untested) understanding is that for a container that uses a single GPU, that GPU always has ID 0 w.r.t. NVML, so this is not a problem.

@grgalex
Copy link
Owner

grgalex commented Sep 7, 2023

Does it only support glibc 2.2.5 & glibc 2.34

It supports many versions of glibc and works seamlessly for each one I've tested on. The GLIBC_{225, 234} shenanigans are to make it work seamlessly across many glibc versions.

See the comment in https://github.com/grgalex/nvshare/blob/9504cdcdcd21c6935f54877da677272e1493f081/src/hook.c:

 * Since we're interposing dlsym() in libnvshare, we use dlvsym() to obtain the
 * address of the real dlsym function.
 *
 * Depending on glibc version, we look for the appropriate symbol.
 *
 * Some context on the implementation:
 *
 * glibc 2.34 remove the internal __libc_dlsym() symbol that NVIDIA uses in
 * their cuHook example:
 * https://github.com/phrb/intro-cuda/blob/d38323b81cd799dc09179e2ef27aa8f81b6dac40/src/cuda-samples/7_CUDALibraries/cuHook/libcuhook.cpp#L43
 *
 * One solution, discussed in apitrace's repo is to use dlvsym(), which also
 * takes a version string as a 3rd argument, in order to obtain the real
 * dlsym().
 * 
 * This is what user 'manisandro' suggested 8 years ago, when warning about
 * using the private __libc_dlsym():
 * https://github.com/apitrace/apitrace/issues/258
 * 
 * The maintainer of the repo didn't heed the warning back then, it came back
 * 8 years later and bit them.
 * 
 * This is also what user "derhass" suggests:
 * https://stackoverflow.com/a/18825060
 * (See section "UPDATE FOR 2021/glibc-2.34").
 * 
 * Given all the above, we obtain the real `dlsym()` as such:
 * real_dlsym=dlvsym(RTLD_NEXT, "dlsym", "GLIBC_2.2.5");
 *
 * Since we have to explicitly use a version argument in dlvsym(), we also have
 * to define and export two versions of dlsym (hence the linker script.), one
 * for each distinct glibc symbol version.
 *
 */

@grgalex
Copy link
Owner

grgalex commented Sep 7, 2023

@LokiWager

Feel free to open an issue with your suggested plan (it could be similar to what I proposed, it could be radically different) for implementing any of these features.

Then you can prepare a PR and we can take a look together and hopefully merge! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants