Questions regarding multi-GPU #8

LokiWager · 2023-09-05T01:37:57Z

I couldn't locate the specification for GPU.0 within the code. Where is this detail defined? Although I've searched in scheduler.c, it seems to be absent. The only instances I've noticed are within the initial client setup in client.c and resource assignment in k8s-plugin. Where else might this information be specified?
Given the existing architecture, what potential challenges might we face if we were to extend support for multi-GPU? I presume there might be a requirement for a multi-queue scheduler, an equitable scheduling algorithm for client assignments, and modifications to the k8s-plugin.
Dose it only support glibc 2.2.5 & glibc 2.34

I look forward to your response. Thank you!

The text was updated successfully, but these errors were encountered:

grgalex · 2023-09-07T16:59:47Z

I couldn't locate the specification for GPU.0 within the code. Where is this detail defined? Although I've searched in scheduler.c, it seems to be absent

I mistakenly state that nvshare-scheduler uses GPU with ID 0 in the README. The scheduler is actually GPU-agnostic. We could use the same program to schedule access to a phone booth and we wouldn't have to change a single line.

The only place where GPU ID 0 is hardcoded is the following:

nvshare/src/client.c

Line 385 in 9504cdc

nvml_ret = real_nvmlDeviceGetHandleByIndex(0, &nvml_dev);

However, my (untested) understanding is that for a container that uses a single GPU, that GPU always has ID 0 w.r.t. NVML, so this is not a problem.

grgalex · 2023-09-07T17:03:28Z

Does it only support glibc 2.2.5 & glibc 2.34

It supports many versions of glibc and works seamlessly for each one I've tested on. The GLIBC_{225, 234} shenanigans are to make it work seamlessly across many glibc versions.

See the comment in https://github.com/grgalex/nvshare/blob/9504cdcdcd21c6935f54877da677272e1493f081/src/hook.c:

 * Since we're interposing dlsym() in libnvshare, we use dlvsym() to obtain the
 * address of the real dlsym function.
 *
 * Depending on glibc version, we look for the appropriate symbol.
 *
 * Some context on the implementation:
 *
 * glibc 2.34 remove the internal __libc_dlsym() symbol that NVIDIA uses in
 * their cuHook example:
 * https://github.com/phrb/intro-cuda/blob/d38323b81cd799dc09179e2ef27aa8f81b6dac40/src/cuda-samples/7_CUDALibraries/cuHook/libcuhook.cpp#L43
 *
 * One solution, discussed in apitrace's repo is to use dlvsym(), which also
 * takes a version string as a 3rd argument, in order to obtain the real
 * dlsym().
 * 
 * This is what user 'manisandro' suggested 8 years ago, when warning about
 * using the private __libc_dlsym():
 * https://github.com/apitrace/apitrace/issues/258
 * 
 * The maintainer of the repo didn't heed the warning back then, it came back
 * 8 years later and bit them.
 * 
 * This is also what user "derhass" suggests:
 * https://stackoverflow.com/a/18825060
 * (See section "UPDATE FOR 2021/glibc-2.34").
 * 
 * Given all the above, we obtain the real `dlsym()` as such:
 * real_dlsym=dlvsym(RTLD_NEXT, "dlsym", "GLIBC_2.2.5");
 *
 * Since we have to explicitly use a version argument in dlvsym(), we also have
 * to define and export two versions of dlsym (hence the linker script.), one
 * for each distinct glibc symbol version.
 *
 */

grgalex · 2023-09-07T17:05:26Z

@LokiWager

Feel free to open an issue with your suggested plan (it could be similar to what I proposed, it could be radically different) for implementing any of these features.

Then you can prepare a PR and we can take a look together and hopefully merge! :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions regarding multi-GPU #8

Questions regarding multi-GPU #8

LokiWager commented Sep 5, 2023 •

edited

Loading

grgalex commented Sep 7, 2023

grgalex commented Sep 7, 2023

grgalex commented Sep 7, 2023

Questions regarding multi-GPU #8

Questions regarding multi-GPU #8

Comments

LokiWager commented Sep 5, 2023 • edited Loading

grgalex commented Sep 7, 2023

grgalex commented Sep 7, 2023

grgalex commented Sep 7, 2023

LokiWager commented Sep 5, 2023 •

edited

Loading