-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for GPU checkpointing in nvproxy #11095
Comments
Have you looked at #10478 (which I believe was filed by from one of your colleagues :))? |
Interesting, I assumed that NVIDIA hadn't fixed the issue since NVIDIA/cuda-checkpoint#4 is still open, but honestly, I haven't tried running |
I would recommend trying the latest driver (R565 I believe). |
Thanks! I'll reply back when I've had a chance to try the latest driver. Really appreciate the help on this. |
As currently the latest driver gvisor support is 560.35.03, I have tried this with cuda-checkpoint and gvisor C/R. Unfortunately I still came across the same error during
It seems cuda-checkpoint itself has not updated, does the latest nvidia driver fix this bug? Could you please give more information about how to make it(pytorch + cuda-checkpoint + gvisor C/R) work? Or is there any branch I could try on the latest driver(565)? PS: for detail info runsc: master branch with commit id: runtime config
how do i run vllm container:
how do i use cuda-checkpoint
how do i use runsc checkpoint
|
@tianyuzhou95 It seems NVIDIA/cuda-checkpoint#4 is still not fixed in any releases I have tried. This (pytorch apps not being able to be checkpointed) is a cuda-checkpoint bug, not a gVisor one. Could you follow up with NVIDIA about timeline? |
@ayushr2 Of course, I will follow Nvidia's fix for this. It looks like they plan to support it in early 2025. Thanks! |
Description
We're interested in some form of GPU checkpointing - is this something that the gvisor team plans on supporting at any point?
Generally, existing GPU checkpointing implementations described in papers like Singularity or Cricket intercept CUDA calls via
LD_PRELOAD
. Prior to a checkpoint, they record stateful calls in a log, which is stored at checkpoint time along with the contents of GPU memory. At restore time, GPU memory is reloaded and the log is replayed. Both frameworks have to do some of virtualization of device pointers as well.It seems (perhaps naively) that a similar scheme might be possible within nvproxy, which already intercepts calls to the GPU driver. In theory, nvproxy could record a subset of calls made to the GPU driver and replay them at checkpoint-restore time, virtualizing file descriptors and device pointers as needed; and separately, support copying contents of GPU memory off the device to a file and back.
This is clearly complex. I'm curious if you all believe it to be viable and plan on exploring the scheme described above, or a different one, at any point?
Is this feature related to a specific bug?
No response
Do you have a specific solution in mind?
No response
The text was updated successfully, but these errors were encountered: