Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cudaMemcpy consuming CPU resources ? #58

Open
lix19937 opened this issue Nov 8, 2024 · 1 comment
Open

cudaMemcpy consuming CPU resources ? #58

lix19937 opened this issue Nov 8, 2024 · 1 comment

Comments

@lix19937
Copy link
Owner

lix19937 commented Nov 8, 2024

NVIDIA/nccl#688

@lix19937
Copy link
Owner Author

lix19937 commented Nov 9, 2024

With cudaMemcpy the CUDA driver detects that you are copying from a host pointer to a host pointer and the copy is done on the CPU. You can of course use memcpy on the CPU yourself if you prefer.

If you use cudaMemcpy, there may be an extra stream synchronize performed before doing the copy (which you may see in the profiler, but I'm guessing there—test and see).

On a UVA system you can just use cudaMemcpyDefault as talonmies says in his answer. But if you don’t have UVA (sm_20+ and 64-bit OS), then you have to call the right copy (e.g. cudaMemcpyDeviceToDevice). If you cudaHostRegister() everything you are interested in then cudaMemcpyDeviceToDevice will end up doing the following depending on the where the memory is located:

Host   <-> Host  : performed by the CPU (memcpy)
Host   <-> Device: DMA (device copy engine)
Device <-> Device: Memcpy CUDA kernel (runs on the SMs, launched by driver)

https://stackoverflow.com/questions/12453677/better-or-the-same-cpu-memcpy-vs-device-cudamemcpy-on-pinned-mapped-memory

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant