You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
With cudaMemcpy the CUDA driver detects that you are copying from a host pointer to a host pointer and the copy is done on the CPU. You can of course use memcpy on the CPU yourself if you prefer.
If you use cudaMemcpy, there may be an extra stream synchronize performed before doing the copy (which you may see in the profiler, but I'm guessing there—test and see).
On a UVA system you can just use cudaMemcpyDefault as talonmies says in his answer. But if you don’t have UVA (sm_20+ and 64-bit OS), then you have to call the right copy (e.g. cudaMemcpyDeviceToDevice). If you cudaHostRegister() everything you are interested in then cudaMemcpyDeviceToDevice will end up doing the following depending on the where the memory is located:
Host <-> Host : performed by the CPU (memcpy)
Host <-> Device: DMA (device copy engine)
Device <-> Device: Memcpy CUDA kernel (runs on the SMs, launched by driver)
NVIDIA/nccl#688
The text was updated successfully, but these errors were encountered: