[WIP] introduce CUDA managed memory and use it for a matching function #157
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Introduce the use of CUDA Managed Memory (CMM) and use it in a new feature matching function.
The main change of this PR adds the malloc_mgd and free_mgd functions and the malloc_mgdT template.
As a secondary change, add a new feature matching function that uses CMM.
The demo program popsift-match has been changed to use the new function, making it easy to print results on screen.
Features list
Implementation remarks
CMM allows the programmer to allocate flat 1D memory that is accessible for both the CPU and the GPU. The CUDA device driver guesses on which side the memory is needed next, and performs the transfer in the background. On devices where CPU and GPU share physical memory, this is even better because memory copies can be avoided altogether. Using CMM started to make sense with CUDA CC 6.0 ("Pascal").
Using CMM is purportedly safe, but that is not true. If the programmer doesn't keep track of the side that control the memory at any time, there will be race conditions. On the NVidia Tegra, a shared memory architecture, we have seen race conditions spanning several allocated memory regions when those regions are so small that they fit into the same memory page, e.g. control structures. The simple way of preventing race conditions is cudaDeviceSynchronize. It is more efficient to use cudaMemAdvise to tell the driver that the CPU uses it next (don't forget to unset the location hint after the CPU is finished using the memory!), and cudaMemPrefetchAsync to inform the driver about the specific stream on the GPU that will need the memory (the other streams don't have to wait for it).