-
Notifications
You must be signed in to change notification settings - Fork 207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
making Multinomial sampling slightly faster #786
base: master
Are you sure you want to change the base?
Conversation
The NEW block shows the time for the current code and the old block shows it for the previous code. |
@pavanky Could you have a look at this too ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The downside of this is the function uses more memory. I think there are ways to make it faster without generating the uniform array outside of the kernel.
lib/THC/generic/THCTensorRandom.cu
Outdated
n_sample, | ||
THCudaLongTensor_data(state, self), | ||
numDist, numCategories, | ||
THCTensor_(data)(state, prefixSum)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the lines are not aligned.
@pavanky But I think generating just an array (the size of |
@amartya18x what speedups are you seeing? Can you give the output of the benchmark and specify what GPU you ran them on? |
Yeah, I'm not convinced. Not only does it require allocation and deallocation (which is bad if one is not running with the caching allocator), it's replacing in-kernel generation with extra global memory reads and writes, which together with the memory allocator stuff may be on par with what was there before. Do you have kernel timings? |
@@ -292,7 +291,7 @@ sampleMultinomialWithoutReplacement(curandStateMtgp32* state, | |||
|
|||
// All threads must participate in this | |||
T r = ScalarConvert<float, T>::to(curand_uniform(&state[blockIdx.x])); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
plus, this wasn't converted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I didn't do it for this. I just tried it out for the other one. If the idea convinced everyone, I could do it for the other one too.
print("") | ||
print("Benchmarking multinomial with "..curr_n_class.." classes and "..curr_n_sample.." samples") | ||
torch.seed() | ||
local probs = torch.CudaDoubleTensor(n_dist, curr_n_class):uniform(0,1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that benchmarking using float64 is useful, since almost all work done with Torch is in float32. Furthermore, this will have very skewed results on different GPUs due to lack of float64 ALUs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you suggest using torch.FloatTensor
?
a:reset() | ||
for i = 1,10 do | ||
torch.multinomial(probs, curr_n_sample, true) | ||
cutorch.synchronize() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are you synchronizing every time through for the benchmark? one should only synchronize at the beginning and the end.
@soumith I had the benchmarking here in a comment showing the speedups. I think the rebasing messed it up. Let me check |
@wickedfoo the problem was the previous kernel was generating more random numbers than needed. This PR fixes that issue by creating the required number of random numbers outside. That said, this isn't the most optimal fix. We should be generating the necessary random numbers within the kernel. |
Is generating more random numbers than needed really that much of an issue? If we're talking microseconds here, this isn't that big of a deal in my opinion. The RNG being used requires the entire block to update the state at once, although you are right that only one value per warp is being used. The alternative would be to see if warp divergence for the binary search part isn't too bad, in which case every thread would use the value that it generated locally. |
@wickedfoo Another alternative would be to perform parallel linear search instead of a binary search on a single thread. Either way the improvements here would be minimal. The more substantial improvements we are seeing come from here: #784 |
AFAIK this is only a limitation of MTGP32 generator. This is also a bit slower than other generators in CURAND. Any reason to stick with MTGP32 ? |
Nvm my previous comment, I didnt realize cutorch only supports this generator. |
This makes multinomial sampling slightly faster by generating a
sample
sized array of uniformly sampled random numbers outside the kernel and passing it to the kernel. The different distributions in the kernel can use the same array of uniformly sampled numbers as they are independant distributions.I have addded a file called
test/multinomial.lua
to do some benchmarking. I can remove it later if the PR is accepted.