-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
multicore mul_assign does not improve performance as expected #1
Comments
After Ignacio Hagopian's suggestsion, I changed batch size for bench to Moreover, nothing changes on Intel. I suspect that I am messing something up. Can anyone verify? |
If anyone is curious about what the ideal times should be. Run cargo bench |
I increased the moduli count to check what happens. Looks like performance is close to what we expect when cores are oversubscribed, that is no. of threads are greater than no. of cores. Following are benchmarks when I set moduli to 16.
With 16 moduli, time for 2 threads and 4 threads must be 1/2 and 1/4th of time for 1 thread. Notice the pattern that as I find this behaviour odd. We are definitely spending time doing something that we shouldn't. |
Here's the function that performs mul_assign operation a bunch of time in main.rs. Someone with experience with profiling can help us figure out where are we spending time? |
Benchmarks for mul_assign
Moduli count is number of modulus in the polynomial. In mul_assign each modulus is expected to be processed on a different thread. Num thread is number of threads that rayon is allowed to use set by either calling num_threads() or setting
RAYON_NUM_THREADS
environment variable.n
is the degree of polynomial.Since we parallelize mul_assign across Axis(0), that is per moduli, and number of modulus is set to 2, doubling number of threads from 1 to 2 should halve the time. But, as visible from the benchmarks, this isn't the case. For ex, take n=32768. When num thread is set 1 it takes 177.51 µs on intel. And when num thread is 2 it takes 104.93 µs, which is only 1.7x instead of 2x.
Obviously difference between 1.7x and 2x is not much. So I wondered how do the numbers hold for moduli count 4.
Below are benchmarks when moduli is set to 4. I ran them for num thread 1 and 4. Because when thread is set to 4, vectors belonging to each 4 modulus should be processed on 4 different threads.
As visible, time with num thread set to 4 isn't 1/4th of time when num thread is 1. What's worse is on intel for n=32768 with 4 threads performance only improves by 1.22x.
I am unsure why this is the case. Any suggestions? Also will enourage others to verify the benchmarks.
The text was updated successfully, but these errors were encountered: