multicore mul_assign does not improve performance as expected #1

Janmajayamall · 2023-06-13T11:15:14Z

Benchmarks for mul_assign

Moduli Count: 2

M1 Pro
Num thread = 1
poly/mul_assign/n=2048  time:   [10.447 µs 10.545 µs 10.652 µs]
poly/mul_assign/n=8192  time:   [55.902 µs 57.877 µs 60.232 µs]
poly/mul_assign/n=32768 time:   [131.06 µs 137.26 µs 143.14 µs]

Num thread = 2
poly/mul_assign/n=2048  time:   [9.3860 µs 9.4671 µs 9.5760 µs]
poly/mul_assign/n=8192  time:   [37.545 µs 38.424 µs 39.509 µs]
poly/mul_assign/n=32768 time:   [102.68 µs 106.68 µs 110.04 µs]


Intel (c3-highcpu-8 on GCP; 8vcpus = 4cores)
Num thread = 1
poly/mul_assign/n=2048  time:   [17.455 µs 17.741 µs 18.101 µs]
poly/mul_assign/n=8192  time:   [48.850 µs 48.970 µs 49.119 µs]
poly/mul_assign/n=32768 time:   [176.64 µs 177.51 µs 178.85 µs]

Num thread = 2
poly/mul_assign/n=2048  time:   [13.085 µs 13.252 µs 13.461 µs]
poly/mul_assign/n=8192  time:   [33.119 µs 34.885 µs 36.532 µs]
poly/mul_assign/n=32768 time:   [102.00 µs 104.93 µs 107.91 µs]

Moduli count is number of modulus in the polynomial. In mul_assign each modulus is expected to be processed on a different thread. Num thread is number of threads that rayon is allowed to use set by either calling num_threads() or setting RAYON_NUM_THREADS environment variable. n is the degree of polynomial.

Since we parallelize mul_assign across Axis(0), that is per moduli, and number of modulus is set to 2, doubling number of threads from 1 to 2 should halve the time. But, as visible from the benchmarks, this isn't the case. For ex, take n=32768. When num thread is set 1 it takes 177.51 µs on intel. And when num thread is 2 it takes 104.93 µs, which is only 1.7x instead of 2x.

Obviously difference between 1.7x and 2x is not much. So I wondered how do the numbers hold for moduli count 4.

Below are benchmarks when moduli is set to 4. I ran them for num thread 1 and 4. Because when thread is set to 4, vectors belonging to each 4 modulus should be processed on 4 different threads.

Moduli Count = 4

M1 Pro

Num thread = 1
poly/mul_assign/n=2048  time:   [21.444 µs 21.894 µs 22.494 µs]
poly/mul_assign/n=8192  time:   [67.386 µs 68.144 µs 68.947 µs]
poly/mul_assign/n=32768 time:   [231.78 µs 232.42 µs 233.14 µs]

Num thread = 4
poly/mul_assign/n=2048  time:   [18.290 µs 18.616 µs 19.039 µs]
poly/mul_assign/n=8192  time:   [46.190 µs 46.922 µs 47.759 µs]
poly/mul_assign/n=32768 time:   [97.121 µs 105.42 µs 113.17 µs]

Intel (c3-highcpu-8 on GCP; 8vcpus = 4cores)
Num thread = 1
poly/mul_assign/n=2048  time:   [28.046 µs 28.602 µs 29.302 µs]
poly/mul_assign/n=8192  time:   [91.766 µs 93.288 µs 95.162 µs]
poly/mul_assign/n=32768 time:   [360.53 µs 361.45 µs 362.84 µs]

Num thread = 4
poly/mul_assign/n=2048  time:   [18.743 µs 18.964 µs 19.218 µs]
poly/mul_assign/n=8192  time:   [52.354 µs 53.754 µs 55.209 µs]
poly/mul_assign/n=32768 time:   [291.93 µs 294.22 µs 296.67 µs]

As visible, time with num thread set to 4 isn't 1/4th of time when num thread is 1. What's worse is on intel for n=32768 with 4 threads performance only improves by 1.22x.

I am unsure why this is the case. Any suggestions? Also will enourage others to verify the benchmarks.

The text was updated successfully, but these errors were encountered:

Janmajayamall · 2023-06-14T04:48:40Z

Moduli Count: 2

M1 Pro

Num thread = 1
poly/mul_assign/n=2048  time:   [13.385 µs 13.471 µs 13.573 µs]
poly/mul_assign/n=8192  time:   [39.681 µs 39.728 µs 39.792 µs]
poly/mul_assign/n=32768 time:   [151.54 µs 151.62 µs 151.73 µs]

Num thread = 2
poly/mul_assign/n=2048  time:   [10.333 µs 10.383 µs 10.448 µs]
poly/mul_assign/n=8192  time:   [23.465 µs 23.524 µs 23.596 µs]
poly/mul_assign/n=32768 time:   [78.530 µs 78.614 µs 78.713 µs]

Intel (c3-highcpu-8 on GCP; 8vcpus = 4cores)

Num thread = 1
poly/mul_assign/n=2048  time:   [17.073 µs 17.301 µs 17.573 µs]
poly/mul_assign/n=8192  time:   [54.544 µs 56.398 µs 58.400 µs]
poly/mul_assign/n=32768 time:   [180.22 µs 180.48 µs 180.77 µs]

Num thread = 2
poly/mul_assign/n=2048  time:   [13.983 µs 14.253 µs 14.571 µs]
poly/mul_assign/n=8192  time:   [37.040 µs 38.126 µs 39.957 µs]
poly/mul_assign/n=32768 time:   [101.31 µs 102.59 µs 104.20 µs]

Moduli Count: 4 

M1 Pro 

Num thread = 1
poly/mul_assign/n=2048  time:   [21.943 µs 22.058 µs 22.207 µs]
poly/mul_assign/n=8192  time:   [78.400 µs 78.609 µs 78.869 µs]
poly/mul_assign/n=32768 time:   [299.03 µs 299.64 µs 300.32 µs]

Num thread = 4
poly/mul_assign/n=2048  time:   [13.139 µs 13.242 µs 13.354 µs]
poly/mul_assign/n=8192  time:   [32.000 µs 32.759 µs 33.552 µs]
poly/mul_assign/n=32768 time:   [120.78 µs 121.33 µs 121.93 µs]

Intel (c3-highcpu-8 on GCP; 8vcpus = 4cores)

Num thread = 1 
poly/mul_assign/n=2048  time:   [31.002 µs 31.210 µs 31.460 µs]
poly/mul_assign/n=8192  time:   [93.601 µs 93.730 µs 93.916 µs]
poly/mul_assign/n=32768 time:   [359.79 µs 360.19 µs 360.62 µs]

Num thread = 4
poly/mul_assign/n=2048  time:   [25.843 µs 26.133 µs 26.437 µs]
poly/mul_assign/n=8192  time:   [59.301 µs 60.815 µs 62.316 µs]
poly/mul_assign/n=32768 time:   [289.85 µs 291.35 µs 293.09 µs]

After Ignacio Hagopian's suggestsion, I changed batch size for bench to BatchInput::PerIteration. Previously I had set batch size to 1000. Apparently even batch size of 1000 were affecting the benches. For ex, for moduli count = 2 on M1, except when n=2048, time for num thread 2 is approximately halve of when num thread is 1. But this does not follows through for moduli count 4. I find odd that instead of time being 1/4th of time on single thread, all values are approximately halve.

Moreover, nothing changes on Intel. I suspect that I am messing something up. Can anyone verify?

Janmajayamall · 2023-06-14T05:34:11Z

If anyone is curious about what the ideal times should be.

Run cargo bench modulus/mul_mod_fast_vec/32768 it will dispaly bench for mul_mod_fast_vec for a n=32768. When I run it on intel it shows 81us. A mul_assign with moduli set to 2 performs 2 of mul_mod_fast_vec. So Ideally time for poly/mul_assign/n=32768 with thread 1 must be 2 * 81us on Intel (it is 180us which is close). When thread is set to 2, time should be around 81us, but it isn't. Following this for other benches you can reasonably approximate ideal times and verify whether everything works as expected.

Janmajayamall · 2023-06-15T06:24:35Z

I increased the moduli count to check what happens. Looks like performance is close to what we expect when cores are oversubscribed, that is no. of threads are greater than no. of cores.

Following are benchmarks when I set moduli to 16.

Intel (c3-highcpu-8 on GCP; 8vcpus = 4cores)

Num thread = 1
poly/mul_assign/n=2048  time:   [98.786 µs 99.094 µs 99.532 µs]
poly/mul_assign/n=8192  time:   [347.95 µs 348.16 µs 348.39 µs]
poly/mul_assign/n=32768 time:   [1.3415 ms 1.3421 ms 1.3426 ms]

Num thread = 2
poly/mul_assign/n=2048  time:   [60.983 µs 61.947 µs 63.097 µs]
poly/mul_assign/n=8192  time:   [183.04 µs 183.34 µs 183.66 µs]
poly/mul_assign/n=32768 time:   [701.45 µs 702.11 µs 702.84 µs]

Num thread = 4
poly/mul_assign/n=2048  time:   [47.943 µs 48.379 µs 48.867 µs]
poly/mul_assign/n=8192  time:   [147.05 µs 148.59 µs 150.31 µs]
poly/mul_assign/n=32768 time:   [392.58 µs 396.88 µs 402.06 µs]

With 16 moduli, time for 2 threads and 4 threads must be 1/2 and 1/4th of time for 1 thread. Notice the pattern that as n increases behaviour of time values get closer to what we expect. For example, when n=32768 time on thread=4 is 396us and thread=2 is 702us. Both are approximately 1/4th and 1/2 of 1.342ms, which is the time when thread=1.

I find this behaviour odd. We are definitely spending time doing something that we shouldn't.

Janmajayamall · 2023-06-15T06:33:26Z

Here's the function that performs mul_assign operation a bunch of time in main.rs. Someone with experience with profiling can help us figure out where are we spending time?

Janmajayamall changed the title ~~mul_assign~~ multicore mul_assign does not improve performance as expected Jun 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multicore mul_assign does not improve performance as expected #1

multicore mul_assign does not improve performance as expected #1

Janmajayamall commented Jun 13, 2023

Janmajayamall commented Jun 14, 2023

Janmajayamall commented Jun 14, 2023 •

edited

Loading

Janmajayamall commented Jun 15, 2023

Janmajayamall commented Jun 15, 2023

multicore mul_assign does not improve performance as expected #1

multicore mul_assign does not improve performance as expected #1

Comments

Janmajayamall commented Jun 13, 2023

Janmajayamall commented Jun 14, 2023

Janmajayamall commented Jun 14, 2023 • edited Loading

Janmajayamall commented Jun 15, 2023

Janmajayamall commented Jun 15, 2023

Janmajayamall commented Jun 14, 2023 •

edited

Loading