mitigate regression induced by the parallel implementation #230

imrn99 · 2024-10-31T09:14:13Z

imrn99
Oct 31, 2024
Maintainer

I noticed that a huge regression in performance happened with the changes of #201.

Observed changes

while we can expect a loss of speed due to sync overhead, the observed changes were absurdly important:

bench	after (`fdb2ed9`)	before (`b3e6c20`)
`builder-time`	45ms	1.34ms
`grisubal-time`	144ms	5ms

Cause

I haven't fully identified the cause(s) yet, but my first impression is that the initalization of TVars is very costly, see the snippet below from the stm crate source code:

impl<T> TVar<T>
    where T: Any + Sync + Send + Clone
{
    /// Create a new `TVar`.
    pub fn new(val: T) -> TVar<T> {
        TVar {
            control_block: VarControlBlock::new(val), // this right here
            _marker: PhantomData,
        }
    }

	// ...
}

impl VarControlBlock {
    /// create a new empty `VarControlBlock`
    pub fn new<T>(val: T) -> Arc<VarControlBlock>
        where T: Any + Sync + Send
    {
        let ctrl = VarControlBlock {
            waiting_threads: Mutex::new(Vec::new()), 
            dead_threads: AtomicUsize::new(0),
            value: RwLock::new(Arc::new(val)),
        };
        Arc::new(ctrl)
    }

	// ...
}

imrn99 · 2024-11-06T08:49:01Z

imrn99
Nov 6, 2024
Maintainer Author

I'm running experiment with a fork of the stm crate. I tried replacing some internal data structures with no-alloc variants (using heapless & arrayvec). There seem to be little to no change in performance, although this needs confirmation with benchmarks on more consistent machines.

There is one structure that I haven't been able to replace though: a BTreeMap used to keep track of accessed variable in a Transaction. There are no available equivalent atm (heapless's LinearMap lacks one method).

Essentially, TVar's inits and operations take much more time (several order of magnitudes for inits) than atomics (even Atomic<T> from the atomic crate). This raises multiple questions:

how much could we improve the implementation? how performant can we make a new implementation?
is the tradeoff worth it? guaranteeing operation correctness is very important, is it worth downgrading performance that much?
can we consider having multiple implementations?

0 replies

imrn99 · 2024-11-06T13:59:44Z

imrn99
Nov 6, 2024
Maintainer Author

perf report of the sequential impl, benching the builder binary:

perf report of the parallel impl:

it seems like the stm variables require that many more cycles to run; for example, the number of cycles dedicated to computing orbits goes from 10% down to less than 1%

0 replies

imrn99 · 2024-11-06T14:31:24Z

imrn99
Nov 6, 2024
Maintainer Author

perf stat -d of the builder binary, seq impl:

 Performance counter stats for './target/profiling/builder 1024 1024':

            137,67 msec task-clock                       #    0,997 CPUs utilized
                 2      context-switches                 #   14,527 /sec
                 1      cpu-migrations                   #    7,264 /sec
            36 951      page-faults                      #  268,402 K/sec
       742 525 510      cycles                           #    5,394 GHz                         (70,98%)
       151 260 278      stalled-cycles-frontend          #   20,37% frontend cycles idle        (71,01%)
     2 331 338 369      instructions                     #    3,14  insn per cycle
                                                  #    0,06  stalled cycles per insn     (70,91%)
       418 462 962      branches                         #    3,040 G/sec                       (71,55%)
         5 624 721      branch-misses                    #    1,34% of all branches             (72,30%)
       699 688 406      L1-dcache-loads                  #    5,082 G/sec                       (72,01%)
         9 674 762      L1-dcache-load-misses            #    1,38% of all L1-dcache accesses   (71,24%)
   <not supported>      LLC-loads
   <not supported>      LLC-load-misses

       0,138110185 seconds time elapsed

       0,085484000 seconds user
       0,051671000 seconds sys

parallel impl:

 Performance counter stats for 'taskset -c 0 ./target/profiling/builder 1024 1024':

          2 723,26 msec task-clock                       #    0,998 CPUs utilized
               137      context-switches                 #   50,307 /sec
                 1      cpu-migrations                   #    0,367 /sec
           712 863      page-faults                      #  261,768 K/sec
    14 732 320 056      cycles                           #    5,410 GHz                         (71,36%)
     2 820 127 554      stalled-cycles-frontend          #   19,14% frontend cycles idle        (71,43%)
    44 049 373 349      instructions                     #    2,99  insn per cycle
                                                  #    0,06  stalled cycles per insn     (71,43%)
     8 484 034 434      branches                         #    3,115 G/sec                       (71,46%)
       110 288 231      branch-misses                    #    1,30% of all branches             (71,49%)
    13 633 379 729      L1-dcache-loads                  #    5,006 G/sec                       (71,45%)
       227 469 230      L1-dcache-load-misses            #    1,67% of all L1-dcache accesses   (71,39%)
   <not supported>      LLC-loads
   <not supported>      LLC-load-misses

       2,728805716 seconds time elapsed

       1,909099000 seconds user
       0,798234000 seconds sys

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mitigate regression induced by the parallel implementation #230

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

mitigate regression induced by the parallel implementation #230

imrn99 Oct 31, 2024 Maintainer

Observed changes

Cause

Replies: 3 comments

imrn99 Nov 6, 2024 Maintainer Author

imrn99 Nov 6, 2024 Maintainer Author

imrn99 Nov 6, 2024 Maintainer Author

imrn99
Oct 31, 2024
Maintainer

imrn99
Nov 6, 2024
Maintainer Author

imrn99
Nov 6, 2024
Maintainer Author

imrn99
Nov 6, 2024
Maintainer Author