Skip to content
sapanaKale edited this page Nov 22, 2021 · 2 revisions

ADRs

Best Parallel simulations implementation as of now

Date: 7th Oct 2021

Prologue

Find out the most performant branch with data parallelism implementation to merge back with master.

Context

Following are the branches exists in Epirust source code with the logic of data parallelism in different ways:

  1. parallel
  2. map_reduce
  3. read_only_view Details can be here in the doc. We benchmark the all branches against master with different no. of threads configuration in order to find most performant implementation as of now.

Decision

From the benchmarks data, the parallel branch looks more performant than all other branches with less no. of threads.

Considerations and Assumptions

Simulations for benchmarks has been run on Gondor.

Consequences

number_of_threads is the new field added in config file.

Related ADR/Spike

Supporting Document

  1. Benchmarks data
  2. Details of different parallel approaches

Parallel chunks and Parallel iter comparison

Date: 7th Oct 2021

Prologue

Use par_chunks instead of par_iter with different chunk sizes and compare the performance.

Context

For running simulations in parallel, we are using a parallel iterator from the rayon crate. It distributes the work between the threads using a stealing mechanism. Thinking if we could find some ideal chunk size and distribute the work between the threads using those chunks, it would be more efficient. We ran a few benchmarks for both the implementations.

Decision

From the benchmarks data, there is no significant difference in performance using parallel chunks. The best performance we get for ideal chunk size is almost the same as parallel iterator one. We did profiling using perf and generated the cpu flame graph for both. There is no difference in both implementations in terms of cpu consumption.

Considerations and Assumptions

Simulations for benchmarks have been run on local (macbook) in mini_epirust.

Consequences

Related ADR/Spike

Supporting Document

  1. Benchmarks data
  2. CPU flamegraph for par_chunks implementation
  3. CPU flamegraph for par_iter implementation
Clone this wiki locally