-
Notifications
You must be signed in to change notification settings - Fork 507
Benchmarking Rayon
Benchmarking Rayon is very much a "work in progress". It is hard to estimate the impact of a change to the core scheduler across the wide diversity of machines and workloads that people encounter in practice. Also, we don't have a wide set of benchmarks showing representative use cases of Rayon "in production". So, if you are using Rayon in production, contributing benchmarks would be very helpful!
With that said, this page exists to try and document some of the benchmarks and tooling considerations we do have.
It may seem obvious, but the first thing when reporting any results is to document your setup. What OS are you running? How many cores do you have? And so forth. It's hard to do a thorough job -- even bash environment variables can affect performance -- but it's important to have at least a basic picture.
The most basic thing to run is rayon-demo. I often use the cargo-chrono tool to test the impact of multiple commits. The tool allows you to do multiple runs and compute the median, which helps to eliminate the impact of noise. The following is a standard command line of mine. You have to change the XXX as appropriate to name the PR and the branch you are testing:
cargo-chrono bench -f rayon-prXXX.csv --commits "nikomatsakis/master XXX" --repeat 15 \
--ignore-dirty '*pr*csv' --ignore-dirty '*svg' \
join_microbench \
nbody_par \
quick_sort_par_bench \
merge_sort_par_bench \
parallel_find_last \
parallel_find_missing \
dj10 \
fibonacci_split_recursive
This will generate a file (rayon-prXXX.csv
) containing the measurement results. You can then generate plots like so:
cargo-chrono plot -f rayon-prXXX.csv --output-file rayon-prXXX-points.svg
cargo-chrono plot -f rayon-prXXX.csv --output-file rayon-prXXX-points-medians.svg --medians --normalize
The first command shows all measurements. The second command computes medians, normalized to 100. This is pretty useful for detecting deviations one way or the other. I usually eyeball the first plot to make sure the measurements are clustered in a tight little group and then look at the second one to see if anything really changed.
Servo is a big application that is now using rayon in a number of different ways.
XXX this needs to be written
Stylo is an important use case for Rayon. It is difficult to profile because:
(1) Stylo (and Rayon) code is embedded within the mass of Gecko, which makes it hard to identify the parts that are Stylo-specific.
(2) a typical Stylo parallel pass takes only a few milliseconds or tens of milliseconds, meaning there's a lot of noise in the timing numbers. Also, that makes it impossible to profile Stylo using profilers that can attach and detach from processes at arbitrary points.
(3) We can only indirectly control the parallel workload that Stylo presents to Rayon, by messing with the DOM that we ask Stylo to style.
To deal with (1) and (2), install the patch at https://bugzilla.mozilla.org/show_bug.cgi?id=1367962. This causes Stylo to iterate the number of times specified by env var STYLO_ITERS. Although less than perfect, this makes it possible to magnify the Stylo/Rayon costs arbitrarily, relative to the rest of Gecko. It also prints an iteration counter, so you can see when to attach/detach/start/stop your favourite profiler, like the Gecko profiler or VTune.
To deal with (3), you might consider using the test cases from https://bugzilla.mozilla.org/show_bug.cgi?id=1368415. These generate simple DOMs (derivatives of the bloom-basic test) that exhibit specific parallel workloads.
Some other hints:
-
I disable e10s, just so as to reduce the number of processes and -- in the case of profilers that create logfiles -- logfiles involved.
-
I run 'top' in a shell, with a 1-second update time. Together with STYLO_ITERS set to a large number, you can get some impression from this of the actual achieved level of parallelism by looking at the %CPU column. Check this -- it may be different from what you expect.
-
Beware of CPU frequency scaling. In another shell, I run 'watch -n 1 "grep MHz /proc/cpuinfo"'. Make sure your cores are running consistently at the speed you expect, while profiling.
-
Expect noisy results, even more so than when profiling sequential code. I make multiple measurements and take the lowest value from each one, rather than median or averaging. This is on the basis that there's some minimum number of cycles needed to run the test workload, and so the minimum time is the most accurate value. Said differently, the underlying noise distribution in timing measurements is one-sided, not two-sided as averaging or median assumes.