-
Notifications
You must be signed in to change notification settings - Fork 58
Benchmarks of the Juvix C runtime
Benchmarked version: commit 148ececb4d4259eacbb980f5992073a3ac611d82 from 31.10.2022
We benchmark several programs manually compiled into the primitives of the new Juvix C runtime. The code corresponds closely to the code that will be generated by the new compilation process, with basic low-level optimisations (unboxing, untagging, etc.) but without any high-level optimisations on JuvixCore (inlining, specialisation, constant folding, fusion, etc.). This corresponds to the compilation process planned for the 0.4 milestone.
We compare the running time and memory usage with analogous programs written in Haskell, OCaml, JuvixCore (using the evaluator), old Juvix (with the "direct" transpilation to C) and C.
The results suggest that for most first-order programs the new
compilation process will produce code with running time comparable to
the code produced by the native OCaml compiler. For higher-order
programs heavy on closure manipulation, the results are acceptable but
noticeably worse, especially with third-order functions (i.e. functions
which take functions taking functions). This could, however, be
alleviated by implementing the specialisation optimisation (see the
"specialised" column in the ackermann
and mapfun
benchmarks).
Besides, functional programs of order higher than two are rare.
The comparisons with OCaml and Haskell were not entirely fair because
the new Juvix runtime does not perform garbage collection. The overhead
of garbage collection is particularly visible on the mergesort
benchmark which creates many intermediate data structures that are
quickly discarded. With proper memory management, the running time
results on first-order programs for the new Juvix runtime are expected
to become slightly worse than for the native OCaml compiler.
For simple programs operating on integers which don't require any heap
memory allocation (fibonacci
and combinations
benchmarks), the
direct transpilation to C in the old Juvix seems to perform best
(behind only C). The reason is that for very simple programs `clang`
can better optimise the output of such a direct transpiler. The main
problem with the transpilation to C approach is that it cannot scale to
reliably work for more complex programs, as evidenced by the segfaults,
longer running time and higher memory use on other benchmarks.
In addition to the fibonacci
and combinations
benchmarks, the
advantage of direct transpilation for very simple programs is also
visible on the fold
benchmark where a simple loop over a list
dominates the running time. However, this is partly because the
translation of closures in the old Juvix transpilation method is
incorrect allowing it to be more efficient.
Compute the Nth Fibonacci number modulo 228 (N = 100’000’000)
The Nth Fibonacci number is computed in O(N). Needs only constant stack space and no heap memory. This benchmark tests the efficiency of tail recursion and arithmetic operations.
Count combinations of numbers 1 to N having sum N (N = 100)
This benchmark tests the efficiency of general recursion. No heap memory needs to be allocated. Uses stack space proportional to N. The running time is exponential in N.
Compute the Nth prime (N = 16384)
The Nth prime number is computed via the Eratosthenes sieve. A list of N primes is created. No intermediate lists are discarded (garbage collection not needed). This benchmark tests the efficiency of tail recursion, arithmetic operations, list cell allocation and access.
Merge sort a list of N integers (N = 2’000’000)
At each level of merge sort intermediate lists are created and discarded. The running time for this benchmark largely depends on the efficiency of memory management. Here one may observe the overhead of garbage collection or the memory blow-up if no garbage collection is used.
Optionally sum N integers from a binary tree K times (N = 220, K = 100)
If a fixed number k is encountered in the tree then the result is
Nothing
, otherwise it is Just sum
. The computation is repeated
for values of k from 0 to K. This tests the efficiency of handling
optional values and data structure access.
Fold a list of N integers K times (N = 100’000, K = 1000)
The sum of N natural numbers is computed via foldleft (tail-recursive). The computation is repeated K times. The list is created only once, so that allocation time does not dominate. This benchmark tests the efficiency of closure call and list cell access.
Compute the Nth Fibonacci number modulo 228 with CPS (N = 100’000’000)
The function computing the Nth Fibonacci number is written in continuation-passing style, tail-recursively calling a continuation supplied as an argument. This benchmark tests the efficiency of closure call and allocation.
Map and fold a list of N integers K times (N = 10000, K = 10000)
This benchmark tests the efficiency of standard higher-order functions on lists, closure call and memory management. The program allocates O(K) intermediate lists of length N which are quickly discarded.
Compute Ack(3, N) with the higher-order Ackermann function definition (N = 11)
The higher-order Ackermann function definition iterates an iteration of function compositions. Hence, it uses a third-order invocation of an iteration function. This benchmark tests the efficiency of creating and calling second-order closures, and of partial application.
Successively map K functions to a list of N integers (K = 100, N = 10000)
The benchmark stores K second-order closures in a list, maps them successively to a list of K closures, and then successively maps the K closures from the result to a list of N integers. This benchmark tests the efficiency of manipulating closures and storing them in data structures.
The benchmark programs can be found in tests/benchmark
in the Juvix
source directory.
For each program the total running time (elapsed real time) and memory use (maximum resident set size) were measured on an M1 iMac with no significant background activity. Averages of several runs were taken. The variance was negligible, unless indicated otherwise by providing a range.
Compute the Nth Fibonacci number modulo 228 (N = 100’000’000)
Juvix C runtime (native) | Juvix C runtime (wasm32, wasmer) | Juvix transpilation (native) | Juvix transpilation (wasm32, wasmer) | JuvixCore evaluator | Haskell (native, ghc -O2) | Haskell (native, ghc -XStrict -O2) | OCaml (native, ocamlopt -O2) | OCaml (bytecode) | C (native, clang -O3) | C (wasm32, clang -Os, wasmer) | |
---|---|---|---|---|---|---|---|---|---|---|---|
Time (seconds, real) | 0.26 | 0.35 | 0.35 | 0.23 | 13.15 | 10.03 | 0.39 | 0.35 | 0.94 | 0.16 | 0.22 |
Memory use (MB, max RSS) | 1.5 | 3.8 | 1.3 | 8.8 | 21.3 | 8067.7 | 9.7 | 1.7 | 1.8 | 1.3 | 4.0 |
Count all combinations of numbers 1 to N having sum N (N = 1000)
Juvix C runtime (native) | Juvix C runtime (wasm32, wasmer) | Juvix transpilation (native) | Juvix transpilation (wasm32, wasmer) | JuvixCore evaluator | Haskell (native, ghc -O2) | Haskell (native, ghc -XStrict -O2) | OCaml (native, ocamlopt -O2) | OCaml (bytecode) | C (native, clang -O3) | C (wasm32, clang -Os, wasmer) | |
---|---|---|---|---|---|---|---|---|---|---|---|
Time (seconds, real) | 6.67 | 11.25 | 3.22 | 5.1 | 441.71 | 5.48 | 5.48 | 6.53 | 41.08 | 2.69 | 4.80 |
Memory use (MB, max RSS) | 1.5 | 3.9 | 1.3 | 8.9 | 22.3 | 9.6 | 9.6 | 1.7 | 1.9 | 1.3 | 4.0 |
Compute the Nth prime (N = 16384)
Juvix C runtime (native) | Juvix C runtime (wasm32, wasmer) | Juvix transpilation (native) | Juvix transpilation (wasm32, wasmer) | JuvixCore evaluator | Haskell (native, ghc -O2) | Haskell (native, ghc -XStrict -O2) | OCaml (native, ocamlopt -O2) | OCaml (bytecode) | C (native, clang -O3) | C (wasm32, clang -Os, wasmer) | |
---|---|---|---|---|---|---|---|---|---|---|---|
Time (seconds, real) | 1.52 | 1.91 | segfault | 3.09 | 167.04 | 3.85 | 3.85 | 1.68 | 14.82 | 0.12 | 0.13 |
Memory use (MB, max RSS) | 1.7 | 4.0 | segfault | 9.3 | 24.4 | 9.8 | 9.6 | 2.2 | 2.2 | 1.4 | 4.0 |
Merge sort a list of N integers (N = 2’000’000)
Juvix C runtime (native) | Juvix C runtime (wasm32, wasmer) | Juvix transpilation (native) | Juvix transpilation (wasm32, wasmer) | JuvixCore evaluator | Haskell (native, ghc -O2) | Haskell (native, ghc -XStrict -O2) | OCaml (native, ocamlopt -O2) | OCaml (bytecode) | C (native, clang -O3) | C (wasm32, clang -Os, wasmer) | |
---|---|---|---|---|---|---|---|---|---|---|---|
Time (seconds, real) | 0.40 | 0.31 | 3.55 | 1.32 | 22.45 | 2.86 | 2.90 | 1.95 | 3.52 | 0.15 | 0.15 |
Memory use (MB, max RSS) | 1973.7 | 720.4 | 5046.7 | 2729.8 | 1728.9 | 253.6 | 253.6 | 172.6 | 343.1 | 24.4 | 26.8 |
Optionally sum N non-zero integers from a binary tree K times (N = 220, K = 100)
Juvix C runtime (native) | Juvix C runtime (wasm32, wasmer) | Juvix transpilation (native) | Juvix transpilation (wasm32, wasmer) | JuvixCore evaluator | Haskell (native, ghc -O2) | Haskell (native, ghc -XStrict -O2) | OCaml (native, ocamlopt -O2) | OCaml (bytecode) | C (native, clang -O3) | C (wasm32, clang -Os, wasmer) | |
---|---|---|---|---|---|---|---|---|---|---|---|
Time (seconds, real) | 0.45 | 0.64 | 3.29 | 1.57 | 22.75 | 5.58 | 0.59 | 0.30 | 3.57 | 0.27 | 0.50 |
Memory use (MB, max RSS) | 1.6 | 3.8 | 2646.1 | 1320.9 | 22.4 | 5560.7 | 9.7 | 3.9 | 4.0 | 1.3 | 4.1 |
Fold a list of N integers K times (N = 100’000, K = 1000)
Juvix C runtime (native) | Juvix C runtime (wasm32, wasmer) | Juvix transpilation (native) | Juvix transpilation (wasm32, wasmer) | JuvixCore evaluator | Haskell (native, ghc -O2) | Haskell (native, ghc -XStrict -O2) | OCaml (native, ocamlopt -O2) | OCaml (bytecode) | C (native, clang -O3) | C (wasm32, clang -Os, wasmer) | |
---|---|---|---|---|---|---|---|---|---|---|---|
Time (seconds, real) | 0.45 | 0.54 | 0.35 | 0.23 | 15.27 | 0.58 | 0.58 | 0.36 | 1.80 | NA | NA |
Memory use (MB, max RSS) | 3.1 | 4.6 | 4.4 | 10.6 | 43.4 | 12.7 | 12.7 | 5.9 | 5.9 | NA | NA |
Compute the Nth Fibonacci number modulo 228 with CPS (N = 100’000’000)
Juvix C runtime (native) | Juvix C runtime (wasm32, wasmer) | Juvix transpilation (native) | Juvix transpilation (wasm32, wasmer) | JuvixCore evaluator | Haskell (native, ghc -O2) | Haskell (native, ghc -XStrict -O2) | OCaml (native, ocamlopt -O2) | OCaml (bytecode) | C (native, clang -O3) | C (wasm32, clang -Os, wasmer) | |
---|---|---|---|---|---|---|---|---|---|---|---|
Time (seconds, real) | 0.43 | 0.52 | 1.56 | stack overflow | 20.22 | 10.04 | 0.39 | 0.35 | 1.60 | 0.16 | 0.25 |
Memory use (MB, max RSS) | 1.5 | 3.9 | 1539.3 | stack overflow | 21.3 | 8067.7 | 9.7 | 1.7 | 1.8 | 1.3 | 4.0 |
Map and fold a list of N integers K times (N = 10000, K = 10000)
Juvix C runtime (native) | Juvix C runtime (wasm32, wasmer) | Juvix transpilation (native) | Juvix transpilation (wasm32, wasmer) | JuvixCore evaluator | Haskell (native, ghc -O2) | Haskell (native, ghc -XStrict -O2) | OCaml (native, ocamlopt -O2) | OCaml (bytecode) | C (native, clang -O3) | C (wasm32, clang -Os, wasmer) | |
---|---|---|---|---|---|---|---|---|---|---|---|
Time (seconds, real) | 1.01 | 1.59 | 2.74 | 1.81 | 38.24 | 1.29 | 2.42 | 1.43 | 4.22 | NA | NA |
Memory use (MB, max RSS) | 2154.5 | 893.0 | 3059.1 | 1542.0 | 26.4 | 10.6 | 10.7 | 7.5 | 10-20 | NA | NA |
Compute Ack(3, N) with the higher-order Ackermann function definition (N = 11)
Juvix C runtime (native) | Juvix C runtime (wasm32, wasmer) | Juvix C runtime (specialised, native) | Juvix C runtime (specialised, wasm32, wasmer) | Juvix transpilation (native) | Juvix transpilation (wasm32, wasmer) | JuvixCore evaluator | Haskell (native, ghc -O2) | Haskell (native, ghc -XStrict -O2) | OCaml (native, ocamlopt -O2) | OCaml (bytecode) | C (native, clang -O3) | C (wasm32, clang -Os, wasmer) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Time (seconds, real) | 0.92 | 1.21 | 0.30 | 0.65 | segfault | runtime error | 11.71 | 0.87 | 0.47 | 0.54 | 1.35 | 0.00 | 0.14 |
Memory use (MB, max RSS) | 2.6 | 4.1 | 2.3 | 3.9 | segfault | runtime error | 23.3 | 13.6 | 9.6 | 2.0 | 3.6 | 1.3 | 4.0 |
Successively map K functions to a list of N integers (K = 100, N = 10000)
Juvix C runtime (native) | Juvix C runtime (wasm32, wasmer) | Juvix C runtime (specialised, native) | Juvix C runtime (specialised, wasm32, wasmer) | Juvix transpilation (native) | Juvix transpilation (wasm32, wasmer) | JuvixCore evaluator | Haskell (native, ghc -O2) | Haskell (native, ghc -XStrict -O2) | OCaml (native, ocamlopt -O2) | OCaml (bytecode) | C (native, clang -O3) | C (wasm32, clang -Os, wasmer) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Time (seconds, real) | 1.27 | 1.04 | 0.39 | 0.46 | segfault | runtime error | 4.18 | 1.85 | 0.95 | 0.19 | 0.68 | NA | NA |
Memory use (MB, max RSS) | 3209.8 | 1229.7 | 21.8 | 13.2 | segfault | runtime error | 33.0 | 13.6 | 11.6 | 5.3 | 7.9 | NA | NA |
"Juvix C runtime" denotes C programs written using the primitives of the new Juvix runtime. These programs were "manually" compiled from the corresponding Juvix/JuvixCore programs, according to the new Juvix compilation concept. They correspond closely to the code that will be generated by the basic version of the new compilation process, without any high-level optimisations (inlining, specialisation, fusion, constant folding) but with basic low-level memory representation optimisations (unboxing, untagging, etc). This version of the new compilation process should be finished with the 0.4 milestone.
The "specialised" column for "Juvix C runtime" denotes a version of the corresponding "Juvix C runtime" benchmark program for which specialisation of higher-order functions was manually performed (to simulate the effects of the high-level specialisation optimisation).
"Juvix transpilation" denotes Juvix programs compiled with the old
compilation process via a "direct" translation to C. For a fair
comparison, all number operations were implemented using native binary
C integers (exposed via foreign
and compile
blocks) without
overflow check, instead of using the unary Nat from the standard
library. For Haskell, we use the fixed-precision Int instead of the
arbitrary-precision Integer.
For the simplest benchmark programs without heap memory allocation
(e.g. fibonacci
, combinations
), the performance of "Juvix
transpilation" is comparable to or better than that of "Juvix C
runtime". This is because clang
managed to eliminate (tail)
recursion and optimise the code to essentially the same or better
thing. The main problem with the old "direct" transpilation to C
approach is that it cannot scale to reliably work for more complex
programs. By "more complex" I mean larger program size, more
functions, more complex patterns of recursion and/or the use of more
functional programming features (including functional data
structures). I don't mean higher computational complexity or more
resource use.
The segfaults and runtime errors for "Juvix transpilation" are
consequences of incorrectly generated code (compilation of partial
application is not entirely correct) or stack overflows (when clang
didn't figure out how to eliminate tail recursion).
The comparison with "Juvix transpilation" is not entirely fair for
benchmarks that test the manipulation and calling of closures (e.g.
fold
). Juvix transpilation achieves good performance (when it doesn't
segfault) at the expense of correctness: partial application is not
compiled correctly and fixing this would require a fundamental change in
closure representation.
The comparison with Haskell and OCaml compilers is not entirely fair,
because the new Juvix C runtime does not perform garbage collection.
With the GC overhead, I would expect the Juvix runtime results for
native compilation of first-order programs to become a bit worse than
the native OCaml versions. The GC overhead is particularly noticeable
for the mergesort
benchmark which creates many large intermediate
lists. The memory usage of the Juvix runtime is much higher on this
benchmark than the memory usage of OCaml or Haskell versions. The
relatively small time difference between the OCaml native and bytecode
versions of mergesort
also indicates that GC accounts for a
significant part of the running time.
Another small overhead will be introduced by bounds checking for integer operations. Currently, the Juvix C runtime operates on unboxed 31-bit (or 63-bit) integers without checking for integer overflow.
If we decide to default to transparent arbitrary-precision integers, then another small overhead will be introduced by the need to check the integer representation with each arithmetic operation.
Admittedly, the programs were deliberately written in a way to make
high-level optimisations unnecessary, except specialisation for
higher-order functions (mostly in ackermann
and mapfun
). This
also explains the good performance of the OCaml native compiler which
doesn't do much high-level optimisation.
In the "Juvix transpilation" and OCaml version of mergesort
, to
avoid stack overflow the merge
function was written tail-recursively
with accumulator reversal at the end. This is not necessary for the
Juvix C runtime, because the stack is dynamically extended when
needed.
As evidenced by the combinations
benchmark, for non-tail-recursive
direct calls our code performs worse than the code which uses the C /
WebAssembly stack and function call mechanisms. However, in general it
is impossible to directly use the C / WebAssembly stack and call
mechanisms for a purely functional language. Since we dynamically
allocate the stack segments when needed, stack overflow is impossible.
This is convenient in an eager functional language. Otherwise, one needs
to rewrite all functions operating on large data to use tail recursion.
We pay for this convenience with a small overhead, which is the main
reason for poorer performance on combinations
where stack
manipulation cost dominates.
Haskell's laziness seems to introduce more overhead than I
expected. This would explain the comparatively better performance of the
native OCaml compiler. The problem is particularly stark when Haskell's
strictness analysis fails for some reason, as in the fibonacci
benchmark. The second "Haskell" column with the "-XStrict" flag for GHC
indicates the version of the benchmark compiled with strictness as the
default.
The C versions of the programs were written to take advantage of C's imperative features, e.g., using arrays instead of lists, loops instead of recursion. No C versions are provided for some benchmarks designed to test specifically functional language features.
With the new Juvix runtime, the 32-bit WebAssembly version of
mergesort
is faster than the 64-bit native version because it needs
roughly half as much memory (the word size is 4 bytes instead of 8). The
difference is even starker between the WebAssembly and native versions
of mergesort
for "Juvix transpilation".