Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NaN-boxing of SOM values #37

Open
wants to merge 24 commits into
base: master
Choose a base branch
from
Open

NaN-boxing of SOM values #37

wants to merge 24 commits into from

Conversation

Hirevo
Copy link
Owner

@Hirevo Hirevo commented Nov 10, 2023

This PR reduces the size of SOM values from 16 bytes (as an enum) to just 8 bytes using a technique called NaN-boxing.

NaN-boxing is a technique where we take advantage of the fact that just 12 bits are enough to signal an NaN in 64-bit floating point numbers, so we hijack the 52 remaining bits to encode custom payloads.

This technique is notably used within the LuaJIT project and LibJS from the SerenityOS project.

This PR is built on top of the custom GC (som-gc) branch and the NaN-boxing is only implemented for the bytecode interpreter right now (so the AST one is currently broken).

This change means multiple things:

  • SOM values are now more compact, so memory usage should be looking down (but I haven't done measurements on this yet)
  • All previously unboxed value types are still unboxed in the new representation
  • But unboxed integers are now 32-bit, instead of 64-bit
  • Big integers are now GC-managed and therefore cheaply copyable

This change also makes the code a bit less portable due to us relying on the fact that pointers on most 64-bit systems actually only use the lowest 48 bits out of the 64 available to make them embeddable in the NaNs.
While this is valid for x86-64, I am not currently aware of which architecture are now broken by this assumption.

Depends on #33.

@Hirevo Hirevo added C-enhancement Category: Enhancements C-refactor Category: Refactors M-interpreter Module: Interpreter P-medium Priority: Medium labels Nov 10, 2023
@Hirevo Hirevo self-assigned this Nov 10, 2023
@Hirevo
Copy link
Owner Author

Hirevo commented Nov 10, 2023

The bytecode interpreter is fully working and passing tests except for 6 failures in Integer tests I have yet to fix due to unboxed integers now being 32-bit.

The CI is unable to show this due to the AST interpreter not being able to build successfully right now.

Also, currently, both the enum-based Value and the NaN-boxed SOMValue types are co-existing and convertible from one to the other.
I've kept the original Value type for now to get the primitives' implementations to work quickly and avoid reimplementing them all before seeing the interpreter working again.
The conversion between these two value representation is not that cheap, so I'll only do performance measurements after I've fully moved the primitives over to the NaN-boxed representation.

@Hirevo
Copy link
Owner Author

Hirevo commented Nov 17, 2023

The primitives have now been all revisited to stop using the old Value type.
Primitives now have a much more declarative way of extracting their expected arguments, thanks to some trait-based type system magic.

Previously, all primitives were bare function pointers, had the same prototype (fn(&mut Interpreter, &mut GcHeap, &mut Universe) -> ()), and had to take care of extracting and checking arguments themselves and putting their return values onto the interpreter stack manually.

Now, primitives can declare their expected arguments directly in their argument list, trait-based dispatch will take care of extracting them from the interpreter stack and converting them automatically (using a new FromArgs trait).
The primitives can also directly return values and they will automatically be pushed back onto the interpreter stack.

This significantly simplified the function bodies of the primitives, clearing them free of most conversions and interpreter stack management.

I think the ergonomics can be improved even further, through the use of procedural macros (like a#[primitive] annotation) which would do some automatic rewriting to canonicalise all primitives under a common bare function pointer type, instead of needing to make them heap-allocated trait objects.

@som-rs-benchmarker
Copy link

som-rs-benchmarker bot commented Nov 22, 2023

Here are the benchmark results for feature/nan-boxing (commit: bcf43ac):

AST interpreter
+-----------------+---------------------------------------+---------------------------+
| Benchmark       | master (base)                         | feature/nan-boxing (head) |
+-----------------+---------------------------------------+---------------------------+
| Bounce          | 184.98 ms ± 5.53 (175.65..191.77)     | 0.94x ± 0.07 (0.78..0.99) |
| BubbleSort      | 277.82 ms ± 12.67 (264.13..299.19)    | 1.03x ± 0.06 (0.94..1.10) |
| DeltaBlue       | 158.31 ms ± 6.92 (152.37..175.96)     | 1.02x ± 0.05 (0.98..1.07) |
| Dispatch        | 188.48 ms ± 10.37 (178.69..216.00)    | 0.97x ± 0.10 (0.84..1.05) |
| Fannkuch        | 124.15 ms ± 2.84 (120.66..128.23)     | 1.00x ± 0.08 (0.83..1.06) |
| Fibonacci       | 358.35 ms ± 11.23 (342.14..373.38)    | 0.98x ± 0.06 (0.86..1.02) |
| FieldLoop       | 321.04 ms ± 10.11 (304.11..337.41)    | 0.99x ± 0.04 (0.95..1.02) |
| GraphSearch     | 86.62 ms ± 6.90 (79.94..101.50)       | 1.09x ± 0.09 (1.05..1.12) |
| IntegerLoop     | 327.27 ms ± 14.87 (313.40..352.81)    | 1.00x ± 0.05 (0.97..1.03) |
| JsonSmall       | 202.98 ms ± 20.83 (190.92..261.24)    | 0.99x ± 0.11 (0.92..1.05) |
| List            | 230.33 ms ± 5.71 (223.30..239.87)     | 0.99x ± 0.06 (0.88..1.05) |
| Loop            | 427.06 ms ± 14.27 (412.32..450.72)    | 1.01x ± 0.04 (0.94..1.03) |
| Mandelbrot      | 249.36 ms ± 7.26 (239.72..259.53)     | 0.98x ± 0.03 (0.95..1.01) |
| NBody           | 215.81 ms ± 13.48 (202.28..239.21)    | 1.03x ± 0.07 (0.96..1.07) |
| PageRank        | 297.74 ms ± 16.89 (283.94..341.75)    | 0.97x ± 0.07 (0.90..1.02) |
| Permute         | 304.93 ms ± 12.34 (285.04..325.98)    | 1.04x ± 0.06 (0.94..1.06) |
| Queens          | 235.13 ms ± 9.76 (223.21..254.53)     | 1.00x ± 0.06 (0.93..1.04) |
| QuickSort       | 75.73 ms ± 4.42 (72.35..87.62)        | 0.99x ± 0.08 (0.87..1.05) |
| Recurse         | 269.71 ms ± 11.48 (256.03..293.63)    | 0.97x ± 0.09 (0.84..1.05) |
| Richards        | 4013.68 ms ± 71.25 (3909.78..4116.37) | 1.02x ± 0.02 (0.99..1.03) |
| Sieve           | 425.09 ms ± 13.67 (401.47..444.89)    | 1.02x ± 0.05 (0.97..1.06) |
| Storage         | 86.23 ms ± 5.67 (80.57..98.61)        | 1.04x ± 0.08 (0.95..1.09) |
| Sum             | 164.04 ms ± 3.14 (158.94..169.52)     | 0.99x ± 0.03 (0.95..1.02) |
| Towers          | 316.76 ms ± 17.83 (301.78..353.98)    | 0.95x ± 0.09 (0.83..1.05) |
| TreeSort        | 156.32 ms ± 4.50 (151.02..164.89)     | 0.97x ± 0.10 (0.76..1.03) |
| WhileLoop       | 363.40 ms ± 19.73 (341.65..415.20)    | 0.99x ± 0.06 (0.95..1.02) |
|                 |                                       |                           |
| Average Speedup |              (baseline)               | 1.00x ± 0.01 (0.94..1.09) |
+-----------------+---------------------------------------+---------------------------+

The raw ReBench data files are available for download here: baseline and head

Bytecode interpreter
+-----------------+---------------------------------------+---------------------------+
| Benchmark       | master (base)                         | feature/nan-boxing (head) |
+-----------------+---------------------------------------+---------------------------+
| Bounce          | 87.11 ms ± 2.20 (84.52..90.68)        | 1.00x ± 0.05 (0.94..1.07) |
| BubbleSort      | 123.71 ms ± 7.04 (116.51..140.07)     | 0.99x ± 0.11 (0.85..1.09) |
| DeltaBlue       | 68.40 ms ± 2.62 (65.30..74.32)        | 0.96x ± 0.06 (0.90..1.06) |
| Dispatch        | 91.38 ms ± 1.82 (89.29..94.82)        | 1.06x ± 0.03 (1.03..1.11) |
| Fannkuch        | 67.40 ms ± 12.44 (56.09..86.29)       | 1.21x ± 0.26 (0.98..1.35) |
| Fibonacci       | 155.57 ms ± 3.01 (152.47..162.36)     | 0.90x ± 0.09 (0.72..0.96) |
| FieldLoop       | 212.93 ms ± 8.28 (200.94..227.04)     | 1.38x ± 0.10 (1.21..1.47) |
| GraphSearch     | 41.52 ms ± 4.46 (37.67..50.78)        | 0.91x ± 0.12 (0.83..1.05) |
| IntegerLoop     | 160.50 ms ± 1.81 (158.55..164.58)     | 1.05x ± 0.02 (1.02..1.10) |
| JsonSmall       | 100.39 ms ± 5.33 (93.78..113.70)      | 1.00x ± 0.06 (0.92..1.03) |
| List            | 119.10 ms ± 2.13 (116.46..122.75)     | 0.88x ± 0.13 (0.65..1.01) |
| Loop            | 217.92 ms ± 19.36 (201.56..265.78)    | 1.06x ± 0.14 (0.91..1.16) |
| Mandelbrot      | 135.36 ms ± 8.10 (128.33..155.52)     | 1.07x ± 0.07 (1.03..1.13) |
| NBody           | 90.60 ms ± 2.78 (87.32..96.59)        | 0.95x ± 0.11 (0.73..1.04) |
| PageRank        | 141.75 ms ± 5.32 (136.60..152.48)     | 0.92x ± 0.16 (0.62..1.02) |
| Permute         | 135.05 ms ± 7.34 (126.41..151.20)     | 1.05x ± 0.13 (0.86..1.15) |
| Queens          | 98.83 ms ± 5.63 (94.02..113.54)       | 1.05x ± 0.07 (0.98..1.11) |
| QuickSort       | 34.64 ms ± 1.09 (33.09..36.69)        | 1.03x ± 0.08 (0.91..1.11) |
| Recurse         | 127.24 ms ± 7.27 (121.79..146.41)     | 0.95x ± 0.08 (0.86..1.02) |
| Richards        | 1705.71 ms ± 47.35 (1648.36..1805.16) | 0.97x ± 0.04 (0.89..1.01) |
| Sieve           | 191.29 ms ± 5.58 (184.62..202.07)     | 1.02x ± 0.06 (0.94..1.09) |
| Storage         | 37.55 ms ± 1.82 (34.90..40.54)        | 1.00x ± 0.12 (0.81..1.11) |
| Sum             | 78.53 ms ± 1.29 (77.34..81.14)        | 1.02x ± 0.07 (0.92..1.12) |
| Towers          | 144.80 ms ± 2.66 (142.14..150.08)     | 1.12x ± 0.05 (1.01..1.17) |
| TreeSort        | 54.66 ms ± 1.64 (52.41..57.80)        | 0.91x ± 0.12 (0.72..1.09) |
| WhileLoop       | 212.61 ms ± 7.89 (199.99..229.61)     | 1.01x ± 0.07 (0.92..1.08) |
|                 |                                       |                           |
| Average Speedup |              (baseline)               | 1.02x ± 0.02 (0.88..1.38) |
+-----------------+---------------------------------------+---------------------------+

The raw ReBench data files are available for download here: baseline and head

The benchmarks were run using ReBench v1.2.0
The statistical analysis was done using rebench-tabler v0.1.0

@smarr
Copy link
Contributor

smarr commented Nov 22, 2023

SOM-RS Benchmarker? :) building your own GitHub tooling? Very interesting!
The NaN boxing and GC work is of course also very nice to see.

@Hirevo
Copy link
Owner Author

Hirevo commented Nov 22, 2023

Yeah, it is just a little thing I built to have these benchmarks summary tables without running them manually and managing all the .data files myself.
It is a very simple system built using shell scripts and my webhook handling server I built a while back and that I had running anyway.

I am not even sure if it will stay, given that it is a bit redundant with ReBenchDB, but maybe having the results inline in the PRs can make the results a bit easier to access.

I have also updated my ReBenchDB instance and noticed a new compare page, but it seems to have an issue right now displaying the results (like in this page), so I'll investigate that.

Also, it seems that ReBenchDB has a feature to post GitHub comments automatically which could replace my little setup.
I'll have to get around to set it up soon.

The NaN boxing and GC work is of course also very nice to see.

Thank you, altough it's advancing slower than I would like (like the GC who hasn't really improved in a while).
I am currently reading a lot of material on ways to improve interpreters, but I find it harder than expected to integrate some of these ideas into SOM-RS.

@smarr
Copy link
Contributor

smarr commented Nov 22, 2023

I am not even sure if it will stay, given that it is a bit redundant with ReBenchDB, but maybe having the results inline in the PRs can make the results a bit easier to access.

It's something I always wanted for ReBenchDB ;)

I have also updated my ReBenchDB instance and noticed a new compare page, but it seems to have an issue right now displaying the results (like in this page), so I'll investigate that.

Hmm. So, yeah, that could be a bug, or somehow the configuration between runs changed in ways that are not expected?

It's all reimplemented in TypeScript, and the R is removed from the project.
This makes things a bit faster, but at the same time, there's a lot of rather complex ugly code now to try to process and reshape the data.

Also, it seems that ReBenchDB has a feature to post GitHub comments automatically which could replace my little setup. I'll have to get around to set it up soon.

Yes, there's a "feature", but it is broken at least on my setup, and I haven't had a time to debug that since upgrading rebench.dev...

The NaN boxing and GC work is of course also very nice to see.

Thank you, altough it's advancing slower than I would like (like the GC who hasn't really improved in a while). I am currently reading a lot of material on ways to improve interpreters, but I find it harder than expected to integrate some of these ideas into SOM-RS.

Is Rust getting in your way? :)

@Hirevo Hirevo force-pushed the feature/nan-boxing branch from bcf43ac to c07e966 Compare May 8, 2024 13:44
@som-rs-benchmarker
Copy link

som-rs-benchmarker bot commented May 8, 2024

Here are the benchmark results for feature/nan-boxing (commit: c07e966):

AST interpreter
+-----------------+----------------------------------------+---------------------------+
| Benchmark       | master (base)                          | feature/nan-boxing (head) |
+-----------------+----------------------------------------+---------------------------+
| Bounce          | 185.58 ms ± 11.51 (173.57..208.52)     | 0.98x ± 0.08 (0.90..1.06) |
| BubbleSort      | 254.45 ms ± 17.04 (238.33..294.32)     | 1.00x ± 0.08 (0.92..1.05) |
| DeltaBlue       | 148.46 ms ± 4.87 (140.96..156.67)      | 0.96x ± 0.10 (0.79..1.05) |
| Dispatch        | 179.93 ms ± 8.00 (171.06..195.05)      | 1.02x ± 0.07 (0.93..1.08) |
| Fannkuch        | 122.70 ms ± 11.24 (113.72..147.90)     | 1.00x ± 0.11 (0.87..1.05) |
| Fibonacci       | 362.05 ms ± 10.68 (348.34..381.90)     | 1.05x ± 0.04 (1.03..1.10) |
| FieldLoop       | 307.67 ms ± 7.91 (299.54..323.67)      | 1.00x ± 0.10 (0.79..1.06) |
| GraphSearch     | 87.78 ms ± 13.77 (76.57..121.88)       | 1.09x ± 0.20 (0.92..1.23) |
| IntegerLoop     | 310.70 ms ± 25.21 (292.31..369.39)     | 0.99x ± 0.09 (0.92..1.06) |
| JsonSmall       | 183.38 ms ± 9.01 (173.31..200.77)      | 1.03x ± 0.06 (0.97..1.06) |
| List            | 226.56 ms ± 16.41 (214.70..265.03)     | 1.01x ± 0.09 (0.91..1.08) |
| Loop            | 414.59 ms ± 28.56 (382.10..460.44)     | 1.05x ± 0.09 (0.93..1.09) |
| Mandelbrot      | 243.40 ms ± 9.44 (231.55..262.19)      | 0.97x ± 0.07 (0.89..1.03) |
| NBody           | 193.37 ms ± 4.93 (190.04..205.91)      | 0.96x ± 0.05 (0.90..1.01) |
| PageRank        | 281.45 ms ± 11.94 (271.82..311.19)     | 0.96x ± 0.08 (0.84..1.03) |
| Permute         | 291.63 ms ± 20.57 (274.45..347.34)     | 0.93x ± 0.08 (0.89..0.99) |
| Queens          | 228.87 ms ± 15.72 (208.00..254.67)     | 0.99x ± 0.08 (0.92..1.05) |
| QuickSort       | 69.23 ms ± 1.86 (66.19..71.57)         | 0.93x ± 0.10 (0.79..1.01) |
| Recurse         | 268.09 ms ± 17.18 (256.46..310.36)     | 1.02x ± 0.08 (0.97..1.09) |
| Richards        | 3859.18 ms ± 138.18 (3711.90..4188.74) | 0.99x ± 0.06 (0.90..1.04) |
| Sieve           | 391.74 ms ± 7.94 (380.17..402.79)      | 0.99x ± 0.04 (0.95..1.07) |
| Storage         | 79.09 ms ± 3.05 (76.13..85.86)         | 0.90x ± 0.13 (0.67..1.01) |
| Sum             | 154.05 ms ± 2.57 (150.48..158.50)      | 1.02x ± 0.03 (0.98..1.06) |
| Towers          | 302.00 ms ± 9.45 (292.08..318.36)      | 0.94x ± 0.08 (0.81..1.06) |
| TreeSort        | 146.26 ms ± 4.72 (140.41..157.79)      | 0.84x ± 0.08 (0.68..0.94) |
| WhileLoop       | 349.88 ms ± 18.98 (331.06..393.56)     | 1.00x ± 0.08 (0.85..1.06) |
|                 |                                        |                           |
| Average Speedup |               (baseline)               | 0.99x ± 0.02 (0.84..1.09) |
+-----------------+----------------------------------------+---------------------------+

The raw ReBench data files are available for download here: baseline and head

Bytecode interpreter
+-----------------+---------------------------------------+---------------------------+
| Benchmark       | master (base)                         | feature/nan-boxing (head) |
+-----------------+---------------------------------------+---------------------------+
| Bounce          | 78.71 ms ± 8.04 (70.61..94.73)        | 1.27x ± 0.16 (1.11..1.44) |
| BubbleSort      | 97.94 ms ± 2.70 (95.11..101.91)       | 1.15x ± 0.14 (0.88..1.31) |
| DeltaBlue       | 61.87 ms ± 9.30 (55.73..87.66)        | 1.17x ± 0.22 (1.02..1.41) |
| Dispatch        | 82.37 ms ± 15.45 (73.87..125.17)      | 1.42x ± 0.28 (1.25..1.52) |
| Fannkuch        | 46.44 ms ± 1.91 (43.58..49.21)        | 1.07x ± 0.18 (0.87..1.38) |
| Fibonacci       | 130.35 ms ± 4.79 (125.05..138.36)     | 1.16x ± 0.09 (1.05..1.24) |
| FieldLoop       | 167.48 ms ± 10.10 (158.22..186.16)    | 1.50x ± 0.23 (1.09..1.69) |
| GraphSearch     | 33.96 ms ± 2.87 (32.21..41.90)        | 1.01x ± 0.23 (0.67..1.19) |
| IntegerLoop     | 140.92 ms ± 7.84 (129.82..150.09)     | 1.32x ± 0.12 (1.18..1.43) |
| JsonSmall       | 79.34 ms ± 6.22 (74.22..95.49)        | 1.01x ± 0.23 (0.66..1.17) |
| List            | 101.89 ms ± 6.65 (94.91..117.06)      | 1.21x ± 0.10 (1.12..1.30) |
| Loop            | 182.97 ms ± 16.53 (167.38..220.81)    | 1.32x ± 0.19 (1.08..1.48) |
| Mandelbrot      | 123.16 ms ± 21.04 (106.08..166.40)    | 1.34x ± 0.28 (1.01..1.49) |
| NBody           | 78.92 ms ± 3.85 (74.55..87.14)        | 1.23x ± 0.13 (1.03..1.42) |
| PageRank        | 123.35 ms ± 21.25 (108.65..163.91)    | 1.23x ± 0.27 (0.88..1.37) |
| Permute         | 109.58 ms ± 7.65 (103.68..125.93)     | 1.27x ± 0.12 (1.12..1.41) |
| Queens          | 87.51 ms ± 4.24 (81.50..95.83)        | 1.31x ± 0.09 (1.19..1.44) |
| QuickSort       | 30.34 ms ± 6.03 (26.43..47.31)        | 1.21x ± 0.35 (0.82..1.52) |
| Recurse         | 111.72 ms ± 5.83 (106.23..125.29)     | 1.11x ± 0.13 (0.94..1.28) |
| Richards        | 1435.33 ms ± 69.83 (1373.79..1576.64) | 1.12x ± 0.08 (1.02..1.20) |
| Sieve           | 161.99 ms ± 12.50 (149.47..192.75)    | 1.23x ± 0.13 (1.10..1.35) |
| Storage         | 31.71 ms ± 1.22 (29.68..33.66)        | 0.98x ± 0.28 (0.65..1.38) |
| Sum             | 68.14 ms ± 6.29 (64.21..84.45)        | 1.12x ± 0.12 (1.05..1.23) |
| Towers          | 112.44 ms ± 6.69 (106.97..130.18)     | 1.15x ± 0.16 (0.89..1.33) |
| TreeSort        | 45.67 ms ± 1.51 (44.31..49.55)        | 1.03x ± 0.13 (0.86..1.20) |
| WhileLoop       | 163.02 ms ± 6.94 (150.01..173.15)     | 1.19x ± 0.14 (1.01..1.39) |
|                 |                                       |                           |
| Average Speedup |              (baseline)               | 1.20x ± 0.04 (0.98..1.50) |
+-----------------+---------------------------------------+---------------------------+

The raw ReBench data files are available for download here: baseline and head

The benchmarks were run using ReBench v1.2.0
The statistical analysis was done using rebench-tabler v0.1.0

The source code of this benchmark runner is available as a GitHub Gist for more details about the setup

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Category: Enhancements C-refactor Category: Refactors M-interpreter Module: Interpreter P-medium Priority: Medium
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants