Inspired by copy-and-patch, but what if we didn't have to patch anything?
The basic idea is that we use the compiler to generate 'templates' for us, that we can then copy in place. This relies heavily on continuation passing, meaning that all operations defined by the jit library have to allow for continuation passing optimizations to happen. In copy-and-patch, the templates are filled in at runtime with whatever values the user chooses, but since this relies on parsin ELF relocations, the library will have to be ported to different platforms. Not a huge issue, but if we could avoid runtime patching of relocations we could (in theory) build a jit library that's architecture agnostic, while having very low latencies.
make
This will produce a simple test program, copyjit
, which just generates a
procedure that sums the first billion numbers.
The idea here is to generate a number of continuation passing procedures,
and string them together at runtime. To avoid patching at runtime, all
procedures have a fixed signature, defined by DEFINE_OP
in ops/common.h
.
Each procedure defines one operation, and each operation passes the execution on to operations that directly follow it in memory. In other words, the generated procedures are copied into place in memory to generate executable code.
This implementation uses four pseudo-registers and one stack pointer.
One pseudo-register is just an argument that operations pass to eachother, but
since most architectures pass arguments in registers, a pseudo-register will
likely map to a real register. The registers are a
, x
, y
, o
.
They are loosely based on a 6502-style architecture where the operations are
defined in terms of fixed registers, as this seems the easiest to implement.
a
is intended as an accumulator, x
and y
are general purpose and o
is
for offsets or immediate values. Procedures are possible to implement, but I
don't have any examples at the moment. The argument passing would likely be done
on the stack, though, meaning that this jit method is likely on the slower end
of jit compilers.
I don't know if the architecture I'm going with is necessarily optimal for the task, it's just the first one that popped into my mind.
Since we would optimally like to just copy pre-generated binary code one block after another, each operation should jump to the address following the last instruction in the operation. However, due to continuation passing, this jump is usually the last step in an operation, so we would like to filter it out whenever possible to in crease execution speed and minimize binary size.
Essentially, we generate an empty operation lib/empty
that just falls through to another operation, and then we can compare this empty
operation to other operations. If we find a procedure epilogue that matches the
empty operation, we can (fairly safely?) filter the epilogue out, saving on code
size and increasing runtime performance. GCC's -fno-schedule-insns
and
-fno-schedule-insn2
make it more unlikely that the epilogue has other
instructions mixed in, increasing the chances of succesful filtration.
The code should still work without the filtration, but is a bit slower.
Implementing this filtering is arguably the most hacky part of the project.
The first implementation took the address of a label as the continuation function to produce a jump that technically lands within the procedure itself, with the effect of just falling directly through. Essentially:
void some_operation(int arg0, int arg2)
{
do_work(arg0, arg2);
((op_call)(&&_next_op))(arg0, arg2);
_next_op:
}
This requires a GCC extension as taking the address of a label is not part of standard C.
Using a label as a continuation point seems to have uncovered some bugs in the GNU assembler, as the compiler itself generates seemingly 'correct' assembly but the final binaries are incorrect. This method works on x86 with the GNU toolchain, with at least RISC-V and ARM showing the previously mentioned issues. LLVM seems to work on RISC-V and x86, but fails on ARM. I haven't tested other systems yet.
Instead of relying on labels, we try to use a custom linker script that places a symbol directly after the procedure. This is the current implementation and seems to work on x86 and RISCV without modifications. AARCH64 does work, though unfortunately I had to add AARCH64 specific compilation options to force the tiny memory model. Not a huge issue, but considering the dream is to have a completely architecture agnostic jit library, this sort of goes against that.
Still haven't tested other architectures. While somewhat hacky, the method seems to be more reliable than the first attempt.
We just 'allocate' some extrabytes in the .text
sections that the operation refers to,
and whose address is known so that the jit compiler could populate the area with some
data. While this is kind of patching at runtime, it can still be done
architecture agnostically. In pseudo-assembly form, something like:
1: does_some_work
2: load 4
3: jump 5
4: 0x1234 (data) <-- last four bytes of an operation reserved for data
5: next_operation
All GCC test cases were cross-compiled with Debian's prebuilt compiler packages
(gcc-aarch64-linux-gnu
etc.) and run with qemu usermode emulation with
Debian's prebuilt qemu-user
package. Similarly, LLVM test cases were compiled
with the same architecture string as the GCC tests.
For instance, a GCC test case:
make CROSS_COMPILE=riscv64-linux-gnu- ARCH=riscv64 -j$(nproc) examples
qemu-riscv64 -L /usr/riscv64-linux-gnu ./examples/copyjit
And an LLVM test case:
make CROSS_COMPILE=riscv64-linux-gnu- ARCH=riscv64 -j$(nproc) examples
qemu-riscv64 -L /usr/riscv64-linux-gnu ./examples/copyjit
Some of the 'broken' architectures may end up working with some compiler flag tweaks, similar to aarch64.
The test setup is far from optimal and ideally I would set up some automatic CI that would probably be easier to parse. Additionally, I haven't yet put that much time into figuring out why each arch fails, but I added some of my thoughts as to what the issue seems to be.
Linux:
GCC toolchain:
OK:
x86_64
i686
riscv64
aarch64
arm
Broken:
mips64el
(ISA doesn't contain pc-relative instructions, disallowing copying code directly) (mips64r6 has the instructions, gcc needs at least this patch to use them properly though)mips64
mipsel
mips
sparc64
(fails to generate continuation passing)sh4
(fails to generate continuation passing (I think?))s390x
(doesn't crash, but seems to jump over some code or something?)m68k
(segfault, unclear why)powerpc64le
(fails to generate continuation passing) (does work with power10, though)powerpc64
powerpc
Fails to compile:
hppa64
(missing string.h?)hppa
('undefined reference to$$dyncall
)alpha
('internal compiler error: in extract_insn, at recog.cc:2791'
)
Untestable (by me at least):
arc
(compiles but doesn't have a qemu user emulator)
LLVM toolchain:
OK:
x86_64
aarch64
riscv64
Broken:
i686
(fails to generate continuation passing)arm
(unsure, possibly some branch issue)
TODO: continue testing LLVM