-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
s390x: initial port #91
Conversation
Let me know when the code is ready for review. Considering the missing parts, this patch definitely not closes #89. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This patch is quite big, will take time to review (I will do it in multiple steps). I noticed a large amount of syntax errors, I would recommend to fix them in the meantime. Please also squash the commits.
@@ -715,6 +720,12 @@ SLJIT_API_FUNC_ATTRIBUTE sljit_sw sljit_exec_offset(void* ptr); | |||
#define SLJIT_NUMBER_OF_SAVED_REGISTERS 5 | |||
#define SLJIT_LOCALS_OFFSET_BASE 0 | |||
|
|||
#elif (defined SLJIT_CONFIG_S390X && SLJIT_CONFIG_S390X) | |||
|
|||
#define SLJIT_NUMBER_OF_REGISTERS 12 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only 12? I expected 14 based on the abi.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was guessing that is used for making sure the 2 "reserved" registers are not touched (indeed SLJIT_R0 == r2), but I admit it is one of those things from the original code that would had been useful to discuss with the original author (@mundaym).
same goes for the FPU avoidance and the use of the constant pool (eventhough I think I had figured out that last part, and the cache comment was just deceiving, hence why it was removed and replaced by an empty CACHE_CLEAR as it is done in the other CISC architecture)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking a look at this @carenas and @zherczeg, it's good to see it being worked on again. I am happy to answer questions. Feel free to tag me.
I was guessing that is used for making sure the 2 "reserved" registers are not touched
Yes, I set the number of registers to 12 because some of the registers are 'reserved' to hold temporary values during codegen.
same goes for the FPU avoidance
I think we just ignored it in our initial port because (at least at the time) PCRE2JIT didn't use the FPU and we never finished it.
the use of the constant pool
The constant pool is probably overengineered. You could emit a LLILF-IIHF pair for each constant load. There are definitely some gotchas when modifying instructions like that though. In particular the effect on the performance of the instruction cache can be bad (they usually aren't heavily optimized for self modifying code - I think this is true on x86 too though).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @mundaym!
Thank you for joining. I forgot about the stack pointer, but usually 2 temporary registers are enough to implement al instructions. I hope at some point we will only use r0 (since it cannot be used for addressing) and r14 (link register).
On the fly code modification is not frequent, but they often worth it, so their cost should be negligible in the long run.
I used constant pool on ARMv5, but I decided to never use it again. It make things way too complex.
sljit_src/sljitNativeS390X.c
Outdated
|
||
/* General Purpose Registers [0-15]. */ | ||
typedef sljit_uw sljit_gpr; | ||
const sljit_gpr r0 = 0; /* 0 in address calculations; reserved */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reg_map should be used here.
sljit_src/sljitNativeS390X.c
Outdated
return 0; | ||
} | ||
|
||
static SLJIT_INLINE long get_hwcap() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What this function does?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
returns a 64bit bitmap with "capabilities" that can be used to determine what the CPU supports (ex: VECTOR operations)
the original code has several of those, that seem to have been done statically at compile time and also at runtime, probably to optimize them but also to provide wide enough support.
since we will be targeting z14, it is mostly noise (and indeed the coverage for the !have_genext() needs work IMHO) but it is better to keep it for now since it will be useful in step 3 and beyond (even if supporting QEMU is likely not possible, but there might be some IBM customers with older hardware)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some background: the HWCAP bit vector gives you information about what hardware features the Linux kernel supports. This is important since STFLE might tell you that vector instructions are available on the machine but if the kernel does not advertise support for them then they can't be used (the kernel needs to save/restore vector registers during context switches amongst other things).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we will be targeting z14, it is mostly noise (and indeed the coverage for the !have_genext() needs work IMHO)
Yeah, if targeting z14 and above then I'd just drop facility checks like genext
(maybe do them at startup to give a nice error message on older machines). When initially starting the port I added the checks but as it became clear that it was not going to be easy to support older machines I started to leave them out which is why they are a bit patchy.
This port of SLJIT has (at least) the following limitations: - Only works on recent machines (z12 and above) - No 64-bit multiply overflow detection on machines before z14 - Poor flag performance due to emulated flag registers https://github.com/linux-on-ibm-z/pcre2/tree/s390x-experimental
33a7bba
to
3e6f500
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will continue the review later.
|
||
/* RR form instructions. */ | ||
#define SLJIT_S390X_RR(name, pattern) \ | ||
SLJIT_S390X_INSTRUCTION(name, sljit_gpr dst, sljit_gpr src) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer these to be macros (since they are not even static functions).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they are defined as an inlined static function (i fixed that on my first patchset) through the (non C89 compatible variadic macro) in line 333
https://github.com/carenas/sljit/blob/s390x/sljit_src/sljitNativeS390X.c#L334-L335
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is possible to do it without variadic macro.
#define SLJIT_S390X_INSTRUCTION static SLJIT_INLINE sljit_ins
#define SLJIT_S390X_RR(name, pattern) \
SLJIT_S390X_INSTRUCTION name(sljit_gpr dst, sljit_gpr src)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added as TODO for later as it will require undef of this macro on each redefinition and we might have a different solution long term (ex: non function macros)
note though that this code will only be built with gcc/clang and both seem to support this feature on this platform, so strict c89 compliance might not be a requirement IMHO (most other issues pointed out from pedantic were cleaned in the first batch)
return push_inst(compiler, iihf(target, (sljit_u32)(v>>32))); | ||
} | ||
/* TODO(mundaym): instruction sequences that don't use extended immediates */ | ||
abort(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can make have_eimm
mandatory. Crashing here probably not preferred.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
those abort() were left explicitly (together with a matching TODO) as the target for this codebase are still developers (also why is disabled for auto detection).
once step2 is completed, it will be mandatory because the code will only be enabled if targeting z14 (or most likely z13, since the Ubuntu instances from Travis can't build for anything newer than that), but leaving it there allows us to develop and test and implementation (once a developer removes the safeguards)
if the community fails to materialize, or there is no demand for it (which will require people finding this after they try to force it and crash) then we can remove it and make the "officially unsupported" claim and we should make that choice before this code is enabled for end users (with auto detection on) but I think it is premature now.
also I am no mainframe expert (maybe @edelsohn could find one) and while it is likely eimm should be available in all cases (since it comes with z10 or higher) there might be cases were it is not enabled (maybe needs a license, or is an option that can be turned off at the hypervisor) so crashing can be useful (with the constrains explained above) and removing it makes us lose visibility of places where the codebase might need improvement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little confused about this Travis comment because Sleef is testing in Travis and getting z15. There only are z15 systems. Maybe this is because the Travis yml configuration is requesting xenial instead of bionic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a little confusing to have technical questions scattered through a pull request review. Maybe the technical questions could be in the feature request issue itself to keep them localized? I definitely can forward questions or ask mainframe colleagues to join to answer questions. And when the community starts to look at performance issues, the IBM Linux on Z team definitely can assist with performance analysis and advice.
3ad8d2c
to
4b8dfd1
Compare
6990937
to
aeb2f2d
Compare
Hopefully no more style issues; code still has plenty of TODOs and abort() but that is expected as it only meant to be a starting base for further development, with a focus of supporting enough JIT to allow PCRE to run and a target of developers (not final users) it does have enough support to run cleanly both SLJITs and PCREs tests in recent enough hardware though (z15 based VMs/containers from Linux ONE) using both gcc and clang, and RedHat 8 (compiled for z14) and two recent versions of Ubuntu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are at halfway now!
sljit_src/sljitConfigInternal.h
Outdated
|
||
#define SLJIT_NUMBER_OF_REGISTERS 12 | ||
#define SLJIT_NUMBER_OF_SAVED_REGISTERS 8 | ||
#define SLJIT_LOCALS_OFFSET_BASE 160 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
160 bytes are quite big. Does it really needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The platform ABI specifies a 160 byte area allocated in the callers stack frame for callee-save registers. So when calling C functions up to 160 bytes of the callers stack frame might be clobbered:
: local :
: variables :
| |
+---------------+ <- caller stack pointer + 160
| |
| callee-save |
| registers |
| |
+---------------+ <- caller stack pointer
| |
: callee stack :
: frame :
Having such a large callee-save area can be a bit wasteful but changing it would obviously cause compatability issues and in practice it works ok.
There is more detail about the (non-vector) calling convention here: http://legacy.redhat.com/pub/redhat/linux/7.1/es/os/s390x/doc/lzsabi0.pdf.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is only needed if the function is not a leaf function (hence why I was expecting to get to it on step 3 with the other optimizations, as more of our functions should be leaf functions IMHO)
since sljit only supports up to 4 arguments (with the current implementation only managing 3) we are not going to need to go to the stack to get any parameters, but as @mundaym pointed out then our generated code won't be fully ABI compatible either (the part that worries me more, is the mention of allocating a frame with space for the back chain + some 8 bytes of reserved space which is both marked as optional and as required in pages 24-26 of the legacy ABI document that was shared) but mainly because the creation of the frame is done by the caller instead of the callee which might be unique to this ABI
also important to mention that will be even bigger once vector instructions are used
|
||
/* RR form instructions. */ | ||
#define SLJIT_S390X_RR(name, pattern) \ | ||
SLJIT_S390X_INSTRUCTION(name, sljit_gpr dst, sljit_gpr src) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is possible to do it without variadic macro.
#define SLJIT_S390X_INSTRUCTION static SLJIT_INLINE sljit_ins
#define SLJIT_S390X_RR(name, pattern) \
SLJIT_S390X_INSTRUCTION name(sljit_gpr dst, sljit_gpr src)
SLJIT_S390X_RILA(aih, 0xcc0800000000, sljit_s32) /* TODO(mundaym): high-word facility? */ | ||
|
||
/* ADD LOGICAL IMMEDIATE */ | ||
SLJIT_S390X_RILA(alfi, 0xc20b00000000, sljit_u32) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there warnings for these long constants?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no (and I was surprised as well), the C standard allows for literals to be assigned an integer type (depending on the architecture) up to unsigned long and at least in the gcc/clang that comes with RHEL8/s390x it automatically gets it to the right type without the need for suffixes or triggering warnings.
$ cat t.c
#include <stdio.h>
#define I 0xffffffff
#define L 0xffffffffffffffff
int main(void) {
printf("%zu %zu\n", sizeof(I), sizeof(L));
return 0;
}
$ gcc -Wall -Wextra -o t t.c
$ clang -Weverything -o t t.c
$ ./t
4 8
sljit_ins *ibuf = (sljit_ins *)buf->memory; | ||
for (i = 0; i < len; ++i, ++j) { | ||
sljit_ins ins = ibuf[i]; | ||
if (ins & sljit_ins_const) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have 64 bit, where 16 bit is never used, we could use more flags to handle the cases below in a similar way. Unless we want to reduce the instructon space in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the use of instruction tags instead of the linked lists was an oddity of the original code that I punted changing originally to avoid unnecessary refactoring, but as you pointed out might be worth considering as well in all other generators, as well as for all types in this one.
since we need max 6 bytes for all instructions, using 8 in general makes sense (unlike x86 where we could have up to 15); everything else is RISC so having one single instruction size typedef works like a charm.
then again I think that is probably not worth refactoring right now either IMHO, in retrospective I probably should had documented more verbosely those things like the original author did by adding TODOs, but I was of the mentality of removing them instead of adding more ;)
|
||
/* emit the code */ | ||
j = 0; | ||
for (buf = compiler->buf; buf != NULL; buf = buf->next) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other implementations does not need two loops. Cannot we compute the size in a similar way to others? Furthermore it seems only one constant pool is created. What happens if the offsets are too long? Can be a security issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the two loops was another of those things that looked like worth fixing in step 3 (or later) instead of now, and I have to admit that by doing it in 2 loops is easier to understand for new developers IMHO. note also that the first loop is needed because that is the only way we can calculate the code size BEFORE the code memory is allocated for now, and unlike other implementations this one does jumps and put_labels in only one loop ;)
the constant pool is usually created before the function code and there is one per function; you are correct this implementation is different but since they are 32bit wide shouldn't be a problem unless we have a JIT function bigger than 4GB, added a comment/TODO though
if (args >= 2) | ||
FAIL_IF(push_inst(compiler, lgr(gpr(SLJIT_S1), gpr(SLJIT_R1)))); | ||
if (args >= 3) | ||
FAIL_IF(push_inst(compiler, lgr(gpr(SLJIT_S2), gpr(SLJIT_R2)))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the fourth argument mapped to s3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, through the reg_map; but not sure if more is needed as I had to admit I just found out about this fourth argument from your answer to #49
references to SLJIT_S3 and SLJIT_S4 in the testsuite seem to work fine so I am hoping it is, unless there is a deficiency there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw other issues, but I will try to do those in follow-up patches.
sljit_src/sljitNativeS390X.c
Outdated
#define tmp0 r0 | ||
#define tmp1 r1 | ||
|
||
/* TODO(carenas): flags should move, there is also a[2-15] available */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is extremely important. Most cpus predict inidirect branch to link register as "return", and they use a special stack to predict the target address of these. For other indirect jumps, they will not use this stack which has a high performance costs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend deleting the comment about a[2-15]
being available. Access registers aren't really suitable for flags since there aren't any instructions to set the condition code or branch based on their value. Moving values between access registers and general purpose registers is also likely to be a multi-cycle operation (the scheduling information in LLVM is a good source of estimates for this sort of thing: https://github.com/llvm/llvm-project/blob/819c1651f723b51490eb98b3fc11cbd7dafc7831/llvm/lib/Target/SystemZ/SystemZScheduleZ14.td#L699).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, the comment wasn't about moving the flags there but on probably using them if we run out of registers for other things.
I have an old branch that was using r13 for it, but I couldn't get it not to crash (after a very long night and too many wild chases) and couldn't figure out why either since the generated code looked sane (considering the different registers used, of course) and abandoned.
my plan was to do that and use an "a" registers as last resort (but for regular use only) since we were running quickly out of registers once we took seriously all those "reserved" by the ABI, but remember that I didn't document the extra registers while changing that line ;)
disable autodetection for s390x, since it still a work in progress and not yet optimized, but add put_label and support for alternate allocators while rebasing/updating the code base. adjust syntax to match the style used by the rest of the codebase for easy of maintenance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are still many improvements needed, but lets use this as first version. I will try to get an account for IBM cloud to work on them. I suggest to do the code quality improvements first before we add new features.
implements (eventhough not in the cleanest or more performant way) all functionality that is required to pass all tests in both sljit and pcre2, but still lacking :
to enable will need to explicitly define SLJIT_CONFIG_S390X at compile time and it is known to work with z12 or later (using a Linux ONE remote machine[0], either directly or through CI) with a expected target of z13 or z14 for the needed vector support that is required for a high performance JIT in PCRE2.
Partially closes: #89
[0] https://linuxone.cloud.marist.edu/#/register?flag=VM