s390x: initial port #91

carenas · 2020-09-05T08:16:08Z

implements (eventhough not in the cleanest or more performant way) all functionality that is required to pass all tests in both sljit and pcre2, but still lacking :

support for FPU and related functions (including vector operations)
several unimplemented or partially implemented opcodes/features (obviously none needed by PCRE with the target CPU)
no optimizations or profiling (including plenty of TODO that might need tackling before this can be released to users)

to enable will need to explicitly define SLJIT_CONFIG_S390X at compile time and it is known to work with z12 or later (using a Linux ONE remote machine[0], either directly or through CI) with a expected target of z13 or z14 for the needed vector support that is required for a high performance JIT in PCRE2.

Partially closes: #89

[0] https://linuxone.cloud.marist.edu/#/register?flag=VM

zherczeg · 2020-09-05T19:32:26Z

Let me know when the code is ready for review. Considering the missing parts, this patch definitely not closes #89.

carenas · 2020-09-05T20:39:15Z

Let me know when the code is ready for review. Considering the missing parts, this patch definitely not closes #89.

it is ready for review for step 1, and while I agree it doesn't fully close #89, I was hoping to close at least the "redefined" objective on it so I can get the work for step 2 going

zherczeg

This patch is quite big, will take time to review (I will do it in multiple steps). I noticed a large amount of syntax errors, I would recommend to fix them in the meantime. Please also squash the commits.

.travis.yml

zherczeg · 2020-09-06T11:29:53Z

sljit_src/sljitConfigInternal.h

@@ -715,6 +720,12 @@ SLJIT_API_FUNC_ATTRIBUTE sljit_sw sljit_exec_offset(void* ptr);
 #define SLJIT_NUMBER_OF_SAVED_REGISTERS 5
 #define SLJIT_LOCALS_OFFSET_BASE 0

+#elif (defined SLJIT_CONFIG_S390X && SLJIT_CONFIG_S390X)
+
+#define SLJIT_NUMBER_OF_REGISTERS 12


Only 12? I expected 14 based on the abi.

I was guessing that is used for making sure the 2 "reserved" registers are not touched (indeed SLJIT_R0 == r2), but I admit it is one of those things from the original code that would had been useful to discuss with the original author (@mundaym).

same goes for the FPU avoidance and the use of the constant pool (eventhough I think I had figured out that last part, and the cache comment was just deceiving, hence why it was removed and replaced by an empty CACHE_CLEAR as it is done in the other CISC architecture)

Thanks for taking a look at this @carenas and @zherczeg, it's good to see it being worked on again. I am happy to answer questions. Feel free to tag me.

I was guessing that is used for making sure the 2 "reserved" registers are not touched

Yes, I set the number of registers to 12 because some of the registers are 'reserved' to hold temporary values during codegen.

same goes for the FPU avoidance

I think we just ignored it in our initial port because (at least at the time) PCRE2JIT didn't use the FPU and we never finished it.

the use of the constant pool

The constant pool is probably overengineered. You could emit a LLILF-IIHF pair for each constant load. There are definitely some gotchas when modifying instructions like that though. In particular the effect on the performance of the instruction cache can be bad (they usually aren't heavily optimized for self modifying code - I think this is true on x86 too though).

Hi @mundaym!
Thank you for joining. I forgot about the stack pointer, but usually 2 temporary registers are enough to implement al instructions. I hope at some point we will only use r0 (since it cannot be used for addressing) and r14 (link register).

On the fly code modification is not frequent, but they often worth it, so their cost should be negligible in the long run.

I used constant pool on ARMv5, but I decided to never use it again. It make things way too complex.

sljit_src/sljitLir.c

zherczeg · 2020-09-06T11:53:45Z

sljit_src/sljitNativeS390X.c

+
+/* General Purpose Registers [0-15]. */
+typedef sljit_uw sljit_gpr;
+const sljit_gpr r0 = 0; /* 0 in address calculations; reserved */


reg_map should be used here.

zherczeg · 2020-09-06T12:00:31Z

sljit_src/sljitNativeS390X.c

+	return 0;
+}
+
+static SLJIT_INLINE long get_hwcap()


What this function does?

returns a 64bit bitmap with "capabilities" that can be used to determine what the CPU supports (ex: VECTOR operations)

the original code has several of those, that seem to have been done statically at compile time and also at runtime, probably to optimize them but also to provide wide enough support.

since we will be targeting z14, it is mostly noise (and indeed the coverage for the !have_genext() needs work IMHO) but it is better to keep it for now since it will be useful in step 3 and beyond (even if supporting QEMU is likely not possible, but there might be some IBM customers with older hardware)

Just some background: the HWCAP bit vector gives you information about what hardware features the Linux kernel supports. This is important since STFLE might tell you that vector instructions are available on the machine but if the kernel does not advertise support for them then they can't be used (the kernel needs to save/restore vector registers during context switches amongst other things).

since we will be targeting z14, it is mostly noise (and indeed the coverage for the !have_genext() needs work IMHO)

Yeah, if targeting z14 and above then I'd just drop facility checks like genext (maybe do them at startup to give a nice error message on older machines). When initially starting the port I added the checks but as it became clear that it was not going to be easy to support older machines I started to leave them out which is why they are a bit patchy.

sljit_src/sljitNativeS390X.c

This port of SLJIT has (at least) the following limitations: - Only works on recent machines (z12 and above) - No 64-bit multiply overflow detection on machines before z14 - Poor flag performance due to emulated flag registers https://github.com/linux-on-ibm-z/pcre2/tree/s390x-experimental

zherczeg

I will continue the review later.

sljit_src/sljitLir.c

sljit_src/sljitNativeS390X.c

zherczeg · 2020-09-07T17:55:04Z

sljit_src/sljitNativeS390X.c

+
+/* RR form instructions. */
+#define SLJIT_S390X_RR(name, pattern) \
+SLJIT_S390X_INSTRUCTION(name, sljit_gpr dst, sljit_gpr src) \


I would prefer these to be macros (since they are not even static functions).

they are defined as an inlined static function (i fixed that on my first patchset) through the (non C89 compatible variadic macro) in line 333

https://github.com/carenas/sljit/blob/s390x/sljit_src/sljitNativeS390X.c#L334-L335

It is possible to do it without variadic macro.

#define SLJIT_S390X_INSTRUCTION static SLJIT_INLINE sljit_ins #define SLJIT_S390X_RR(name, pattern) \ SLJIT_S390X_INSTRUCTION name(sljit_gpr dst, sljit_gpr src)

added as TODO for later as it will require undef of this macro on each redefinition and we might have a different solution long term (ex: non function macros)

note though that this code will only be built with gcc/clang and both seem to support this feature on this platform, so strict c89 compliance might not be a requirement IMHO (most other issues pointed out from pedantic were cleaned in the first batch)

sljit_src/sljitNativeS390X.c

zherczeg · 2020-09-07T17:59:23Z

sljit_src/sljitNativeS390X.c

+		return push_inst(compiler, iihf(target, (sljit_u32)(v>>32)));
+	}
+	/* TODO(mundaym): instruction sequences that don't use extended immediates */
+	abort();


We can make have_eimm mandatory. Crashing here probably not preferred.

those abort() were left explicitly (together with a matching TODO) as the target for this codebase are still developers (also why is disabled for auto detection).

once step2 is completed, it will be mandatory because the code will only be enabled if targeting z14 (or most likely z13, since the Ubuntu instances from Travis can't build for anything newer than that), but leaving it there allows us to develop and test and implementation (once a developer removes the safeguards)

if the community fails to materialize, or there is no demand for it (which will require people finding this after they try to force it and crash) then we can remove it and make the "officially unsupported" claim and we should make that choice before this code is enabled for end users (with auto detection on) but I think it is premature now.

also I am no mainframe expert (maybe @edelsohn could find one) and while it is likely eimm should be available in all cases (since it comes with z10 or higher) there might be cases were it is not enabled (maybe needs a license, or is an option that can be turned off at the hypervisor) so crashing can be useful (with the constrains explained above) and removing it makes us lose visibility of places where the codebase might need improvement.

I'm a little confused about this Travis comment because Sleef is testing in Travis and getting z15. There only are z15 systems. Maybe this is because the Travis yml configuration is requesting xenial instead of bionic.

It's a little confusing to have technical questions scattered through a pull request review. Maybe the technical questions could be in the feature request issue itself to keep them localized? I definitely can forward questions or ask mainframe colleagues to join to answer questions. And when the community starts to look at performance issues, the IBM Linux on Z team definitely can assist with performance analysis and advice.

carenas · 2020-09-08T07:35:33Z

Hopefully no more style issues; code still has plenty of TODOs and abort() but that is expected as it only meant to be a starting base for further development, with a focus of supporting enough JIT to allow PCRE to run and a target of developers (not final users)

it does have enough support to run cleanly both SLJITs and PCREs tests in recent enough hardware though (z15 based VMs/containers from Linux ONE) using both gcc and clang, and RedHat 8 (compiled for z14) and two recent versions of Ubuntu

zherczeg

We are at halfway now!

zherczeg · 2020-09-08T17:49:38Z

sljit_src/sljitConfigInternal.h

+
+#define SLJIT_NUMBER_OF_REGISTERS 12
+#define SLJIT_NUMBER_OF_SAVED_REGISTERS 8
+#define SLJIT_LOCALS_OFFSET_BASE 160


160 bytes are quite big. Does it really needed?

The platform ABI specifies a 160 byte area allocated in the callers stack frame for callee-save registers. So when calling C functions up to 160 bytes of the callers stack frame might be clobbered:

: local : : variables : | | +---------------+ <- caller stack pointer + 160 | | | callee-save | | registers | | | +---------------+ <- caller stack pointer | | : callee stack : : frame :

Having such a large callee-save area can be a bit wasteful but changing it would obviously cause compatability issues and in practice it works ok.

There is more detail about the (non-vector) calling convention here: http://legacy.redhat.com/pub/redhat/linux/7.1/es/os/s390x/doc/lzsabi0.pdf.

it is only needed if the function is not a leaf function (hence why I was expecting to get to it on step 3 with the other optimizations, as more of our functions should be leaf functions IMHO)

since sljit only supports up to 4 arguments (with the current implementation only managing 3) we are not going to need to go to the stack to get any parameters, but as @mundaym pointed out then our generated code won't be fully ABI compatible either (the part that worries me more, is the mention of allocating a frame with space for the back chain + some 8 bytes of reserved space which is both marked as optional and as required in pages 24-26 of the legacy ABI document that was shared) but mainly because the creation of the frame is done by the caller instead of the callee which might be unique to this ABI

also important to mention that will be even bigger once vector instructions are used

zherczeg · 2020-09-08T17:57:56Z

sljit_src/sljitNativeS390X.c

+
+/* RR form instructions. */
+#define SLJIT_S390X_RR(name, pattern) \
+SLJIT_S390X_INSTRUCTION(name, sljit_gpr dst, sljit_gpr src) \


It is possible to do it without variadic macro.

#define SLJIT_S390X_INSTRUCTION static SLJIT_INLINE sljit_ins #define SLJIT_S390X_RR(name, pattern) \ SLJIT_S390X_INSTRUCTION name(sljit_gpr dst, sljit_gpr src)

zherczeg · 2020-09-08T18:01:02Z

sljit_src/sljitNativeS390X.c

+SLJIT_S390X_RILA(aih,   0xcc0800000000, sljit_s32) /* TODO(mundaym): high-word facility? */
+
+/* ADD LOGICAL IMMEDIATE */
+SLJIT_S390X_RILA(alfi,  0xc20b00000000, sljit_u32)


Is there warnings for these long constants?

no (and I was surprised as well), the C standard allows for literals to be assigned an integer type (depending on the architecture) up to unsigned long and at least in the gcc/clang that comes with RHEL8/s390x it automatically gets it to the right type without the need for suffixes or triggering warnings.

$ cat t.c #include <stdio.h> #define I 0xffffffff #define L 0xffffffffffffffff int main(void) { printf("%zu %zu\n", sizeof(I), sizeof(L)); return 0; } $ gcc -Wall -Wextra -o t t.c $ clang -Weverything -o t t.c $ ./t 4 8

sljit_src/sljitNativeS390X.c

zherczeg · 2020-09-08T18:20:02Z

sljit_src/sljitNativeS390X.c

+		sljit_ins *ibuf = (sljit_ins *)buf->memory;
+		for (i = 0; i < len; ++i, ++j) {
+			sljit_ins ins = ibuf[i];
+			if (ins & sljit_ins_const) {


If we have 64 bit, where 16 bit is never used, we could use more flags to handle the cases below in a similar way. Unless we want to reduce the instructon space in the future.

the use of instruction tags instead of the linked lists was an oddity of the original code that I punted changing originally to avoid unnecessary refactoring, but as you pointed out might be worth considering as well in all other generators, as well as for all types in this one.

since we need max 6 bytes for all instructions, using 8 in general makes sense (unlike x86 where we could have up to 15); everything else is RISC so having one single instruction size typedef works like a charm.

then again I think that is probably not worth refactoring right now either IMHO, in retrospective I probably should had documented more verbosely those things like the original author did by adding TODOs, but I was of the mentality of removing them instead of adding more ;)

sljit_src/sljitNativeS390X.c

zherczeg · 2020-09-08T18:25:58Z

sljit_src/sljitNativeS390X.c

+
+	/* emit the code */
+	j = 0;
+	for (buf = compiler->buf; buf != NULL; buf = buf->next) {


The other implementations does not need two loops. Cannot we compute the size in a similar way to others? Furthermore it seems only one constant pool is created. What happens if the offsets are too long? Can be a security issue.

the two loops was another of those things that looked like worth fixing in step 3 (or later) instead of now, and I have to admit that by doing it in 2 loops is easier to understand for new developers IMHO. note also that the first loop is needed because that is the only way we can calculate the code size BEFORE the code memory is allocated for now, and unlike other implementations this one does jumps and put_labels in only one loop ;)

the constant pool is usually created before the function code and there is one per function; you are correct this implementation is different but since they are 32bit wide shouldn't be a problem unless we have a JIT function bigger than 4GB, added a comment/TODO though

sljit_src/sljitNativeS390X.c

zherczeg · 2020-09-08T18:29:30Z

sljit_src/sljitNativeS390X.c

+	if (args >= 2)
+		FAIL_IF(push_inst(compiler, lgr(gpr(SLJIT_S1), gpr(SLJIT_R1))));
+	if (args >= 3)
+		FAIL_IF(push_inst(compiler, lgr(gpr(SLJIT_S2), gpr(SLJIT_R2))));


Is the fourth argument mapped to s3?

yes, through the reg_map; but not sure if more is needed as I had to admit I just found out about this fourth argument from your answer to #49

references to SLJIT_S3 and SLJIT_S4 in the testsuite seem to work fine so I am hoping it is, unless there is a deficiency there

zherczeg

I saw other issues, but I will try to do those in follow-up patches.

sljit_src/sljitNativeS390X.c

zherczeg · 2020-09-11T05:46:53Z

sljit_src/sljitNativeS390X.c

+#define tmp0	r0
+#define tmp1	r1
+
+/* TODO(carenas): flags should move, there is also a[2-15] available */


Yes, this is extremely important. Most cpus predict inidirect branch to link register as "return", and they use a special stack to predict the target address of these. For other indirect jumps, they will not use this stack which has a high performance costs.

I recommend deleting the comment about a[2-15] being available. Access registers aren't really suitable for flags since there aren't any instructions to set the condition code or branch based on their value. Moving values between access registers and general purpose registers is also likely to be a multi-cycle operation (the scheduling information in LLVM is a good source of estimates for this sort of thing: https://github.com/llvm/llvm-project/blob/819c1651f723b51490eb98b3fc11cbd7dafc7831/llvm/lib/Target/SystemZ/SystemZScheduleZ14.td#L699).

yes, the comment wasn't about moving the flags there but on probably using them if we run out of registers for other things.

I have an old branch that was using r13 for it, but I couldn't get it not to crash (after a very long night and too many wild chases) and couldn't figure out why either since the generated code looked sane (considering the different registers used, of course) and abandoned.

my plan was to do that and use an "a" registers as last resort (but for regular use only) since we were running quickly out of registers once we took seriously all those "reserved" by the ABI, but remember that I didn't document the extra registers while changing that line ;)

sljit_src/sljitNativeS390X.c

disable autodetection for s390x, since it still a work in progress and not yet optimized, but add put_label and support for alternate allocators while rebasing/updating the code base. adjust syntax to match the style used by the rest of the codebase for easy of maintenance.

zherczeg

There are still many improvements needed, but lets use this as first version. I will try to get an account for IBM cloud to work on them. I suggest to do the code quality improvements first before we add new features.

zherczeg requested changes Sep 6, 2020

View reviewed changes

carenas force-pushed the s390x branch 3 times, most recently from 33a7bba to 3e6f500 Compare September 7, 2020 03:38

carenas requested a review from zherczeg September 7, 2020 03:43

carenas force-pushed the s390x branch from 3e6f500 to fd6abb0 Compare September 7, 2020 07:05

zherczeg requested changes Sep 7, 2020

View reviewed changes

carenas marked this pull request as draft September 7, 2020 20:39

carenas force-pushed the s390x branch 10 times, most recently from 3ad8d2c to 4b8dfd1 Compare September 8, 2020 02:34

carenas marked this pull request as ready for review September 8, 2020 02:37

carenas force-pushed the s390x branch 3 times, most recently from 6990937 to aeb2f2d Compare September 8, 2020 07:12

carenas requested a review from zherczeg September 8, 2020 07:12

carenas force-pushed the s390x branch from aeb2f2d to 8f07d76 Compare September 8, 2020 18:27

zherczeg requested changes Sep 8, 2020

View reviewed changes

carenas force-pushed the s390x branch from 8f07d76 to 5af8798 Compare September 10, 2020 06:18

carenas requested a review from zherczeg September 10, 2020 16:00

carenas force-pushed the s390x branch from 5af8798 to e81a006 Compare September 11, 2020 05:28

zherczeg requested changes Sep 11, 2020

View reviewed changes

carenas force-pushed the s390x branch from e81a006 to 24ba574 Compare September 11, 2020 12:04

carenas requested a review from zherczeg September 11, 2020 12:22

zherczeg approved these changes Sep 12, 2020

View reviewed changes

zherczeg merged commit 9541992 into zherczeg:master Sep 12, 2020

carenas deleted the s390x branch September 14, 2020 08:01

carenas mentioned this pull request Feb 13, 2021

Basic enablement of s390x JIT #100

Closed

s390x: initial port #91

s390x: initial port #91

Conversation

carenas commented Sep 5, 2020 • edited Loading

zherczeg commented Sep 5, 2020

carenas commented Sep 5, 2020

zherczeg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

carenas Sep 6, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

carenas Sep 6, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mundaym Sep 7, 2020 • edited Loading

Choose a reason for hiding this comment

zherczeg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

carenas Sep 7, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

carenas Sep 7, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

carenas commented Sep 8, 2020 • edited Loading

zherczeg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

carenas Sep 10, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

carenas Sep 10, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zherczeg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zherczeg left a comment

Choose a reason for hiding this comment

carenas commented Sep 5, 2020 •

edited

Loading

carenas Sep 6, 2020 •

edited

Loading

carenas Sep 6, 2020 •

edited

Loading

mundaym Sep 7, 2020 •

edited

Loading

carenas Sep 7, 2020 •

edited

Loading

carenas Sep 7, 2020 •

edited

Loading

carenas commented Sep 8, 2020 •

edited

Loading

carenas Sep 10, 2020 •

edited

Loading

carenas Sep 10, 2020 •

edited

Loading