Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

s390x: initial port #91

Merged
merged 2 commits into from
Sep 12, 2020
Merged

s390x: initial port #91

merged 2 commits into from
Sep 12, 2020

Conversation

carenas
Copy link
Contributor

@carenas carenas commented Sep 5, 2020

implements (eventhough not in the cleanest or more performant way) all functionality that is required to pass all tests in both sljit and pcre2, but still lacking :

  • support for FPU and related functions (including vector operations)
  • several unimplemented or partially implemented opcodes/features (obviously none needed by PCRE with the target CPU)
  • no optimizations or profiling (including plenty of TODO that might need tackling before this can be released to users)

to enable will need to explicitly define SLJIT_CONFIG_S390X at compile time and it is known to work with z12 or later (using a Linux ONE remote machine[0], either directly or through CI) with a expected target of z13 or z14 for the needed vector support that is required for a high performance JIT in PCRE2.

Partially closes: #89

[0] https://linuxone.cloud.marist.edu/#/register?flag=VM

@zherczeg
Copy link
Owner

zherczeg commented Sep 5, 2020

Let me know when the code is ready for review. Considering the missing parts, this patch definitely not closes #89.

@carenas
Copy link
Contributor Author

carenas commented Sep 5, 2020

Let me know when the code is ready for review. Considering the missing parts, this patch definitely not closes #89.

it is ready for review for step 1, and while I agree it doesn't fully close #89, I was hoping to close at least the "redefined" objective on it so I can get the work for step 2 going

Copy link
Owner

@zherczeg zherczeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This patch is quite big, will take time to review (I will do it in multiple steps). I noticed a large amount of syntax errors, I would recommend to fix them in the meantime. Please also squash the commits.

.travis.yml Outdated Show resolved Hide resolved
@@ -715,6 +720,12 @@ SLJIT_API_FUNC_ATTRIBUTE sljit_sw sljit_exec_offset(void* ptr);
#define SLJIT_NUMBER_OF_SAVED_REGISTERS 5
#define SLJIT_LOCALS_OFFSET_BASE 0

#elif (defined SLJIT_CONFIG_S390X && SLJIT_CONFIG_S390X)

#define SLJIT_NUMBER_OF_REGISTERS 12
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only 12? I expected 14 based on the abi.

Copy link
Contributor Author

@carenas carenas Sep 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was guessing that is used for making sure the 2 "reserved" registers are not touched (indeed SLJIT_R0 == r2), but I admit it is one of those things from the original code that would had been useful to discuss with the original author (@mundaym).

same goes for the FPU avoidance and the use of the constant pool (eventhough I think I had figured out that last part, and the cache comment was just deceiving, hence why it was removed and replaced by an empty CACHE_CLEAR as it is done in the other CISC architecture)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a look at this @carenas and @zherczeg, it's good to see it being worked on again. I am happy to answer questions. Feel free to tag me.

I was guessing that is used for making sure the 2 "reserved" registers are not touched

Yes, I set the number of registers to 12 because some of the registers are 'reserved' to hold temporary values during codegen.

same goes for the FPU avoidance

I think we just ignored it in our initial port because (at least at the time) PCRE2JIT didn't use the FPU and we never finished it.

the use of the constant pool

The constant pool is probably overengineered. You could emit a LLILF-IIHF pair for each constant load. There are definitely some gotchas when modifying instructions like that though. In particular the effect on the performance of the instruction cache can be bad (they usually aren't heavily optimized for self modifying code - I think this is true on x86 too though).

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mundaym!
Thank you for joining. I forgot about the stack pointer, but usually 2 temporary registers are enough to implement al instructions. I hope at some point we will only use r0 (since it cannot be used for addressing) and r14 (link register).

On the fly code modification is not frequent, but they often worth it, so their cost should be negligible in the long run.

I used constant pool on ARMv5, but I decided to never use it again. It make things way too complex.

sljit_src/sljitLir.c Outdated Show resolved Hide resolved
sljit_src/sljitLir.c Outdated Show resolved Hide resolved

/* General Purpose Registers [0-15]. */
typedef sljit_uw sljit_gpr;
const sljit_gpr r0 = 0; /* 0 in address calculations; reserved */
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reg_map should be used here.

return 0;
}

static SLJIT_INLINE long get_hwcap()
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What this function does?

Copy link
Contributor Author

@carenas carenas Sep 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

returns a 64bit bitmap with "capabilities" that can be used to determine what the CPU supports (ex: VECTOR operations)

the original code has several of those, that seem to have been done statically at compile time and also at runtime, probably to optimize them but also to provide wide enough support.

since we will be targeting z14, it is mostly noise (and indeed the coverage for the !have_genext() needs work IMHO) but it is better to keep it for now since it will be useful in step 3 and beyond (even if supporting QEMU is likely not possible, but there might be some IBM customers with older hardware)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some background: the HWCAP bit vector gives you information about what hardware features the Linux kernel supports. This is important since STFLE might tell you that vector instructions are available on the machine but if the kernel does not advertise support for them then they can't be used (the kernel needs to save/restore vector registers during context switches amongst other things).

Copy link
Contributor

@mundaym mundaym Sep 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we will be targeting z14, it is mostly noise (and indeed the coverage for the !have_genext() needs work IMHO)

Yeah, if targeting z14 and above then I'd just drop facility checks like genext (maybe do them at startup to give a nice error message on older machines). When initially starting the port I added the checks but as it became clear that it was not going to be easy to support older machines I started to leave them out which is why they are a bit patchy.

sljit_src/sljitNativeS390X.c Outdated Show resolved Hide resolved
sljit_src/sljitNativeS390X.c Outdated Show resolved Hide resolved
sljit_src/sljitNativeS390X.c Outdated Show resolved Hide resolved
sljit_src/sljitNativeS390X.c Show resolved Hide resolved
This port of SLJIT has (at least) the following limitations:

- Only works on recent machines (z12 and above)
- No 64-bit multiply overflow detection on machines before z14
- Poor flag performance due to emulated flag registers

https://github.com/linux-on-ibm-z/pcre2/tree/s390x-experimental
Copy link
Owner

@zherczeg zherczeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will continue the review later.

sljit_src/sljitLir.c Outdated Show resolved Hide resolved
sljit_src/sljitNativeS390X.c Show resolved Hide resolved
sljit_src/sljitNativeS390X.c Outdated Show resolved Hide resolved
sljit_src/sljitNativeS390X.c Outdated Show resolved Hide resolved
sljit_src/sljitNativeS390X.c Outdated Show resolved Hide resolved
sljit_src/sljitNativeS390X.c Outdated Show resolved Hide resolved

/* RR form instructions. */
#define SLJIT_S390X_RR(name, pattern) \
SLJIT_S390X_INSTRUCTION(name, sljit_gpr dst, sljit_gpr src) \
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer these to be macros (since they are not even static functions).

Copy link
Contributor Author

@carenas carenas Sep 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they are defined as an inlined static function (i fixed that on my first patchset) through the (non C89 compatible variadic macro) in line 333

https://github.com/carenas/sljit/blob/s390x/sljit_src/sljitNativeS390X.c#L334-L335

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible to do it without variadic macro.

#define SLJIT_S390X_INSTRUCTION static SLJIT_INLINE sljit_ins
#define SLJIT_S390X_RR(name, pattern) \
SLJIT_S390X_INSTRUCTION name(sljit_gpr dst, sljit_gpr src)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added as TODO for later as it will require undef of this macro on each redefinition and we might have a different solution long term (ex: non function macros)

note though that this code will only be built with gcc/clang and both seem to support this feature on this platform, so strict c89 compliance might not be a requirement IMHO (most other issues pointed out from pedantic were cleaned in the first batch)

sljit_src/sljitNativeS390X.c Outdated Show resolved Hide resolved
return push_inst(compiler, iihf(target, (sljit_u32)(v>>32)));
}
/* TODO(mundaym): instruction sequences that don't use extended immediates */
abort();
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can make have_eimm mandatory. Crashing here probably not preferred.

Copy link
Contributor Author

@carenas carenas Sep 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

those abort() were left explicitly (together with a matching TODO) as the target for this codebase are still developers (also why is disabled for auto detection).

once step2 is completed, it will be mandatory because the code will only be enabled if targeting z14 (or most likely z13, since the Ubuntu instances from Travis can't build for anything newer than that), but leaving it there allows us to develop and test and implementation (once a developer removes the safeguards)

if the community fails to materialize, or there is no demand for it (which will require people finding this after they try to force it and crash) then we can remove it and make the "officially unsupported" claim and we should make that choice before this code is enabled for end users (with auto detection on) but I think it is premature now.

also I am no mainframe expert (maybe @edelsohn could find one) and while it is likely eimm should be available in all cases (since it comes with z10 or higher) there might be cases were it is not enabled (maybe needs a license, or is an option that can be turned off at the hypervisor) so crashing can be useful (with the constrains explained above) and removing it makes us lose visibility of places where the codebase might need improvement.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused about this Travis comment because Sleef is testing in Travis and getting z15. There only are z15 systems. Maybe this is because the Travis yml configuration is requesting xenial instead of bionic.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little confusing to have technical questions scattered through a pull request review. Maybe the technical questions could be in the feature request issue itself to keep them localized? I definitely can forward questions or ask mainframe colleagues to join to answer questions. And when the community starts to look at performance issues, the IBM Linux on Z team definitely can assist with performance analysis and advice.

@carenas carenas marked this pull request as draft September 7, 2020 20:39
@carenas carenas force-pushed the s390x branch 10 times, most recently from 3ad8d2c to 4b8dfd1 Compare September 8, 2020 02:34
@carenas carenas marked this pull request as ready for review September 8, 2020 02:37
@carenas carenas force-pushed the s390x branch 3 times, most recently from 6990937 to aeb2f2d Compare September 8, 2020 07:12
@carenas carenas requested a review from zherczeg September 8, 2020 07:12
@carenas
Copy link
Contributor Author

carenas commented Sep 8, 2020

Hopefully no more style issues; code still has plenty of TODOs and abort() but that is expected as it only meant to be a starting base for further development, with a focus of supporting enough JIT to allow PCRE to run and a target of developers (not final users)

it does have enough support to run cleanly both SLJITs and PCREs tests in recent enough hardware though (z15 based VMs/containers from Linux ONE) using both gcc and clang, and RedHat 8 (compiled for z14) and two recent versions of Ubuntu

Copy link
Owner

@zherczeg zherczeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are at halfway now!


#define SLJIT_NUMBER_OF_REGISTERS 12
#define SLJIT_NUMBER_OF_SAVED_REGISTERS 8
#define SLJIT_LOCALS_OFFSET_BASE 160
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

160 bytes are quite big. Does it really needed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The platform ABI specifies a 160 byte area allocated in the callers stack frame for callee-save registers. So when calling C functions up to 160 bytes of the callers stack frame might be clobbered:

    :     local     :
    :   variables   :
    |               |
    +---------------+ <- caller stack pointer + 160
    |               |
    |  callee-save  |
    |   registers   |
    |               |
    +---------------+ <- caller stack pointer
    |               |
    :  callee stack :
    :     frame     :

Having such a large callee-save area can be a bit wasteful but changing it would obviously cause compatability issues and in practice it works ok.

There is more detail about the (non-vector) calling convention here: http://legacy.redhat.com/pub/redhat/linux/7.1/es/os/s390x/doc/lzsabi0.pdf.

Copy link
Contributor Author

@carenas carenas Sep 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is only needed if the function is not a leaf function (hence why I was expecting to get to it on step 3 with the other optimizations, as more of our functions should be leaf functions IMHO)

since sljit only supports up to 4 arguments (with the current implementation only managing 3) we are not going to need to go to the stack to get any parameters, but as @mundaym pointed out then our generated code won't be fully ABI compatible either (the part that worries me more, is the mention of allocating a frame with space for the back chain + some 8 bytes of reserved space which is both marked as optional and as required in pages 24-26 of the legacy ABI document that was shared) but mainly because the creation of the frame is done by the caller instead of the callee which might be unique to this ABI

also important to mention that will be even bigger once vector instructions are used


/* RR form instructions. */
#define SLJIT_S390X_RR(name, pattern) \
SLJIT_S390X_INSTRUCTION(name, sljit_gpr dst, sljit_gpr src) \
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible to do it without variadic macro.

#define SLJIT_S390X_INSTRUCTION static SLJIT_INLINE sljit_ins
#define SLJIT_S390X_RR(name, pattern) \
SLJIT_S390X_INSTRUCTION name(sljit_gpr dst, sljit_gpr src)

SLJIT_S390X_RILA(aih, 0xcc0800000000, sljit_s32) /* TODO(mundaym): high-word facility? */

/* ADD LOGICAL IMMEDIATE */
SLJIT_S390X_RILA(alfi, 0xc20b00000000, sljit_u32)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there warnings for these long constants?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no (and I was surprised as well), the C standard allows for literals to be assigned an integer type (depending on the architecture) up to unsigned long and at least in the gcc/clang that comes with RHEL8/s390x it automatically gets it to the right type without the need for suffixes or triggering warnings.

$ cat t.c
#include <stdio.h>

#define I	0xffffffff
#define L	0xffffffffffffffff

int main(void) {
  printf("%zu %zu\n", sizeof(I), sizeof(L));
  return 0;
}
$ gcc -Wall -Wextra -o t t.c
$ clang -Weverything -o t t.c
$ ./t
4 8

sljit_src/sljitNativeS390X.c Outdated Show resolved Hide resolved
sljit_src/sljitNativeS390X.c Show resolved Hide resolved
sljit_ins *ibuf = (sljit_ins *)buf->memory;
for (i = 0; i < len; ++i, ++j) {
sljit_ins ins = ibuf[i];
if (ins & sljit_ins_const) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have 64 bit, where 16 bit is never used, we could use more flags to handle the cases below in a similar way. Unless we want to reduce the instructon space in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the use of instruction tags instead of the linked lists was an oddity of the original code that I punted changing originally to avoid unnecessary refactoring, but as you pointed out might be worth considering as well in all other generators, as well as for all types in this one.

since we need max 6 bytes for all instructions, using 8 in general makes sense (unlike x86 where we could have up to 15); everything else is RISC so having one single instruction size typedef works like a charm.

then again I think that is probably not worth refactoring right now either IMHO, in retrospective I probably should had documented more verbosely those things like the original author did by adding TODOs, but I was of the mentality of removing them instead of adding more ;)

sljit_src/sljitNativeS390X.c Outdated Show resolved Hide resolved

/* emit the code */
j = 0;
for (buf = compiler->buf; buf != NULL; buf = buf->next) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other implementations does not need two loops. Cannot we compute the size in a similar way to others? Furthermore it seems only one constant pool is created. What happens if the offsets are too long? Can be a security issue.

Copy link
Contributor Author

@carenas carenas Sep 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the two loops was another of those things that looked like worth fixing in step 3 (or later) instead of now, and I have to admit that by doing it in 2 loops is easier to understand for new developers IMHO. note also that the first loop is needed because that is the only way we can calculate the code size BEFORE the code memory is allocated for now, and unlike other implementations this one does jumps and put_labels in only one loop ;)

the constant pool is usually created before the function code and there is one per function; you are correct this implementation is different but since they are 32bit wide shouldn't be a problem unless we have a JIT function bigger than 4GB, added a comment/TODO though

sljit_src/sljitNativeS390X.c Outdated Show resolved Hide resolved
if (args >= 2)
FAIL_IF(push_inst(compiler, lgr(gpr(SLJIT_S1), gpr(SLJIT_R1))));
if (args >= 3)
FAIL_IF(push_inst(compiler, lgr(gpr(SLJIT_S2), gpr(SLJIT_R2))));
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the fourth argument mapped to s3?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, through the reg_map; but not sure if more is needed as I had to admit I just found out about this fourth argument from your answer to #49

references to SLJIT_S3 and SLJIT_S4 in the testsuite seem to work fine so I am hoping it is, unless there is a deficiency there

Copy link
Owner

@zherczeg zherczeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw other issues, but I will try to do those in follow-up patches.

sljit_src/sljitNativeS390X.c Outdated Show resolved Hide resolved
sljit_src/sljitNativeS390X.c Outdated Show resolved Hide resolved
#define tmp0 r0
#define tmp1 r1

/* TODO(carenas): flags should move, there is also a[2-15] available */
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is extremely important. Most cpus predict inidirect branch to link register as "return", and they use a special stack to predict the target address of these. For other indirect jumps, they will not use this stack which has a high performance costs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend deleting the comment about a[2-15] being available. Access registers aren't really suitable for flags since there aren't any instructions to set the condition code or branch based on their value. Moving values between access registers and general purpose registers is also likely to be a multi-cycle operation (the scheduling information in LLVM is a good source of estimates for this sort of thing: https://github.com/llvm/llvm-project/blob/819c1651f723b51490eb98b3fc11cbd7dafc7831/llvm/lib/Target/SystemZ/SystemZScheduleZ14.td#L699).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the comment wasn't about moving the flags there but on probably using them if we run out of registers for other things.

I have an old branch that was using r13 for it, but I couldn't get it not to crash (after a very long night and too many wild chases) and couldn't figure out why either since the generated code looked sane (considering the different registers used, of course) and abandoned.

my plan was to do that and use an "a" registers as last resort (but for regular use only) since we were running quickly out of registers once we took seriously all those "reserved" by the ABI, but remember that I didn't document the extra registers while changing that line ;)

sljit_src/sljitNativeS390X.c Show resolved Hide resolved
sljit_src/sljitNativeS390X.c Outdated Show resolved Hide resolved
sljit_src/sljitNativeS390X.c Outdated Show resolved Hide resolved
disable autodetection for s390x, since it still a work in progress
and not yet optimized, but add put_label and support for alternate
allocators while rebasing/updating the code base.

adjust syntax to match the style used by the rest of the codebase
for easy of maintenance.
Copy link
Owner

@zherczeg zherczeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are still many improvements needed, but lets use this as first version. I will try to get an account for IBM cloud to work on them. I suggest to do the code quality improvements first before we add new features.

@zherczeg zherczeg merged commit 9541992 into zherczeg:master Sep 12, 2020
@carenas carenas deleted the s390x branch September 14, 2020 08:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Port PCRE2 JIT to Linux on IBMz (s390x)
4 participants