-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance: Pallene checks in for-loops are preventing good compiler optimizations from being applied (loop unrolling & autovectorization) #564
Comments
Wow, these are very interesting numbers. Thanks for measuring them! One thing I want to add to Pallene is a separate array datatype that's backed by a C buffer instead of by a Lua table. Similarly to how we have two kinds of records: Pallene records implemented as userdata and Lua records implemented as Lua tables. A separate type for arrays would avoid run-time tag checks and would also use less RAM (which also speeds things up). The catch is that such arrays cannot be directly manipulated from Lua. If Lua wants to read or write to these arrays it would need to invoke a method or metamethod. What do you think? Would this do what you're looking for?
I'd rather not make Pallene unsafe; it's asking for trouble.
Maybe! In theory we could split things into two loops. First we check all the type tags and then the second time around we can speed through the loop without any type checks. However, it's possible that having to go through the array twice will cost more than going through it only once.
Cool idea! And it might be more practical than you think. Roberto told me he's considering changing the implementation of arrays so that type tags are stored separately from the values. Instead of TVTVTVTV, it would be TTTVVVV. (The current implementation wastes some RAM due to padding, because tags are smaller than values). If Lua ends up doing that, maybe we could represent homogenous arrays as just TVVVVVVV, which would be even more of a space save.
That's an interesting theory. If you would be willing to test some manual loop unrolling, it would be valuable info. |
Thinking a bit more, there are some challenges due to tag variants. Lua uses a 8-bit tag with 4 bits for the type and 4 bits for the "variant". Although lists often have an homogenous type, often the variants might differ. For example, boolean true and false are two different variants, as are Lua funciton closures vs C function closures. Another tricky case are arrays with optional or missing values (nil or false). But maybe there's a way to do it... |
I meant to respond a week ago, but after doing a bunch of experiments, it took me way longer to clean everything up and write up the results:
This didn't help as much as I hoped. You can see the manual unroll 2 is not much different, but the unroll 4 does get a small performance boost. But my conclusion from this is that the checks themselves are the real bottleneck. Removing those checks yields a ton more speed, and seems to also remove the need to manually unroll the loops since the compiler seems to do it for us.
P.S. I think pallene_renormalize_array can be simplified from:
to:
I did not see any noticeable impact on performance with this change. It was just one of the things that I tried when trying to manually inline to make things shorter. Thought I would mention it though.
2b. A C API to interact with these Pallene arrays would be very desirable. I would love the ability to say, memcpy, between C and Pallene arrays to transfer data between them really fast. And in extreme cases, being able to get access to the array address in C and know the size and offsets so it can be directly read or written to would be amazing. (My thinking about such a feature is somewhat influenced by how we copy memory buffers from the CPU to the GPU in APIs like OpenGL.)
3a. Already, because of my previous other crashing bugs, I am frequently switching between Pallene and Lua. If my code had to care about the array type and wasn't transparent, I would be in big trouble. 3b. Particularly for arrays of primitives, it is very nice working with others who are not advanced coders, and only deal with Lua. Allowing them to write code as if it was just a normal float array in Lua makes things so easy. 3c. Similarly, it is nice to be able to pull off the shelf existing functions/modules that operate on regular Lua arrays (of numbers). 3d. My hope is for at least arrays of primitives, you can both provide the index accessor metamethods, e.g. a[i], and also make these fast. My take-aways from the Lua/C-API are making C-arrays accessible to Lua with index accessors is really slow to cross the FFI bridge. So in the end, we lose more performance than we gain if the user is frequently accessing individual elements on the scripting-side. The problem is this message is not conveyed well in general, and entire eco-systems sometime get built-up around complete misunderstnadings of how things work and hand-wavy beliefs that "C/native is always faster". The Lua communitiy has been better at this than most, Most people using Python don't really understand what's going on under the hood and they cling to tropes that if they use X (substitute X, with NumPy, Pandas, PyPy, C module, etc.), for all their arrays, everything will get magically fast (which it doesn't). So with Pallene, I really hope we can make the plain and simple for-loop on arrays of numbers as fast as what you would expect with an optimizing C compiler on the backend. This will avoid recreating all the strange mystiscism seen in Python about needing magical libraries, writing C modules, or needing JIT, to make things fast. So back to Pallene/Lua arrays, so what I hope you can do is use your private Pallene-internal access to Lua-internals to make individual element access on a Pallene array about on par with a regular Lua array. (I think it is reasonable to treat arrays of records differently than arrays of primitives, or even arrays of numbers, if this can't easily be extended universally.) 3e. This is my first time using LuaLanes. Lanes has limitations about how it handles userdata across lanes, and requires people to implement a custom clone metamethod for their userdata if they want it to be handled by Lanes. Because Lanes is popular enough, I think some modules implement this metamethod. But this made me really appreciate how I was using native Lua data types with Pallene and didn't have to worry about this. It's not to say I don't want better arrays for Pallene, but this is one of those subtle trade-offs I've only recently come to appreciate. (And I can already see the desire for "Pallene Power Patches" to add the clonable metamethod to Pallene arrays so thay can be copied across Lanes :P) Bottom line is I think these checks inside the for-loop have an outsized negative performance impact, and we should try to find ways to hoist them out of the loops. I believe it is reasonable to assume:
So Pallene loops are a potential source of magnified gains or losses, since these are one of the key places where modern CPUs are designed to extract outsized gains if you use them right. One psuedo-algorithm I was thinking of, was that I presume the Lua-internals have a common function point for adding or changing an element in the array part. If so, then if we create a homogeneous bit flag somewhere to track the state, then this can be used exclusively for true homogenity (and thus weed out the subvariants). I'm also just focusing on the "array part" of the Lua table. My thesis is we only care about the array part. if adding_to_array_part if removing_array_element Also, for pointer types, maybe the checks aren't as bad since there will be pointer chasing which will probably kill performance anyway. But for numerical arrays/loops, I presume this is where we can get outsized performance gains if we keep the code lean. I am attaching more source code examples which include some of the experiments I ran. As a reminder, in Compiler Explorer, set the compiler flags appropritately. As a starting point: Also can be interesting. Also switching to arm64, I believe automatically utilizes NEON SIMD instructions. |
Sorry, I also had to wait a week to replay :) I was really busy moving to my new apartment and this discussion deserved a well-reasoned response.
Is it 24 seconds for the two loops (about 12 seconds per loop)? If so, it would be an impressive speedup regardless, albeit not the 2x speedup we'd need to blindly apply this code transformation everywhere.
Had the compiler been able to vectorize the loop, I would have expected a larger performance boost. Maybe it's still not being able to vectorize it?
Typically
At a first glance I would expect that the compiler would produce identical code. Am I missing something here?
The lack of module system bites us again. This is not an all a fundamental limitation of Pallene, just that we haven't had the time to properly implement this feature. (If you want to help with this, it would be very much appreciated; someone who's already writing multi-file pallene programs should have very useful insights about how the module system should work)
I feel that this in an argument in favor of at least adding support for arrays of simple non-pointer types, as a first step.
Let me know! My main question is whether special arrays fit better as an FFI thing (because one might want to use arrays of structs made by C) or as a core part of Pallene language (like records).
Good point. Although it sounds more like a Pallene API to interact with C arrays.
You hit the nail on the head! This might be the biggest trade-offs between Lua tables and malloc-ed arrays. With records we do not expose an __index metamethod, so the Lua side is warned that accessing a field will cost a function call.
For sure. But we do have to live with the fact that some optimizations are harder to do on Lua arrays...
I'm afraid that fast metamethods might be impossible. The issue is less about the FFI bridge and more about the metamethod call itself. A Lua function call to implement the metamethod is more expensive than a raw GETINDEX operation.
In theory, Pallene for loops can be fast when accessing a Lua table and even faster if we add a custom array type. Lua loops will be regular speed for Lua tables but much slower for a custom array type.
Noted.
This is a laudable goal, although for now I'm not sure if it's something we can do...
One problem is that if at any point the array "escapes" or we call a Lua function, then we can no longer optimize the tag checks because Lua might have modified the array behind our backs. This can make the optimization fragile, because small changes to the program might prevent the optimization from kicking in.
Just to be clear, this change would be inside the Lua interpreter or inside the code emitted by the Pallene compiler?
Right. |
This is a bit of a follow up to this other perfomance thread I posted at #561.
I've been looking a little deeper into the compiler's ability to unroll loops and autovectorize Pallene generated C code.
For context, the era of instruction level parallelism, aka SIMD, is already well upon us, and this is one of the few ways to harvest out-sized gains on CPUs that are no longer ramping up clock speeds. This trend only seems to be continuing. So we should be on the constant look out for ways to take advantage of the hardware, when there are opportunities (especially if the C compiler/optimizer is willing to do most of the hardwork for us if we just do the "right" things.)
TLDR: I believe the checks generated in Pallene for-loops are preventing the compiler/optimizer from generating much better code, and thus sacrificing non-trivial amounts of performance gains we otherwise might be able to get for free. I'm hoping we can find a way to change things, such that the compiler can apply its much better optimizations.
So I experimented with a very simple for-loop that iterates through an array of doubles and adds up all the values in the array.
I wrote 2 reference programs in C, and a progam in Pallene.
This is my simplified interpretation of what the internal Lua array looks like. (The real Lua uses a union as the first field, so this is simplified to reduce the union down to a double.)
So I believe this case should represent the theoretical max of what we hope the emitted Pallene code can achieve.
In all the results, the "loop time" is the important part. Startup time doesn't count.
3a. pallenec, no extra optimizations
3b. pallenec with CFLAGS:
3c. hand modified --emit-c file removing both checks inside for-loop:
3d. hand modified --emit-c, removing only first check:
3e. hand modified --emit-c, removing only second check:
To sum up, after I remove both checks inside the for-loop in the Pallene generated code, we get performance that matches the pure C counterpart. But leaving in the checks, the optimizer is not able to unroll and vectorize.
So we are looking at ~23% slow down caused by the two checks inside the loop.
*Additional note: I didn't benchmark with more aggressive processor optimization flags, such as -mavx2, -mavx512f, but I did look at the generated assembly, and you can definitely see an impactful change in how it uses the SIMD registers. I think gcc by default is using something that looks like -msse2 on my computer. This makes sense because every x86-64 CPU can do SSE2.
Some possible ideas to improve this:
Add a compiler optimization flag (e.g. "--unchecked-loops") that removes these checks. This would be only for users who know what they are doing to opt into.
Is there a way we can generate these checks that don't cause the compiler to give up on the optimizations?
a. Assuming there isn't a different way to write the same exact behavior that doesn't cause the optimizer to give up, one idea is that maybe for the check, we just keep track of a "found_an_error" boolean through the loop. Then after the end of the loop, we check the flag, and then outside the loop, we finally trigger the error. It comes a little late, but maybe this can be part of an optional compiler flag.
b. Additionally, with this deferred error idea, maybe we only apply it to certain types that are less likely to cause dangerous errors until the end of the loop. For example, floats and integers might be able to defer the error, but other types we always keep the checks. The idea is that the program will eventually trigger an error anyway, so garbage numbers won't hopefully be able to do real damage, especially with an opted-in compiler flag.
Is there a way we can hoist these checks outside the loop?
a. Is there something clever we can do before or after the loop to check?
b. Maybe we can iterate through the loop twice. The first time is just the checks. The second time is the code without the checks. We'll need to benchmark, but the hope/presumption is that the performance gains won from the optimizer will vastly outweight the cost of looping through the array twice.
This will probably require cooperation with Lua, so down the line, if Lua decides Pallene will be an official feature moving forward, maybe an internal flag can be set somewhere that tracks whether an array is homogeneous or not. If the array is known to be homogeneous, then checks can be skipped. If Lua adds or changes a value, which results in a heterogenous array, it can set a flag that Pallene can read. This can allow Pallene to choose between two different code paths, with-checks and without-checks. Presumably, once Lua determines the array is no longer homogeneous, it can't unset the flag unless the array is emptied (or down to 1 element).
Pallene compiler/optimizer manually unrolls loops itself. Rather than completely relying on the underlying C compiler, Pallene could attempt to unroll some loops itself and try to generate output that might make it easier for the C autovectorizer to find operations it can combine into SIMD registers.
Most modern CPUs (both desktop and mobile) support at least 128-bit wide instructions. So assuming Pallene/Lua is using 64-bit ints and floats, I think an unrolling of 2 might be sufficient. Some flexibility for more powerful architectures or Lua compiled with using 32-bit sizes or just performance experimentation, to unroll more loops (e.g. 4 or 8 or 16) might be a nice to have (controlled by compiler flag options). Even if the C compiler isn't able to fully vectorize, sometimes the loop unrolling itself gives a big enough performance boost by itself to make it worth it. Since Lua is also used for embedded hardware which may benefit more from small code size and lack SIMD, compiler flags to turn on/off unrolling is desirable.
I'm attaching my example files, plus the assembly output of the Pallene code, with and without the checks removed.
You can use the -S flag to generate the assembly output.
https://godbolt.org is also great, and I recommend pasting in the C programs there. (The Pallene program is harder to use with Compiler Explorer because of the extra headers needed for lua.h.)
UNROLL_VEC.zip
The text was updated successfully, but these errors were encountered: