Latest Mindcode optimizations - demonstration #106
Closed
cardillan
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
There were some rather complex compiler optimizations added to Mindcode lately. While I believe these optimizations are very efficient, evaluating the effects of optimizations is rather difficult, as it strongly depends on the kind of the optimized program. To demonstrate the effects of these new optimizations, I've created a program which measures the item levels in the core (or theoretically any container) and displays them on a large display. As Mindustry Logic doesn't support printing text on the logic display directly, the program needs to draw each digit separately using primitive drawing operations (in our case, lines and rectangles). This makes for a very draw-heavy program, which, in my experience, can often benefit a lot from the optimizations I want to showcase here.
The entire program can be found here. There's the
drawDigit
function, which is at the core of the program and is responsible for drawing individual digits. (Please ignore thedrawDigitSlanted
function - its purpose is to draw digits using less drawing operations and therefore draw the entire display not just a bit faster, but also in one go - there is a limit on the number of drawing operations that can be made before sending the result to the display usingdrawflush
instruction. Digits printed by this function are a bit less readable.)The program works nicely - give it a try in Mindustry yourself. When run on a Logic Processor, which executes 480 instruction per second, the program, as compiled with all optimizations on current Mindcode, takes about 2.5 to 2.8 seconds to redraw the screen, depending mainly on the numbers being actually drawn. Left and right columns are updated separately, so the actual update period isn't perceived to be that long.
Measuring efficiency of various versions of the program created using different optimization settings by timing the runs in Mindustry is a bit tedious and quite imprecise, as the timings varies depending on actual numbers being drawn and probably some other factors. For these reasons I've created a simplified version which omits some initializations and a few drawing operations (such as setting up colors). This version of the program is then run on a simulated processor, which allows us to count the number of instructions executed and therefore obtain the measurements with absolute precision.
The benchmarked program draws a single number in all positions, for simplicity and repeatability. As different numbers are drawn using different code paths, I've run the simulations using three different numbers: 21600, 12345 and 13579. The first number is the fastest to draw of the three (0 and 1 are both drawn using one drawing operation), and was chosen as it is most commonly displayed when the core is full of materials. The remaining numbers are similarly complex to draw, but the last one usually takes the most instructions, due to going through less efficient branches in the
case
expression selecting the number to be drawn.There isn't an intention to test all possible combinations of optimizations. The tests starts with optimizations turned completely off and then proceeds to turn them on one by one in a predefined order. It must be noted that the optimizations can (and do) interact with each other, so activating them in different order might lead to different intermediate results - and might lead us to ascribe improvements that are a result of two or more optimizations interacting, to a single optimization that was the latest added to me mix.
So, with the above precautions, I hereby present the results of the benchmark described above:
Each row of the table consists of one configuration tested. Column 2 contains the size of the code compiled using given optimizations. Columns 3 to 5 contain the number of steps the benchmarked program needed to run while drawing the numbers contained in the header. Columns 6 to 8 contain the difference between the current and previous row, in other words the effect of the additional optimization(s) applied in given configuration. The last triplet of columns shows the number of steps taken to execute the given program as a percentage of the original, non-optimized code. Also, some optimizations are missing in the table, because they had no effect on this particular benchmark.
In this treatise, I want to concentrate of the latest optimizations that were added to Mindcode. They're capable of significantly improving code execution speeds - some of them are aimed at making the code faster at the expense of increasing code size (see optimization for speed).
The first of those is Data Flow Optimization. This optimizer analyzes variables usage and values assigned to them, and is able to reuse identical expressions that are present in the source code. As such, it is very helpful in drawing-heavy programs such as this benchmark. Let's have a look at the code drawing the
5
digit:All these
x + WIDTH
,y + HEIGHT
expressions are computed just once and then reused in the rest of the code block. This optimization could be emulated by the programmer by creating variables for all those reused expressions, such asx_width = x + WIDTH
,y_height = y + HEIGHT
and using those, but letting the optimizer do it helps the programmer, makes the source code cleaner and might even sometimes spot possibilities for optimizations that the programmer missed. Since the benchmark is really draw-heavy, this optimizer has, by far, the largest effect on it.Function Inlining doesn't do much at this moment, but we'll revisit it later.
Loop Unrolling in the basic form provides modest improvement per iteration, but since the innermost loop, that was also unrolled, is executed 80 times, the improvements do add up nicely.
Case Switching is the hero in this story. It dramatically speeds up drawing of the higher digits, at the expense of slower drawing of lower digits. If the displayed numbers were drawn evenly from the range of all 5-digit numbers, the improvement in the average case would be significant (we can regard the 13579 number slightly worse than an average case.) The number drawn probably most often, 21600, has a markedly worse performance. However, on average, the performance of drawing different numbers is evened out, which might be regarded as a positive effect in itself.
What's the best we could theoretically do? When inspecting the resulting code, we can see that there are still loops that could be unrolled, if we could go over the 1000 instruction limit. Compiling the code with
#set instruction-limit = 10000
produces 4848 instructions, but results into all loops unrolled. The processor emulator isn't limited to 1000 instructions, so from the emulated run we get:That's a reduction of 30% over the best case so far! How does the code actually look like? I'm going to show only a small excerpt:
This is the code that prints "9" at a fixed position on the screen -- since every function was inlined and every loop unrolled, the result is a linear code which contains a separate switch statement for printing a given digit at every possible position on the screen. This allows Mindcode to precompute every single expression that relates to drawing operations, the only variables left in the program are those concerning the actual values being printed.
Can this help us somehow? We obviously can't use this code, it's too large for Mindustry processors. Let's review the code printing "9" in the best usable version:
(Note: the x coordinate is represented by
__fn2_x
, while the y coordinate byy
. This might look strange, but it is because the Data Flow Optimizer replaces variables as it sees fit. For some reason it let the x coordinate be represented by the inlined function variable, while replacing__fn2_y
with they
variable passed in as an argument.)Each drawing coordinate must be computed from the basic position provided by the variables/function arguments,
__fn2_x
andy
.By further inspecting logs produced by command-line compiler, we also notice that there's a loop at line 86 which is unrolled:
The loop at line 86 is a loop which prints all five digits of a particular number:
This means that there's a separate sequence of code for each digit of the 5-digit numbers being printed. On top of this, there's a loop which prints numbers in a grid consisting of two columns and eight rows. We obviously can't do anything about the eight rows - the code currently has over 500 instructions, even if only a quarter was the drawing loop (and it is more), unrolling would multiply that number by eight, taking us over the 1000 instructions limit.
Can we do something about columns? Well, if we could create a separate code path for each column, the x positions of each digit could be evaluated fully down to a constant. That could save us some instructions. So, let's have a look at the loop in question:
This loop prints both columns - the left one is printed first, until index is equal to 8, at which point the display is flushed and a coordinate for the right column is computed.
displayItem
is again inlined here, because it is a single function call. To turn it into two function calls with different, but constant values for thex
argument, we can do this:It's an improvement, but not what we expected. In the logs, we find this:
Mindcode cannot inline
displayItem
now, because it would need 449 instructions to do so, and only 396 are available at this moment. Bummer.Wait a moment! We expected that after the inlining all the computations of the
x
coordinates should be replaced by constant values, reducing the code size somewhat. Mindcode can't see that far (yet), but perhaps if we gave it more space, the functions could be inlined and still make it under 1000 instruction limits? We see that we're only some 53 instructions short, so we'll give it a shot by increasing the instruction limit just a bit:and we get:
Hooray! Allowing Mindcode to go over the 1100 instruction limit, all the optimizations we wanted were applied. Just to make sure, let's check the resulting code:
The code looks as expected: all x coordinates are replaced by a constant, and only the y coordinates are computed. And this is also the part where Function Inlining gets into the spotlight - the
drawDigit
method is called multiple times in the source code now and therefore isn't inlined during code generation. It's the combination of loop unrolling and function inlining, which allows all the x coordinates in the drawing instructions to be evaluated down to a constant.Conclusion
All in all, optimization for speed together with some manual tweaking gave us a speedup of about 30% compared to the code without any optimizations for speed. While this improvement is quite impressive, it must be noted that I've demonstrated a type of program suitable for optimization for speed. The same level of improvement cannot be expected on every program.
Beta Was this translation helpful? Give feedback.
All reactions