Latest Mindcode optimizations - demonstration #106

cardillan · 2023-07-17T17:24:12Z

cardillan
Jul 17, 2023
Maintainer

There were some rather complex compiler optimizations added to Mindcode lately. While I believe these optimizations are very efficient, evaluating the effects of optimizations is rather difficult, as it strongly depends on the kind of the optimized program. To demonstrate the effects of these new optimizations, I've created a program which measures the item levels in the core (or theoretically any container) and displays them on a large display. As Mindustry Logic doesn't support printing text on the logic display directly, the program needs to draw each digit separately using primitive drawing operations (in our case, lines and rectangles). This makes for a very draw-heavy program, which, in my experience, can often benefit a lot from the optimizations I want to showcase here.

The entire program can be found here. There's the drawDigit function, which is at the core of the program and is responsible for drawing individual digits. (Please ignore the drawDigitSlanted function - its purpose is to draw digits using less drawing operations and therefore draw the entire display not just a bit faster, but also in one go - there is a limit on the number of drawing operations that can be made before sending the result to the display using drawflush instruction. Digits printed by this function are a bit less readable.)

The program works nicely - give it a try in Mindustry yourself. When run on a Logic Processor, which executes 480 instruction per second, the program, as compiled with all optimizations on current Mindcode, takes about 2.5 to 2.8 seconds to redraw the screen, depending mainly on the numbers being actually drawn. Left and right columns are updated separately, so the actual update period isn't perceived to be that long.

Measuring efficiency of various versions of the program created using different optimization settings by timing the runs in Mindustry is a bit tedious and quite imprecise, as the timings varies depending on actual numbers being drawn and probably some other factors. For these reasons I've created a simplified version which omits some initializations and a few drawing operations (such as setting up colors). This version of the program is then run on a simulated processor, which allows us to count the number of instructions executed and therefore obtain the measurements with absolute precision.

The benchmarked program draws a single number in all positions, for simplicity and repeatability. As different numbers are drawn using different code paths, I've run the simulations using three different numbers: 21600, 12345 and 13579. The first number is the fastest to draw of the three (0 and 1 are both drawn using one drawing operation), and was chosen as it is most commonly displayed when the core is full of materials. The remaining numbers are similarly complex to draw, but the last one usually takes the most instructions, due to going through less efficient branches in the case expression selecting the number to be drawn.

There isn't an intention to test all possible combinations of optimizations. The tests starts with optimizations turned completely off and then proceeds to turn them on one by one in a predefined order. It must be noted that the optimizations can (and do) interact with each other, so activating them in different order might lead to different intermediate results - and might lead us to ascribe improvements that are a result of two or more optimizations interacting, to a single optimization that was the latest added to me mix.

So, with the above precautions, I hereby present the results of the benchmark described above:

Optimizations	Size	21600	12345	13579	Diff	Diff	Diff	Pct	Pct	Pct
None	364	3180	3628	3948				100%	100%	100%
+ Jump Optimization	357	3068	3516	3836	-112	-112	-112	96%	97%	97%
+ Jump Straightening	347	2924	3276	3436	-144	-240	-400	92%	90%	87%
+ Case Expression Optimization	346	2844	3196	3356	-80	-80	-80	89%	88%	85%
+ If Expression Optimization	344	2842	3194	3354	-2	-2	-2	89%	88%	85%
+ Jump Threading	344	2838	3190	3350	-4	-4	-4	89%	88%	85%
+ Temp Variables Elimination	336	2597	2949	3109	-241	-241	-241	82%	81%	79%
+ Dead Code Elimination	311	2318	2670	2830	-279	-279	-279	73%	74%	72%
+ Single Step Elimination	307	2302	2654	2798	-16	-16	-32	72%	73%	71%
+ Loop Optimization	307	2238	2590	2734	-64	-64	-64	70%	71%	69%
+ Data Flow Optimization	238	1550	1758	1870	-688	-832	-864	49%	48%	47%
+ Function Inlining	222	1528	1736	1848	-22	-22	-22	48%	48%	47%
+ Loop Unrolling	508	1172	1380	1492	-356	-356	-356	37%	38%	38%
+ Case Switching	533	1268	1380	1332	96	0	-160	40%	38%	34%

Each row of the table consists of one configuration tested. Column 2 contains the size of the code compiled using given optimizations. Columns 3 to 5 contain the number of steps the benchmarked program needed to run while drawing the numbers contained in the header. Columns 6 to 8 contain the difference between the current and previous row, in other words the effect of the additional optimization(s) applied in given configuration. The last triplet of columns shows the number of steps taken to execute the given program as a percentage of the original, non-optimized code. Also, some optimizations are missing in the table, because they had no effect on this particular benchmark.

In this treatise, I want to concentrate of the latest optimizations that were added to Mindcode. They're capable of significantly improving code execution speeds - some of them are aimed at making the code faster at the expense of increasing code size (see optimization for speed).

The first of those is Data Flow Optimization. This optimizer analyzes variables usage and values assigned to them, and is able to reuse identical expressions that are present in the source code. As such, it is very helpful in drawing-heavy programs such as this benchmark. Let's have a look at the code drawing the 5 digit:

            line(x, y, x + WIDTH, y)
            line(x + WIDTH, y, x + WIDTH, y + HALF)
            line(x, y + HALF, x + WIDTH, y + HALF)
            line(x, y + HALF, x, y + HEIGHT)
            line(x, y + HEIGHT, x + WIDTH, y + HEIGHT)

All these x + WIDTH, y + HEIGHT expressions are computed just once and then reused in the rest of the code block. This optimization could be emulated by the programmer by creating variables for all those reused expressions, such as x_width = x + WIDTH, y_height = y + HEIGHT and using those, but letting the optimizer do it helps the programmer, makes the source code cleaner and might even sometimes spot possibilities for optimizations that the programmer missed. Since the benchmark is really draw-heavy, this optimizer has, by far, the largest effect on it.

Function Inlining doesn't do much at this moment, but we'll revisit it later.

Loop Unrolling in the basic form provides modest improvement per iteration, but since the innermost loop, that was also unrolled, is executed 80 times, the improvements do add up nicely.

Case Switching is the hero in this story. It dramatically speeds up drawing of the higher digits, at the expense of slower drawing of lower digits. If the displayed numbers were drawn evenly from the range of all 5-digit numbers, the improvement in the average case would be significant (we can regard the 13579 number slightly worse than an average case.) The number drawn probably most often, 21600, has a markedly worse performance. However, on average, the performance of drawing different numbers is evened out, which might be regarded as a positive effect in itself.

What's the best we could theoretically do? When inspecting the resulting code, we can see that there are still loops that could be unrolled, if we could go over the 1000 instruction limit. Compiling the code with #set instruction-limit = 10000 produces 4848 instructions, but results into all loops unrolled. The processor emulator isn't limited to 1000 instructions, so from the emulated run we get:

Optimizations	Size	21600	12345	13579	Diff	Diff	Diff	Pct	Pct	Pct
...
+ Case Switching	533	1268	1380	1332	96	0	-160	40%	38%	34%
Instruction limit 10000	4848	896	976	944	-372	-404	-388	28%	27%	24%

That's a reduction of 30% over the best case so far! How does the code actually look like? I'm going to show only a small excerpt:

draw line 113 12 120 12 0 0
draw line 120 12 120 24 0 0
draw line 113 18 120 18 0 0
draw line 113 18 113 24 0 0
draw line 113 24 120 24 0 0

This is the code that prints "9" at a fixed position on the screen -- since every function was inlined and every loop unrolled, the result is a linear code which contains a separate switch statement for printing a given digit at every possible position on the screen. This allows Mindcode to precompute every single expression that relates to drawing operations, the only variables left in the program are those concerning the actual values being printed.

Can this help us somehow? We obviously can't use this code, it's too large for Mindustry processors. Let's review the code printing "9" in the best usable version:

op add __tmp96 __fn2_x 7
draw line __fn2_x y __tmp96 y 0 0
op add __tmp99 y 12
draw line __tmp96 y __tmp96 __tmp99 0 0
op add __tmp100 y 6
draw line __fn2_x __tmp100 __tmp96 __tmp100 0 0
draw line __fn2_x __tmp100 __fn2_x __tmp99 0 0
draw line __fn2_x __tmp99 __tmp96 __tmp99 0 0

(Note: the x coordinate is represented by __fn2_x, while the y coordinate by y. This might look strange, but it is because the Data Flow Optimizer replaces variables as it sees fit. For some reason it let the x coordinate be represented by the inlined function variable, while replacing __fn2_y with the y variable passed in as an argument.)

Each drawing coordinate must be computed from the basic position provided by the variables/function arguments, __fn2_x and y.

By further inspecting logs produced by command-line compiler, we also notice that there's a loop at line 86 which is unrolled:

  * Loop Unrolling: unroll loop at line 86                       cost   357, benefit     6250,0, efficiency       17,5 (+355 instructions)

The loop at line 86 is a loop which prints all five digits of a particular number:

    for i in 0 ... 5
        drawDigit(x, y, amount % 10)
        x -= SPACE_X
        amount \= 10
        if amount == 0 break end
    end

This means that there's a separate sequence of code for each digit of the 5-digit numbers being printed. On top of this, there's a loop which prints numbers in a grid consisting of two columns and eight rows. We obviously can't do anything about the eight rows - the code currently has over 500 instructions, even if only a quarter was the drawing loop (and it is more), unrolling would multiply that number by eight, taking us over the 1000 instructions limit.

Can we do something about columns? Well, if we could create a separate code path for each column, the x positions of each digit could be evaluated fully down to a constant. That could save us some instructions. So, let's have a look at the loop in question:

x = LEFT1 + OFFSET
y = TOP
eraseNumbers(true)
index = 0

for item in (@sand, @copper, @lead, @coal, @titanium, @thorium, @scrap, @graphite, @metaglass, @silicon,
          @pyratite, @blast-compound, @plastanium, @phase-fabric, @surge-alloy, @spore-pod)
    image(0, 0, item, IMAGE, 0)
    displayItem(x, y)
    index += 1
    y -= SPACE_Y
    if index == 8
        drawflush(display1)
        y = TOP
        x = LEFT2 + OFFSET
        eraseNumbers(false)
    end
end

This loop prints both columns - the left one is printed first, until index is equal to 8, at which point the display is flushed and a coordinate for the right column is computed. displayItem is again inlined here, because it is a single function call. To turn it into two function calls with different, but constant values for the x argument, we can do this:

y = TOP
eraseNumbers(true)
index = 0

for item in (@sand, @copper, @lead, @coal, @titanium, @thorium, @scrap, @graphite, @metaglass, @silicon,
          @pyratite, @blast-compound, @plastanium, @phase-fabric, @surge-alloy, @spore-pod)
    image(0, 0, item, IMAGE, 0)
    if index < 8
        displayItem(LEFT1 + OFFSET, y)
    else
        displayItem(LEFT2 + OFFSET, y)
    end
    index += 1
    y -= SPACE_Y
    if index == 8
        drawflush(display1)
        y = TOP
        eraseNumbers(false)
    end
end

Optimizations	Size	21600	12345	13579	Diff	Diff	Diff	Pct	Pct	Pct
...
+ Case Switching	533	1268	1380	1332	96	0	-160	40%	38%	34%
Rewriting code	573	1240	1352	1304	-28	-28	-28	39%	37%	33%

It's an improvement, but not what we expected. In the logs, we find this:

Pass 1: speed optimization selection (cost limit 396):
  * Loop Unrolling: unroll iteration loop at line 28             cost    97, benefit       47,0, efficiency        0,5 (+18 instructions)
    Loop Unrolling: unroll iteration loop at line 45             cost   289, benefit       47,0, efficiency        0,2
  ! Function Inlining: inline function displayItem               cost   449, benefit      100,0, efficiency        0,2
  ! Function Inlining: inline function call at line 49           cost   450, benefit       50,0, efficiency        0,1
  ! Function Inlining: inline function call at line 51           cost   450, benefit       50,0, efficiency        0,1

Mindcode cannot inline displayItem now, because it would need 449 instructions to do so, and only 396 are available at this moment. Bummer.

Wait a moment! We expected that after the inlining all the computations of the x coordinates should be replaced by constant values, reducing the code size somewhat. Mindcode can't see that far (yet), but perhaps if we gave it more space, the functions could be inlined and still make it under 1000 instruction limits? We see that we're only some 53 instructions short, so we'll give it a shot by increasing the instruction limit just a bit:

#set instruction-limit = 1100

and we get:

Optimizations	Size	21600	12345	13579	Diff	Diff	Diff	Pct	Pct	Pct
...
+ Data Flow Optimization	238	1550	1758	1870	-688	-832	-864	49%	48%	47%
...
+ Case Switching	533	1268	1380	1332	96	0	-160	40%	38%	34%
Rewriting code + 1100 instructions limit	859	1138	1250	1202	-130	-130	-130	36%	34%	30%

Hooray! Allowing Mindcode to go over the 1100 instruction limit, all the optimizations we wanted were applied. Just to make sure, let's check the resulting code:

draw line 113 y 120 y 0 0
op add __tmp109 y 12
draw line 120 y 120 __tmp109 0 0
op add __tmp110 y 6
draw line 113 __tmp110 120 __tmp110 0 0
draw line 113 __tmp110 113 __tmp109 0 0
draw line 113 __tmp109 120 __tmp109 0 0

The code looks as expected: all x coordinates are replaced by a constant, and only the y coordinates are computed. And this is also the part where Function Inlining gets into the spotlight - the drawDigit method is called multiple times in the source code now and therefore isn't inlined during code generation. It's the combination of loop unrolling and function inlining, which allows all the x coordinates in the drawing instructions to be evaluated down to a constant.

Conclusion

All in all, optimization for speed together with some manual tweaking gave us a speedup of about 30% compared to the code without any optimizations for speed. While this improvement is quite impressive, it must be noted that I've demonstrated a type of program suitable for optimization for speed. The same level of improvement cannot be expected on every program.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Latest Mindcode optimizations - demonstration #106

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Latest Mindcode optimizations - demonstration #106

cardillan Jul 17, 2023 Maintainer

Replies: 0 comments

cardillan
Jul 17, 2023
Maintainer