ESP32-S3 code optimization resources #289

schuhumi · 2024-04-05T12:56:59Z

schuhumi
Apr 5, 2024

The esp is a kinda smallish chip, and some of the screens epdiy can drive have many many pixels, so naturally one would like to put the chips capabilities to good use. Since getting into epdiy, I had a couple eureka moments in terms of getting things done quickly. I'm quite sure though that I haven't found all the ways to speed up things, and in any case it would be nice to collect some approaches since I didn't find any holistic take on that endeavor for the esp32(-s3). So I'll list what I've discovered so far, and would be very happy about information / links about further measures.

(Sidenote: the Technical Reference Manual is a great place to look up the detail of some of the following topics )

Utilize 32 bits

For starters something easy: If working on some array that has 8 or 16 bit values, think about whether you can typecast your pointer to uint_32 and work in batches of 2 or 4. This is pretty much a no-brainer for bitmask-stuff for example. One great property of this method is, that you do not need to keep track of any memory alignments.

Internal vs. external memory speed

Apparently, there's a really large speed difference between these two. See the this esp32-s3 memory copy test repo for some numbers. There are things that do not fit into internal memory, for example the frame buffer when using larger displays. I discovered two ways how to move data you'll need soon into the internal memory without blocking the cpu for the copy-action.

There's a caching functionality where you can also kick off preloading some data from external memory into internal memory that you'll need soon. This is used in epdiy for example here. This is a cache, which means you do not need to do anything more for it to be written back eventually etc..
You can do also manual caching, i.e. reserve a piece of memory in the internal memory, and kick of a asynchronous copy action in the background. Then before using it, check with a semaphore that the copying finished. When being done, in this case you obviously have to write it back to the external memory yourself. You can find information about how to issue the async copy in the docs as well as in the memory copy repo as an example. In practice, you'll need three internal memory chunks to have this work efficiently: one where an async copy moves back previously computed stuff to the external ram, on you work in currently, and one where you async copy into the chunk you'll work on next. After every chunk, you can rotate the chunks roles by one.

Interleave memory access and compute

To my understanding, when you write:

for (i = ...; i++) {
    buf[i] = work(buf[i])
}

you end up with a situation where first your cpu is idle wating for buf[i] to turn up in one of its registers, then your memory interface is idle waiting for work(), and then your cpu is idle again writing the result from the register to buf[i]. In the memorycopy repo there's a PIE 128-bit (16 byte loop) example that works this way, and then there's a PIE 128-bit (32 byte loop) variant that is faster and works in this fashion:

for(i = ...; i+=2) {
    A = buf[i];
    B = buf[i+1];
    buf[i] = work(A);
    buf[i+1] = work(B);
}

In the memorycopy example, this surprisingly is faster even though there isn't any work() to do. I haven't benchmarked yet how much of a difference there is for regular non-vector instructions.

Vector Instructions / Single Instruction Multiple Data (SIMD) / Processor Instruction Extensions (PEI)

All the same thing, and Valentin is working on integrating them into epdiy. These allow for working with 128bit per instruction. I haven't tackled them myself yet, because I haven't cleared all the challenges with them in my usecase yet:

the problem needs to be vectorizeable
great speed only comes with great memory alignment, i.e. data you move in and out of the 128bit registers needs to be 16byte aligned in ram
to my knowledge, there's no way to get the compiler to use them automagically. It is manual assembler work
there's not exactly a ton of information/examples about them, what I know of:
- the technical reference manual describes the individual registers and instructions
- in Valentin's vector instruction branch, you can find examples of how to write assembler function with vector instructions in them, and how to call them from c.
- in the memory copy repo mention above, there's examples of how to use these instructions by writing inline assembly in your c program. I think this is very handy when you do not want to write the loops in assembly as well, as otherwise you would have a lot of (costly) function calls.

Zero-overhead loops

Another nugget I found in the memorycopy repo mentioned above is the "zero-overhead" loop that uses the LOOPNEZ instruction from the Xtensa Instruction Set Architecture (page 477). I haven't investigated this thing myself yet, but since the author of the memorycopy repo took the effort for writing an explicit function to utilize it, I do not think the compiler does it automagically.

That's it from the top of my head, I would be interested to hear if you have any more tricks up your sleeves? Things that seem interesting but I haven't grasped yet:

Pipelining: How to encourage and debug it?
ULP Coprocessor: Can one delegate something in a useful manner?
There's a number of accelerators for encryption needs. I could use something for compression, but not sure yet if I can abuse any of them to achieve that...

martinberlin · 2024-04-07T11:29:14Z

martinberlin
Apr 7, 2024
Collaborator Sponsor

Thanks @schuhumi very interesting thread.
I'm interested in trying IDF 5.3 so in the future will need help in this issue #285
And I'm sure that there are still many corners of epdiy where we can collaborate together to improve speed. There is where I would like to see this discussions board going. A collaborative effort not only by small wishes like I want to drive that new panel, or that other unsupported thing. Would like to see that this Discussions board goes more into the direction of making epdiy faster, lowering deep-sleep consumption, and optimizing it in every possible corner. That will ultimately lead to all the projects using the component to be faster and more efficient

0 replies

vroland · 2024-04-14T21:29:18Z

vroland
Apr 14, 2024
Maintainer

@schuhumi Great collection of techniques, thanks! I noticed some significant impact of the pipelining when implementing the vector extension functions, though it's sometimes a bit hard to see why one order is faster than the other...
Regarding the crypto stuff: I was also thinking, maybe we can use this to cheat some accelerated lookup table computations out of it? But idk if that would be worth it compared to the S3 PIE.

0 replies

schuhumi · 2024-05-23T19:40:40Z

schuhumi
May 23, 2024
Author

Thank you both for the remarks. @vroland but how do you get to code that has well working pipelining? Bruteforce by taking rounds of educated guesses and profiling them?

Also I have some learnings about preloading as well as three more techniques, the first of which I forgot to present in the initial post:

Cache Preload robustness

Meanwhile, by frustrating experience, I learned that Cache_Start_DCache_Preload() is very picky about what ranges it gets called with. Otherwise, the esp just crashes and reboots. The manual says that one has to request one block at least, but it is not specific about what this means in detail, or what happens when multiple preload requests overlap temporally. So I made this overly verbose wrapper and now I don't have any unexplained crashes anymore:

// The CONFIG_ESP32S3_DATA_CACHE_LINE_SIZE is either 64, 32 or 16.
// The bit mask that indicates the addresses inside a block can be calculated like this
#define DATA_CACHE_OVERHANG_BITS ((uint32_t)CONFIG_ESP32S3_DATA_CACHE_LINE_SIZE-1)
// The other bits denote the block number:
#define DATA_CACHE_BLOCK_BITS (~DATA_CACHE_OVERHANG_BITS)

bool dcache_preload_active = false;

void Foolproof_Cache_Start_DCache_Preload(uint32_t start, uint32_t len, bool finish_previous_preload) {
    // Start a manual preload of a piece of external ram. The requested region is padded such that
    // the preloading always works (expanded into blocks and one block at a minimum).
    // If a previous preload is still ongoing, it depends on the finish_previous_preload variable
    // what happens:
    //  - true: the call blocks until the previous preload is finished. The task yields while waiting
    //          to be able to get other independent things done.
    //  - false: the previous preload is stopped prematurely, and the new one starts immediately.

    if(dcache_preload_active) {
        if(Cache_DCache_Preload_Done()) {
            dcache_preload_active = false;
        } else {
            if(finish_previous_preload) {
                while(!Cache_DCache_Preload_Done()) {
                    taskYIELD();
                }
            } else {
                Cache_End_DCache_Preload(0); // Do not start auto-preload
            }
            dcache_preload_active = false;
        }
    }

    uint32_t block_start = start & DATA_CACHE_BLOCK_BITS;
    uint32_t block_end = start + len;  // This is exclusive, i.e. the first address we do not require
    if((block_end & DATA_CACHE_OVERHANG_BITS) != 0) {
        // the end of our desired area happens to not fit the block grid
        // -> we need to include the block that our end overhangs into
        block_end = (block_end | DATA_CACHE_OVERHANG_BITS) + 1;
    }
    if( ((block_start^block_end) & DATA_CACHE_BLOCK_BITS) != 0 ) {
        // Our start and stop block are the same one, which makes a length of 0 blocks.
        // We need to request 1 block at a minimum though.
        block_end += CONFIG_ESP32S3_DATA_CACHE_LINE_SIZE;
    }
    // With a CONFIG_ESP32S3_DATA_CACHE_LINE==16 case as example, at this point we should have:
    // block_start = 0b...aaaaa0000
    // block_stop  = 0b...bbbbb0000
    // where the block number indicated by the b bits is at least 1 larger than the one indicated by the a bits
    Cache_Start_DCache_Preload(
        block_start,                // start address of the preload region
        block_end - block_start,    // size of the preload region, should not exceed the size of DCache
        0                           // the preload order, 0 for positive, other for negative
    );
    dcache_preload_active = true;
}

Code in RAM instead of FLASH

This one is very simple, to quote the ESP32 docs:

Some timing critical code may be placed into IRAM to reduce the penalty associated with loading the code from flash. ESP32 reads code and data from flash via the MMU cache. In some cases, placing a function into IRAM may reduce delays caused by a cache miss and significantly improve that function's performance.

This has the most impact in hot codepaths obviously. The docs also explain how to do it.

Exploit shared caches

This picture from the technical reference manual shows the cache structure:

So when you need to do two operations on (roughly) the same memory areas, you can get massive speed gains by doing them in parallel AND having to cache the data only once for both processors. Of course you need to be a bit careful to not get any race conditions, so subdivide your memory area in chunks where multiple of fit into the cache.

In my application of a computer screen I do pixel rows on the screen. I use the low-level functions, and therefore have to calculate the states of the pixels after a refresh myself. At the same time I want to uncompress new image data in roughly the same areas often (imagine dragging a window around for example). I need to come up with some heuristics to promote this happy coincidence, but often it happens that for an area on the screen one core requests preloading a row of pixels to calculate the new pixel state, which it then does. And then, before the cached row of the framebuffer gets evicted from cache the second core can decompress new image data into it. This is really speedy, and I only need one pixel-row indicating counter to make sure the decompressor stays (closely) behind the calculator of new pixel states.

This approach is also nice because when you'd try to make good use of the cache on a single core, you'd either loop through each pixel row two times, or you'd have one loop with expensive branches in it to see what action is necessary for each specific pixel.

Queues/Buffers with minimal copies

I'm not really happy with the buffers that come with FreeRTOS, because there's none that works without copying the payload data while giving pointers to memory areas where the writer can dump variable length stuff. If you used the usb library for example, you'll know that for received data it gives you a function to copy that to a target memory area. I do not want to copy that into a intermediary buffer just that some FreeRTOS buffer can copy it again from there into itself, because that is expensive af. So this became a recurring structure in my code:

Create a global array of items that work like chunks. That could be fixed structs, structs with a buffer + len variable, or whatever. The array should have 256 of them at most
Create two FreeRTOS stream buffers that have the same size as the array has items.
One stream buffer will hold the indices of all unwritten chunk items. Therefore, initially all indices have to be put into it with xStreamBufferSend()
When the writer want to write something, it takes one byte from this buffer using xStreamBufferReceive(). That byte indicates the item to write to. When it is done writing the array at that index, it writes this byte into the other stream buffer using xStreamBufferSend().
The other stream buffer holds the indices of all written items. So the reader takes a byte from this stream buffer, processes the item, and then puts this byte into the buffer for empty items.
You might need to know whether the item processing task is done with all its work. You can check easily by looking up whether the number of bytes in the buffer for the empty items equals the total number of items. So to wait for the processing to finish you could do for example:

   // wait for the parallel task to finish
   while(xStreamBufferBytesAvailable(items_empty)<ITEMS_ARRAY_LEN) {
       taskYIELD();
   }

0 replies

BitsForPeople · 2024-10-30T20:57:44Z

BitsForPeople
Oct 30, 2024

Hi there.
Having contributed the SIMD code to project-x51's memory benchmark, I think I may add some things to the discusstion that I found out about the ESP32s, and specifically the -S3.

Interleave memory access and compute

Your explanation is correct about the way the pipeline works. This is true for all instructions which may have a latency > 1 clock cycle.
While for the S3's SIMD instructions detailed timings are documented in the TRM, this isn't the case for the Xtensa "core" instructions.
If you don't want to first measure the latency of every instruction type you can mostly just go with the generic pipeline optimization technique of avoiding (data) dependencies between one instruction and the one immediately following it. Gcc seems to often do this already.

Vector Instructions / Single Instruction Multiple Data (SIMD) / Processor Instruction Extensions (PEI)
* to my knowledge, there's no way to get the compiler to use them automagically. It is manual assembler work

Correct. No auto-vectorization, and no intrinsics in gcc.

  * in the memory copy repo mention above, there's examples of how to use these instructions by writing [inline assembly in your c program](https://github.com/project-x51/esp32-s3-memorycopy/blob/main/main/main.cpp#L435). I think this is very handy when you do not want to write the loops in assembly as well, as otherwise you would have a lot of (costly) function calls.

I definitely prefer the inline assembly approach because 1) it's easier to use when all you really want is a few asm instructions in the inner-most loop of a bigger function, and 2) it enables the compiler to optimize and/or inline functions containing the inline asm.

Zero-overhead loops

Gcc knows about and uses the zero-overhead loops. But it does so very conservatively; among other things, it will not use them when there's any inline assembly or any function call in a loop's body, which is why you have to inline-assemble the ZOLs too to be able to use SIMD inline assembly inside a ZOL.

* There's a number of accelerators for encryption needs. I could use something for compression, but not sure yet if I can abuse any of them to achieve that...

I don't think the crypto accelerators are of any use for compression. The SIMD instructions however can be, see e.g. https://github.com/BitsForPeople/esp-tamp

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ESP32-S3 code optimization resources #289

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Interleave memory access and compute

Vector Instructions / Single Instruction Multiple Data (SIMD) / Processor Instruction Extensions (PEI)

Zero-overhead loops

Select a reply

ESP32-S3 code optimization resources #289

schuhumi Apr 5, 2024

Utilize 32 bits

Internal vs. external memory speed

Interleave memory access and compute

Vector Instructions / Single Instruction Multiple Data (SIMD) / Processor Instruction Extensions (PEI)

Zero-overhead loops

Replies: 4 comments

martinberlin Apr 7, 2024 Collaborator Sponsor

vroland Apr 14, 2024 Maintainer

schuhumi May 23, 2024 Author

Cache Preload robustness

Code in RAM instead of FLASH

Exploit shared caches

Queues/Buffers with minimal copies

BitsForPeople Oct 30, 2024

Interleave memory access and compute

Vector Instructions / Single Instruction Multiple Data (SIMD) / Processor Instruction Extensions (PEI)

Zero-overhead loops

schuhumi
Apr 5, 2024

martinberlin
Apr 7, 2024
Collaborator Sponsor

vroland
Apr 14, 2024
Maintainer

schuhumi
May 23, 2024
Author

BitsForPeople
Oct 30, 2024