Replies: 4 comments
-
Thanks @schuhumi very interesting thread. |
Beta Was this translation helpful? Give feedback.
-
@schuhumi Great collection of techniques, thanks! I noticed some significant impact of the pipelining when implementing the vector extension functions, though it's sometimes a bit hard to see why one order is faster than the other... |
Beta Was this translation helpful? Give feedback.
-
Thank you both for the remarks. @vroland but how do you get to code that has well working pipelining? Bruteforce by taking rounds of educated guesses and profiling them? Also I have some learnings about preloading as well as three more techniques, the first of which I forgot to present in the initial post: Cache Preload robustnessMeanwhile, by frustrating experience, I learned that // The CONFIG_ESP32S3_DATA_CACHE_LINE_SIZE is either 64, 32 or 16.
// The bit mask that indicates the addresses inside a block can be calculated like this
#define DATA_CACHE_OVERHANG_BITS ((uint32_t)CONFIG_ESP32S3_DATA_CACHE_LINE_SIZE-1)
// The other bits denote the block number:
#define DATA_CACHE_BLOCK_BITS (~DATA_CACHE_OVERHANG_BITS)
bool dcache_preload_active = false;
void Foolproof_Cache_Start_DCache_Preload(uint32_t start, uint32_t len, bool finish_previous_preload) {
// Start a manual preload of a piece of external ram. The requested region is padded such that
// the preloading always works (expanded into blocks and one block at a minimum).
// If a previous preload is still ongoing, it depends on the finish_previous_preload variable
// what happens:
// - true: the call blocks until the previous preload is finished. The task yields while waiting
// to be able to get other independent things done.
// - false: the previous preload is stopped prematurely, and the new one starts immediately.
if(dcache_preload_active) {
if(Cache_DCache_Preload_Done()) {
dcache_preload_active = false;
} else {
if(finish_previous_preload) {
while(!Cache_DCache_Preload_Done()) {
taskYIELD();
}
} else {
Cache_End_DCache_Preload(0); // Do not start auto-preload
}
dcache_preload_active = false;
}
}
uint32_t block_start = start & DATA_CACHE_BLOCK_BITS;
uint32_t block_end = start + len; // This is exclusive, i.e. the first address we do not require
if((block_end & DATA_CACHE_OVERHANG_BITS) != 0) {
// the end of our desired area happens to not fit the block grid
// -> we need to include the block that our end overhangs into
block_end = (block_end | DATA_CACHE_OVERHANG_BITS) + 1;
}
if( ((block_start^block_end) & DATA_CACHE_BLOCK_BITS) != 0 ) {
// Our start and stop block are the same one, which makes a length of 0 blocks.
// We need to request 1 block at a minimum though.
block_end += CONFIG_ESP32S3_DATA_CACHE_LINE_SIZE;
}
// With a CONFIG_ESP32S3_DATA_CACHE_LINE==16 case as example, at this point we should have:
// block_start = 0b...aaaaa0000
// block_stop = 0b...bbbbb0000
// where the block number indicated by the b bits is at least 1 larger than the one indicated by the a bits
Cache_Start_DCache_Preload(
block_start, // start address of the preload region
block_end - block_start, // size of the preload region, should not exceed the size of DCache
0 // the preload order, 0 for positive, other for negative
);
dcache_preload_active = true;
} Code in RAM instead of FLASHThis one is very simple, to quote the ESP32 docs:
This has the most impact in hot codepaths obviously. The docs also explain how to do it. Exploit shared cachesThis picture from the technical reference manual shows the cache structure: So when you need to do two operations on (roughly) the same memory areas, you can get massive speed gains by doing them in parallel AND having to cache the data only once for both processors. Of course you need to be a bit careful to not get any race conditions, so subdivide your memory area in chunks where multiple of fit into the cache. In my application of a computer screen I do pixel rows on the screen. I use the low-level functions, and therefore have to calculate the states of the pixels after a refresh myself. At the same time I want to uncompress new image data in roughly the same areas often (imagine dragging a window around for example). I need to come up with some heuristics to promote this happy coincidence, but often it happens that for an area on the screen one core requests preloading a row of pixels to calculate the new pixel state, which it then does. And then, before the cached row of the framebuffer gets evicted from cache the second core can decompress new image data into it. This is really speedy, and I only need one pixel-row indicating counter to make sure the decompressor stays (closely) behind the calculator of new pixel states. This approach is also nice because when you'd try to make good use of the cache on a single core, you'd either loop through each pixel row two times, or you'd have one loop with expensive branches in it to see what action is necessary for each specific pixel. Queues/Buffers with minimal copiesI'm not really happy with the buffers that come with FreeRTOS, because there's none that works without copying the payload data while giving pointers to memory areas where the writer can dump variable length stuff. If you used the usb library for example, you'll know that for received data it gives you a function to copy that to a target memory area. I do not want to copy that into a intermediary buffer just that some FreeRTOS buffer can copy it again from there into itself, because that is expensive af. So this became a recurring structure in my code:
// wait for the parallel task to finish
while(xStreamBufferBytesAvailable(items_empty)<ITEMS_ARRAY_LEN) {
taskYIELD();
} |
Beta Was this translation helpful? Give feedback.
-
Hi there.
Your explanation is correct about the way the pipeline works. This is true for all instructions which may have a latency > 1 clock cycle.
Correct. No auto-vectorization, and no intrinsics in gcc.
I definitely prefer the inline assembly approach because 1) it's easier to use when all you really want is a few asm instructions in the inner-most loop of a bigger function, and 2) it enables the compiler to optimize and/or inline functions containing the inline asm.
Gcc knows about and uses the zero-overhead loops. But it does so very conservatively; among other things, it will not use them when there's any inline assembly or any function call in a loop's body, which is why you have to inline-assemble the ZOLs too to be able to use SIMD inline assembly inside a ZOL.
I don't think the crypto accelerators are of any use for compression. The SIMD instructions however can be, see e.g. https://github.com/BitsForPeople/esp-tamp |
Beta Was this translation helpful? Give feedback.
-
The esp is a kinda smallish chip, and some of the screens epdiy can drive have many many pixels, so naturally one would like to put the chips capabilities to good use. Since getting into epdiy, I had a couple eureka moments in terms of getting things done quickly. I'm quite sure though that I haven't found all the ways to speed up things, and in any case it would be nice to collect some approaches since I didn't find any holistic take on that endeavor for the esp32(-s3). So I'll list what I've discovered so far, and would be very happy about information / links about further measures.
(Sidenote: the Technical Reference Manual is a great place to look up the detail of some of the following topics )
Utilize 32 bits
For starters something easy: If working on some array that has 8 or 16 bit values, think about whether you can typecast your pointer to
uint_32
and work in batches of 2 or 4. This is pretty much a no-brainer for bitmask-stuff for example. One great property of this method is, that you do not need to keep track of any memory alignments.Internal vs. external memory speed
Apparently, there's a really large speed difference between these two. See the this esp32-s3 memory copy test repo for some numbers. There are things that do not fit into internal memory, for example the frame buffer when using larger displays. I discovered two ways how to move data you'll need soon into the internal memory without blocking the cpu for the copy-action.
Interleave memory access and compute
To my understanding, when you write:
you end up with a situation where first your cpu is idle wating for
buf[i]
to turn up in one of its registers, then your memory interface is idle waiting forwork()
, and then your cpu is idle again writing the result from the register tobuf[i]
. In the memorycopy repo there's aPIE 128-bit (16 byte loop)
example that works this way, and then there's aPIE 128-bit (32 byte loop)
variant that is faster and works in this fashion:In the memorycopy example, this surprisingly is faster even though there isn't any
work()
to do. I haven't benchmarked yet how much of a difference there is for regular non-vector instructions.Vector Instructions / Single Instruction Multiple Data (SIMD) / Processor Instruction Extensions (PEI)
All the same thing, and Valentin is working on integrating them into epdiy. These allow for working with 128bit per instruction. I haven't tackled them myself yet, because I haven't cleared all the challenges with them in my usecase yet:
Zero-overhead loops
Another nugget I found in the memorycopy repo mentioned above is the "zero-overhead" loop that uses the
LOOPNEZ
instruction from the Xtensa Instruction Set Architecture (page 477). I haven't investigated this thing myself yet, but since the author of the memorycopy repo took the effort for writing an explicit function to utilize it, I do not think the compiler does it automagically.That's it from the top of my head, I would be interested to hear if you have any more tricks up your sleeves? Things that seem interesting but I haven't grasped yet:
Beta Was this translation helpful? Give feedback.
All reactions