-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use binary instead of linear search #30
Conversation
I have also upstreamed the following change (without a pull request) that uses lookup-tables to find glyphs: |
Thank you very much! I will create a branch from it, test it on the device and at least measure average power consumption and compare with linear and binary search. Already your '__flash' upgrade seems to save a little power, and the binary search as well. My way of measuring is not very accurate however, but it is quite realistic I think. Your solution with lookup-tables however is quite advanced and I hope it would not be ungrateful if I eventually keep it as a branch and leave it with '__flash' + binary search for now? |
Of course. Also I don't know anything about the tinings in the project, what the power consumptions are in the different states, and how much ATmega328P contributes. Presumably most of the power consumption is due to sensor activity. |
For reference I rebased it against main as use-glyph-lookup-tables. Some ideas to speed up the code or save power:
|
Thank you, I just "stole" your branch.
Created #34
I'll try that out.
No idea either :-) |
In README.md timings and power consumption in the different states are described. I think updating the display consumes most power: A full update takes about 3 seconds where display and MCU are busy and active. So the fast update taking about 1.5 seconds helped a lot to save power. Without any power management, the three AAA batteries would last only about a week. Now I am hoping they will last at least a year and so far, after two months, it seems realistic. |
So the power is consumed (mostly) by the display, not the µC?
What's the bottleneck? Slow display, slow transmission, or slow µC? If the bottleneck is the display, and the µC mostly waits for the display to chew on transmtted bytes, then maybe slow down the µC. The power consumption of an idealized µC core is P(f) = S + B·f where f is F_CPU, S is the static power consumption (the power consumed even when the µC is clocked with 0 MHz), and B describes the power consumption added due to µC's speed. Usually, power consumption would be proportional to current consumption, but what counts is energy consumption, which is power consumption summed over time. This is enery E (proportional to Coulombs or Ampère·Hours). For a time interval Δt: E(f) = P(f)·Δt = (S + B·f)·Δt = S·Δt + B·Ticks If static power consumption S can be neglected, then S·Δt is small and energy consumption is mostly due to B·Ticks, i.e. does not depend on the consumed time but only on the complexity of the code (Ticks ≈ number of instructions to execute). This means that when the code must wait, e.g. due to slow peripherals etc., then it might be advantageous to slow down F_CPU, because during some fixed time period Δt, the µC executes more code (mostly waits) if F_CPU is high. The only constraint is that F_CPU is high enough to comply with the real-time requirements. |
I think so. The display consumes a lot of power for a long time.
It appears to be the nature of these relatively cheap E-Ink displays. The initialization is quite complex: hardware reset, software reset, sense temperature, load waveform from lookup table, write image data to RAM and update. The update itself happens in three phases I think (in full update mode), all needed to rearrange these tiny plastic pellets in some electrophoretic liquid. The lower the temperature, the longer the procedure takes. In the oven at 50°C it goes quicker than at 20°C and in the freezer it froze solid somewhere below -10°C :-)
Wow, thank you for explaining. So I will measure average power consumption at 1 MHz instead of 8. And also with -O2 instead of -Os.
Do you think it is still worth a try? |
I tried with 1MHz and adjusting the dividers of the ADC and SPI clocks so that both run at the same speed as with 8MHz. Average power consumption is 115µA vs. 90µA at 8MHz. I suppose this is because many things go slower and thus the MCU (and maybe even the display) has to be active for a longer time. I tried also compiling with |
Then UART should use
It's not only about optimizations per se but in many cases also about the cost model. At least it doesn't hurt size-wise:
So almost no change in size. Note: I am avoiding anything newer than v8, mostly due to PR90706 and PR110093. The I also had a look at the code, and there are some places that can be speed up. One of them is in if ((i + 1) % height == 0) {
n = i / height + 1;
} which is expensive. Maybe better keep track of // Prior to the i loop.
uint16_t i_mod_height = 0, i_div_height = 0;
// Each time i is incremented:
if (++i_mod_height == height) {
i_mod_height = 0;
i_div_height += 1;
} and replace So each iteration of |
I don't understand much of what you are discussing in those bug reports, but it sounds like the avr backend in gcc is obsolescent and the assembly it produces becomes less and less efficient because it doesn't get much attention any more? There seems to be no measurable difference in average power consumption of the device between 5.4.0 and 13.2.0, and the sizes are:
I'll build a v8 and see what it does.
Now that is really cool, removing 3 divisions that are done very often! Again I learned something really useful - cheers! Do I see it right that the compiler effectively turns something like |
The problems are in the middle-ends, which is the part of GCC that neither depends explicitly on the back-end (target) nor on the front-end (language). There are more and more parts of the middle-end that work well for bolides, but descreasingly well for AVR. One example (like in the PRs) is the register allocator.
So the project is still in the range where the code is slower than the display / communications, which means further optimizing the code leads to immediate power savings. (If more efficient code no more leads to less power consumption, then the F_CPU can be reduced as explained earlier.
Maybe the power consumption can be modeled? Knowing up-time of display, duration of CPU activity, etc.?
Should be the case for |
So I now have a more controlled way of measuring, and I am actually surprised how stable and reproducible the results are. I also had a look with the scope. Here the full update sequence which happens shortly after POR and then every 288 seconds: The second part of "Buffer Frame", "Write Bitmaps", is the code that was optimized. This is where searching for glyphs and writing the bitmaps to the separate SRAM used as frame buffer is done. Here is just the "Buffer Frame" part in the previous unoptimized version: And the optimized version: So, ~360 vs. ~480ms - quite an improvement I would say! And here some measurements of the average power consumption. Buffering the frame without copying SRAM to display RAM and without updating the display (so just "framebuffering"):
In this scenario, this display stays in deep sleep mode all the time, consuming only about 1µA. The voltage regulator, SRAM and bus transceiver are powered on and take about 13µA. Just 2µA gained with the optimization doesn't seem much, but considering that code is executed only once every 288 seconds and that it speeded up from ~480 to ~360ms is quite impressive I think. The sensors account for only about 1-2µA of the average power consumption. Including copying SRAM to display RAM and updating the display, average power consumption is 99µA (optimized version). What also helps to save a couple of µA is letting the watchdog waking up the MCU only every 8 instead of 1 second. Since the MCU is sleeping by far most of the time, and the code doing some work every 288 seconds already being optimized, I wouldn't know where to save more power. 12-15 months with 3 AAA batteries is already not so bad I'd say. |
I tried that some time ago and got ~70µA, so a bit optimistic. But now that I know the exact timings and have at least a guess on how much power is flowing during the different phases, I will try that again! |
So even if the code would consume zero time, the max power gain would be just 6 µA... Anyway, there are some places where the code can be improved. For example, char codes are typedef uint16/8_t code_t; Notice that when you ever use a code in Glyph[] that doesn't fit into 8 bits, you'll get a diagnostic from the compiler (error or warning depending on flags). Using Likewise, for widths and heights one could use Other optimizations would be to speed up variable bit shifts which require looping. But that speed-up would require to use (inline) functions instead of |
So here is a tiny optimization. IMHO the code is not harder to comprehend than the old code. It can avoid on shift-with-variable-offset in // rotate 8 x 8 pixel
uint16_t m = i / 8 * 8;
for (uint8_t r = 0; r < 8; r++) {
uint8_t bit = (next & (1 << (7 - r))) ? 1 : 0;
rotated[r] |= bit << (7 - i + m);
} Replacement: // rotate 8 x 8 pixel
// uint16_t m = i / 8 * 8;
// We have i - m = i - i/8 * 8 = i & 7, and this mask
// does not depend on r and can be computed before the loop.
uint8_t mask = 1u << (7 - (i & 7));
// No need to loop if `next' is empty.
for (uint8_t r = 0; next; r++) {
// `next' may be consumed and is not used after this loop.
if (next & (1u << 7))
rotated[r] |= mask;
next <<= 1;
} Here is a call of size_t sramWriteString(uint16_t startAddress, char *data) {
size_t length = strlen(data);
size_t written = 0;
for (size_t i = 0; i < length; i++) {
uint16_t address = startAddress + i;
char c = *data++;
sramWrite(address, c);
written++;
if (address == SRAM_HIGH) {
break;
}
}
return written;
} Replacement: size_t sramWriteString(uint16_t address, const char *data) {
size_t written = 0;
for ( ; address != SRAM_HIGH; ++address) {
char c = *data++;
if (c == 0) break;
sramWrite(address, c);
written++;
}
return written;
} Finally, there is |
Again, thank you for taking the time to look into this!
I've added a
Your version looks neat and - as usual - worked copy-paste. But to be honest, for me it is harder to comprehend - at least by just looking at it. So I need to have a closer look and until then, I'll leave it with the old version as I don't want to commit something that I'm unable to explain how it works :)
Would have never guessed that gcc will fold the 2nd log. But if you say it doesn't save much I'd leave it as it is so this calculation stays a plain implementation of the simplified Steinhart-Hart equation (I really like the name). |
In several places, var_mod_height can be of type Moreover, if codes fit in
The idea is not to test a bit |
Point taken. Hope I have them all now.
Thanks for explaining! I'll play with that so it really sinks in... And I finally realized that of course setting/clearing the SRAM framebuffer can be done in sequential mode too: This again saves some µA's - I'd say with all the optimizations done now the device consumes ~10% less, which I think is quite an improvement! You helped me a lot to improve the project, and much more valuable for me is what I have learned from you. If you are interested I'll send you a free "copy" of the thermometer - batteries included! :) |
No description provided.