Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use binary instead of linear search #30

Merged
merged 2 commits into from
Sep 6, 2023
Merged

use binary instead of linear search #30

merged 2 commits into from
Sep 6, 2023

Conversation

gitdode
Copy link
Owner

@gitdode gitdode commented Sep 5, 2023

No description provided.

@sprintersb
Copy link
Contributor

sprintersb commented Sep 6, 2023

I have also upstreamed the following change (without a pull request) that uses lookup-tables to find glyphs:

sprintersb@3294f3b

@gitdode
Copy link
Owner Author

gitdode commented Sep 6, 2023

Thank you very much! I will create a branch from it, test it on the device and at least measure average power consumption and compare with linear and binary search. Already your '__flash' upgrade seems to save a little power, and the binary search as well. My way of measuring is not very accurate however, but it is quite realistic I think.

Your solution with lookup-tables however is quite advanced and I hope it would not be ungrateful if I eventually keep it as a branch and leave it with '__flash' + binary search for now?

@gitdode gitdode merged commit 1499b99 into main Sep 6, 2023
1 check passed
@sprintersb
Copy link
Contributor

sprintersb commented Sep 7, 2023

Of course. Also I don't know anything about the tinings in the project, what the power consumptions are in the different states, and how much ATmega328P contributes. Presumably most of the power consumption is due to sensor activity.

@sprintersb
Copy link
Contributor

For reference I rebased it against main as use-glyph-lookup-tables.

Some ideas to speed up the code or save power:

  • In main(), there is:

              powerOnSensors();
              // give the humidity sensor time to settle
              _delay_ms(100);
              enableADC();
              measureValues();
              disableADC();
              powerOffSensors();

    where the controller does nothing during delay. So maybe a power-safe mode could be used.

  • A different approach is to not output the measured values immediately, but to store them until the next 32-seconds round and only then output / display the stuff instead of delay, and then wait or idle until 100ms are complete.

  • In spi.c, there is:

    uint8_t transmit(uint8_t data) {
        SPDR = data;
        loop_until_bit_is_set(SPSR, SPIF);
        return SPDR;
    }

    This could be split into two versions, one that transmits only, and one that transmits and reads:

    void transmit(uint8_t data) {
        loop_until_bit_is_set(SPSR, SPIF);
        SPDR = data;
    }
    uint8_t transmit_and_read(uint8_t data) {
        loop_until_bit_is_set(SPSR, SPIF);
        SPDR = data;
        loop_until_bit_is_set(SPSR, SPIF);
        return SPDR;
    }

    Notice that writing to SPDR and waiting for SPIF have been swapped in transmit. That way, the code that follows transmit calls runs in parallel with transmission, instead of dead waiting. This is useful if most calls just transmit without reading back, which is the case (at least by looking at the code statically).

  • There's enough program memory left to use -O2, but I have no idea whether it would make a dent. Same for -flto.

@gitdode
Copy link
Owner Author

gitdode commented Sep 7, 2023

For reference I rebased it against main as use-glyph-lookup-tables.

Thank you, I just "stole" your branch.

Some ideas to speed up the code or save power:

* In `main()`, there is:
  ```c++
            powerOnSensors();
            // give the humidity sensor time to settle
            _delay_ms(100);
            enableADC();
            measureValues();
            disableADC();
            powerOffSensors();
  ```
    
  where the controller does nothing during `delay`. So maybe a power-safe mode could be used.

Created #34

* A different approach is to not output the measured values immediately, but to store them until the next 32-seconds round and only then output / display the stuff instead of `delay`, and then wait or idle until 100ms are complete.

* In `spi.c`, there is:
  ```c++
  uint8_t transmit(uint8_t data) {
      SPDR = data;
      loop_until_bit_is_set(SPSR, SPIF);
      return SPDR;
  }
  ```
    
  This could be split into two versions, one that transmits only, and one that transmits and reads:
  ```c++
  void transmit(uint8_t data) {
      loop_until_bit_is_set(SPSR, SPIF);
      SPDR = data;
  }
  uint8_t transmit_and_read(uint8_t data) {
      loop_until_bit_is_set(SPSR, SPIF);
      SPDR = data;
      loop_until_bit_is_set(SPSR, SPIF);
      return SPDR;
  }
  ```
    
  Notice that writing to `SPDR` and waiting for `SPIF` have been swapped in `transmit`. That way, the code that follows `transmit` calls runs in parallel with transmission, instead of dead waiting.  This is useful if most calls just transmit without reading back, which is the case (at least by looking at the code statically).

I'll try that out.

* There's enough program memory left to use `-O2`, but I have no idea whether it would make a dent. Same for `-flto`.

No idea either :-)

@gitdode
Copy link
Owner Author

gitdode commented Sep 7, 2023

Of course. Also I don't know anything about the tinings in the project, what the power consumptions are in the different states, and how much ATmega328P contributes. Presumably most of the power consumption is due to sensor activity.

In README.md timings and power consumption in the different states are described. I think updating the display consumes most power: A full update takes about 3 seconds where display and MCU are busy and active. So the fast update taking about 1.5 seconds helped a lot to save power.
Disabling ADC and SPI before entering sleep mode also drastically reduces power consumption.

Without any power management, the three AAA batteries would last only about a week. Now I am hoping they will last at least a year and so far, after two months, it seems realistic.

@sprintersb
Copy link
Contributor

I think updating the display consumes most power

So the power is consumed (mostly) by the display, not the µC?

A full update takes about 3 seconds

What's the bottleneck? Slow display, slow transmission, or slow µC?

If the bottleneck is the display, and the µC mostly waits for the display to chew on transmtted bytes, then maybe slow down the µC.

The power consumption of an idealized µC core is P(f) = S + B·f where f is F_CPU, S is the static power consumption (the power consumed even when the µC is clocked with 0 MHz), and B describes the power consumption added due to µC's speed. Usually, power consumption would be proportional to current consumption, but what counts is energy consumption, which is power consumption summed over time. This is enery E (proportional to Coulombs or Ampère·Hours). For a time interval Δt:

E(f) = P(f)·Δt = (S + B·f)·Δt = S·Δt + B·Ticks

If static power consumption S can be neglected, then S·Δt is small and energy consumption is mostly due to B·Ticks, i.e. does not depend on the consumed time but only on the complexity of the code (Ticks ≈ number of instructions to execute).

This means that when the code must wait, e.g. due to slow peripherals etc., then it might be advantageous to slow down F_CPU, because during some fixed time period Δt, the µC executes more code (mostly waits) if F_CPU is high.

The only constraint is that F_CPU is high enough to comply with the real-time requirements.

@gitdode gitdode deleted the binary-search branch September 9, 2023 19:01
@gitdode
Copy link
Owner Author

gitdode commented Sep 9, 2023

So the power is consumed (mostly) by the display, not the µC?

I think so. The display consumes a lot of power for a long time.

What's the bottleneck? Slow display, slow transmission, or slow µC?

It appears to be the nature of these relatively cheap E-Ink displays. The initialization is quite complex: hardware reset, software reset, sense temperature, load waveform from lookup table, write image data to RAM and update. The update itself happens in three phases I think (in full update mode), all needed to rearrange these tiny plastic pellets in some electrophoretic liquid. The lower the temperature, the longer the procedure takes. In the oven at 50°C it goes quicker than at 20°C and in the freezer it froze solid somewhere below -10°C :-)

If the bottleneck is the display, and the µC mostly waits for the display to chew on transmtted bytes, then maybe slow down the µC.

The power consumption of an idealized µC core is P(f) = S + B·f where f is F_CPU, S is the static power consumption (the power consumed even when the µC is clocked with 0 MHz), and B describes the power consumption added due to µC's speed. Usually, power consumption would be proportional to current consumption, but what counts is energy consumption, which is power consumption summed over time. This is enery E (proportional to Coulombs or Ampère·Hours). For a time interval Δt:

E(f) = P(f)·Δt = (S + B·f)·Δt = S·Δt + B·Ticks

If static power consumption S can be neglected, then S·Δt is small and energy consumption is mostly due to B·Ticks, i.e. does not depend on the consumed time but only on the complexity of the code (Ticks ≈ number of instructions to execute).

This means that when the code must wait, e.g. due to slow peripherals etc., then it might be advantageous to slow down F_CPU, because during some fixed time period Δt, the µC executes more code (mostly waits) if F_CPU is high.

The only constraint is that F_CPU is high enough to comply with the real-time requirements.

Wow, thank you for explaining. So I will measure average power consumption at 1 MHz instead of 8.

And also with -O2 instead of -Os.
Even though I just read that:

-Os Optimize for size. -Os enables all -O2 optimizations except those that often increase code size:

-falign-functions  -falign-jumps
-falign-labels  -falign-loops
-fprefetch-loop-arrays  -freorder-blocks-algorithm=stc

Do you think it is still worth a try?

@gitdode
Copy link
Owner Author

gitdode commented Sep 10, 2023

I tried with 1MHz and adjusting the dividers of the ADC and SPI clocks so that both run at the same speed as with 8MHz. Average power consumption is 115µA vs. 90µA at 8MHz. I suppose this is because many things go slower and thus the MCU (and maybe even the display) has to be active for a longer time.

I tried also compiling with -O2 instead of -Os and that seems to save a couple of µA's which is nice, so I went for that one.

@sprintersb
Copy link
Contributor

If the bottleneck is the display, and the µC mostly waits for the display to chew on transmtted bytes, then maybe slow down the µC.
Wow, thank you for explaining. So I will measure average power consumption at 1 MHz instead of 8.

Then UART should use U2X = 1, otherwise the timing error is at 7% for 9600 Bd. With U2X = 1 the error is 0.2%.

And also with -O2 instead of -Os. Even though I just read that:

-Os Optimize for size. -Os enables all -O2 optimizations except those that often increase code size:
-falign-functions -falign-jumps -falign-labels -falign-loops
-fprefetch-loop-arrays -freorder-blocks-algorithm=stc
Do you think it is still worth a try?

It's not only about optimizations per se but in many cases also about the cost model. At least it doesn't hurt size-wise:

  • With avr-gcc v8 -Os, size is 16460 bytes.
  • With avr-gcc v8 -O2 -flto, size is 16466 bytes.

So almost no change in size. Note: I am avoiding anything newer than v8, mostly due to PR90706 and PR110093. The ld I am using with avr-gcc v8 complains about combining -r with -mrelax with LTO, and I have a gcc patch for that. Without it, you can work around with -Wl,--relax -Wa,--mlink-relax in place of -mrelax. Option --mlink-relax keeps the assembler from trying to be smarter than it is.

I also had a look at the code, and there are some places that can be speed up.

One of them is in display.c::bufferBitmap(): In each loop over i, there are two 16-bit divisions with unknown 16-bit divisor height:

if ((i + 1) % height == 0) {
    n = i / height + 1;
}

which is expensive. Maybe better keep track of i mod height like:

// Prior to the i loop.
uint16_t i_mod_height = 0, i_div_height = 0;

// Each time i is incremented:
if (++i_mod_height == height) {
    i_mod_height = 0;
    i_div_height += 1;
}

and replace (i + 1) % height == 0 with i_mod_height == height - 1) etc.
This saves 2·size divisions. Likewise, the calls to bufferByte() add 8·size divisions. So better keep track of when w % height == 0 by counting like above, and pass that info down to bufferByte().

So each iteration of i costs ten 16-bit divisions, which is around 2000 ticks.

@gitdode
Copy link
Owner Author

gitdode commented Sep 11, 2023

So almost no change in size. Note: I am avoiding anything newer than v8, mostly due to PR90706 and PR110093. The ld I am using with avr-gcc v8 complains about combining -r with -mrelax with LTO, and I have a gcc patch for that. Without it, you can work around with -Wl,--relax -Wa,--mlink-relax in place of -mrelax. Option --mlink-relax keeps the assembler from trying to be smarter than it is.

I don't understand much of what you are discussing in those bug reports, but it sounds like the avr backend in gcc is obsolescent and the assembly it produces becomes less and less efficient because it doesn't get much attention any more?

There seems to be no measurable difference in average power consumption of the device between 5.4.0 and 13.2.0, and the sizes are:

avr-gcc 5.4.0
   text    data     bss     dec     hex filename
  16770      72     104   16946    4232 thermidity.elf

avr-gcc 13.2.0
   text    data     bss     dec     hex filename
  16698      72     104   16874    41ea thermidity.elf

I'll build a v8 and see what it does.

I also had a look at the code, and there are some places that can be speed up.

Now that is really cool, removing 3 divisions that are done very often! Again I learned something really useful - cheers!
And it again seems to save some µA's (which is significant because that's a couple of %). It is hard to measure though, because the display consumes a lot in a short time but infrequently. Probably changing the program to let the MCU do the heavy lifting repeatedly and skipping the display update would give more meaningful results.

Do I see it right that the compiler effectively turns something like n += width / 8 into n += width >> 3 so one doesn't have to worry about divisions with a divisor that is a power of 2?

@sprintersb
Copy link
Contributor

I don't understand much of what you are discussing in those bug reports, but it sounds like the avr backend in gcc is obsolescent and the assembly it produces becomes less and less efficient because it doesn't get much attention any more?

The problems are in the middle-ends, which is the part of GCC that neither depends explicitly on the back-end (target) nor on the front-end (language). There are more and more parts of the middle-end that work well for bolides, but descreasingly well for AVR. One example (like in the PRs) is the register allocator.

I also had a look at the code, and there are some places that can be speed up.

Now that is really cool, removing 3 divisions that are done very often! Again I learned something really useful - cheers! And it again seems to save some µA's (which is significant because that's a couple of %).

So the project is still in the range where the code is slower than the display / communications, which means further optimizing the code leads to immediate power savings. (If more efficient code no more leads to less power consumption, then the F_CPU can be reduced as explained earlier.

It is hard to measure though, because the display consumes a lot in a short time but infrequently. Probably changing the program to let the MCU do the heavy lifting repeatedly and skipping the display update would give more meaningful results.

Maybe the power consumption can be modeled? Knowing up-time of display, duration of CPU activity, etc.?

Do I see it right that the compiler effectively turns something like n += width / 8 into n += width >> 3

Should be the case for unsigned. For signed, a division is not the same like a shift.

@gitdode
Copy link
Owner Author

gitdode commented Sep 14, 2023

So I now have a more controlled way of measuring, and I am actually surprised how stable and reproducible the results are. I also had a look with the scope.

Here the full update sequence which happens shortly after POR and then every 288 seconds:

FullSequence

The second part of "Buffer Frame", "Write Bitmaps", is the code that was optimized. This is where searching for glyphs and writing the bitmaps to the separate SRAM used as frame buffer is done.
Then the display is initialized and the contents of the SRAM are copied to the display's RAM. Then follows the long display update (fast mode), during that virtually no code is executed (just a few commands sent to the display driver).

Here is just the "Buffer Frame" part in the previous unoptimized version:

BufferingUnOptimized

And the optimized version:

BufferingOptimized

So, ~360 vs. ~480ms - quite an improvement I would say!

And here some measurements of the average power consumption.

Buffering the frame without copying SRAM to display RAM and without updating the display (so just "framebuffering"):

Unoptimized: 77µA
Optimized:   75µA

In this scenario, this display stays in deep sleep mode all the time, consuming only about 1µA. The voltage regulator, SRAM and bus transceiver are powered on and take about 13µA.

Just 2µA gained with the optimization doesn't seem much, but considering that code is executed only once every 288 seconds and that it speeded up from ~480 to ~360ms is quite impressive I think.

The sensors account for only about 1-2µA of the average power consumption.

Including copying SRAM to display RAM and updating the display, average power consumption is 99µA (optimized version).

What also helps to save a couple of µA is letting the watchdog waking up the MCU only every 8 instead of 1 second.

Since the MCU is sleeping by far most of the time, and the code doing some work every 288 seconds already being optimized, I wouldn't know where to save more power. 12-15 months with 3 AAA batteries is already not so bad I'd say.

@gitdode
Copy link
Owner Author

gitdode commented Sep 14, 2023

Maybe the power consumption can be modeled? Knowing up-time of display, duration of CPU activity, etc.?

I tried that some time ago and got ~70µA, so a bit optimistic. But now that I know the exact timings and have at least a guess on how much power is flowing during the different phases, I will try that again!

@sprintersb
Copy link
Contributor

So even if the code would consume zero time, the max power gain would be just 6 µA...

Anyway, there are some places where the code can be improved. For example, char codes are uint16_t. If you instead use code_t (like I did in the use-glyph-lookup-tables branch), then switching from 16-bit to 8-bit is a 1-liner in font.h of

typedef uint16/8_t code_t;

Notice that when you ever use a code in Glyph[] that doesn't fit into 8 bits, you'll get a diagnostic from the compiler (error or warning depending on flags).

Using code_t makes the code a bit more comprehensible without adding any overhead, and using 8-bit saves program memory and a bit of register pressure. Also, in expressions like (l+m)/2 in binary search, l and m are promoted to 16 bits before adding them.

Likewise, for widths and heights one could use width_t and height_t without any overhead, and again switching to 8-bit types saves flash in Glyph[] and reduces register pressure etc. Variables like i_mod_height would also be of height_t, and in (width * height) / 8 the factors are promoted to 16 bits. (And there is already the limitation that width * height fits in 16 bits.)

Other optimizations would be to speed up variable bit shifts which require looping. But that speed-up would require to use (inline) functions instead of a << b and would bring much less gain compared to kicking out divisions.

@sprintersb
Copy link
Contributor

So here is a tiny optimization. IMHO the code is not harder to comprehend than the old code. It can avoid on shift-with-variable-offset in bufferBitmap():
Old snip:

        // rotate 8 x 8 pixel
        uint16_t m = i / 8 * 8;
        for (uint8_t r = 0; r < 8; r++) {
            uint8_t bit = (next & (1 << (7 - r))) ? 1 : 0;
            rotated[r] |= bit << (7 - i + m);
        }

Replacement:

        // rotate 8 x 8 pixel
        // uint16_t m = i / 8 * 8;
        // We have i - m = i - i/8 * 8 = i & 7, and this mask
        // does not depend on r and can be computed before the loop.
        uint8_t mask = 1u << (7 - (i & 7));
        // No need to loop if `next' is empty.
        for (uint8_t r = 0; next; r++) {
            // `next' may be consumed and is not used after this loop.
            if (next & (1u << 7))
                rotated[r] |= mask;
            next <<= 1;
        }

Here is a call of strlen than can be avoided as the string is read byte-per-byte anyway:

size_t sramWriteString(uint16_t startAddress, char *data) {
    size_t length = strlen(data);
    size_t written = 0;
    for (size_t i = 0; i < length; i++) {
        uint16_t address = startAddress + i;
        char c = *data++;
        sramWrite(address, c);
        written++;
        if (address == SRAM_HIGH) {
            break;
        }
    }
    return written;
}

Replacement:

size_t sramWriteString(uint16_t address, const char *data) {
    size_t written = 0;
    for ( ; address != SRAM_HIGH; ++address) {
        char c = *data++;
        if (c == 0) break;
        sramWrite(address, c);
        written++;
    }
    return written;
}

Finally, there is log(resTh / TH_RESI) that can be replaced by log(resTh) - log(TH_RESI). gcc will fold the 2nd log, hence we replace a float division by a subtraction. But this only saves around 200 ticks because float subtraction is already relatively expensive (490 ticks for division vs. 180 ticks for subtraction).

@gitdode
Copy link
Owner Author

gitdode commented Sep 17, 2023

Again, thank you for taking the time to look into this!

Anyway, there are some places where the code can be improved. For example, char codes are uint16_t. If you instead use code_t (like I did in the use-glyph-lookup-tables branch), then switching from 16-bit to 8-bit is a 1-liner in font.h

I've added a types.h with typedefs for code, width, height, row and col. Actually, all of them can be 8 instead of 16 Bit, so there should be some savings in there.

So here is a tiny optimization. IMHO the code is not harder to comprehend than the old code. It can avoid on shift-with-variable-offset in bufferBitmap()

Your version looks neat and - as usual - worked copy-paste. But to be honest, for me it is harder to comprehend - at least by just looking at it. So I need to have a closer look and until then, I'll leave it with the old version as I don't want to commit something that I'm unable to explain how it works :)

Finally, there is log(resTh / TH_RESI) that can be replaced by log(resTh) - log(TH_RESI). gcc will fold the 2nd log, hence we replace a float division by a subtraction.

Would have never guessed that gcc will fold the 2nd log. But if you say it doesn't save much I'd leave it as it is so this calculation stays a plain implementation of the simplified Steinhart-Hart equation (I really like the name).

@sprintersb
Copy link
Contributor

I've added a types.h with typedefs for code, width, height, row and col. Actually, all of them can be 8 instead of 16 Bit, so there should be some savings in there.

In several places, var_mod_height can be of type height_t because the modulus is in [0, height-1].

Moreover, if codes fit in code_t, then indices in respective arrays also fit in code_t because each code it present once at most. Hence, in the binary search for example, the indices l and m will fit in code_t (Font.length is already uint8_t).

Your version looks neat and - as usual - worked copy-paste. But to be honest, for me it is harder to comprehend

The idea is not to test a bit next.(7-r) at a variable offset that runs from 7 to 0, but instead to test bit next.7 and then left-shift next to get the next bit at position 7 until all bits are consumed.

@gitdode
Copy link
Owner Author

gitdode commented Sep 20, 2023

In several places, var_mod_height can be of type height_t because the modulus is in [0, height-1].

Moreover, if codes fit in code_t, then indices in respective arrays also fit in code_t because each code it present once at most. Hence, in the binary search for example, the indices l and m will fit in code_t (Font.length is already uint8_t).

Point taken. Hope I have them all now.

The idea is not to test a bit next.(7-r) at a variable offset that runs from 7 to 0, but instead to test bit next.7 and then left-shift next to get the next bit at position 7 until all bits are consumed.

Thanks for explaining! I'll play with that so it really sinks in...

And I finally realized that of course setting/clearing the SRAM framebuffer can be done in sequential mode too:

ClearFrameSequential

This again saves some µA's - I'd say with all the optimizations done now the device consumes ~10% less, which I think is quite an improvement!

You helped me a lot to improve the project, and much more valuable for me is what I have learned from you. If you are interested I'll send you a free "copy" of the thermometer - batteries included! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants