Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arduino LMIC stops after some time (6-9 hours) and doesn’t sends data #968

Open
mirhamza708 opened this issue Oct 5, 2024 · 16 comments
Open
Assignees
Labels

Comments

@mirhamza708
Copy link

mirhamza708 commented Oct 5, 2024

Hello guys,
I need help with this because I am trying for a month now to fix this. I have a node device and I have done all the configuration according to the github documentation of MCCI

I use RFM96 chip with stm32f103c8t6 and the pinmap is as follows

// Pin mapping
const lmic_pinmap lmic_pins = {
    .nss = PA4,
    .rxtx = LMIC_UNUSED_PIN,
    .rst = PC14,
    .dio = {PA1, PB13, PB11},
};

my platformio.ini file is as follows

; PlatformIO Project Configuration File
;
;   Build options: build flags, source filter
;   Upload options: custom upload port, speed and extra flags
;   Library options: dependencies, extra library storages
;   Advanced options: extra scripting
;
; Please visit documentation for the other options and examples
; https://docs.platformio.org/page/projectconf.html

[env:genericSTM32F103C8]
platform = ststm32
board = genericSTM32F103C8
framework = arduino
debug_tool = stlink
board_build.core = stm32
lib_deps = 
	adafruit/Adafruit ADS1X15@^2.5.0
	bblanchon/ArduinoJson@^7.2.0
	; matthijskooijman/IBM LMIC framework@^1.5.1
	mcci-catena/MCCI LoRaWAN LMIC library@^4.1.1
build_flags =
    -D ARDUINO_LMIC_PROJECT_CONFIG_H_SUPPRESS
    -D CFG_eu868=1
    -D CFG_sx1276_radio=1
    -D LMIC_LORAWAN_SPEC_VERSION=LMIC_LORAWAN_SPEC_VERSION_1_0_2

The devices functions properly just the Lorawan side of things doesnt seem right, here is an image from chirpstack server.

a screenshot from the serial port the join at 9:07 pm is here:

-157610058: EV_TXSTART
-157288904: EV_JOINED
netid: 0
devaddr: -------- (just to hide it)
AppSKey: --------------------------------------------------- (just to hide it)
NwkSKey: --------------------------------------------------- (just to hide it)
-157288075: EV_TXSTART
-157141837: EV_TXCOMPLETE (includes waiting for RX windows)

I don’t understand why the time gets overflown? its a signed integer in the library, can someone guide why used a signed integer for time keeping? I saw a comment in the source that the negative means the time has already passed. Its confusing so please if someone can guide I will be really thankful.

here is an image from the chirpstack server
8 Hours reset chirpstack

I have placed an 8 hours reset of the lorawan stack so it does this every 8 hours just to avoid a complete stop:

void lmicStart (void) {
      // LMIC init
  os_init();
  // Reset the MAC state. Session and pending data transfers will be discarded.
  LMIC_reset();
  // Set up the channels used by the Things Network, which corresponds
  // to the defaults of most gateways. Without this, only three base
  // channels from the LoRaWAN specification are used, which certainly
  // works, so it is good for debugging, but can overload those
  // frequencies, so be sure to configure the full frequency range of
  // your network here (unless your network autoconfigures them).
  // Setting up channels should happen after LMIC_setSession, as that
  // configures the minimal channel set.
  // NA-US channels 0-71 are configured automatically
  LMIC_setupChannel(0, 868100000, DR_RANGE_MAP(DR_SF12, DR_SF7),  BAND_CENTI);      // g-band
  LMIC_setupChannel(1, 868300000, DR_RANGE_MAP(DR_SF12, DR_SF7B), BAND_CENTI);      // g-band
  LMIC_setupChannel(2, 868500000, DR_RANGE_MAP(DR_SF12, DR_SF7),  BAND_CENTI);      // g-band
  LMIC_setupChannel(3, 867100000, DR_RANGE_MAP(DR_SF12, DR_SF7),  BAND_CENTI);      // g-band
  LMIC_setupChannel(4, 867300000, DR_RANGE_MAP(DR_SF12, DR_SF7),  BAND_CENTI);      // g-band
  LMIC_setupChannel(5, 867500000, DR_RANGE_MAP(DR_SF12, DR_SF7),  BAND_CENTI);      // g-band
  LMIC_setupChannel(6, 867700000, DR_RANGE_MAP(DR_SF12, DR_SF7),  BAND_CENTI);      // g-band
  LMIC_setupChannel(7, 867900000, DR_RANGE_MAP(DR_SF12, DR_SF7),  BAND_CENTI);      // g-band
  LMIC_setupChannel(8, 868800000, DR_RANGE_MAP(DR_FSK,  DR_FSK),  BAND_MILLI);      // g2-band
  // TTN defines an additional channel at 869.525Mhz using SF9 for class B
  // devices' ping slots. LMIC does not have an easy way to define set this
  // frequency and support for class B is spotty and untested, so this
  // frequency is not configured here.

  // Disable link check validation
  LMIC_setLinkCheckMode(0);
  // // TTN uses SF9 for its RX2 window.
  // LMIC.dn2Dr = DR_SF9;
  // Set data rate and transmit power for uplink (note: txpow seems to be ignored by the library)
  LMIC_setDrTxpow(DR_SF10,14);
  LMIC_setClockError(MAX_CLOCK_ERROR * 10 /100);
  // Start job (sending automatically starts OTAA too)
  do_send(&mySendjob);
}

Without the 10% clock error (clock is setup at 24MHz using the 8MHz external oscillator and passed through the pll mul) set the join procedure takes 10-15 minutes. The hardware was designed by someone else so I can't really say much about it but the question is if it runs for 7-8 hours what's the reason that makes it stall.

Thank you guys, and if anyother information is required then please let me know.

@robepapp
Copy link

robepapp commented Oct 24, 2024

Hi,

Additional information as i think I have the same problem with an STM32L081CZ and an sx1272 radio.
The code based on the ttn-otaa example, tailored to my hardware. After 6-8 hours of successful operation, node misses sending data (or sends occasionally once per day)
In the log I can see similar minus values for the ostime like in the previous case.
Debug examples:

[2024-10-20 00:26:40.749] Current interrupts:1000
[2024-10-20 00:26:40.768] 2129580220: engineUpdate, opmode=0x908
[2024-10-20 00:26:40.783] EV_TXSTART
[2024-10-20 00:26:40.800] Leaving onEvent
[2024-10-20 00:26:40.800] 2129583168: TXMODE, freq=867100000, len=16, SF=12, BW=125, CR=4/5, IH=0
[2024-10-20 00:26:43.105] 2129727365: setupRx1 txrxFlags 0x20 --> 01
[2024-10-20 00:26:43.122] start single rx: now-rxtime: 14
[2024-10-20 00:26:43.139] 2129728819: RXMODE_SINGLE, freq=867100000, SF=12, BW=125, CR=4/5, IH=0
[2024-10-20 00:26:43.344] rxtimeout: entry: 2129742353 rxtime: 2129727983 entry-rxtime: 14370 now-entry: 14 rxtime-txend: 62375
[2024-10-20 00:26:44.105] 2129789863: setupRx2 txrxFlags 0x1 --> 02
[2024-10-20 00:26:44.122] start single rx: now-rxtime: 13
[2024-10-20 00:26:44.141] 2129790644: RXMODE_SINGLE, freq=869525000, SF=12, BW=125, CR=4/5, IH=0
[2024-10-20 00:26:44.345] rxtimeout: entry: 2129804851 rxtime: 2129790483 entry-rxtime: 14368 now-entry: 13 rxtime-txend: 124875
[2024-10-20 00:26:44.395] 2129806828: processRx2DnData txrxFlags 0x2 --> 00
[2024-10-20 00:26:44.413] 2129808839: processDnData_norx txrxFlags 00 --> 20
[2024-10-20 00:26:44.446] EV_TXCOMPLETE (includes waiting for RX windows)
[2024-10-20 00:26:44.462] Leaving onEvent
[2024-10-20 00:26:44.479] 2129813202: engineUpdate, opmode=0x900
[2024-10-20 00:31:44.444] Prepare data for transmisson.
[2024-10-20 00:31:44.444] Current interrupts:1000
[2024-10-20 00:31:44.461] -2146406124: engineUpdate, opmode=0x908
[2024-10-20 00:31:44.494] -2146403958: engineUpdate, opmode=0x108
[2024-10-20 00:31:44.512] -2146403688: engineUpdate, opmode=0x108

Or the next with bit more details:

Current interrupts:1000
1332960928: engineUpdate, opmode=0x908
Enter onEvent
os_getTime = 1332963640 osTicks, 21327 sec
EV_TXSTART
Leaving onEvent
1332966051: TXMODE, freq=867700000, len=16, SF=12, BW=125, CR=4/5, IH=0
do_send finished
Tx result:0
1333110249: setupRx1 txrxFlags 0x20 --> 01
start single rx: now-rxtime: 13
1333111708: RXMODE_SINGLE, freq=867700000, SF=12, BW=125, CR=4/5, IH=0
rxtimeout: entry: 1333125234 rxtime: 1333110866 entry-rxtime: 14368 now-entry: 13 rxtime-txend: 62375
1333172749: setupRx2 txrxFlags 0x1 --> 02
start single rx: now-rxtime: 13
1333174166: RXMODE_SINGLE, freq=869525000, SF=12, BW=125, CR=4/5, IH=0
rxtimeout: entry: 1333187734 rxtime: 1333173366 entry-rxtime: 14368 now-entry: 14 rxtime-txend: 124875
1333189482: processRx2DnData txrxFlags 0x2 --> 00
1333191613: processDnData_norx txrxFlags 00 --> 20
Enter onEvent
os_getTime = 1333193470 osTicks, 21331 sec
EV_TXCOMPLETE (includes waiting for RX windows)
Leaving onEvent
1333197636: engineUpdate, opmode=0x900
=0x108
-1096335584: engineUpdate, opmode=0x108
-1096335366: engineUpdate, opmode=0x108
-1096334167: engineUpdate, opmode=0x108
-1096333227: engineUpdate, opmode=0x108
-1096331804: engineUpdate, opmode=0x108
-1096329617: engineUpdate, opmode=0x108

The ostime_t is a uint32_t, it should not be an overflow, neither became a negative value so I dont understand that behavior.

@terrillmoore
Copy link
Member

ostime_t is not a uint32_t. It is an int32_t (see line 40 of oslmic_types.h). s4_t is int32_t and is signed.

Negative time values are expected and not a problem. The LMIC is coded taking that overflow into account. It's one of the two possible styles of handling time overflow. The classic one is using uint32_t, but the signed approach was taking in the original IBM Zurich code, and it was done correctly, as far as I can tell. I personally prefer the unsigned style, but in C with twos-complement and silent overflow, the two forms are 100% equivalent (though individual comparisons must be done differently).

@robepapp
Copy link

Thank you very much for your quick reply, I really appreciate it!
But actually the behavior of my node to stop sending messages apparently coincides with the event when the time value goes negative. And I can't find any solution to resolve it currently.

@terrillmoore
Copy link
Member

Sorry you're having problems. Because other users don't have this problem (specifically, because MCCI has run many devices for many months at a time), I have to guess that it's a system problem (the combination of LMIC and the rest of the firmware on the system) rather than a bug that is localized to the LMIC. To solve these, one needs to look at the entire code base. I suggest that you file a separate ticket, and attach a complete code base that demonstrates the problem -- of course, if you don't want to disclose your full code, you may need to develop a toy version that shows the bug. Sometimes preparing the toy version makes the bug go away; and that also gives a clue as to what to look for.

@robepapp
Copy link

Thank You, I will do so!

@mirhamza708
Copy link
Author

Thank You, I will do so!

Please remember to tag me there, I am also having the problem.

@robepapp
Copy link

robepapp commented Nov 3, 2024

In the last few days I have been testing my node with the basic ttn-otaa example program that I have attached too, but the problem still exists.

iM881aLoRaOtaa.txt

The behavior of the node is the same. The next is from the screenshot from the ChiprStack server DeviceData:

image

It can be seen that the scheduled data transmission is missed after a time (it is usually 7-10 hours) and the node start uploading data once or twice a day in apparently random cases. The original frequency of data transmission was 10 minutes, but the phenomenon is the same in case of 5 minutes. as well.

When restarting, everything starts from the beginning - that is the workaround applied by @mirhamza708

@terrillmoore
Copy link
Member

Once again. I'm sure everyone thinks they've given me an exact repro, but nobody has. It does not happen for me, with thousands of devices deployeed and a varient of applciations.

In order to debug this, I need;

  1. a complete app that demonstrates the problem. Attach a zip file to the ticket, please. Don't assume that I have it, please.
  2. exact information on your configuration (region, etc.) I see that some of you are using EU868.
  3. exact information on the device and Arduino BSP version that you are using, along with any other libraries.
  4. exact instructions on how to duplicate with The Things Network. Chirpstack has too many configuration knobs, I cannot set up Chirpstack to duplicatie anyone else's setup. I rely on TTN because they make most of the decisions and there aren't any knobs. Once we know that it works (or doesn't work) with TTN, we can explore other LNSs.

I know this might be too much work. But I won't be able to respond unless someone undertakes this. This is free software, folks: if you have a problem, the onus is partially on you to help replicate in order to get support.

Best regards,
--Terry

@mirhamza708
Copy link
Author

I will try to share the requested information. I have not used TTN but I will set this up. I will try to share a complete example with the issue being properly elaborated.

Thank you.

@robepapp
Copy link

Hi,
I configured one of the Lora nodes under the TTN network with the recommended default settings, but the error phenomenon is still the same - after about 8 hours of continuous, regular data transmission, the data transmission changed to random, once or twice daily upload.
If necessary, I will provide the contact information of the TTN Console.
I have also attached the relevant project files in a zip file.
STM32L081 Lora node PR.zip

Many thanks and Regards,
Robert

@terrillmoore
Copy link
Member

Thank you for posting the files. I have reviewed the code. I see you have serial debug output enabled. Can you please also post the debug output when things start to malfunction? Best regards, --Terry

@robepapp
Copy link

Hi,
The following is a debug example from a log, when a transmit procedure started, but not finished succesfully:

Current interrupts:1000
1332960928: engineUpdate, opmode=0x908
Enter onEvent
os_getTime = 1332963640 osTicks, 21327 sec
EV_TXSTART
Leaving onEvent
1332966051: TXMODE, freq=867700000, len=16, SF=12, BW=125, CR=4/5, IH=0
do_send finished
Tx result:0
1333110249: setupRx1 txrxFlags 0x20 --> 01
start single rx: now-rxtime: 13
1333111708: RXMODE_SINGLE, freq=867700000, SF=12, BW=125, CR=4/5, IH=0
rxtimeout: entry: 1333125234 rxtime: 1333110866 entry-rxtime: 14368 now-entry: 13 rxtime-txend: 62375
1333172749: setupRx2 txrxFlags 0x1 --> 02
start single rx: now-rxtime: 13
1333174166: RXMODE_SINGLE, freq=869525000, SF=12, BW=125, CR=4/5, IH=0
rxtimeout: entry: 1333187734 rxtime: 1333173366 entry-rxtime: 14368 now-entry: 14 rxtime-txend: 124875
1333189482: processRx2DnData txrxFlags 0x2 --> 00
1333191613: processDnData_norx txrxFlags 00 --> 20
Enter onEvent
os_getTime = 1333193470 osTicks, 21331 sec
EV_TXCOMPLETE (includes waiting for RX windows)
Leaving onEvent
1333197636: engineUpdate, opmode=0x900
=0x108
-1096335584: engineUpdate, opmode=0x108
-1096335366: engineUpdate, opmode=0x108
-1096334167: engineUpdate, opmode=0x108
-1096333227: engineUpdate, opmode=0x108
-1096331804: engineUpdate, opmode=0x108
-1096329617: engineUpdate, opmode=0x108

Also, attached a whole debug log file containing when a node is starting until the malfunction happens.
teraterm2.log

Thanks & Regards,
Robert

@terrillmoore
Copy link
Member

Thank you for posting the log file.

At line 1509, I see EV_TXCOMPLETE. But after that I don't see any evidence that the code is calling the transmit path.

I see that ALL transmits in the recent path past prior to the failure are on channel 867,700,000 MHz. This behavior started at around line 269; prior to that it was using the other channels.

I don't know why the LMIC started to stick on one channel.

The device is transmitting at SF12. Based on the times printed, it appears that your message takes about 2.3 seconds of airtime.

The fact that there are no more launches of the transmission strongly indicates that sendjob has been lost from the queue. This can happen (in the sketch) if timing gets messed up. However, I suspect that lines were lost between 1511 and 1512; the line "=0x108" is almost certainly a fragment of the engineUpdate, opmode=0x108 printout that follows. Those lines are about 3.4ms apart (according to time stamp) and there's a huge jump between line 1511 (which is positive) and 1513 (which is negative). That's a jump of about 8 hours (coincidentally).

It is possible that on your BSP, something bad is happening to the LMIC's idea of time -- however, that wouldn't account for the missing text between 1511 and 1512.

In addition, by the way, you have an error in your do_send(). There are three paths in the if clauses; only one calls LMIC_setTxData2(). That means that if either of the other two paths are taken, the data won't be sent -- and that means you won't ever send again. The easy fix is to set DataTxError = 1 on the other two paths; that will at least result in a printout if it fails rather than printing

do_send finished
Tx result:0

suggesting that the sent was scheduled, even though it was not.

The opmode=0x108 printout is consistent with the LMIC waiting for channel bandwidth to become available (according to the duty cycle policy). If the clock is screwed up, it could wait approximately forever.

Try fixing the DataTxError problem, and try running again to see if you can get a full log with out missing lines around the time of the error. If necessary, stop (call abort, or just loop) if you see do_send returning non-zero.

@Gooseman42
Copy link

Gooseman42 commented Nov 15, 2024

Sadly I have the same problem here: getting stuck in a loop after 6-9h. I added some code inside the engineUpdate function to help me tag the problem (basically printing NOIDLE) but it did not help.. the code is basically still the same as the original ttn-otaa example.
ttn_otaa_stuckloop.zip
Serial log and code/libraries attached. If required I can provide more data.
Thanks for any pointers!
Achim

@robepapp
Copy link

robepapp commented Nov 18, 2024

Hi,

I modified the do_send() function of my otaa sketch to the same like in the ttn-otaa.ino example:

void do_send(osjob_t* j)
{
    // Check if there is not a current TX/RX job running
    if (LMIC.opmode & OP_TXRXPEND) {
        #ifdef SerialDbg 
        Serial1.println(F("OP_TXRXPEND, not sending"));
        #endif      
  }    
     else {
        // Prepare upstream data transmission at the next possible time.        
         #ifdef SerialDbg 
        Serial1.println(F("Prepare data for transmisson.")); 
        Serial1.print(F("Current interrupts:"));
        Serial1.println(inputCounter);
        #endif              
          mydata[0] = (int)((inputCounter & 0XFF));
          mydata[1] = (int)((inputCounter >> 8) & 0XFF);
          mydata[2] = (int)((inputCounter >> 16) & 0xFF);
          mydata[3] = (int)((inputCounter >> 24) & 0xFF);           
          LMIC_setTxData2(1, mydata, 3,0);
          #ifdef SerialDbg 
           Serial1.println(F("Pocket queued"));  
           #endif        
           } 
     #ifdef SerialDbg 
     Serial1.println(F("do_send finished"));  
     #endif  
    }

The test with the sketch ends with the same result new log file is uploaded.

However I have an important feedback regarding the time, when the problem occurs.
Please check the following log part with system time included:

[2024-11-14 20:55:49.721] Leaving onEvent
[2024-11-14 20:55:49.721] 2132085603: engineUpdate, opmode=0x900
[2024-11-14 21:05:49.705] Prepare data for transmisson.
[2024-11-14 21:05:49.705] Current interrupts:1000
[2024-11-14 21:05:49.721] -2125381671: engineUpdate, opmode=0x908
[2024-11-14 21:05:49.754] Pocket queued
[2024-11-14 21:05:49.754] do_send finished
[2024-11-14 21:05:49.771] -2125378661: engineUpdate, opmode=0x108
[2024-11-14 21:05:49.789] -2125376359: engineUpdate, opmode=0x108
[2024-11-14 21:05:49.805] -2125376140: engineUpdate, opmode=0x108
[2024-11-14 21:05:49.822] -2125375424: engineUpdate, opmode=0x108
[2024-11-14 21:05:49.855] -2125374001: engineUpdate, opmode=0x108

There are no lost lines, logging is continuous when the system time changes its sign.
The log from @Gooseman42 in this thread has that similar pattern, when regular, timed upload is interrupted and, in my case changes to occasional one-two logs per day.

otaa.log

Regards,
Robert

@terrillmoore
Copy link
Member

@robepapp Thanks for the log. @Gooseman42 sorry, wanted to get v5 out, and so that took all my time this weekend.

@robepapp Your code needs to capture (and print) the result of calling LMIC_setTxData2() -- it returns non-zero for failure. If it fails, you may not get a callback.

However, in this case, it probably didn't fail.

The opmode shows that the LMIC has chosen a new channel (900 -> 908 -> 108) and is waiting to send data (opmode & 0x0008).

I think the thing to do is confirm the path the code is taking. Please change line 2726 of lmic.c from:

    LMIC_X_DEBUG_PRINTF("%"LMIC_PRId_ostime_t": next engine update in %"LMIC_PRId_ostime_t"\n", now, txbeg-TX_RAMPUP);

to

    LMIC_DEBUG_PRINTF("%"LMIC_PRId_ostime_t": next engine update in %"LMIC_PRId_ostime_t"\n", now, txbeg-TX_RAMPUP);

I'm pretty certain that this path is being taken. If you are feeling brave, please add a couple of extra items to that print:

    LMIC_DEBUG_PRINTF("%"LMIC_PRId_ostime_t": next engine update in %"LMIC_PRId_ostime_t
        ". globalDutyRate=%" LMIC_PRId_ostime_t 
        " globalDutyAvail=%" LMIC_PRId_ostime_t 
        " txend=%" LMIC_PRId_ostime_t"
        " txcChnl=%d "
        "band=%d "
        "band.avail=%" LMIC_PRId_ostime_t
        "\n",
        now, txbeg-TX_RAMPUP, LMIC.globalDutyRate, LMIC.globalDutyAvail, LMIC.txend,
        LMIC.txChnl, LMIC.channelFreq[LMIC.txChnl] & 0x3,
        LMIC.bands[LMIC.channelFreq[LMIC.txChnl] & 0x3].avail);

This will show exactly what is going into the incorrect decision and will probably lead to a patch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants