Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] coreMQTT keep alive handling fails and never reconnects #48

Closed
lhammond opened this issue Sep 5, 2023 · 51 comments
Closed

[BUG] coreMQTT keep alive handling fails and never reconnects #48

lhammond opened this issue Sep 5, 2023 · 51 comments

Comments

@lhammond
Copy link

lhammond commented Sep 5, 2023

Describe the bug
Please provide a clear and concise description explaining the bug.

System information

  • Hardware board: [ esp32s3 (Seeed Xiao) ]
  • IDE used: [ VSCode ]
  • Operating System: [ MacOS ]
  • Code version: (v202212.00-23-gd25036b
  • Project/Demo: [ Temperature LED demo with AWS pub/sub ]

Expected behavior
Expected behavior would be for the MQTT subsystem to continue retries until reconnected.

Screenshots or console output
image

Steps to reproduce bug
Example:
1. "I am using project [ ... ], and have configured with [ ... ]"
2. "When run on [ ... ], I observed that [ ... ]"

Code to reproduce bug

idf.py -p /dev/cu.usbmodem14101 flash monitor
@Skptak
Copy link
Member

Skptak commented Sep 6, 2023

Hey @lhammond, thanks for submitting this issue!
It appears this is an issue many people are running since #41, #45, #46, and #47 all appear to similar issues with the coreMQTT connection to the AWS IoT broker.

I've reached out to the team that works directly with coreMQTT and with the ESP32 boards to see if we can get to the bottom of what is causing these issues.

@lhammond
Copy link
Author

lhammond commented Sep 6, 2023 via email

@lhammond
Copy link
Author

@Skptak Is there anything we can do to help push towards resolution? Should I be monitoring this situation in another place? Thank you!

@rawalexe
Copy link
Member

rawalexe commented Sep 11, 2023

@lhammond,
Can you please give a bit more insight on the problem, what version of esp-idf are you using? have you flashed your credential to the device and is your Thing name and endpoint at the correct account? Did you provision your device?

I was not able to reproduce your problem and here is a small sectional screenshot of my logs
Screenshot 2023-09-11 at 4 44 40 PM

For tour reference I followed this readme: https://github.com/FreeRTOS/iot-reference-esp32c3/blob/2dccbcad1a0e54ec2e32cc242d4bf4f4ab6c1274/GettingStartedGuide.md

@rawalexe
Copy link
Member

Can you also please passte your skdconfig file for s3

@gavin-hy
Copy link

@rawalexe Have you tested it over a long period of time?
When I tested it, this happened within four hours! When I reconfigured the network it was able to reconnect to broke, but the above problem still occurred after a while. There is another problem. After I unplug the AP's network cable, after a while, I plug it in again and it will no longer be able to actively report information.
Below is a screenshot of my log and sdkconfig.
1694506767402
1694507127378

1694507127363
image

@lhammond
Copy link
Author

Hi @rawalexe you can see the version at the top of this issue thread .. v202212.00-23-gd25036b
I am using the LED pub sub demo, not OTA.

You will see the issue I'm experiencing in the screenshot in the original post. If left alone, the device eventually disconnects and continues to output the "no command structure" forever.

@Skptak
Copy link
Member

Skptak commented Sep 12, 2023

Hey @lhammond, sorry for the delay in getting back to you. The team has been looking into this issue to try and provide support. I've ordered an ESP32-S3 so I can try and replicate your exact environment as we can't seem to replicate this issue on the ESP32-C3.

While I wait for the board to get here I'm wondering if you tried this potential fix that @ActoryOu mentioned in #46?

It seems like 1 second timeout is not enough for the device to finish TLS flow.
Could you help to set Featured FreeRTOS IoT Integration -> TLS Transport Send / Receive timeout in milliseconds to 10000 by idf.py menuconfig and retest?

I'm wondering if the timeout on the TLS transport send/receives might be what is causing the MQTT agent to go down.

Thanks again for your patience with this!

@lhammond
Copy link
Author

@gavin-hy what MCU are you running?

@lhammond
Copy link
Author

Hey @Skptak .. I'm away from my lab for a day and will try the potential fix you mention upon return. Thanks!

@gavin-hy
Copy link

@gavin-hy what MCU are you running?

ESP32-C3

@rawalexe
Copy link
Member

rawalexe commented Sep 14, 2023

Hi @rawalexe you can see the version at the top of this issue thread .. v202212.00-23-gd25036b I am using the LED pub sub demo, not OTA.

You will see the issue I'm experiencing in the screenshot in the original post. If left alone, the device eventually disconnects and continues to output the "no command structure" forever.

Hello @lhammond and @gavin-hy
I am running all the demos, if you look into my attached screenshot. I'll try to replicate your issue by just running the temp sub pub over long period of time.

@lhammond
Copy link
Author

lhammond commented Sep 14, 2023 via email

@anubhavrawal
Copy link

@lhammond can you please provide your whole skdconfig file for S3. With your endpoint removed. So that I have a 1-1 for replication your issue for S3.

Thank you,
AR

@rawalexe
Copy link
Member

Hello @lhammond @gavin-hy,
I've tried killing the connection and then bring it back up, keeping it alive for long hours and am still not able to reproduce the bug.
Screenshot 2023-09-15 at 2 38 07 PM

Have you changed the code to any degree? Can you please send me a zip of your repo?

Best Regards,
AR

@lhammond
Copy link
Author

@Skptak there was no change in behavior by changing the TLS Transport Send / Receive timeout to 10000

@lhammond
Copy link
Author

@rawalexe yes, I have made some modifications. How can I send you the zip file?

@lhammond
Copy link
Author

@lhammond can you please provide your whole skdconfig file for S3. With your endpoint removed. So that I have a 1-1 for replication your issue for S3.

Thank you, AR

I am not using OTA demo nor S3 .. do you still need the sdkconfig?

@lhammond
Copy link
Author

lhammond commented Sep 19, 2023

@rawalexe @anubhavrawal It's too big to email, I just shared a google drive link to your email .. let me know if you can't down load it. I'm happy to get on a google meet if you'd like.

My edits were intended to comment out the publish loop ( not using a temp sensor ) and add a few helper functions to control a neopixel 16 ring. Here's my git status

image

@FreeRTOS FreeRTOS deleted a comment from anubhavrawal Sep 19, 2023
@rawalexe
Copy link
Member

rawalexe commented Sep 19, 2023

@lhammond ,
Thank you for sharing the code, I was running all the demos to see if any other fail as well. However, for my other runs I disabled other demos from menuconfig and proceeded with the possible replication process of killing the internet connection.

Thank you for the file, I was able to download it and will try to replicate it today. I'll keep you posted on my progress

Best Regards,
AR

@lhammond
Copy link
Author

lhammond commented Sep 19, 2023 via email

@rawalexe
Copy link
Member

rawalexe commented Sep 21, 2023

@lhammond
Can you please provide me with a little bit more context in this repo as some of things that I immediately notice is that you are not using the the right tag that you mentioned at the start of the ticket, you are on the main branch, also what's your esp-idf version number?

If you try to use the tagged version at commit 2dccbca with esp-idf 4.4.5 commit ac5d805d0e do you still see the issue?

Best Regards,
AR

@lhammond
Copy link
Author

@rawalexe
bash-3.2$ idf.py --version ESP-IDF v5.0.3-230-g35c484324f-dirty

I will try with the versions above and let you know

@lhammond
Copy link
Author

@rawalexe I am preparing to test with the new versions. I did want to point out that I am using a NeoPixel ring with (RMT - Addressable LED ) .. this option does not appear in menuconfig for commit 2dccbca. I'm guessing all of that functionality is implemented in the the demo's .c or app_driver.c and I can port it over.

image

@lhammond
Copy link
Author

@rawalexe After adding component_compile_options(-Wno-error=format= -Wno-format) to the bottom of main/CMakeLists.txt apparently due to espressif/esp-idf#9511 (comment)

I am seeing the below

image

@lhammond
Copy link
Author

@rawalexe I kept the publish while(true) loop but commented out the logic and it resolved the above sensor-related error. I have the pub/sub temperature LED demo running now with the versions you requested. I started it at 3:03 PM EST .. going to monitor it long term.

@lhammond
Copy link
Author

lhammond commented Sep 21, 2023

about 11 minutes into the test I get the following .. I will trying increasing the TLS timeout

image

@lhammond
Copy link
Author

@rawalexe there is no TLS timeout in this version .. but maybe CONNACK is the same .. I made these changes and rerunning the test

image

@lhammond
Copy link
Author

@rawalexe @anubhavrawal @Skptak

The versions above with a CONNACK of 10000 has been running for two days. I commented out the publish logic inside the while(true) loop.

So the question is, do I back port the LED demo logic to this version or is there a plan to fix latest branches to address the connectivity issue?

thanks

@rawalexe
Copy link
Member

Hello @lhammond,
I see that there are few bugs in the repo, but it will take us sometime to look more into it and find a fix for this. If you can find the specific bug within the repo and submit a PR the team will be happy to merge a permanent fix, and will be the quickest fix.

Best Regards,
AR

@lhammond
Copy link
Author

@rawalexe ok, I'll see what I can do. I need to push these production devices out asap, so will probably backport the LED control logic first and will try to find some time to look around for the connectivity issue. Do you have an idea of which repo to look in? Is it a submodule or in this repo?

Would you guys be looking to apply any fixes to version 5.x?

@rawalexe
Copy link
Member

rawalexe commented Sep 26, 2023

The repo is aimed to work with the latest esp-idf. But after observing the issues for a while it looks like having a single submodule esp-idf might be a better idea and support for latest esp-idf will be at best effort. The next fixes will be to ensure full compatibility with v5.x.

@txf-
Copy link

txf- commented Sep 28, 2023

I can confirm that the commit d25036b is definitely the cause of it not reconnecting. I had previously reported here #34 (comment), when it was still a patch.

After reverting the changes to the previous version of the agent manager, the device reconnected on any timeout or disconnection.

@rawalexe
Copy link
Member

rawalexe commented Oct 1, 2023

Hello @lhammond @txf-,
I have created this repo and tested out on my local device, can either of you test out to fit in your use cases?

https://github.com/rawalexe/iot-reference-esp32c3/tree/newEsp

It has few updated instructions and esp-idf v5.1.1 submodule. The are some build warnings but will be improved further.

Best Regards,
AR

@lhammond
Copy link
Author

lhammond commented Oct 1, 2023 via email

@txf-
Copy link

txf- commented Oct 1, 2023

The reconnection issues were fixed by the reversion of the optimizations in core_mqtt_agent_manager.c.

I can't actually tell what changes were made in newEsp that affects this. Is this repo just adjustments to make it work with idf 5.x?

@rawalexe
Copy link
Member

rawalexe commented Oct 1, 2023

Yes the changes are the documentation on using Amazon's version of FreeRTOS and submodule to latest esp-idf. I did not have any problem building the project or running them so am looking for verification that this works on the previously problematic scenarios.

Best Regards,
AR

@lhammond
Copy link
Author

lhammond commented Oct 3, 2023 via email

@lhammond
Copy link
Author

lhammond commented Oct 3, 2023 via email

@rawalexe
Copy link
Member

rawalexe commented Oct 5, 2023

Hello @lhammond ,
you would need at an esp idf 5.0+ to be able to switch kernel version. It should be under idf menuconfig > Component config > FreeRTOS > Kernel> Run the Amazon SMP FreeRTOS kernel instead (FEATURE UNDER DEVELOPMENT)

I would recommend using the submoduled esp-idf for standardization purpose but any 5.0+ idf should mostly behave the similar.

I am sorry but I was not able to see the attached image within the comments as it only shows like [image: image.png]

Thank you

Best Regards,
AR

@lhammond
Copy link
Author

lhammond commented Oct 6, 2023

@rawalexe ok, finally got it sorted and just started a long running test. stay tuned!

@lhammond
Copy link
Author

lhammond commented Oct 6, 2023

this happened after about 20 seconds. I increased CONNACK timeout to 20000 and trying again

image

@rawalexe
Copy link
Member

rawalexe commented Oct 8, 2023

That's. quite unfortunate, Give me some time I'll spend some time to see if I we can fix this.

Best Regards,
AR

@lhammond
Copy link
Author

Hi @rawalexe .. I'm back on this project again. Have you made any progress?

@anubhavrawal
Copy link

Hello @lhammond,
We forwarded this issue to espressif as they wanted this to be compatible with all the esp-idf versions 4.4+. I'll keep you posted once we hear back from them.

@rawalexe
Copy link
Member

Hello @lhammond ,
After talking with espressif, they mentioned that adding process loop was indeed a known problem and noticed that the file you sent us didn't contain the commit reverting the process loop, commit id : f4fe11e27a7d686b7a2f22de278ece570b692ce9. I apologize to ask you to run these on different condition but just want to make sure that the known issues aren't creating any problems. Can you please make sure that you are using the latest changes in the main and still facing the issues? I attempted these changes on my device, didn't replicate test for long hours though and the demo ran as expected for 30 mins or so.

Best Regards,
AR

@lhammond
Copy link
Author

lhammond commented Nov 27, 2023 via email

@rawalexe
Copy link
Member

rawalexe commented Dec 8, 2023

I provided the commit id to make sure that it's included in with the repo you are testing with. if that commit id is in your git log history your code should run without any problem. Now that you are actually running the demo, please let us know how it goes. If it fails can you also make sure that your internet connect isn't down by visiting a website, just in case

@lhammond
Copy link
Author

lhammond commented Dec 15, 2023

@rawalexe The LED/temperature demo at commit hash at f4fe11e has been running for about 3.5 days. I have not yet tried pulling the AP's network cable to check behavior, but this is encouraging. I will try that test sometime this weekend. If I understand your message, I need to make sure that commit has is in git log. I will try with main now.

@rawalexe
Copy link
Member

Hello @lhammond,
That's a great news, and yes you do understand it correctly. Please let us know if this resolves your problem so that we can close the issue appropriately.

Best Regards,
AR

@rawalexe
Copy link
Member

As there is no further concern from you, I am closing this issue as resolved, if the problem persists please feel free to reopen the issue or open a new one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants