Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SDK 5.3.1 WiFi still bugged in TCP/IP stack (IDFGH-14128) #14932

Open
3 tasks done
filzek opened this issue Nov 25, 2024 · 11 comments
Open
3 tasks done

SDK 5.3.1 WiFi still bugged in TCP/IP stack (IDFGH-14128) #14932

filzek opened this issue Nov 25, 2024 · 11 comments
Assignees
Labels
Status: Opened Issue is new

Comments

@filzek
Copy link

filzek commented Nov 25, 2024

Answers checklist.

  • I have read the documentation ESP-IDF Programming Guide and the issue is not addressed there.
  • I have updated my IDF branch (master or release) to the latest version and checked that the issue is present there.
  • I have searched the issue tracker for a similar issue and not found a similar issue.

General issue report

The SDK 5.3.1 has a very deep bug in the IP/WiFi stack where it get stucked in TCP mode, the UDP mode keep working.

Also the wifi act weird in de-handshake sometimes.

Using WebSocket make the problem to get worst.

Unique way to solve it to deinitialize the lwip and wifi, and recreat it all again.:

Merge branch 'bugfix/fix_some_wifi_bugs_241024_v5.3' into 'release/v5.3'
fix(wifi): fix some wifi bugs 241024 v5.3
See merge request espressif/esp-idf!34420

@espressif-bot espressif-bot added the Status: Opened Issue is new label Nov 25, 2024
@github-actions github-actions bot changed the title SDK 5.3.1 WiFi still bugged in TCP/IP stack SDK 5.3.1 WiFi still bugged in TCP/IP stack (IDFGH-14128) Nov 25, 2024
@AxelLin
Copy link
Contributor

AxelLin commented Nov 26, 2024

@filzek Could you share more detail to reproduce this issue?

@filzek
Copy link
Author

filzek commented Nov 27, 2024

We have not yet found a straightforward way to replicate this process. However, our analysis so far indicates that btm_rrm_t is not being properly destroyed when the Wi-Fi/LWIP stack is reinitialized. This oversight results in the continuous creation of new tasks, leading to task duplication and potential resource exhaustion.

@filzek
Copy link
Author

filzek commented Nov 29, 2024

Wifi layer still being corrupted and will stop work in multitasking complex tasks. We have this problems for so many versions, we would like to know any true robust system running right now with esp32 without network issues. Seems that for the last 4 years the problem still the same Wifi stocks halts loose connection and never came back. Now tcp/ip layer with same proglems. SDK 5.3.1 not okay.

We really want to understand why things are this deeply bad in keep the connection working???? Why we need to create a lot of patches to try to make the wifi and ip stack barnacle workable in a production environment.

I am about to open to offer for thousands of USD to show that solution aren't working at all in the development level inside espressif and CPUs sold could be extremely effective and can't stand working in production environment.

Things already went too far and now true answer come to the table to solve it. Everyone in espressif push to one to another and no one there really calls it on!

It's time to someone come abroad and solve the problem with the wifi and ip layer, thousands of offline devices that need to be power off and power on again isn't a true solution for this kind of service.

@euripedesrocha can someone come abroad to solve the problem for real????

@AxelLin
Copy link
Contributor

AxelLin commented Nov 29, 2024

Wifi layer still being corrupted and will stop work in multitasking complex tasks. We have this problems for so many versions, we would like to know any true robust system running right now with esp32 without network issues. Seems that for the last 4 years the problem still the same Wifi stocks halts loose connection and never came back. Now tcp/ip layer with same proglems. SDK 5.3.1 not okay.

Do you mean the older sdk versions (e.g. 5.2.x, 5.1.x) also have the same issue?

@filzek
Copy link
Author

filzek commented Nov 30, 2024

Wifi layer still being corrupted and will stop work in multitasking complex tasks. We have this problems for so many versions, we would like to know any true robust system running right now with esp32 without network issues. Seems that for the last 4 years the problem still the same Wifi stocks halts loose connection and never came back. Now tcp/ip layer with same proglems. SDK 5.3.1 not okay.

Do you mean the older sdk versions (e.g. 5.2.x, 5.1.x) also have the same issue?

@AxelLin NO WAY, latest STABLE WIFI / IP STACK working is SDK 3, all others versions is complete BUGGED the WIFI on ESP32 and makes devices to crash random, and the Espressif know it all and didnt come to public to tell the problem so far, the dev team try to hide the problem as it is nothing happen at all, and we try to always show the problem to them to let them to be able to fix, but they tends to just become deaf to let know that this is sintomatic and spread all over, can occurs random and needs a physical reset to make the hardware works again. This is why 3.1 is out to use, but what about the 1.1, 3.0 hardware in the market, costumers that arent stable because a software failure that hangs the WIFI interface and makes in irreparable????

@filzek
Copy link
Author

filzek commented Dec 2, 2024

So, why the Wifi / TCP stack cant survive and start to degrade all over aleatory? This happen everywhere and why things arent clear to know what to do or what to do not do? We can release the code that stuck and halt all internal esp32 registries even upon restart of the cpu, it keep a mess, so, only a full power down and power on can recover the inside registries from it, and its very simple to make it happen. This is something related to the current caos, but, not the intent.

The great question is why the WiFi cant stand running and colapse? Why did it not bring and info to the problem?

@filzek
Copy link
Author

filzek commented Dec 2, 2024

@AxelLin I thinkl the problem could start with something related to the software/hw ble/wifi coexistance, somethings point out there.

@bryghtlabs-richard
Copy link
Contributor

@filzek, could you attach an sdkconfig? We've also had some networking troubles, but coexistance seems to be working pretty well for us.

@hansw123
Copy link
Collaborator

@filzek
sorry for late reply
maybe you can provide the AP's Specific model which you used, or you can provide the wifi wireshark capture and log releat to the issue

@filzek
Copy link
Author

filzek commented Feb 25, 2025

Hi @hansw123 @bryghtlabs-richard @AxelLin

We are tracking the issue to the lowest level as possible, but we can't make the problem happen on bench development, only in release field this happen so far.

In our tracking the problem happen following this:
1 - board loses wifi and can't find the AP anymore.
We track it and take action to stop the wifi, redefine to default, set parameters and start it again and they to connect again.

2 - WiFi reconnection sometimes loses the IP and can't get it so far, so a manual dhcp stop and start mist be done, but the IP address must be cleaned to all zeroes first.

3 - DHCP loses the IP while the WiFi layer still connected.
Just repeat the same as above.

The item 3 track with a running ping continuously to the own IP get in the STA interface, so if it's connected the ping loses should be minimal and if so it is working, but if the IP stop to ping the lwip/dhcp layer is somehow breaked. Fox as the step 2.

The WiFi event handle acting in IP event when the problem 2 happen there is no IP in the interface so doing as said in another code side the issues could be fixed. The log tells that it got the IP but it really doesn't and there is no action on any http server. Websocket or any other part, so the dhcp simple doesn't work as intent so force doing the solution 2 it fixes everything.

The item 1 sometimes is extremely difficult to track as simple it stop working but it doesn't call any disconnection or wifi event handlers, this make totally difficult to track the field deployed devices, asto this solution a set of supervisory ips, pings, external actions, layer check, are done to understand the break on the wifi and so the solution 1 is applied

The Nimble is latest sdk 5.4 with latest commit as feb 24 2025 is working as intent but still problem with asserts yet.

Tomorrow I Will add the sdkconfig to here.

We.have fixed the bugs by alternative corrections, the best is if in the wifi driver it could be fixed as show above as the tracking issues could be something easy to the wifi/lwip team to patch.

@filzek
Copy link
Author

filzek commented Feb 25, 2025

Updating findings
Sometimes the wifi never came back alive, so, no reconnection or even found the AP anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Opened Issue is new
Projects
None yet
Development

No branches or pull requests

5 participants