-
-
Notifications
You must be signed in to change notification settings - Fork 507
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Execute MQTT client synchronously in main loop() #1201
Conversation
processing a published valued on a subscribed topic is currently running in a task that is not the task executing the main loop(). that's because the espMqttClient(Secure) was constructed without arguments, which selects the constructor with two arguments (priority and core), both of which have default values. that constructor selects espMqttClientTypes::UseInternalTask::YES, causing a task to be created in whose context the MQTT client loop is executed. MQTT subscribers assume they are running in the same context as the main loop(). most code assumes exactly that. as the scheduler is preemptive and very little (none at all?) code is interlocked, we have to make sure to meet the programmer's expectations. this changeset calls the MQTT client loop in the context of the main loop() and enforces the use of espMqttClientTypes::UseInternalTask::NO.
In this case you would fix the problem with the mqtt interface but not with the web api which also runs in a seperate thread... therefor this is not the solution for the entire problem.
In this case I would argue that the people learn it :) |
I'm with you on this train of thought. The whole project should work well with multiple tasks, which should improve responsiveness. And even help making use of all cores... Personally, I am not afraid of using mutexes. And I also already suspected that I will find the same problem in the web server implementation. Especially since multiple people in the downstream project are suspecting that the OpenDTU(-OnBattery) is unstable (reboots) if the web app is used for longer periods of time or with multiple clients at the same time. However, I don't see mutexes anywhere except for the Hoymiles library, and even there the public API is not yet thread-safe, as I pointed out. It takes significant effort to fix all of this. Not even the access to the global config is thread-safe. Until at least an attempt on thread-safety is made, I would prefer if we could agree that we need to reduce the contexts in which externally triggered events are handled to a single one. That includes MQTT subscribers and web application handlers in particular. What do you think? Start tackling thread-safety immediately, risking deadlocks and missing vulnerable blocks, or reduce the contexts to gain stability until we can find time to address thread-safety throughout the whole project, one component at a time? |
In a first step, only completly assembled commands should be placed into the queue. auto cmd = _radio->prepareCommand<ActivePowerControlCommand>();
cmd->setActivePowerLimit(limit, type);
cmd->setTargetAddress(serial());
_radio->enqueCommand(cmd); And the command queue is also thread safe now |
Since the entire web server is implemented to run in a separate thread, and I am unwilling to change it to a synchronous one, the root cause must be fixed in any case. |
Again, I am with you, I had this exact same aspect in mind when I saw that. It's not critical any more if the interface is locked by a mutex, but I think it is bad design to add an incomplete data structure to a container.
Hm. I really was hoping I could convince you to consider this a hotfix.
Yes. It's nice to see that you want it done properly. Yet, I don't see it happening short-term. So, what's the plan? I am sure you understood the issue and I think you are taking it seriously. How can we make progress? Do you think about tackling this by yourself? I saw you using |
I don't know if this is from the AhoyDTU legacy of some code or the same choise was made in OpenDTU too. What I recall is that the std::mutex and other std:: implementations would not work / compile on ESP8266 in Ahoy. That said it was determined that it is necessary to handle the NRF (and probably CMT) interrupts in due time and therefor the Mutex/Semaphore were probably impemented in OpenDTU and some other way in Ahoy. I find your idea to "reduce the contexts in which externally triggered events are handled to a single one" quite compelling. Though I do not know how this would work for the timing sensitive NRF and CMT code or whether this is required for the Async WebServer and TCP libraries which are included from upstream, as well ? IMHO this would require first a change in the used MQTT Client library too, in order to make it Async. Or are we using an Async MQTT library already ? |
Thats the reason why it is implemented differently now.
If you look at the commits from yesterday you will see some commits related to that issues. In total it should address more locations as a change of the mqtt library scheduler would have done.
The MQTT handling is currently running in a separate thread. It is not handled in the context of the main loop. (Same as AsyncTCP which is used by the webserver and also spawns a separate thread) |
Thanks for chiming in, @stefan123t. I am not an expert on the Hoymiles library, though I learned a lot the last weeks. As far as I can tell, all the heavy lifting of the Hoymiles library is done in main
Nice! Now let's move all lines More feedback:
None of it is important. Take it or leave it. I like exchanging notes, maybe you like getting feedback. I usually do. Are you planning to keep hunting these concurrency issues? |
@schlimmchen thanks for the explanation behind your reasoning. To be honest I and Lukas have been hunting (and haunted 😀) by these concurrency issues in the AhoyDTU code for quite some time. Though tbnobody's code is much higher quality and I do not know what his current overall status with regards to concurrency of various parts of the code is. The original code by Hoymiles for the DTU Pro has a couple of timers and timeouts implemented to get the low-level timing hopefully mostly right. Here are the two commits that tbnobody mentioned above: I have checked the current master code for the xSemaphoreTake / Give macros you described:
The parsers are used to read the response from the inverter where as Hoymiles.cpp (and cmt_spi3.c) are used to communicate with the inverter via NRF / CMT modules. Actually @LennartF22 and the creators of the openDTU Fusion v2.1 PCB are waiting for the merge from https://github.com/LennartF22/OpenDTU/tree/spi-eth and @dAjaY85 for a Display library update which would also include some changes to share the SPI for CMT, Display and NRF. Actually the NRF library is the reason tbnobody uses GPL-v2-or-later as being a derived work of https://github.com/nRF24/RF24 I do not know exactly what RF24 lib code maintainers actually use to make theirs IRQ / thread safe. Whereas the CMT code is from an application example. Now to your proposed changes:
Yes, this could probably be used directly, but I think tbnobody wants to make its use / reason explicit. While I can understand why you would want to reverse the order, ie. especially the last two lines: auto cmd = _radio->prepareCommand<AlarmDataCommand>();
cmd->setTime(now);
cmd->setTargetAddress(serial());
_radio->enqueCommand(cmd);
EventLog()->setLastAlarmRequestSuccess(CMD_PENDING); I would expect these two actions to be atomic and therefor enclosed in one of those xSemaphoreTake / Give macros to make this thread safe too. |
This change is clearly not wanted here. I can understand that. Thanks @tbnobody for making improvements regarding the enqueuing of Hoymiles radio messages and using locking mechanisms to guard the public interface of the Hoymiles lib. With the introduction of TaskScheduler, this change is also effectively gone in the downstream project OpenDTU-OnBattery. We will observe whether or not the problem we observed there is coming back. It probably does not, as you tightened down the public API of the Hoymiles library pretty well. |
This pull request has been automatically locked since there has not been any recent activity after it was closed. Please open a new discussion or issue for related concerns. |
processing a published valued on a subscribed topic is currently running in a task that is not the task executing the main
loop()
. that's because theespMqttClient(Secure)
was constructed without arguments, which selects the constructor with two arguments (priority and core), both of which have default values. that constructor selectsespMqttClientTypes::UseInternalTask::YES
, causing a task to be created in whose context the MQTT client loop is executed, which in turn executes all subscriber callbacks.MQTT subscribers assume they are running in the same context as the main
loop()
. most code assumes exactly that. as the scheduler is preemptive and very little code is interlocked, we have to make sure to meet the programmer's expectations.this changeset calls the MQTT client loop in the context of the main
loop()
and enforces the use ofespMqttClientTypes::UseInternalTask::NO
.this fixes a real-world issue observed by @the-lonely-one in hoylabs#268 (comment): using MQTT, the inverter power limit was set with an interval of ~20 seconds. after at most 15 hours, the inverter would not be sent any limit command any more.
that symptom could be observed due to #1093, which is already merged in the downstream project OpenDTU-OnBattery. the
CMD_PENDING
flag would be set after the command was queued and processed, due to preemptive scheduling. since the command is gone, theCMD_PENDING
flag is never reset.this change was already tested by me, @helgeerbe, and more importantly by @the-lonely-one, who confirmed that the observed issue is gone with this change.
please note that even if #1093 is never merged (:disappointed:), there is still an issue with the non-thread-safety of this code and other pieces of code:
if the scheduler changes contexts right after the first or second line, the second or third line will write freed memory when returning to the context that enqueued the command: the command will be dequeued in another context, rejected as it has no target serial set, and will then be deleted (
shared_ptr
goes out of scope).it might be beneficial to interlock this and all other public interfaces to the hoymiles library. until then, I consider this changeset an important bugfix. even if the public interface of the hoymiles library is at some point interlocked, I still argue that this changeset is important. as I wrote above: most people assume their code is executed in the context of the main
loop()
, and more problems are to be expected if MQTT subscribers run in a separate context.