-
Notifications
You must be signed in to change notification settings - Fork 818
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High CPU usage when run Ctrie.insert function #841
Comments
We want to modify the Ctrie remove function and a thread can be used to automatically clear the TNode |
There's a possibility |
I found that the TNode's token is null. because so if the tnode remove failed, the number of TNode become more and more in the old version, but not in the new version the code as follows: old version:
|
What the scenarios does the system not automatically clear TNodes? Just set TNode successfully, but remove TNode failed. I can't figure out why. |
why do not just change to
|
the compareAndSet is because there may be many parallel threads trying to edit the same node at the same time, and only one can "win" and have it's changes be accepted. The other threads will have to retry. |
1、However, the synchronous simulation shows that the two threads enter the cleantomb function, and the later thread overwrites the result of the previous thread, and the return result is success. 2、I Just use sleep function to increase the probability of problems, and use two threads to call cleantomb at the same time. 3、The following is the test code: 5、the running result is: 6、the return result of two threads are OK,and not retry again |
1、The cause of the problem is that the copy method is a deep copy function. new ArrayList<>(children) is a deepcopy function.
2、for example:、 3、the running result is, and aList is not modified: |
Even if return repeat, the following method is executed, and no operation is performed, Action.OK is returned and the TNode node is not deleted.
|
Theoretically, this problem is solved in version 0.17. The token of the TNode node is no longer empty. Even if the token node is not deleted, the node will be restarted when the same token node is added next time, and the TNode node will be converted to the CNode node, In this case, the number of TNode nodes will not increase to 7 million, Unless the token nodes are not reused. private Action createNodeAndInsertSubscription(Topic topic, INode inode, CNode cnode, Subscription newSubscription) { |
Therefore, to avoid this situation, it is recommended that you directly remove the node, but you need to check whether the node exists in the child set. private Action cleanTomb(INode inode, INode iParent) { public boolean remove(INode node) { |
Are you testing this on the development version, or on a released version?
The problem is that So the code should be:
Can you please test that? |
1、the version 0.16, the link is 2、the line is 204:
3、 you can use the following test code: import java.util.concurrent.atomic.AtomicInteger; public class Main {
} 4、the result is only delete one node,not two nodes: |
In the multi-thread scenario, this problem is likely to occur. |
You should really use at least 0.17 for testing. A lot has changed since 0.16... With the fix I just posted, the result of the |
1、 iParent.compareAndSet(iParent.mainNode(), updatedCnode) ,the function is equivalent to iParent.mainNode().compareAndSet(iParent.mainNode(), newNode) private Action cleanTomb(INode inode, INode iParent) { class INode {
}
|
Yes, that's why I explained in my previous post why |
1、The latest version does the same.
2、it is recommended that you directly remove the node, and you need to check whether the node exists in the child set, and You can also give up using the the compareAndSet function. public boolean remove(INode node) { public class Main {
// CNode updatedCnode = origCnode.copy();
} 4、the modified of CNode.java 5、 the test result is that two nodes has been deleted |
Of course, I just posted the update for you to test here in the thread, I didn't make a PR yet, let alone that Andsel had time to review that PR and merge it... |
OK, thank you very much,I'll test it when you're done. |
You can test it right now by changing the |
1、yes,one of the threads fails to execute, the result is: 2、the test code is public class Main {
} 3、 but the threads that failed to execute, retry to run the program, the TNode can not be deleted. public void removeFromTree(Topic topic, String clientID) {
|
so it is recommended that change in this way to ensure that the TNode is deleted. |
That won't work, since When fixing a bug leads us to another bug, we'll just have to fix that one too. |
OK, if (cnode instanceof TNode) should cleantomb again, until deleted successfully. private Action remove(String clientId, Topic topic, INode inode, INode iParent) { |
Something else to check: if we remove the last INode from a level, is the parent also cleaned up? We should not be leaving empty nodes hanging around... |
Do all children nodes become TNodes? Only TNodes can be removed. |
After thinking about it, only one thread compareAndSet succeeds and the other threads compareAndSet fails. private Action remove(String clientId, Topic topic, INode inode, INode iParent) { |
There is a maximum of 1 thread per CPU-core doing subscriptions, so there is no way for any "recursion" to increase. |
The machine is a 24 U physical core, each physical core has four cores, and hyper-threading is 2, that is, 24 x 4 x 2 = 192 threads. |
There are multiple threads processing the parent node at the same time when cleantomb, once this parent node fails to remove child, the recursion will continue. |
That sounds like a great setup for gathering some statistics! In 0.17 this problem should be much less than in 0.16, since I sped up the copy procedure a lot and the copy procedure is the slow factor in the parallel process. Faster copies means less chance of a conflict. Note that they do not recurse, they only repeat. I guess the worst case in your case happens when many users "log off" and many userX nodes are removed at the same time succession. The root of the issue is the very flat tree you have. We may have to add a method to artificially deepen the tree in your case. |
1、If cleanTomb fails during the running of remove function, repeat is returned, indicating that the recursion continues. During the recursion, the CNode have changed to the TNode. When the CNode is determined as the TNode instance, the following process is performed: if (cnode instanceof TNode) { 2、If we do not modify the logic here, OK is returned and the recursion ends, but if we continue cleantomb here, cleantomb fails and the recursion continues until successfully: if (cnode instanceof TNode) { 3、The tree is flat. All devices are mounted under the users node. When the connection channel between the MQTT server and devices is inactive, the server automatically unsubscribes the topic,and By default, the connection is disconnected if the connection is inactive within 10 seconds. This is also added to avoid the increasing number of nodes and not automatically deleted.However, once these devices are simultaneously inactive, the cleantomb function will run simultaneously. At this point, only one thread is successfully compareAndSet on the same parent node. If other threads fail to be compareAndSet , the recursion continues until the modification is successful. void handleConnectionLost() { 4、There is a low probability that deletion fails. However, once a node fails to be deleted, the number of TNodes increases as time goes by. This is why this problem occurs after the system runs for about two months. More than 5 million TNodes are generated, resulting in high CPU usage. |
There is no way to avoid the flatness of the tree built , so it is better to have a separate thread to delete the TNode node. |
It is always possible to internally make a tree deeper, for instance by separating all nodes based on the first character, then by the second, etc. Having a separate thread won't make things more efficient, since that thread still has to content with the other threads creating new nodes. |
If this is the case, if only one thread can delete TNodes and other threads fail to delete TNodes, how should these TNodes be handled or not deleted? |
My working version with the fixes I made above can be found at: https://github.com/FraunhoferIOSB/moquette/tree/fix_841_cleanTomb_reference_loss |
OK. thank you , does this fix would be merged into the main branch? |
It will probably get merged once Andsel has had time to look at it. In the meantime it would help if you could test it to see if the problem is actually fixed. |
OK, thank you,However, the test period is long. This problem occurs occasionally. Therefore, I need to observe the problem for a long time. Capture the profile to check the memory and obtain the number of TNodes. and check whether the number of TNodes increases. |
I found that the fixs has some errors and it hasn't been merged: Error: Errors: |
Those build failures are caused by a timeout that is set too strict for the (quite slow) GitHub build systems. That's why it fails only sometimes and not always. |
#842 |
I've done a bit of testing with artificial-tree-deepening, with some interesting results. Described in #849 |
#842 only fixes a part of the problem, reopening :) |
#841 (comment) |
Expected behavior
I found that the cpu usage of mqtt broker is high after running for one month,the cpu usage increase from 9% to 95%
The number of CPU cores of the server is 24U
I found that the ctrie tree is very large when the cpu usage is high, the tree contains a large number of tomb nodes (TNode)
In this case, a large number of CPU resources are consumed by running function: CNode.anychidrenmatch.
1、Why do a large number of TNode nodes exist in the system and are not cleared?
2、Hope that the TNode node can be automatically cleared or a thread can be used to automatically clear the TNode when the TNode is not used.
3、What the following scenarios does the system not automatically clear TNodes?
if (cnode instanceof TNode) {
// this inode is a tomb, has no clients and should be cleaned up
// Because we implemented cleanTomb below, this should be rare, but possible
// Consider calling cleanTomb here too
return Action.OK;
}
Actual behavior
when the usage is 5%, the number of TNodes is 151, but when the cpu usage is up to 95% , the number of TNodes is 7,065,591
The TNodes are not cleared, and the number is been increasing over time.
Steps to reproduce
Minimal yet complete reproducer code (or URL to code) or complete log file
Moquette MQTT version
0.16
https://github.com/moquette-io/moquette/archive/refs/tags/v0.16.tar.gz
JVM version (e.g.
java -version
)1.8
OS version (e.g.
uname -a
)The text was updated successfully, but these errors were encountered: