-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate performance issues of Web UI #23
Comments
@romain-jacob could you please tell me where exactly the issues occur? I do see that the status messages come in too slowly. Potentially this needs some re-engineering of the protocol - especially concerning timeouts. |
Just everything is slow. The plotter (so, Dash) is fine, but the rest of the GUI lags a lot, and I don't manage to get the IPs of the clients. It looks like things just time out. Data still comes in fine and I can see the clients registered in the database, so that's alright. I quickly checked the VM resources, and neither CPU nor memory seems to run out. So that seems to be an issue in the protocol running behind the web front-end |
Ok. I assume that the issue is the synchronous waiting on the server side. Potentially: Lines 113 to 124 in 1911af7
Probably the correct way would be not to wait on the server, but rather forward requests via SSE (Server side events) to the browser (feels slightly better, but not yet great since it doesn't encapsulate functionality correctly). I think we'd need to come up with another design here. |
Yes, that's most likely the culprit. Still though, I'm a bit surprised that things timed out already. There are so many clients... no sure why it takes so long to get a reply from the client. |
I've just restarted mmserver on the VM and see that autopower 16 always re registers. There could be an issue either with this device or the server. |
Another thing could be that the clients are hanging somewhere in the management thread. This method: Line 617 in 1911af7
is blocking somewhere - potentially. The Web UI only shows that one device is registered while this is clearly not expected (the devices do upload data). Or it's (thread) contention on the server. |
I have the feeling that this needs some testing with an actual setup. |
I think, I pinned the issue down: 10 workers for gRPC aren't enough for that many clients: Line 573 in 1911af7
I've set the number higher on the server and now it should work. Not the best solution, but it should work for now. |
I don't get what the problem you mention about autopower16. Looks fine to me. On the other hand, I observed twice now some clients that used to be registered and that lost their connection to the server (autopower9 for a long time, and I just lost autopower8 last night). I'm not sure what happens there (and I can't SSH into those since they are behind a NAT). That should be looked at more closely eventually, but that's the wrong issue for this... |
Oh, I see what you are saying with 16 now. Looks like it keeps registering new measurements... strange |
autopower16 also constantly registers with the server. If you run cli.py on the server, you'll see what I mean. You'll get messages saying that the client re registered. |
I'll try rebooting 16 and see if that fixes it. |
Bottleneck 1: Bottleneck 2: (Probably, not sure how to validate this) I think it makes sense to discuss if we should rewrite the server in C++ too (I am slightly in favor of doing so). It makes sense for scalability reasons. The servers' code is not that large. |
I'm not against it. Since the client is already in C++, that would make sense. I don't have a good sense for the amount of work that would represent though. |
Writing it down here to make sure I don't forget: Right now it seems we poll every client as soon as the managementUI is opened. That's quite wasteful and unnecessary. Let's talk about it. |
--> Only display DB related info, no status requests. |
I had another thought about the data upload speedup and the acking strategy. I think that - if we ack after writing into the DB and not commit, ack, send ack off, but rather write everything into the DB, ack everything, send ack off to client we always have the potential of duplicates. |
Okay. As I said, I don't understand this process well enough to have an opinion at the moment. You can try something out you think is promising and we can test what happens with the two clients that are currently as ZhdK |
Ok. I've now implemented the suggestion - with duplicate checks in SQL: a9d6f14 I could not produce duplicates and the client side DB table correctly sets the was_uploaded flag. Also the upload is way faster now. I believe that this should be good now. |
After some more investigation, the main performance bottleneck is the check if a client is authorized as management client. This slows down management requests. |
Do you understand why this check takes so long? Shouldn't it be "just" one RTT between the client and the server? |
The password verification is designed to be slow. This is done to avoid brute force attacks if the hash would be leaked. We however need a way to authenticate management clients vs. normal clients. So the implementation needs to be different. |
The WebUI probably needs some overhaul and investigation of performance bottlenecks.
I strongly assume that the issue is not hardware.
TODO:
The text was updated successfully, but these errors were encountered: