Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate performance issues of Web UI #23

Open
2 tasks
UsualSpec opened this issue Sep 19, 2024 · 22 comments
Open
2 tasks

Investigate performance issues of Web UI #23

UsualSpec opened this issue Sep 19, 2024 · 22 comments
Labels

Comments

@UsualSpec
Copy link
Collaborator

UsualSpec commented Sep 19, 2024

The WebUI probably needs some overhaul and investigation of performance bottlenecks.
I strongly assume that the issue is not hardware.
TODO:

  • Find out where potential bottlenecks occur (probably wrong timeouts): Web UI, Server or Plotter?
  • Is there any issue related to slow DB queries?
@UsualSpec
Copy link
Collaborator Author

@romain-jacob could you please tell me where exactly the issues occur? I do see that the status messages come in too slowly. Potentially this needs some re-engineering of the protocol - especially concerning timeouts.

@romain-jacob
Copy link
Contributor

romain-jacob commented Sep 19, 2024

Just everything is slow. The plotter (so, Dash) is fine, but the rest of the GUI lags a lot, and I don't manage to get the IPs of the clients. It looks like things just time out. Data still comes in fine and I can see the clients registered in the database, so that's alright.

I quickly checked the VM resources, and neither CPU nor memory seems to run out. So that seems to be an issue in the protocol running behind the web front-end

@UsualSpec
Copy link
Collaborator Author

Ok. I assume that the issue is the synchronous waiting on the server side. Potentially:

autopower/server/server.py

Lines 113 to 124 in 1911af7

def waitForResponseTo(self, requestNo):
self.responsedictlock.acquire()
while not self.responseArrived(requestNo):
# wait_for is new in version 3.2 but docs do not specify if wait_for also returns a boolean on timeout (https://docs.python.org/3/library/threading.html#threading.Condition.wait_for)
noTimeout = self.responsedictcv.wait(timeout=30) # times out after 30 seconds
if not noTimeout:
self.responsedictlock.release()
return False # on timeout give back issue
# block until requestNo has arrived
self.responsedictlock.release()
return True

Probably the correct way would be not to wait on the server, but rather forward requests via SSE (Server side events) to the browser (feels slightly better, but not yet great since it doesn't encapsulate functionality correctly). I think we'd need to come up with another design here.

@romain-jacob
Copy link
Contributor

Yes, that's most likely the culprit. Still though, I'm a bit surprised that things timed out already. There are so many clients... no sure why it takes so long to get a reply from the client.

@UsualSpec
Copy link
Collaborator Author

I've just restarted mmserver on the VM and see that autopower 16 always re registers. There could be an issue either with this device or the server.

@UsualSpec
Copy link
Collaborator Author

UsualSpec commented Sep 19, 2024

Another thing could be that the clients are hanging somewhere in the management thread.

This method:

void AutopowerClient::manageMsmt() {

is blocking somewhere - potentially. The Web UI only shows that one device is registered while this is clearly not expected (the devices do upload data).

Or it's (thread) contention on the server.

@UsualSpec
Copy link
Collaborator Author

I have the feeling that this needs some testing with an actual setup.

@UsualSpec
Copy link
Collaborator Author

UsualSpec commented Sep 19, 2024

I think, I pinned the issue down: 10 workers for gRPC aren't enough for that many clients:

grpcServer = grpc.server(futures.ThreadPoolExecutor(max_workers=10))

I've set the number higher on the server and now it should work. Not the best solution, but it should work for now.
autopower16 still seems to be misbehaving - I suspect a client issue there.

@romain-jacob
Copy link
Contributor

I don't get what the problem you mention about autopower16. Looks fine to me.

On the other hand, I observed twice now some clients that used to be registered and that lost their connection to the server (autopower9 for a long time, and I just lost autopower8 last night). I'm not sure what happens there (and I can't SSH into those since they are behind a NAT). That should be looked at more closely eventually, but that's the wrong issue for this...

@romain-jacob
Copy link
Contributor

Oh, I see what you are saying with 16 now. Looks like it keeps registering new measurements... strange

@UsualSpec
Copy link
Collaborator Author

autopower16 also constantly registers with the server. If you run cli.py on the server, you'll see what I mean. You'll get messages saying that the client re registered.

@romain-jacob
Copy link
Contributor

I'll try rebooting 16 and see if that fixes it.

@UsualSpec
Copy link
Collaborator Author

UsualSpec commented Oct 5, 2024

Bottleneck 1:
Password verification for management clients. Authentication needs some rethinking

Bottleneck 2: (Probably, not sure how to validate this)
Streaming in registerClient(). According to https://grpc.io/docs/guides/performance/#python Python is slow for streaming RPCs. This is python specific.

I think it makes sense to discuss if we should rewrite the server in C++ too (I am slightly in favor of doing so). It makes sense for scalability reasons. The servers' code is not that large.

@romain-jacob
Copy link
Contributor

I'm not against it. Since the client is already in C++, that would make sense. I don't have a good sense for the amount of work that would represent though.

@romain-jacob
Copy link
Contributor

Writing it down here to make sure I don't forget:

Right now it seems we poll every client as soon as the managementUI is opened. That's quite wasteful and unnecessary. Let's talk about it.

@UsualSpec
Copy link
Collaborator Author

--> Only display DB related info, no status requests.

@UsualSpec
Copy link
Collaborator Author

UsualSpec commented Oct 16, 2024

I had another thought about the data upload speedup and the acking strategy. I think that - if we ack after writing into the DB and not commit, ack, send ack off, but rather write everything into the DB, ack everything, send ack off to client we always have the potential of duplicates.
However, if we also introduce an index for the ack_id and server_measurement_id on the server, it should be fairly fast. The index could introduce something like a hash table lookup which would work in constant time...

@romain-jacob
Copy link
Contributor

Okay. As I said, I don't understand this process well enough to have an opinion at the moment. You can try something out you think is promising and we can test what happens with the two clients that are currently as ZhdK

@UsualSpec
Copy link
Collaborator Author

Ok. I've now implemented the suggestion - with duplicate checks in SQL: a9d6f14

I could not produce duplicates and the client side DB table correctly sets the was_uploaded flag. Also the upload is way faster now. I believe that this should be good now.
Still, it's worth a test e.g. as you suggested.

@UsualSpec
Copy link
Collaborator Author

After some more investigation, the main performance bottleneck is the check if a client is authorized as management client. This slows down management requests.
We need to have a way to only allow management clients to issue requests to "dangerous" functions such as starting/stopping measurements.
The question is, if we can verify the connection instead of each call.

@romain-jacob
Copy link
Contributor

Do you understand why this check takes so long? Shouldn't it be "just" one RTT between the client and the server?

@UsualSpec
Copy link
Collaborator Author

The password verification is designed to be slow. This is done to avoid brute force attacks if the hash would be leaked.
We don't really have this issue.

We however need a way to authenticate management clients vs. normal clients. So the implementation needs to be different.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants