Investigate performance issues of Web UI #23

UsualSpec · 2024-09-19T14:23:12Z

The WebUI probably needs some overhaul and investigation of performance bottlenecks.
I strongly assume that the issue is not hardware.
TODO:

Find out where potential bottlenecks occur (probably wrong timeouts): Web UI, Server or Plotter?
Is there any issue related to slow DB queries?

UsualSpec · 2024-09-19T14:30:13Z

@romain-jacob could you please tell me where exactly the issues occur? I do see that the status messages come in too slowly. Potentially this needs some re-engineering of the protocol - especially concerning timeouts.

romain-jacob · 2024-09-19T14:34:42Z

Just everything is slow. The plotter (so, Dash) is fine, but the rest of the GUI lags a lot, and I don't manage to get the IPs of the clients. It looks like things just time out. Data still comes in fine and I can see the clients registered in the database, so that's alright.

I quickly checked the VM resources, and neither CPU nor memory seems to run out. So that seems to be an issue in the protocol running behind the web front-end

UsualSpec · 2024-09-19T14:41:59Z

Ok. I assume that the issue is the synchronous waiting on the server side. Potentially:

autopower/server/server.py

Lines 113 to 124 in 1911af7

    
           def waitForResponseTo(self, requestNo): 
        
               self.responsedictlock.acquire() 
        
               while not self.responseArrived(requestNo): 
        
                   # wait_for is new in version 3.2 but docs do not specify if wait_for also returns a boolean on timeout (https://docs.python.org/3/library/threading.html#threading.Condition.wait_for) 
        
                   noTimeout = self.responsedictcv.wait(timeout=30)  # times out after 30 seconds 
        
                   if not noTimeout: 
        
                       self.responsedictlock.release() 
        
                       return False  # on timeout give back issue 
        
               # block until requestNo has arrived 
        
               self.responsedictlock.release() 
        
               return True

Probably the correct way would be not to wait on the server, but rather forward requests via SSE (Server side events) to the browser (feels slightly better, but not yet great since it doesn't encapsulate functionality correctly). I think we'd need to come up with another design here.

romain-jacob · 2024-09-19T14:49:08Z

Yes, that's most likely the culprit. Still though, I'm a bit surprised that things timed out already. There are so many clients... no sure why it takes so long to get a reply from the client.

UsualSpec · 2024-09-19T19:40:30Z

I've just restarted mmserver on the VM and see that autopower 16 always re registers. There could be an issue either with this device or the server.

UsualSpec · 2024-09-19T19:49:59Z

Another thing could be that the clients are hanging somewhere in the management thread.

This method:

autopower/client/client.cc

Line 617 in 1911af7

void AutopowerClient::manageMsmt() {

is blocking somewhere - potentially. The Web UI only shows that one device is registered while this is clearly not expected (the devices do upload data).

Or it's (thread) contention on the server.

UsualSpec · 2024-09-19T19:59:21Z

I have the feeling that this needs some testing with an actual setup.

UsualSpec · 2024-09-19T21:04:45Z

I think, I pinned the issue down: 10 workers for gRPC aren't enough for that many clients:

autopower/server/server.py

Line 573 in 1911af7

grpcServer = grpc.server(futures.ThreadPoolExecutor(max_workers=10))

I've set the number higher on the server and now it should work. Not the best solution, but it should work for now.
autopower16 still seems to be misbehaving - I suspect a client issue there.

romain-jacob · 2024-09-20T08:07:38Z

I don't get what the problem you mention about autopower16. Looks fine to me.

On the other hand, I observed twice now some clients that used to be registered and that lost their connection to the server (autopower9 for a long time, and I just lost autopower8 last night). I'm not sure what happens there (and I can't SSH into those since they are behind a NAT). That should be looked at more closely eventually, but that's the wrong issue for this...

romain-jacob · 2024-09-20T08:37:43Z

Oh, I see what you are saying with 16 now. Looks like it keeps registering new measurements... strange

UsualSpec · 2024-09-20T09:08:05Z

autopower16 also constantly registers with the server. If you run cli.py on the server, you'll see what I mean. You'll get messages saying that the client re registered.

romain-jacob · 2024-09-20T09:15:35Z

I'll try rebooting 16 and see if that fixes it.

UsualSpec · 2024-10-05T18:55:16Z

Bottleneck 1:
Password verification for management clients. Authentication needs some rethinking

Bottleneck 2: (Probably, not sure how to validate this)
Streaming in registerClient(). According to https://grpc.io/docs/guides/performance/#python Python is slow for streaming RPCs. This is python specific.

I think it makes sense to discuss if we should rewrite the server in C++ too (I am slightly in favor of doing so). It makes sense for scalability reasons. The servers' code is not that large.

romain-jacob · 2024-10-07T07:31:39Z

I'm not against it. Since the client is already in C++, that would make sense. I don't have a good sense for the amount of work that would represent though.

romain-jacob · 2024-10-11T10:46:55Z

Writing it down here to make sure I don't forget:

Right now it seems we poll every client as soon as the managementUI is opened. That's quite wasteful and unnecessary. Let's talk about it.

UsualSpec · 2024-10-11T13:25:27Z

--> Only display DB related info, no status requests.

UsualSpec · 2024-10-16T09:42:51Z

I had another thought about the data upload speedup and the acking strategy. I think that - if we ack after writing into the DB and not commit, ack, send ack off, but rather write everything into the DB, ack everything, send ack off to client we always have the potential of duplicates.
However, if we also introduce an index for the ack_id and server_measurement_id on the server, it should be fairly fast. The index could introduce something like a hash table lookup which would work in constant time...

romain-jacob · 2024-10-16T09:56:04Z

Okay. As I said, I don't understand this process well enough to have an opinion at the moment. You can try something out you think is promising and we can test what happens with the two clients that are currently as ZhdK

UsualSpec · 2024-10-17T07:07:09Z

Ok. I've now implemented the suggestion - with duplicate checks in SQL: a9d6f14

I could not produce duplicates and the client side DB table correctly sets the was_uploaded flag. Also the upload is way faster now. I believe that this should be good now.
Still, it's worth a test e.g. as you suggested.

UsualSpec · 2024-10-22T14:13:18Z

After some more investigation, the main performance bottleneck is the check if a client is authorized as management client. This slows down management requests.
We need to have a way to only allow management clients to issue requests to "dangerous" functions such as starting/stopping measurements.
The question is, if we can verify the connection instead of each call.

romain-jacob · 2024-10-23T09:07:25Z

Do you understand why this check takes so long? Shouldn't it be "just" one RTT between the client and the server?

UsualSpec · 2024-10-23T11:27:54Z

The password verification is designed to be slow. This is done to avoid brute force attacks if the hash would be leaked.
We don't really have this issue.

We however need a way to authenticate management clients vs. normal clients. So the implementation needs to be different.

UsualSpec mentioned this issue Sep 19, 2024

Make client requests non blocking #24

Open

romain-jacob added the priority label Oct 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate performance issues of Web UI #23

Investigate performance issues of Web UI #23

UsualSpec commented Sep 19, 2024 •

edited

Loading

UsualSpec commented Sep 19, 2024

romain-jacob commented Sep 19, 2024 •

edited

Loading

UsualSpec commented Sep 19, 2024

romain-jacob commented Sep 19, 2024

UsualSpec commented Sep 19, 2024

UsualSpec commented Sep 19, 2024 •

edited

Loading

UsualSpec commented Sep 19, 2024

UsualSpec commented Sep 19, 2024 •

edited

Loading

romain-jacob commented Sep 20, 2024

romain-jacob commented Sep 20, 2024

UsualSpec commented Sep 20, 2024

romain-jacob commented Sep 20, 2024

UsualSpec commented Oct 5, 2024 •

edited

Loading

romain-jacob commented Oct 7, 2024

romain-jacob commented Oct 11, 2024

UsualSpec commented Oct 11, 2024

UsualSpec commented Oct 16, 2024 •

edited

Loading

romain-jacob commented Oct 16, 2024

UsualSpec commented Oct 17, 2024

UsualSpec commented Oct 22, 2024

romain-jacob commented Oct 23, 2024

UsualSpec commented Oct 23, 2024

Investigate performance issues of Web UI #23

Investigate performance issues of Web UI #23

Comments

UsualSpec commented Sep 19, 2024 • edited Loading

UsualSpec commented Sep 19, 2024

romain-jacob commented Sep 19, 2024 • edited Loading

UsualSpec commented Sep 19, 2024

romain-jacob commented Sep 19, 2024

UsualSpec commented Sep 19, 2024

UsualSpec commented Sep 19, 2024 • edited Loading

UsualSpec commented Sep 19, 2024

UsualSpec commented Sep 19, 2024 • edited Loading

romain-jacob commented Sep 20, 2024

romain-jacob commented Sep 20, 2024

UsualSpec commented Sep 20, 2024

romain-jacob commented Sep 20, 2024

UsualSpec commented Oct 5, 2024 • edited Loading

romain-jacob commented Oct 7, 2024

romain-jacob commented Oct 11, 2024

UsualSpec commented Oct 11, 2024

UsualSpec commented Oct 16, 2024 • edited Loading

romain-jacob commented Oct 16, 2024

UsualSpec commented Oct 17, 2024

UsualSpec commented Oct 22, 2024

romain-jacob commented Oct 23, 2024

UsualSpec commented Oct 23, 2024

UsualSpec commented Sep 19, 2024 •

edited

Loading

romain-jacob commented Sep 19, 2024 •

edited

Loading

UsualSpec commented Sep 19, 2024 •

edited

Loading

UsualSpec commented Sep 19, 2024 •

edited

Loading

UsualSpec commented Oct 5, 2024 •

edited

Loading

UsualSpec commented Oct 16, 2024 •

edited

Loading