-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SQL Server and PostgreSQL can indicate report 0 as the queue length value #4557
Conversation
the query interval was increased originally because of how much it is hammering the database, which was theorised to potentially increase the likelihood of connection starvation. I'm thinking this needs to be looked at with greater depth, potentially changing how the interval logic works e.g. change it so that if there is no reading in an interval it will take the previous value |
@tmasternak do the changes made to the transport address the original reason as to why this interval was decreased? |
@jpalac @PhilBastian great questions. Let me give the best answers I have. TL;DR;I propose to go with Longer versionLet me start with SQL Server, which is potentially easier to answer. The value for queue length measurement interval was changed relatively recently (~2 months ago) and my understanding is that it was done to align the value with PostgreSQL. For SQL Server we never observed connection starvation and/or received feedback from users that the value is problematic. As a result, I propose to revert the change and make the interval for SQL Server When it comes to PostgreSQL I assumed that the queue length measurement mechanism would never use more than one connection at any given point. Looking at the code, it seems that we open a single connection and then update queue sizes, batching multiple select statements in a single client request. ServiceControl/src/ServiceControl.Transports.PostgreSql/QueueLengthProvider.cs Lines 93 to 99 in 76fa2e0
As a result, I think that decreasing the query interval should not affect the connection pooling problem (to be verified for sure during pre-release testing). We could still be hammering the database at too much of an interval but as mentioned previously Finally, when it comes to:
I fully agree, I don't think this can be done as part of a bug fix though. The work needed is too big. I propose opening a dedicated improvement type of issue to capture that future work. |
From memory the issue with Postgreql was that even though the connection was released, by the time a new one was requested (200ms), the released one was not yet available, hence a new one would be used. As a result a number (up to 8?) comnections were being used, which is significant when the limit in pgres is 100. Again, perhaps the work you've done on the transport has reduced that impact. Is there an issue with changing the SQLServer one back, but having the pgres one run less often? |
We could try testing that, but I'm not sure we can reliably do that with the time we have.
@jpalac yes, this might be the best approach. |
Ok - perhaps lets move ahead with that then? |
I've seen the same result in the Learning transport during the NDC Porto conference while I was running the demo continuously. This makes me think that the solution might be deeper. Is far as I understand, the current assumption is that we need to query frequently enough to fill each bucket with data. Maybe another solution would be to distinguish buckets for which we have no data and those that have value of 0. Then we could assume that when we have no data, the value is the same as the previous value of, if we have values on both sides, set the bucket to the average of the two neighbors? |
@SzymonPobiega I've created dedicated issue and captured your solution idea there Particular/ServicePulse#2156 |
@jpalac I've done some more testing and I think we should be good with 200 ms for PostgreSQL. I've tweaked the connection string of the endpoints in the demo setup and all Service Control instances with the "ApplicationName" part in the connection string. This allowed me to measure the number of active connections coming from various parts of this simple system. The query used was:
Here are the results: ![]() This indicates that connection string pooling for queue length monitoring works as expected i.e. there is only one active connection in the pool at any given point. Secondly, there are 53 active connections in total with endpoints taking 8 connections each, which means that we could spin up to ~5 endpoints more and not reach the default 100 active connections limit in PostgreSQL. |
Fixes #4556