SQL Server and PostgreSQL can indicate report 0 as the queue length value #4557

tmasternak · 2024-10-23T12:42:44Z

PhilBastian · 2024-10-23T23:10:56Z

the query interval was increased originally because of how much it is hammering the database, which was theorised to potentially increase the likelihood of connection starvation. I'm thinking this needs to be looked at with greater depth, potentially changing how the interval logic works e.g. change it so that if there is no reading in an interval it will take the previous value

@andreasohlund @SzymonPobiega

jpalac · 2024-10-24T03:24:20Z

the query interval was increased originally because of how much it is hammering the database, which was theorised to potentially increase the likelihood of connection starvation. I'm thinking this needs to be looked at with greater depth, potentially changing how the interval logic works e.g. change it so that if there is no reading in an interval it will take the previous value

@tmasternak do the changes made to the transport address the original reason as to why this interval was decreased?

tmasternak · 2024-10-24T06:58:04Z

@jpalac @PhilBastian great questions. Let me give the best answers I have.

TL;DR;

I propose to go with 200 ms for SQL Server and PostgreSQL. I think this should not affect connection starvation. If we get feedback from users that the load on db is too much we can provide an option to change the value (as we do for ASB). Let's capture re-working the query mechanism in a dedicated improvement issue.

Longer version

Let me start with SQL Server, which is potentially easier to answer. The value for queue length measurement interval was changed relatively recently (~2 months ago) and my understanding is that it was done to align the value with PostgreSQL. For SQL Server we never observed connection starvation and/or received feedback from users that the value is problematic. As a result, I propose to revert the change and make the interval for SQL Server 200 ms as it was previously.

When it comes to PostgreSQL I assumed that the queue length measurement mechanism would never use more than one connection at any given point. Looking at the code, it seems that we open a single connection and then update queue sizes, batching multiple select statements in a single client request.

ServiceControl/src/ServiceControl.Transports.PostgreSql/QueueLengthProvider.cs

Lines 93 to 99 in 76fa2e0

    
           await using var connection = new NpgsqlConnection(connectionString); 
        
           await connection.OpenAsync(cancellationToken); 
        
           foreach (var chunk in chunks) 
        
           { 
        
               await UpdateChunk(connection, chunk, cancellationToken); 
        
           }

As a result, I think that decreasing the query interval should not affect the connection pooling problem (to be verified for sure during pre-release testing).

We could still be hammering the database at too much of an interval but as mentioned previously 200 ms value seems to be fine for most/all SQL Server users so I would assume the same should be the case for PostgreSQL.

Finally, when it comes to:

I'm thinking this needs to be looked at with greater depth, potentially changing how the interval logic works e.g. change it so that if there is no reading in an interval it will take the previous value

I fully agree, I don't think this can be done as part of a bug fix though. The work needed is too big. I propose opening a dedicated improvement type of issue to capture that future work.

jpalac · 2024-10-24T08:06:38Z

From memory the issue with Postgreql was that even though the connection was released, by the time a new one was requested (200ms), the released one was not yet available, hence a new one would be used. As a result a number (up to 8?) comnections were being used, which is significant when the limit in pgres is 100.

Again, perhaps the work you've done on the transport has reduced that impact.

Is there an issue with changing the SQLServer one back, but having the pgres one run less often?

tmasternak · 2024-10-24T08:30:19Z

From memory the issue with Postgreql was that even though the connection was released, by the time a new one was requested (200ms), the released one was not yet available, hence a new one would be used. As a result a number (up to 8?) comnections were being used, which is significant when the limit in pgres is 100.

We could try testing that, but I'm not sure we can reliably do that with the time we have.

Is there an issue with changing the SQLServer one back, but having the pgres one run less often?

@jpalac yes, this might be the best approach.

jpalac · 2024-10-24T09:23:27Z

Is there an issue with changing the SQLServer one back, but having the pgres one run less often?

@jpalac yes, this might be the best approach.

Ok - perhaps lets move ahead with that then?

SzymonPobiega · 2024-10-24T15:42:19Z

I've seen the same result in the Learning transport during the NDC Porto conference while I was running the demo continuously. This makes me think that the solution might be deeper. Is far as I understand, the current assumption is that we need to query frequently enough to fill each bucket with data.

Maybe another solution would be to distinguish buckets for which we have no data and those that have value of 0. Then we could assume that when we have no data, the value is the same as the previous value of, if we have values on both sides, set the bucket to the average of the two neighbors?

tmasternak · 2024-11-05T13:40:52Z

Maybe another solution would be to distinguish buckets for which we have no data and those that have value of 0. Then we could assume that when we have no data, the value is the same as the previous value of, if we have values on both sides, set the bucket to the average of the two neighbors?

@SzymonPobiega I've created dedicated issue and captured your solution idea there Particular/ServicePulse#2156

tmasternak · 2024-11-05T13:46:40Z

Again, perhaps the work you've done on the transport has reduced that impact.

@jpalac I've done some more testing and I think we should be good with 200 ms for PostgreSQL. I've tweaked the connection string of the endpoints in the demo setup and all Service Control instances with the "ApplicationName" part in the connection string. This allowed me to measure the number of active connections coming from various parts of this simple system.

The query used was:

SELECT application_name, count(*) 
FROM pg_stat_activity
where datname = 'nservicebus'
group by application_name;

Here are the results:

This indicates that connection string pooling for queue length monitoring works as expected i.e. there is only one active connection in the pool at any given point. Secondly, there are 53 active connections in total with endpoints taking 8 connections each, which means that we could spin up to ~5 endpoints more and not reach the default 100 active connections limit in PostgreSQL.

decrease ql query interval for sql transports

fff1204

tmasternak requested review from andreasohlund, jpalac and PhilBastian October 23, 2024 12:42

tmasternak mentioned this pull request Oct 23, 2024

SQL Server and PostgreSQL can indicate report 0 as the queue length value #4558

Merged

tmasternak self-assigned this Oct 23, 2024

tmasternak mentioned this pull request Oct 23, 2024

SQL Server and PostgreSQL can indicate report 0 as the queue length value #4559

Merged

tmasternak mentioned this pull request Nov 5, 2024

Enable users to distinguish when there is queue length data from when the queue length size is 0 Particular/ServicePulse#2156

Open

andreasohlund approved these changes Nov 5, 2024

View reviewed changes

WilliamBZA merged commit e3b0b66 into master Nov 6, 2024
31 checks passed

WilliamBZA deleted the decrease-ql-interval-for-sql branch November 6, 2024 11:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SQL Server and PostgreSQL can indicate report 0 as the queue length value #4557

SQL Server and PostgreSQL can indicate report 0 as the queue length value #4557

tmasternak commented Oct 23, 2024

PhilBastian commented Oct 23, 2024

jpalac commented Oct 24, 2024

tmasternak commented Oct 24, 2024 •

edited

Loading

jpalac commented Oct 24, 2024

tmasternak commented Oct 24, 2024

jpalac commented Oct 24, 2024

SzymonPobiega commented Oct 24, 2024

tmasternak commented Nov 5, 2024

tmasternak commented Nov 5, 2024

SQL Server and PostgreSQL can indicate report 0 as the queue length value #4557

SQL Server and PostgreSQL can indicate report 0 as the queue length value #4557

Conversation

tmasternak commented Oct 23, 2024

PhilBastian commented Oct 23, 2024

jpalac commented Oct 24, 2024

tmasternak commented Oct 24, 2024 • edited Loading

TL;DR;

Longer version

jpalac commented Oct 24, 2024

tmasternak commented Oct 24, 2024

jpalac commented Oct 24, 2024

SzymonPobiega commented Oct 24, 2024

tmasternak commented Nov 5, 2024

tmasternak commented Nov 5, 2024

tmasternak commented Oct 24, 2024 •

edited

Loading