Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQL Server and PostgreSQL can indicate report 0 as the queue length value #4557

Merged
merged 1 commit into from
Nov 6, 2024

Conversation

tmasternak
Copy link
Member

Fixes #4556

@PhilBastian
Copy link
Contributor

the query interval was increased originally because of how much it is hammering the database, which was theorised to potentially increase the likelihood of connection starvation. I'm thinking this needs to be looked at with greater depth, potentially changing how the interval logic works e.g. change it so that if there is no reading in an interval it will take the previous value

@andreasohlund @SzymonPobiega

@jpalac
Copy link
Contributor

jpalac commented Oct 24, 2024

the query interval was increased originally because of how much it is hammering the database, which was theorised to potentially increase the likelihood of connection starvation. I'm thinking this needs to be looked at with greater depth, potentially changing how the interval logic works e.g. change it so that if there is no reading in an interval it will take the previous value

@tmasternak do the changes made to the transport address the original reason as to why this interval was decreased?

@tmasternak
Copy link
Member Author

tmasternak commented Oct 24, 2024

@jpalac @PhilBastian great questions. Let me give the best answers I have.

TL;DR;

I propose to go with 200 ms for SQL Server and PostgreSQL. I think this should not affect connection starvation. If we get feedback from users that the load on db is too much we can provide an option to change the value (as we do for ASB). Let's capture re-working the query mechanism in a dedicated improvement issue.

Longer version

Let me start with SQL Server, which is potentially easier to answer. The value for queue length measurement interval was changed relatively recently (~2 months ago) and my understanding is that it was done to align the value with PostgreSQL. For SQL Server we never observed connection starvation and/or received feedback from users that the value is problematic. As a result, I propose to revert the change and make the interval for SQL Server 200 ms as it was previously.

When it comes to PostgreSQL I assumed that the queue length measurement mechanism would never use more than one connection at any given point. Looking at the code, it seems that we open a single connection and then update queue sizes, batching multiple select statements in a single client request.

await using var connection = new NpgsqlConnection(connectionString);
await connection.OpenAsync(cancellationToken);
foreach (var chunk in chunks)
{
await UpdateChunk(connection, chunk, cancellationToken);
}

As a result, I think that decreasing the query interval should not affect the connection pooling problem (to be verified for sure during pre-release testing).

We could still be hammering the database at too much of an interval but as mentioned previously 200 ms value seems to be fine for most/all SQL Server users so I would assume the same should be the case for PostgreSQL.

Finally, when it comes to:

I'm thinking this needs to be looked at with greater depth, potentially changing how the interval logic works e.g. change it so that if there is no reading in an interval it will take the previous value

I fully agree, I don't think this can be done as part of a bug fix though. The work needed is too big. I propose opening a dedicated improvement type of issue to capture that future work.

@jpalac
Copy link
Contributor

jpalac commented Oct 24, 2024

From memory the issue with Postgreql was that even though the connection was released, by the time a new one was requested (200ms), the released one was not yet available, hence a new one would be used. As a result a number (up to 8?) comnections were being used, which is significant when the limit in pgres is 100.

Again, perhaps the work you've done on the transport has reduced that impact.

Is there an issue with changing the SQLServer one back, but having the pgres one run less often?

@tmasternak
Copy link
Member Author

From memory the issue with Postgreql was that even though the connection was released, by the time a new one was requested (200ms), the released one was not yet available, hence a new one would be used. As a result a number (up to 8?) comnections were being used, which is significant when the limit in pgres is 100.

We could try testing that, but I'm not sure we can reliably do that with the time we have.

Is there an issue with changing the SQLServer one back, but having the pgres one run less often?

@jpalac yes, this might be the best approach.

@jpalac
Copy link
Contributor

jpalac commented Oct 24, 2024

Is there an issue with changing the SQLServer one back, but having the pgres one run less often?

@jpalac yes, this might be the best approach.

Ok - perhaps lets move ahead with that then?

@SzymonPobiega
Copy link
Member

I've seen the same result in the Learning transport during the NDC Porto conference while I was running the demo continuously. This makes me think that the solution might be deeper. Is far as I understand, the current assumption is that we need to query frequently enough to fill each bucket with data.

Maybe another solution would be to distinguish buckets for which we have no data and those that have value of 0. Then we could assume that when we have no data, the value is the same as the previous value of, if we have values on both sides, set the bucket to the average of the two neighbors?

@tmasternak
Copy link
Member Author

Maybe another solution would be to distinguish buckets for which we have no data and those that have value of 0. Then we could assume that when we have no data, the value is the same as the previous value of, if we have values on both sides, set the bucket to the average of the two neighbors?

@SzymonPobiega I've created dedicated issue and captured your solution idea there Particular/ServicePulse#2156

@tmasternak
Copy link
Member Author

Again, perhaps the work you've done on the transport has reduced that impact.

@jpalac I've done some more testing and I think we should be good with 200 ms for PostgreSQL. I've tweaked the connection string of the endpoints in the demo setup and all Service Control instances with the "ApplicationName" part in the connection string. This allowed me to measure the number of active connections coming from various parts of this simple system.

The query used was:

SELECT application_name, count(*) 
FROM pg_stat_activity
where datname = 'nservicebus'
group by application_name;

Here are the results:

image

This indicates that connection string pooling for queue length monitoring works as expected i.e. there is only one active connection in the pool at any given point. Secondly, there are 53 active connections in total with endpoints taking 8 connections each, which means that we could spin up to ~5 endpoints more and not reach the default 100 active connections limit in PostgreSQL.

@WilliamBZA WilliamBZA merged commit e3b0b66 into master Nov 6, 2024
31 checks passed
@WilliamBZA WilliamBZA deleted the decrease-ql-interval-for-sql branch November 6, 2024 11:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SQL Server and PostgreSQL can indicate report 0 as the queue length value
6 participants