You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We are seeing an issue with drainLoop in SimpleDequePool that is similar to this issue. We see the following pattern:
Elevated errors over a period of hours (in our case, we see a high count of StackOverflowErrors). We are still attempting to understand the cause of these errors.
We observe a Too many permits returned error in our logs. In our case, the connection pool is configured with 500 connections, so the error is Too many permits returned: returned=1, would bring to 501/500..
Idle an active connections go to 0 (observed via metrics) and pending connections continue to build (we've seen many thousands).
No more outgoing request are observed.
We have manually attempted to reproduce this issue by implementing a custom AllocationStrategy that delegates to an actual allocation strategy but throws an exception in returnPermits (either randomly, or after a given number of calls). After this exception is thrown, we observe the same behaviour as above. We don't currently understand how reach a state where the PERMITS count is off (we are unable to enable DEBUG logging in production due to PII concerns, but are looking to deploy some changes that will add logging on the number of permits as well as the state of the connection pool. Any pointers you have would be appreciated).
Another option is that it's hitting this exception in destroyPoolable.
Expected Behavior
Ideally, the connection pool would continue to function. I don't know if that is realistic given that PERMITS would continue to be wrong.
Actual Behavior
No more connections are made.
Steps to Reproduce
We realize this is a contrived example, but matches what we end up seeing in production.
classThrowingAllocationStrategy(...) {
privatevalallocationStrategyDelegate = // your actually, allocation strategy here, eg. Http2AllocationStrategy, SizeBasedAllocationStrategy// implement the remainder by just delegating the calls to the underlying allocation strategyoverridefunreturnPermits(p0: Int) {
randomlyThrow()
allocationStrategyDelegate.returnPermits(p0)
}
privatefunrandomlyThrow() {
if (Random.nextInt(100) == 0) {
throwRuntimeException("Boom!")
}
}
}
<SNIP>
// init custom connection provider with the custom allocation strategyreturnConnectionProvider
.allocationStrategy(loggingAllocationStrategy!!)
.build()
Possible Solution
Given the the PERMITS count is off, it's unclear to me what the proper resolution would be. We noticed that disposing of the connection provider did resolve the issue as expected, presumably because the connection pool would be fresh and not in a bad state. I'm not sure what the implications would be of continuing to use the existing connection pool, as I would expect to see the permits exception being thrown repeatedly.
@violetagg Interesting, that is unexpected for sure. I will look into why this version is being used.
EDIT: Turns out the wrong reactor-pool version was an artifact of some local debugging steps I took, but doesn't reflect the version in the application.
Taking a glance at the current code though, it's similar enough as to have the same issue if this exception were to be thrown. I imagine it's pretty rare.
We are seeing an issue with
drainLoop
inSimpleDequePool
that is similar to this issue. We see the following pattern:Too many permits returned
error in our logs. In our case, the connection pool is configured with 500 connections, so the error isToo many permits returned: returned=1, would bring to 501/500.
.We have manually attempted to reproduce this issue by implementing a custom AllocationStrategy that delegates to an actual allocation strategy but throws an exception in
returnPermits
(either randomly, or after a given number of calls). After this exception is thrown, we observe the same behaviour as above. We don't currently understand how reach a state where thePERMITS
count is off (we are unable to enable DEBUG logging in production due to PII concerns, but are looking to deploy some changes that will add logging on the number of permits as well as the state of the connection pool. Any pointers you have would be appreciated).One hypothesis is that the exception is not handled, and WIP is not decremented. When doAcquire() is called, elements are added to the pending queue as before, but
drainLoop
cannot be entered asWIP
will not be 0.Another option is that it's hitting this exception in
destroyPoolable
.Expected Behavior
Ideally, the connection pool would continue to function. I don't know if that is realistic given that PERMITS would continue to be wrong.
Actual Behavior
No more connections are made.
Steps to Reproduce
We realize this is a contrived example, but matches what we end up seeing in production.
Possible Solution
Given the the PERMITS count is off, it's unclear to me what the proper resolution would be. We noticed that disposing of the connection provider did resolve the issue as expected, presumably because the connection pool would be fresh and not in a bad state. I'm not sure what the implications would be of continuing to use the existing connection pool, as I would expect to see the permits exception being thrown repeatedly.
Your Environment
Also tried the latest main commit for reactor-core and reactor-pool.
netty
, ...): netty, netty == 4.1.x,java -version
): 21, 17.uname -a
):The text was updated successfully, but these errors were encountered: