Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

returnPermits exception stalls connection pool #255

Open
erikbeerepoot opened this issue Dec 9, 2024 · 2 comments
Open

returnPermits exception stalls connection pool #255

erikbeerepoot opened this issue Dec 9, 2024 · 2 comments

Comments

@erikbeerepoot
Copy link

erikbeerepoot commented Dec 9, 2024

We are seeing an issue with drainLoop in SimpleDequePool that is similar to this issue. We see the following pattern:

  1. Elevated errors over a period of hours (in our case, we see a high count of StackOverflowErrors). We are still attempting to understand the cause of these errors.
  2. We observe a Too many permits returned error in our logs. In our case, the connection pool is configured with 500 connections, so the error is Too many permits returned: returned=1, would bring to 501/500..
  3. Idle an active connections go to 0 (observed via metrics) and pending connections continue to build (we've seen many thousands).
  4. No more outgoing request are observed.

We have manually attempted to reproduce this issue by implementing a custom AllocationStrategy that delegates to an actual allocation strategy but throws an exception in returnPermits (either randomly, or after a given number of calls). After this exception is thrown, we observe the same behaviour as above. We don't currently understand how reach a state where the PERMITS count is off (we are unable to enable DEBUG logging in production due to PII concerns, but are looking to deploy some changes that will add logging on the number of permits as well as the state of the connection pool. Any pointers you have would be appreciated).

One hypothesis is that the exception is not handled, and WIP is not decremented. When doAcquire() is called, elements are added to the pending queue as before, but drainLoop cannot be entered as WIP will not be 0.

Another option is that it's hitting this exception in destroyPoolable.

Expected Behavior

Ideally, the connection pool would continue to function. I don't know if that is realistic given that PERMITS would continue to be wrong.

Actual Behavior

No more connections are made.

Steps to Reproduce

We realize this is a contrived example, but matches what we end up seeing in production.

class ThrowingAllocationStrategy(...) {
    private val allocationStrategyDelegate = // your actually, allocation strategy here, eg. Http2AllocationStrategy, SizeBasedAllocationStrategy
    // implement the remainder by just delegating the calls to the underlying allocation strategy

    override fun returnPermits(p0: Int) {
        randomlyThrow()

        allocationStrategyDelegate.returnPermits(p0)
    }

    private fun randomlyThrow() {
        if (Random.nextInt(100) == 0) {
            throw RuntimeException("Boom!")
        }
    }
}

<SNIP>
// init custom connection provider with the custom allocation strategy
return ConnectionProvider
        .allocationStrategy(loggingAllocationStrategy!!)
        .build()
  

Possible Solution

Given the the PERMITS count is off, it's unclear to me what the proper resolution would be. We noticed that disposing of the connection provider did resolve the issue as expected, presumably because the connection pool would be fresh and not in a bad state. I'm not sure what the implications would be of continuing to use the existing connection pool, as I would expect to see the permits exception being thrown repeatedly.

Your Environment

reactor-pool == 0.2.13
io.projectreactor:reactor-core:3.5.20 -> 3.6.9
org.springframework:spring-webflux -> 6.1.12
io.projectreactor.netty:reactor-netty-core:1.1.22
netty:4.1.112.Final

Also tried the latest main commit for reactor-core and reactor-pool.

  • Other relevant libraries versions (eg. netty, ...): netty, netty == 4.1.x,
  • JVM version (java -version): 21, 17.
  • OS and version (eg uname -a):
@violetagg
Copy link
Member

@erikbeerepoot Something is off here, Reactor Netty 1.1.x uses Reactor Pool 1.0.x.
https://github.com/reactor/reactor-netty/blob/1.1.x/gradle.properties

Also Reactor Pool 0.2.x reached OSS EOL https://github.com/reactor/.github/blob/main/SUPPORT.adoc#support-timeline

@erikbeerepoot
Copy link
Author

erikbeerepoot commented Jan 16, 2025

@violetagg Interesting, that is unexpected for sure. I will look into why this version is being used.

EDIT: Turns out the wrong reactor-pool version was an artifact of some local debugging steps I took, but doesn't reflect the version in the application.

Taking a glance at the current code though, it's similar enough as to have the same issue if this exception were to be thrown. I imagine it's pretty rare.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants