Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using Jetty with Spring Webclient causes thread leak #22

Closed
thomas-br opened this issue Aug 13, 2020 · 33 comments
Closed

Using Jetty with Spring Webclient causes thread leak #22

thomas-br opened this issue Aug 13, 2020 · 33 comments

Comments

@thomas-br
Copy link

Hi,

I am currently looking into using the jetty httpclient to replace netty for my Spring Webclient which is responsible for external calls. As there is a hard limit when these external requests shall be cancelled, I am using a timeout in the Webclient execution call:

ResponseEntity<String> response = request.retrieve()
    .toEntity(String.class)
    .timeout(Duration.ofMillis(timeoutMillis))
    .block();

I am using Jetty as ClientHttpConnector, simplified:

HttpClient httpClient = new HttpClient(SSL_CONTEXT_FACTORY);
WebClient.builder()
    .clientConnector(new JettyClientHttpConnector(httpClient))
...

In a load test scenario where all calls are cancelled before it can finish, I am observing a thread leak on the Jetty HttpClient Threads:
image
Even after a long cooldown they are not coming back. And at the peak new requests are failing due to no free resources.

Caused by: java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached
	at java.base/java.lang.Thread.start0(Native Method)
	at java.base/java.lang.Thread.start(Thread.java:804)
	at org.eclipse.jetty.util.thread.QueuedThreadPool.startThread(QueuedThreadPool.java:660)
@sbordet
Copy link
Member

sbordet commented Aug 13, 2020

Can you take a thread dump to see what are those threads doing?

@thomas-br
Copy link
Author

Can you take a thread dump to see what are those threads doing?

Sure. You can find the thread dump here

@thomas-br
Copy link
Author

@sbordet Could you get any insight out of it?

@sbordet
Copy link
Member

sbordet commented Aug 25, 2020

@thomas-br sorry for the late reply, I was in vacation.
Looking at the dumps now.

@sbordet
Copy link
Member

sbordet commented Aug 25, 2020

From your logs, there are 439 different instances of HttpClient.

That is not good, as there should be only 1.
Each instance will require selector threads, schedulers, etc. so the 439 of them will consume a lot of threads.

I suspect there is something really wrong in the way you setup your application.

@thomas-br
Copy link
Author

@sbordet
Thanks for your reply. Okay, yes indeed I am creating a HttpClient for each request.

On the one handside same procedure does not cause such a thread behavior with Netty HttpClient (see graph below).
image

On the other, if you say ideally a singleton instance of the HttpClient should be used sounds generally fair. But the issue I see is, that proxy-configuration can only be applied globally on the HttpClient but which may differ in my case per request. So I do not see an option how to do that thread-safe per request (parallel) without new instances.

@rstoyanchev
Copy link

When creating the JettyClientHttpConnector you can also pass a JettyResourceFactory to re-use the same resources across different client instances. Or you can also configure HttpClient yourself with the same resources, if you see what that constructor does. The convenience of the JettyClientHttpConnector is that it's designed to be declared in Spring configuration and automatically start and stop the resources with the Spring ApplicationContext lifecycle.

@sbordet
Copy link
Member

sbordet commented Aug 26, 2020

But the issue I see is, that proxy-configuration can only be applied globally on the HttpClient but which may differ in my case per request.

Can you detail this?

Typically when using a proxy, you want to say that requests to a particular origin should be proxied - see for example how they are configured in a browser.

Do you have a case where, for the same origin, you want to use a proxy for one request, but not for another?

@thomas-br
Copy link
Author

But the issue I see is, that proxy-configuration can only be applied globally on the HttpClient but which may differ in my case per request.

Can you detail this?

Typically when using a proxy, you want to say that requests to a particular origin should be proxied - see for example how they are configured in a browser.

Do you have a case where, for the same origin, you want to use a proxy for one request, but not for another?

Yes sure. So basically the use case would be that the service itself would not know any request information on application startup, but everything is basically dynamic configuration per "job". So it get's at runtime the task to execute the following request with the passed configuration. We cannot suppose any similarity on, for example, same hosts.

Imagine the first request to example.com is configured to don't use any proxy, second requests is configured to use a certain proxy and third one to the same host / url can even have a complete different proxy setting in place. Same applies for the expected timeout behavior, which is also another problem with a single HttpClientInstance. So every job is completely independent of any other, and should behave stateless without any interference.

Still for testing purposes, I have tried a singleton instance of HttpClient with a static proxy-configuration. Yes this solves the threading problem initially reported. But during the same load test I have seen another issue. Load test setup is the following:

  • Attempted HTTP-Request is requesting a large file, e.g. 20MB
  • On the Spring Webclient configuration we are setting a timeout (see coding above) of (for simplicity) 1s, thus expecting all requests to be cancelled after that time
  • 10 Threads in parallel, thus new requests after each cancellation, all to the same host (for simplicity)

Thread / Memory behavior seems to be okay, but from a certain point I receive the following error:

Max requests queued per destination 1024 exceeded for HttpDestination[https://<target>.com]@7da592a2,queue=1024,pool=DuplexConnectionPool@43c321e[c=0/64/64,a=64,i=0]

Which is within the load period acceptable. But after stopping the load, all WebClient requests are at latest done after 1s. Still this message stays for more than 30min, thus making the instance unusable in such a scenario.

It seems although the WebClient request is cancelled, internally something seems to be queued still. Why is this the case? Why doesn't the triggered timeout by Spring WebClient on the Mono leads to a cleanup of everything related of the initial HttpClient request?

@sbordet
Copy link
Member

sbordet commented Aug 27, 2020

It's the first time that I encounter the need of different proxy configurations for different requests for the same origin server (e.g. browsers do not support this).

Jetty's HttpClient currently does not support this because the proxy configuration is per-HttpClient, not per-request.
You may want to open an issue to https://github.com/eclipse/jetty.project/issues for this.

The workaround is to use different HttpClient instances each with different proxy configurations.
You can even get by using a HttpClient per request, but you need to create, configure, start the HttpClient instance, then send the request, and when the request is complete you must stop the HttpClient instance to shutdown the thread pool and all the other components. One HttpClient per request will impact performance heavily - you can perhaps improve by creating a new HttpClient instance only if you have a proxy configuration (or a new proxy configuration).

About the load test you describe in the above comment, I am not sure I understand "Still this message stays for more than 30min".
From the HttpClient point of view, if you abort a request, it's gone and won't be retried.
Also the Jetty's Reactive HttpClient wrapper does not retry requests.
Can you verify there is no automatic retry mechanism in Spring WebClient?

@thomas-br
Copy link
Author

thomas-br commented Aug 27, 2020

Okay. Accepting that the proxy configuration is bound per HttpClient level.

For the other issue:

About the load test you describe in the above comment, I am not sure I understand "Still this message stays for more than 30min".
From the HttpClient point of view, if you abort a request, it's gone and won't be retried.
Also the Jetty's Reactive HttpClient wrapper does not retry requests.
Can you verify there is no automatic retry mechanism in Spring WebClient?

It seems to be a deadlock. I am not talking about any retry behavior. But more on the mechanism when the Mono of a Webclient request is cancelled by the timeout property. It seems when the timeout on Mono level occurs, the request or better say the resources inside the Jetty HttpClient are not terminated and freed properly.

My observation is that after heavy load of the same requests performed to the same host, which are all terminated by the Mono timeout, will produce a deadlock of leaked queued requests. In my example all WebClient request tasks are finished. But still the HttpDestination queue within the DuplexConnectionPool stays full and blocks new request to the same host until the application restarts. I have observed the instance now for a couple of hours, the queue is not getting cleared.

@sbordet
Copy link
Member

sbordet commented Aug 27, 2020

Any chance that you can write a simple reproducer?

@thomas-br
Copy link
Author

Any chance that you can write a simple reproducer?

Sure, will probably find some time next week :)

@thomas-br
Copy link
Author

thomas-br commented Sep 3, 2020

Hi @sbordet

Simple reproducing project can be found here

Simply insert in the URI constant an arbitrary large file which takes more than 1sec to load.
Background is that we want to run in the timeout case.

Fire a lot of requests, for example with jmeter. I used 10 threads in parallel.
After a few minutes you see the exception described above coming in which shows the block of the host of your file, which persists even after cooldown.

sbordet added a commit that referenced this issue Sep 7, 2020
Added test case - there appear to be no issue though.

Signed-off-by: Simone Bordet <[email protected]>
@sbordet
Copy link
Member

sbordet commented Sep 7, 2020

@thomas-br I simplified your reproducer and I obtain this with 5.2.8:

reactor.core.Exceptions$ReactiveException: java.util.concurrent.TimeoutException: Did not observe any item or terminal signal within 1000ms in 'flatMap' (and no fallback has been configured)

When I fix the timeout() call to be: .timeout(Duration.ofMillis(timeout), Mono.just(...)) then the test passes cleanly.

All seems good to me.

@thomas-br
Copy link
Author

@sbordet I think you did not get my point. Let me try to clarify it again:

java.util.concurrent.TimeoutException is the expected behavior in my example, so it is not about avoiding it!

The issue is just that when a lot of long requests to the same host are being cancelled with .timeout, this will produce at some point the following error:

Max requests queued per destination 1024 exceeded for HttpDestination[https://<target>.com]@7da592a2,queue=1024,pool=DuplexConnectionPool@43c321e[c=0/64/64,a=64,i=0]

And apparently due to the fact that all requests are cancelled by .timeout they will not be removed from the queue.

So in a nutshell my observation is that an attacker could enforce a lot of timeouts to for example github.com, the result would be even the attack is over, github.com would not be reachable any longer by that instance because the HttpDestination seems to have the max of requests queried which are never removed due to the cancellation.

@sbordet
Copy link
Member

sbordet commented Sep 7, 2020

@thomas-br so I can confirm that it's Spring that does not call jettyRequest.abort(..) when a timeout happens.

Because Spring wraps and hides the Jetty APIs, application code does not have, IIUC, any API to manually call jettyRequest.abort(..) with the TimeoutException produced by Spring.

I'm leaning towards this being a Spring issue rather than an issue for this project - they should either call request.abort(..) internally, or allow applications to do so.
Another option would be to forward the .timeout(..) call to the jettyRequest, but then there will be 2 scheduled tasks that will likely race each other.

@rstoyanchev
Copy link

rstoyanchev commented Sep 15, 2020

@sbordet I would expect the Jetty reactive client to handle Reactive Streams signals. In this case case the timeout() operator sends a TimeoutException downstream so the application can continue accordingly, e.g. via .onErrorResume(). It also sends a cancel signal upstream and that reaches ResponseListenerProcessor#cancel() which then has the opportunity to abort the request.

By comparison Spring also wraps use of Reactor Netty as an alternative which does handle cancellations and we don't have to intercept the cancellation signal ourselves. Likewise any other Subscriber (not related to Spring or Reactor) should be able to cancel the subscription and that should abort the request.

@sbordet
Copy link
Member

sbordet commented Sep 15, 2020

@rstoyanchev the problem with cancel() is that the reason for the cancellation is totally lost.
No idea whether it's a timeout, I/O, runtime, etc. exception.

That is why I think it's probably best for the Reactor implementation to actually abort the request with the actual exception that is known to Reactor, also because Reactor completely wraps the Jetty request, so it has access to both the Jetty request and the failure.

I can react to ResponseListenerProcessor.cancel() with a generic IOException failure, but that is a less optimal solution.
Do you have a pointer to the Netty code that implements cancel()?

@rstoyanchev
Copy link

rstoyanchev commented Sep 16, 2020

@sbordet I don't see why it matters whether it is timeout or not. A cancellation means the Subscriber is asking the Publisher to stop and by Reactive Streams rules the Publisher must not publish any further signals which means the response is not going to be consumed. I don't see how having a reason changes any of that.

That is why I think it's probably best for the Reactor implementation to actually abort the request with the actual exception that is known to Reactor

The request is wrapped by Spring's JettyClientHttpConnector in order to plug it into Spring's WebClient which supports higher level Objects for input and output among other things. This connector is also unaware of the exact reason which in this case originates from the timeout operator which is further downstream and not aware (and not meant to be aware) of what the upstream Publisher is or does.

I can react to ResponseListenerProcessor.cancel() with a generic IOException failure

Such an exception cannot be passed to the Subscriber which has already cancelled, and throwing it won't do much good either. In a reactive chain exceptions cannot be thrown and are typically caught and passed downstream as error signals, but again the downstream has already cancelled so the error cannot be passed and will be dropped. In any case, when a clear signal has been given to stop, reacting to that with an exception doesn't make sense to me.

Do you have a pointer to the Netty code that implements cancel()?

This is in Reactor Netty which is a project on top of Netty that adds Reactive Streams semantics and back-pressure. It is comparable to how jetty-reactive-httpclient does the same for Jetty's HttpClient. The code is a bit more involved and it supports more than just HTTP but this is where the equivalent connect Publisher is and this is where the cancellation is handled here.

@sbordet
Copy link
Member

sbordet commented Sep 16, 2020

@rstoyanchev wrote:

I don't see why it matters whether it is timeout or not. A cancellation means the Subscriber is asking the Publisher to stop

But it's not a cancellation. It's a timeout.
The fact that the timeout gets turned into a cancellation is irrelevant and only necessary due to the limited API that RS exposes.

Plus, a single HTTP request is not really a "stream", so when you timeout an HTTP request/response that has already been passed to the subscriber (in the form of a response element), and the subscriber calls cancel() what does that exactly mean?
From the RS point of view it means that the publisher should not publish more. It does not mean that the publisher should "cancel" the current element that is being processed - from the point of view the of publisher, that element is gone already and now it's owned by the subscriber.

And even if you stretch the RS semantic to "cancel" the current element, the publisher may need to know the failure type because it may need to dispose resources in one case of failure, but not in others.

Having the implementation correctly abort HTTP requests, rather than relying on the limited semantic of RS, would be best IMHO.

Having said that, I can abort with a generic exception in ResponseListenerProcessor.cancel() if that's what all the other implementations do.

@rstoyanchev
Copy link

Apologies for the slow reply.

It's true "cancel" can lack certainty, i.e. does the downstream want to abort or is it simply not interested in consuming the rest? In my experience the distinction rarely matters. The only exception I've encountered so far is a database transaction but even there it's more intuitive to interpret a cancellation as aborting the the transaction. In the context of an HTTP response I can see that "cancel" can be interpreted to mean close the connection immediately (disposing of resources) or consume and release the rest of the response without passing it downstream. Is that the concern here, or if not what is the choice of what could be done?

What I don't think is okay is to ignore a cancel signal altogether and do nothing, so the question to me is what is the appropriate default action to take? Another thing to mention, in this example it's only the Reactor .timeout() operator that is aware of the cause for the cancellation but Reactor is just a general-purpose library for composing logic (much like the java.util.Stream for composing logic on collections) and it has no knowledge of what the application is doing. Likewise there is no other code upstream all the way to Jetty's ResponseListenerProcessor that has any more knowledge about the reason for the cancellation.

a single HTTP request is not really a "stream", so when you timeout an HTTP request/response that has already been passed to the subscriber (in the form of a response element), and the subscriber calls cancel() what does that exactly mean?

I realize this part is not obvious if you only think of a Publisher and a Subscriber only but from the higher level perspective of ReactiveX-style composition it's easier to see there is no distinction for the application:

WebClient.builder().clientConnector(new JettyClientHttpConnector(this.httpClient)).build()
        .get().uri(new URI(URI))
        .retrieve()
        .toEntity(String.class)
        .timeout(Duration.ofMillis(1000)) // 1sec to "fail" fast

Effectively the response publisher (which is wrapped in a Mono) flatmaps the response content publisher (Flux) and the output of that downstream is a single chain with the response content, either a single value or possibly a stream of values. A timeout() at that point is for the entire joined upstream chain.

Having the implementation correctly abort HTTP requests, rather than relying on the limited semantic of RS, would be best IMHO.

I think by "implementation" you mean Spring of Reactor but as I mentioned neither Reactor has knowledge of what the application does upstream, nor does any code upstream have any knowledge of the reason for the cancellation.

Having said that, I can abort with a generic exception in ResponseListenerProcessor.cancel() if that's what all the other implementations do.

I think it would be better to do something vs to completely ignore it.

@sbordet
Copy link
Member

sbordet commented Sep 30, 2020

@rstoyanchev my turn to apologize for the delay.

I have implemented the cancellation, but now I am facing another problem: normal requests always fail due to cancellation.

I seem to be hitting MonoNext.NextSubscriber.onNext() which -- when done==false -- first thing cancels the subscription.

The test that fails is: https://github.com/jetty-project/jetty-reactive-httpclient/blob/1.1.x/src/test/java/org/eclipse/jetty/reactive/client/ReactorTest.java#L42

Help?

@rstoyanchev
Copy link

I'll have a look.

@rstoyanchev
Copy link

@sbordet, both tests in ReactorTest are green for me, this is in the 1.1.x branch. I was expecting a failure somewhere?

@sbordet
Copy link
Member

sbordet commented Oct 1, 2020

@rstoyanchev I did not commit your suggested change to ResponseListenerProcessor.cancel() because it breaks.
The change is simple, just add request.abort(new CancellationException()) in there.
I can commit to a branch if that helps.

@rstoyanchev
Copy link

rstoyanchev commented Oct 1, 2020

Okay thanks, I see the problem on our side. We're using Mono#from which cancels after the first item because with a Publisher there is no way to know how many items it might produce. However in this case we know it will only produce one item and there is no need to cancel proactively, so we'll use Mono#fromDirect instead.

I have confirmed locally that after the fix all tests in ReactorTest are green again.

@sbordet
Copy link
Member

sbordet commented Oct 1, 2020

@rstoyanchev not to beat a dead horse, but this is actually the example where an implementation of onNext(), after receiving the item, is entitled to call cancel() - what would be wrong with that?

However, with the "fix" the producer holds on to the item it just passed to onNext() just to "cancel" it.

And the solution of "oh, I know this producer will only produce one item [how do you know?], so let's make a special case, and this time only not cancel the subscription".

All of this seems extremely fragile IMHO and I'd like to reiterate that aborting the request upon a timeout generated and controlled by external subscriber should be done by the external subscriber, not by the implementation as a reaction to a generic call to cancel() which, as demonstrated above, may not be a failure at all.

@rstoyanchev
Copy link

rstoyanchev commented Oct 2, 2020

I know, this doesn't seem very re-assuring :) ... but allow me to elaborate.

JettyClientHttpConnector relies on the response method with a callback to transform the response. That callback turns the response into Mono<JettyClientHttpResponse> which is a Pubilsher of 1:

ReactiveRequest reactiveRequest = ReactiveRequest.newBuilder(request).build();

Publisher<JettyClientHttpResponse> publisher = reactiveRequest..response((response, chunks) -> {
        Flux<DataBuffer> content = Flux.from(chunks).map(this::toDataBuffer);
        return Mono.just(new JettyClientHttpResponse(response, content));
    });

So we know only 1 item will come but since the Jetty API returns Publisher, which can theoretically return any number of items, we turn it back into a Mono so it can be used that further in a Reactor processing chain where you have to choose between Mono or Flux in order for code to compile. To do that we use Mono.from(publisher) but that's meant for use with a Publisher of an unknown number of items. It's not the case here where we expect only 1 item and for that Mono#fromDirect is the right choice. It consumes the first item and completes without an eager cancel.

However, with the "fix" the producer holds on to the item it just passed to onNext() just to "cancel" it.

Not sure what you mean. If you're referring to Mono#fromDirect, that doesn't cancel.

Reactive Streams does not have any specialization for a Publisher of 0..1, see reactive-streams/reactive-streams-jvm#490. It leaves it as a separate exercise for reactive libraries and this is why virtually all reactive libraries that offer declarative composition have a special type for a Publisher of 1 or 0..1 vs many because that distinction is so important for expressing APIs. If there was such a type in Reactive Streams I imagine the Jetty Reactive client would have two response methods with a callback, one for a Publisher of 1 and another for a Publisher of many.

Feel free to conclude this as you prefer. If you decide to leave it as is, we'll certainly address it on the Spring side for spring applications at least but it seems a fix could be available more broadly. In the very least the Javadoc should address what is done or is not done on cancel.

I still believe it is better to provide default handling for cancel than to do ignore it. In the former case you can be surprised by a cancel that you didn't expect which resulted in the connection being closed, which can then be corrected (like we are going to with spring-projects/spring-framework#25849). In the latter case it leads to the leak reported here.

@kimec
Copy link

kimec commented Nov 24, 2020

Greetings,
could the title of this issue be changed to something that reflects the real problem discussed here: eager abortion of in-flight/enqueued requests on RS cancellation signal or such?

There is no thread leak in Jetty reactive client or Spring's WebClient when used properly. We've been using jetty-reactive-httpclient with Spring's WebClient via JettyClientHttpConnector for close to a year now in production deployment. We are doing on average some 4.5 mil HTTP requests per day using both libraries. Our request rate is some 120 requests per second during peak hours and we have never experienced a thread leak or memory leak for that matter in any of the libraries.

I would also like to state that the error Max requests queued per destination 1024 exceeded for HttpDestination is perfectly OK. More often than not, it is a symptom that you are doing something wrong.

For instance, if you are using non blocking IO, why do you even care about early timeout? More so if you want to protect the callee (the server being called) from traffic spikes. If anything, having early timeout increases the likelihood of your application hitting the callee/server with even more requests, because you reply to your caller that much faster with a negative response. The whole point of non blocking IO is to consume less resources in your application, save for file descriptors count, so that you can soak in much much bigger concurrent traffic/load without hassle. You can wait a little bit more for sure.

If you are dealing with a slow callee/server, you can use idleTimeout to drain the request queue more quickly. If the server accepted your request but is not responding, the connection will become idle eventually and Jetty will kill it giving you a clean abortion semantics with a proper exception. You should use keep alive, tweak connection count and tune request per destination count. All of those knobs are configurable. That is, in fact, exactly what we do in our application. Using timeout() operator and expecting that RS semantics automagically solves everything feels a bit wrong to me. That being said, I am not that familiar with the RS specs, so I can be wrong.

Also note, that Jetty tries to be a good citizen. The examples tell you, you should consume what you asked for even if the payload is big. I think that is a good default behaviour. It does not feel right, that you should be able to abort in-flight requests at whim, just because you can with timeout() operator. A real friend would listen to a whole story, no matter how boring, one may say.

@sbordet
Copy link
Member

sbordet commented Dec 3, 2020

I still think forcing publishers to hold the current element in case a subscriber cancels the stream, with the unspecified meaning to "cancel" the current element is an ad-hoc interpretation of reactive streams that has no basis in the specification.

Having said that, I'm willing to add a boolean configuration parameter to ReactiveRequest.Builder, something like abortOnCancel, false by default, that the Spring WebFlux implementation can set to true.

Hopefully this satisfies everybody, and if not shout here rather soon -- I'll probably make a release early next week.

@rstoyanchev
Copy link

It's good to have the choice and we'll certainly put it to use in WebFlux. Thanks!

@sbordet
Copy link
Member

sbordet commented Dec 3, 2020

@thomas-br can you please try the latest code in branch 1.1.x?
Do you still see leaked threads?

sbordet added a commit that referenced this issue Dec 3, 2020
Introduced abortOnCancel in ReactiveRequest.Builder to configure
the behavior in case of cancellation from the content subscriber.

Signed-off-by: Simone Bordet <[email protected]>
@sbordet sbordet closed this as completed Dec 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants