admin interface hangs on 24-core machine #523

jbyler · 2014-04-03T17:57:49Z

Primary symptom: on a machine with 22 or more cores and the default server configuration, the admin interface accepts TCP connections and then never processes the requests, causing a browser to hang forever. This can happen on any machine given the wrong config parameters.

Details:

Server thread configuration must adhere to the following invariants, determined through debugging and trial and error:
- maxThreads > ∑ (acceptorThreads + selectorThreads) over all applicationConnectors
- adminMaxThreads > ∑ (acceptorThreads + selectorThreads) over all adminConnectors
Presumably what this means is that maxThreads includes all the acceptorThreads and selectorThreads, and what's left over is used for handling requests. If there's nothing left over, the requests queue up and never get handled.
If this invariant is not satisfied, either the applicationConnectors or the adminConnectors (respectively) will accept TCP connections but then never handle them
DropWizard's defaults for maxThreads (1024) and adminMaxThreads (64) are fixed, while the defaults for acceptorThreads (#CPUs/2) and selectorThreads (#CPUs) vary on different machines.
So on a machine with 22 or more cores and an admin interface with 2 connectors (one for HTTP, one for HTTPS), the invariant doesn't hold for the admin interface. With 342 or more cores and 2 connectors, it won't hold for the application interface.

There are potentially 3 parts to this bug:

DropWizard defaults should probably be made so they work on all modern hardware.
DropWizard should probably validate the configuration and fail to start if an invariant isn't satisfied, rather than silently hanging.
Perhaps the invariant should be documented.
Alternatively, the meaning of the parameters could be changed so that they are independent. Using maxThreads only for processing requests (and not for selector threads and acceptor threads) might be more intuitive.

Tested with version v0.7.0.rc3.

How to reproduce: use the dropwizard-example application on a 24-core machine, or use the following server config on any machine:

server:
  minThreads: 2
  maxThreads: 2
  applicationConnectors:
    - type: http
      port: 8080
      acceptorThreads: 1
      selectorThreads: 1
  adminConnectors:
    - type: http
      port: 8081
      acceptorThreads: 1
      selectorThreads: 1

App starts successfully, but requests to either the application or the admin interface hang.

The text was updated successfully, but these errors were encountered:

reines · 2014-05-12T22:49:15Z

So this behaviour actually seems to be coming from Jetty. Specifically the ServerConnector accepts a single Executor that is used to run tasks for handling requests, acceptors and selectors.

The only solutions (without changes to Jetty) that I can see are:

Validating the configuration and failing to start.
Count instances of HttpConnectorFactory when constructing the servers thread pool and adjust accordingly. This is clunky because it requires the ServerFactory to care about the type and internal workings of the connectors.
During building in the HttpConnectorFactory adjust the size of the servers thread pool to add enough threads for the acceptor and selectors. This is clunky because it requires the HttpConnectorFactory to care about the type and internal workings of the ThreadPool.

A patch to allow Jetty to use different executors for handling requests versus acceptors and selectors would be fairly simple, perhaps that is the best route here?

reines · 2014-06-10T10:52:19Z

I've submit a bug for this against Jetty, though it is arguable what the correct solution is, and actually whether this is a bug in Jetty, or whether they expect us to be doing this calculation ourselves.

https://bugs.eclipse.org/bugs/show_bug.cgi?id=436987

nicktelford · 2014-06-10T13:55:05Z

@reines I haven't had a chance to look, but does this issue still apply in the latest version of Jetty (9.2.0)?

reines · 2014-06-10T14:07:43Z

I believe so, or at least it does in master, I didn't specifically check the 9.2.0 tag.

nicktelford · 2014-06-10T15:22:52Z

OK, let's see what the Jetty folks say and take things from there.

darkjune · 2014-07-04T09:24:18Z

I think original design of thread pool is this, share all threads in one thread pool, avoid thread context switching cost.

bramp · 2014-08-10T03:39:48Z

It looks like Jetty has changed their defaults to be sensible: jetty/jetty.project@2d52280 thus resolving this issue in jetty-9.2.2.v20140723

jplock · 2014-09-17T00:51:17Z

This should be fixed when #453 gets merged in

joschi · 2014-09-28T17:04:57Z

The issue should be fixed in the current master which is using Jetty 9.2.3.v20140905 (commit 93d3ee5).

Please add a comment if the problem is still occurring.

vishvananda · 2016-02-23T02:25:27Z

This seems to be fixed for maxThreads but not adminMaxThreads.

If maxThreads is set too low, it errors as expected.

If adminMaxThreads is set to something far too low (say 2), the service will happily start without displaying an error message from jetty and then not be able to deal with requests. It also hangs on shutdown for the full 30 second timeout while it is waiting for jetty to shutdown.

I suspect either the admin connector is taking a different route through jetty and skipping the exception, or the exception is getting eaten somewhere on the dropwizard side.

Using dropwizard 0.9.1 fwiw

The issue is reported in #523. The OS on CI Linux machines returns an astonishingly big amount of CPUs (~ 128). Jetty changed their algorithm for calculating the maximum amount of selector and acceptor threads, see `eclipse/jetty.project@2d52280`. But, in Dropwizard we still set the amount of acceptors as #CPUs/2 and selectors #CPUs. Looks like that's too much for Jetty, and it can't handle such a big amount of threads. Because of that we have random errors on our CI environment and, what worse, possibly hurt users who actually run their applications on machines with a big amount of CPU. A solution is to delegate calculating the amount to Jetty (which has more sane defaults) and document the defaults.

* Upgrade Jetty to 9.4.0.v20161208 Hopefully, it's just a version bump. Resolves #1874. * Increase waiting interval for batch HTTP/2 test * Delegate calculation amount of acceptors and selectors threads to Jetty The issue is reported in #523. The OS on CI Linux machines returns an astonishingly big amount of CPUs (~ 128). Jetty changed their algorithm for calculating the maximum amount of selector and acceptor threads, see `eclipse/jetty.project@2d52280`. But, in Dropwizard we still set the amount of acceptors as #CPUs/2 and selectors #CPUs. Looks like that's too much for Jetty, and it can't handle such a big amount of threads. Because of that we have random errors on our CI environment and, what worse, possibly hurt users who actually run their applications on machines with a big amount of CPU. A solution is to delegate calculating the amount to Jetty (which has more sane defaults) and document the defaults.

which includes this fix dropwizard/dropwizard#523.

nicktelford added the bug label Apr 3, 2014

joschi closed this as completed Sep 28, 2014

joschi added this to the 0.8.0 milestone Sep 28, 2014

joschi self-assigned this Sep 28, 2014

arteam mentioned this issue Jan 17, 2017

Upgrade Jetty to 9.4.0.v20161208 #1875

Merged

davidxia added a commit to spotify/helios that referenced this issue Sep 28, 2020

Upgrade dropwizard-core dep from 0.7.1 to 0.9.3

b447423

which includes this fix dropwizard/dropwizard#523.

davidxia mentioned this issue Sep 28, 2020

Upgrade dropwizard-core dep from 0.7.1 to 0.9.3 spotify/helios#1302

Draft

davidxia added a commit to spotify/helios that referenced this issue Sep 28, 2020

Upgrade dropwizard-core dep from 0.7.1 to 0.9.3

920199e

which includes this fix dropwizard/dropwizard#523.

davidxia added a commit to spotify/helios that referenced this issue Oct 6, 2020

Upgrade dropwizard-core dep from 0.7.1 to 0.9.2

e4007de

which includes this fix dropwizard/dropwizard#523.

davidxia mentioned this issue Oct 6, 2020

Upgrade dropwizard-core dep from 0.7.1 to 0.9.2 spotify/helios#1304

Draft

davidxia added a commit to spotify/helios that referenced this issue Oct 6, 2020

Upgrade dropwizard-core dep from 0.7.1 to 0.9.2

84a158f

which includes this fix dropwizard/dropwizard#523.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

admin interface hangs on 24-core machine #523

admin interface hangs on 24-core machine #523

jbyler commented Apr 3, 2014

reines commented May 12, 2014

reines commented Jun 10, 2014

nicktelford commented Jun 10, 2014

reines commented Jun 10, 2014

nicktelford commented Jun 10, 2014

darkjune commented Jul 4, 2014

bramp commented Aug 10, 2014

jplock commented Sep 17, 2014

joschi commented Sep 28, 2014

vishvananda commented Feb 23, 2016

admin interface hangs on 24-core machine #523

admin interface hangs on 24-core machine #523

Comments

jbyler commented Apr 3, 2014

reines commented May 12, 2014

reines commented Jun 10, 2014

nicktelford commented Jun 10, 2014

reines commented Jun 10, 2014

nicktelford commented Jun 10, 2014

darkjune commented Jul 4, 2014

bramp commented Aug 10, 2014

jplock commented Sep 17, 2014

joschi commented Sep 28, 2014

vishvananda commented Feb 23, 2016