Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

admin interface hangs on 24-core machine #523

Closed
jbyler opened this issue Apr 3, 2014 · 10 comments · May be fixed by spotify/helios#1302 or spotify/helios#1304
Closed

admin interface hangs on 24-core machine #523

jbyler opened this issue Apr 3, 2014 · 10 comments · May be fixed by spotify/helios#1302 or spotify/helios#1304
Assignees
Labels
Milestone

Comments

@jbyler
Copy link

jbyler commented Apr 3, 2014

Primary symptom: on a machine with 22 or more cores and the default server configuration, the admin interface accepts TCP connections and then never processes the requests, causing a browser to hang forever. This can happen on any machine given the wrong config parameters.

Details:

  • Server thread configuration must adhere to the following invariants, determined through debugging and trial and error:
    • maxThreads > ∑ (acceptorThreads + selectorThreads) over all applicationConnectors
    • adminMaxThreads > ∑ (acceptorThreads + selectorThreads) over all adminConnectors
  • Presumably what this means is that maxThreads includes all the acceptorThreads and selectorThreads, and what's left over is used for handling requests. If there's nothing left over, the requests queue up and never get handled.
  • If this invariant is not satisfied, either the applicationConnectors or the adminConnectors (respectively) will accept TCP connections but then never handle them
  • DropWizard's defaults for maxThreads (1024) and adminMaxThreads (64) are fixed, while the defaults for acceptorThreads (#CPUs/2) and selectorThreads (#CPUs) vary on different machines.
  • So on a machine with 22 or more cores and an admin interface with 2 connectors (one for HTTP, one for HTTPS), the invariant doesn't hold for the admin interface. With 342 or more cores and 2 connectors, it won't hold for the application interface.

There are potentially 3 parts to this bug:

  • DropWizard defaults should probably be made so they work on all modern hardware.
  • DropWizard should probably validate the configuration and fail to start if an invariant isn't satisfied, rather than silently hanging.
  • Perhaps the invariant should be documented.
  • Alternatively, the meaning of the parameters could be changed so that they are independent. Using maxThreads only for processing requests (and not for selector threads and acceptor threads) might be more intuitive.

Tested with version v0.7.0.rc3.

How to reproduce: use the dropwizard-example application on a 24-core machine, or use the following server config on any machine:

server:
  minThreads: 2
  maxThreads: 2
  applicationConnectors:
    - type: http
      port: 8080
      acceptorThreads: 1
      selectorThreads: 1
  adminConnectors:
    - type: http
      port: 8081
      acceptorThreads: 1
      selectorThreads: 1

App starts successfully, but requests to either the application or the admin interface hang.

@nicktelford nicktelford added the bug label Apr 3, 2014
@reines
Copy link
Contributor

reines commented May 12, 2014

So this behaviour actually seems to be coming from Jetty. Specifically the ServerConnector accepts a single Executor that is used to run tasks for handling requests, acceptors and selectors.

The only solutions (without changes to Jetty) that I can see are:

  • Validating the configuration and failing to start.
  • Count instances of HttpConnectorFactory when constructing the servers thread pool and adjust accordingly. This is clunky because it requires the ServerFactory to care about the type and internal workings of the connectors.
  • During building in the HttpConnectorFactory adjust the size of the servers thread pool to add enough threads for the acceptor and selectors. This is clunky because it requires the HttpConnectorFactory to care about the type and internal workings of the ThreadPool.

A patch to allow Jetty to use different executors for handling requests versus acceptors and selectors would be fairly simple, perhaps that is the best route here?

@reines
Copy link
Contributor

reines commented Jun 10, 2014

I've submit a bug for this against Jetty, though it is arguable what the correct solution is, and actually whether this is a bug in Jetty, or whether they expect us to be doing this calculation ourselves.

https://bugs.eclipse.org/bugs/show_bug.cgi?id=436987

@nicktelford
Copy link
Contributor

@reines I haven't had a chance to look, but does this issue still apply in the latest version of Jetty (9.2.0)?

@reines
Copy link
Contributor

reines commented Jun 10, 2014

I believe so, or at least it does in master, I didn't specifically check the 9.2.0 tag.

@nicktelford
Copy link
Contributor

OK, let's see what the Jetty folks say and take things from there.

@darkjune
Copy link

darkjune commented Jul 4, 2014

I think original design of thread pool is this, share all threads in one thread pool, avoid thread context switching cost.

@bramp
Copy link

bramp commented Aug 10, 2014

It looks like Jetty has changed their defaults to be sensible: jetty/jetty.project@2d52280 thus resolving this issue in jetty-9.2.2.v20140723

@jplock
Copy link
Member

jplock commented Sep 17, 2014

This should be fixed when #453 gets merged in

@joschi
Copy link
Member

joschi commented Sep 28, 2014

The issue should be fixed in the current master which is using Jetty 9.2.3.v20140905 (commit 93d3ee5).

Please add a comment if the problem is still occurring.

@joschi joschi closed this as completed Sep 28, 2014
@joschi joschi added this to the 0.8.0 milestone Sep 28, 2014
@joschi joschi self-assigned this Sep 28, 2014
@vishvananda
Copy link

This seems to be fixed for maxThreads but not adminMaxThreads.

If maxThreads is set too low, it errors as expected.

If adminMaxThreads is set to something far too low (say 2), the service will happily start without displaying an error message from jetty and then not be able to deal with requests. It also hangs on shutdown for the full 30 second timeout while it is waiting for jetty to shutdown.

I suspect either the admin connector is taking a different route through jetty and skipping the exception, or the exception is getting eaten somewhere on the dropwizard side.

Using dropwizard 0.9.1 fwiw

arteam added a commit that referenced this issue Jan 17, 2017
The issue is reported in #523. The OS on CI Linux machines returns an
astonishingly big amount of CPUs (~ 128). Jetty changed their algorithm
for calculating the maximum amount of selector and acceptor threads,
see `eclipse/jetty.project@2d52280`. But, in Dropwizard we still set
the amount of acceptors as #CPUs/2 and selectors #CPUs. Looks like
that's too much for Jetty, and it can't handle such a big amount of threads.
Because of that we have random errors on our CI environment and,
what worse, possibly hurt users who actually run their applications on
machines with a big amount of CPU. A solution is to delegate calculating
the amount to Jetty (which has more sane defaults) and document the
defaults.
jplock pushed a commit that referenced this issue Jan 17, 2017
* Upgrade Jetty to 9.4.0.v20161208

Hopefully, it's just a version bump.

Resolves #1874.

* Increase waiting interval for batch HTTP/2 test

* Delegate calculation amount of acceptors and selectors threads to Jetty

The issue is reported in #523. The OS on CI Linux machines returns an
astonishingly big amount of CPUs (~ 128). Jetty changed their algorithm
for calculating the maximum amount of selector and acceptor threads,
see `eclipse/jetty.project@2d52280`. But, in Dropwizard we still set
the amount of acceptors as #CPUs/2 and selectors #CPUs. Looks like
that's too much for Jetty, and it can't handle such a big amount of threads.
Because of that we have random errors on our CI environment and,
what worse, possibly hurt users who actually run their applications on
machines with a big amount of CPU. A solution is to delegate calculating
the amount to Jetty (which has more sane defaults) and document the
defaults.
davidxia added a commit to spotify/helios that referenced this issue Sep 28, 2020
davidxia added a commit to spotify/helios that referenced this issue Sep 28, 2020
davidxia added a commit to spotify/helios that referenced this issue Oct 6, 2020
davidxia added a commit to spotify/helios that referenced this issue Oct 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
8 participants