Road To Heroku #2105

ssnickolay · 2020-09-21T19:52:44Z

ssnickolay
Sep 21, 2020

I took this title of the discussion to show the end goal. However, I expect it will not be as easy as I would like.

The catalyst for the question is my attempt to run a tiny production Cuba application with TruffleRuby on Heroku. I did it, but the first load tests showed a significant difference in performance between CRuby and TruffleRuby, not in favor of TruffleRuby. Based on the speech from the last Kaigi, I expected another result.

Step by step, I simplified the application to find the reason for the slow performance, and this is what I have:

The test app: https://github.com/ssnickolay/truffleruby-heroku . config.ru is copy-past of https://github.com/eregon/rsb/blob/bench/rack_test_app/config.ru
The test stand (for now no limitation):

Processor: 2,3 GHz 8-Core Intel Core i9
Memory: 32 GB 2667 MHz DDR4

Puma config is the default (Single-mode, Min threads: 0, max threads: 16)
Test tool: wrk without keep-alive. This is more realistic

The results:

⚠️ All wrk commands run one after the other within the test, without restarting the server!

CRuby 2.6.6 (Test#0)

─ wrk -c5 -t5 -d3m http://localhost:9292/request
Running 3m test @ http://localhost:9292/request
  5 threads and 5 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   404.88us  155.06us   4.41ms   81.27%
    Req/Sec     2.49k   182.44     2.81k    82.52%
  2229838 requests in 3.00m, 159.49MB read
Requests/sec:  12380.99
Transfer/sec:      0.89MB


╰─ wrk -c10 -t10 -d3m http://localhost:9292/request
Running 3m test @ http://localhost:9292/request
  10 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   744.09us  283.78us   6.40ms   80.74%
    Req/Sec     1.36k    75.10     3.03k    74.95%
  2436762 requests in 3.00m, 174.29MB read
Requests/sec:  13529.59
Transfer/sec:      0.97MB

TruffleRuby (native)

I used the locally built version, but also tested with asdf (there is no difference):

─ which ruby
/Users/ssnickolay/Projects/oss/graal/truffleruby-ws/truffleruby/mxbuild/truffleruby-native/bin/ruby

Test#1

╰─ wrk -c5 -t5 -d3m http://localhost:9292/request
Running 3m test @ http://localhost:9292/request
  5 threads and 5 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    58.08ms  148.38ms 944.14ms   90.18%
    Req/Sec     1.82k     1.16k    4.28k    51.04%
  1368872 requests in 3.00m, 97.91MB read
Requests/sec:   7600.52
Transfer/sec:    556.68KB

╰─ wrk -c10 -t10 -d3m http://localhost:9292/request
Running 3m test @ http://localhost:9292/request
  10 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    68.06ms  241.66ms   1.97s    93.98%
    Req/Sec     1.14k   579.30     1.89k    65.52%
  1617159 requests in 3.00m, 115.67MB read
  Socket errors: connect 0, read 0, write 0, timeout 80
Requests/sec:   8979.52
Transfer/sec:    657.68KB

Test#2

─ wrk -c20 -t20 -d3m http://localhost:9292/request
Running 3m test @ http://localhost:9292/request
  20 threads and 20 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    69.02ms  193.87ms   1.34s    91.24%
    Req/Sec   447.77    331.05     1.71k    68.28%
  751360 requests in 3.00m, 53.74MB read
  Socket errors: connect 0, read 3, write 0, timeout 0
Requests/sec:   4171.75
Transfer/sec:    305.55KB

╰─ wrk -c20 -t20 -d3m http://localhost:9292/request
Running 3m test @ http://localhost:9292/request
  20 threads and 20 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.03ms    2.62ms  11.36ms   90.00%
    Req/Sec    96.00      0.71    97.00     60.00%
  50 requests in 3.00m, 3.66KB read # !!! IMPORTANT
Requests/sec:      0.28
Transfer/sec:      20.83B

TruffleRuby (JVM)

The command:

TRUFFLERUBYOPT='--experimental-options --jvm' bundle exec puma
*** SIGUSR1 not implemented, signal based restart unavailable!
Puma starting in single mode...
...

Test#3

╰─ wrk -c5 -t5 -d3m http://localhost:9292/request
Running 3m test @ http://localhost:9292/request
  5 threads and 5 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    14.85ms   96.06ms   1.70s    96.55%
    Req/Sec     3.53k     1.75k    5.71k    68.00%
  2796109 requests in 3.00m, 199.99MB read
  Socket errors: connect 0, read 0, write 0, timeout 26
  Non-2xx or 3xx responses: 1
Requests/sec:  15527.86
Transfer/sec:      1.11MB

╰─ wrk -c5 -t5 -d3m http://localhost:9292/request
Running 3m test @ http://localhost:9292/request
  5 threads and 5 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    14.96ms   58.11ms 428.31ms   93.49%
    Req/Sec     4.42k     1.01k    7.74k    89.87%
  3459705 requests in 3.00m, 247.46MB read
  Socket errors: connect 0, read 0, write 0, timeout 20
Requests/sec:  19210.06
Transfer/sec:      1.37MB

Test#4

╰─ wrk -c20 -t20 -d3m http://localhost:9292/request
Running 3m test @ http://localhost:9292/request
  20 threads and 20 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     2.44ms   17.86ms 447.85ms   98.71%
    Req/Sec     1.25k   326.43     1.68k    86.07%
  829990 requests in 3.00m, 59.37MB read
  Socket errors: connect 0, read 0, write 0, timeout 20
Requests/sec:   4608.97
Transfer/sec:    337.57KB

╰─ wrk -c20 -t20 -d3m http://localhost:9292/request
Running 3m test @ http://localhost:9292/request
  20 threads and 20 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   557.00us    1.38ms   5.18ms   90.00%
    Req/Sec    97.00      0.00    97.00    100.00%
  20 requests in 3.00m, 1.46KB read
Requests/sec:      0.11
Transfer/sec:       8.33B

╰─ wrk -c5 -t5 -d3m http://localhost:9292/request
Running 3m test @ http://localhost:9292/request
  5 threads and 5 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     0.00us    0.00us   0.00us     nan%
    Req/Sec     0.00      0.00     0.00       nan%
  0 requests in 3.00m, 0.00B read
Requests/sec:      0.00
Transfer/sec:       0.00B

Questions & Notes

Why TruffleRuby native is slower than CRuby? In Test#1 I added two wrk launches, but I did more (3-4), and the results of 2+ are pretty much the same (so this is not a warmup issue).
You shared the command ./runners/current_ruby_cli.rb --warmup-seconds 150 --benchmark-seconds 120 --no-wrk-close-connection --wrk-concurrency 1 --wrk-connections 1 --server-ruby-opts '--experimental-options --cexts-lock=false --engine.CompilerThreads=-1 --engine.SplittingGrowthLimit=10.0 --engine.SplittingMaxNumberOfSplitNodes=200000000' rails puma, so used it with native (via TRUFFLERUBYOPT="...")

wrk -c5 -t5 -d3m http://localhost:9292/request
Requests/sec:   9394.17
Transfer/sec:    688.05KB
...
Requests/sec:  10931.83
Transfer/sec:    800.67KB
...
Requests/sec:  11216.34
Transfer/sec:    821.51KB
...
Requests/sec:  10915.23
Transfer/sec:    799.45KB

that still not so fast as CRuby (but almost the same)
3. How to pass --warmup-seconds 150 to puma server?
4. Perhaps the most important question: if you take a look at Test#2 and Test#4 you'll see that on a huge load server begins to degrade. Perhaps this is not related to the load itself but is related to the number of requests, and with low wrk settings (-c5), it takes longer to reproduce. Ultimately, the server stops accepting any requests and gets stuck. It fails to turn off with signals (Cntrl+C) and have to kill the process:

I reproduced this many times, and it does not depend on the TruffleRuby version (available on native or JVM).
5. I prepared the flamegraph of execution with the native version. Unfortunately, I cannot make a flame graph when the server "sticks" because the report is not returned. https://drive.google.com/file/d/1tDpOF2p4H_7DRM_q1sq4UHcCrLcif8Vu/view?usp=sharing
6.

[82020] ERROR: worker mode not supported on truffleruby on this platform

This error is shown when starting a puma in a cluster mode on 1) Heroku 2) in docker with Debian 3) locally on the MacOS X. Is it right?

eregon · 2020-09-24T16:34:24Z

eregon
Sep 24, 2020
Maintainer

Thanks a lot for the detailed report.

@bjfish can reproduce the gradual slowdown on macOS.
BTW, did you run on macOS or Linux?

For RSB with the Rack app, on Linux I did not observe such a slowdown.
Maybe disabling keep-alive causes this.

2. Test tool: wrk without keep-alive. This is more realistic

I'm not sure it is more realistic, it probably ends up measuring a lot more how fast the kernel is at creating new connections when the application response time is really small like here.
In fact, the reason for the slowdown might be port exhaustion, which is a common issue when disabling keep-alive.
We should check that hypothesis.

How to pass --warmup-seconds 150 to puma server?

That's simply something for the RSB harness. Running wrk for 150 seconds (for RSB I did it in batches of 10 seconds) before should have the same effect.

1 reply

ssnickolay Sep 24, 2020
Author

BTW, did you run on macOS or Linux?

On macOS, but I can easily try to repeat it via Docker Debian if it will helpful for you. Yeah, absolute numbers will be lower (because of Docker4Mac), but I think the fact of degradation itself is important to us here.

I'm not sure it is more realistic, it probably ends up measuring a lot more how fast the kernel is at creating new connections when the application response time is really small like here.
In fact, the reason for the slowdown might be port exhaustion, which is a common issue when disabling keep-alive.

I mean with keep-alive wrk opens a limited amount of connection and (re)uses them to send a lot of requests. In real production, we receive a couple of requests from tens (hundreds) of thousands of users and keep-alive doesn't make much profit. And, as you can see, without keep-alive we have a completely different result)

eregon · 2020-09-24T16:42:40Z

eregon
Sep 24, 2020
Maintainer

I prepared the flamegraph of execution with the native version.

In the flamegraph, 11.8% of the total time across all threads is spent in JSON#dump.
In this case the pure-Ruby JSON seems used, and it seems rather inefficient for this input as it spends a lot of time in JSON#utf8_to_json (4.3% of total).
It might be worth trying if adding gem 'json' to the Gemfile, which should use the native extension, performs better.

2 replies

ssnickolay Sep 24, 2020
Author

In the flamegraph, 11.8% of the total time across all threads is spent in JSON#dump.

I'm sorry, I sent a slightly different flamegraph. This flamegraph is practically the same as for RSB but returns JSON response instead of plain text.

It might be worth trying if adding gem 'json' to the Gemfile, which should use the native extension, performs better.

It sounds interesting and unexpected at the same time. Do we have the ability to force the use of the correct json or is it all on the bundler side?

eregon Sep 26, 2020
Maintainer

It sounds interesting and unexpected at the same time. Do we have the ability to force the use of the correct json or is it all on the bundler side?

The stdlib json will use the pure-Ruby version (we've noticed it was faster some time ago at least for some cases). If there is any json gem in Gemfile.lock then the C extension should be used (when run under bundle exec). Also since json is a default gem, just installing json will use it by default if bundle exec is not used.

bjfish · 2020-09-24T17:04:16Z

bjfish
Sep 24, 2020

@ssnickolay Do you have some instructions on how you run wrk without keep-alive?

2 replies

ssnickolay Sep 24, 2020
Author

@bjfish ops, it looks like I messed up the settings with ab >_< ab has no keep alive by default and I thought wrk has the same convention. Also I saw --no-wrk-close-connection setting in rsb testing command ...
So, wrk performs requests with default keep-alive, right? I have not passed any special options and the wrk commands are exactly the same as in the examples above.

eregon Sep 26, 2020
Maintainer

Yes, wrk uses keep-alive by default, and there is actually no direct way to disable it (but one can close the connection via a header on the client side).
https://github.com/noahgibbs/rsb#load-testing-tools has more details on this.
ab is known to have some issues although I did not try myself: https://engineering.appfolio.com/appfolio-engineering/2019/4/21/wrk-it-my-experiences-load-testing-with-an-interesting-new-tool

bjfish · 2020-09-24T20:35:14Z

bjfish
Sep 24, 2020

@ssnickolay

Why TruffleRuby native is slower than CRuby? In Test#1 I added two wrk launches, but I did more (3-4), and the results of 2+ are pretty much the same (so this is not a warmup issue).

I did some TruffleRuby native testing locally (also 32GB ram) with some different memory/cexts-lock settings and this worked well for me:

export TRUFFLERUBYOPT='--vm.Xmx16G --vm.Xmn4G --experimental-options --cexts-lock=false'

 % wrk -c5 -t5 -d3m http://localhost:9292/request
Running 3m test @ http://localhost:9292/request
  5 threads and 5 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   132.87ms  322.66ms   1.99s    87.33%
    Req/Sec     6.52k     2.87k    9.04k    80.51%
  3482589 requests in 3.00m, 249.09MB read
  Socket errors: connect 0, read 0, write 0, timeout 70
Requests/sec:  19339.63
Transfer/sec:      1.38MB

Perhaps the most important question: if you take a look at Test#2 and Test#4 you'll see that on a huge load server begins to degrade.

I could also observe degraded performance over time and this combination of settings also alleviated this issue.

9 replies

ssnickolay Sep 28, 2020
Author

No, that just means a young generation of 4GB which makes the GC typically more efficient (at the expense of using more memory in total). Slowdown with more requests is unexpected, we need to figure out why that happens.

FYI without --vm.Xmx16G --vm.Xmn4G and with wrk -c10 -t10 the native TruffleRuby process size starts from 800Mb grows up to 2.5Gb. At the same time, processing 8-9k RPS.

1: 5286.46 RPS
2: 8043.1 RPS
3: 8738.83 RPS
4: 8190.22 RPS
5: 8464.72 RPS
6: 8980.64 RPS
7: 8699.77 RPS
8: 9313.88 RPS
9: 9567.67 RPS
10: 9107.16 RPS
11: 9078.44 RPS
12: 9421.88 RPS
...

This is not a priority issue at the moment, but 2.5Gb for the simple Rack app that's a lot( I had a mistake in the previous comment - CRuby version requires 20Mb (not 220Mb). So on CRuby we can run 10 puma workers x 16 threads to use ~1Gb of memory but achieve 50k RPS (just tested) .

UPD: @eregon $ bundle exec puma -b tcp://127.0.0.1:9292 -e production has the same result for me as described above

eregon Sep 28, 2020
Maintainer

I could reproduce it on Linux with:

$ ruby -v
truffleruby 20.3.0-dev-058635da, like ruby 2.6.6, GraalVM CE JVM [x86_64-linux]
$ TRUFFLERUBYOPT='--experimental-options --cexts-lock=false' bundle exec puma -e production -b tcp://127.0.0.1:9292                              

$ ruby loop_wrk.rb "wrk -c20 -t20 -d3m http://localhost:9292/request" "c20t20.dat" 
Let's start!
1: 70845.13 RPS
2: 49169.83 RPS
3: 0.0 RPS

It looks like a deadlock of Ruby Mutex(es) from a quick look at thread dumps.
Maybe it's a TruffleRuby or Puma issue. I noted that Puma 4.3.5 is used here, while RSB uses Puma 3.11.4.

eregon Nov 2, 2020
Maintainer

@ssnickolay I suspect this is somehow related to Puma's custom thread pool and queue for requests: https://github.com/puma/puma/blob/master/lib/puma/thread_pool.rb
Specifically, I think scaling up and down of threads might be related to this. It's just a guess though.
RSB always uses the same number of min and max threads: https://github.com/noahgibbs/rsb/blob/6126583b7155161ca896c1f8a64e425635f36f1d/bench_lib.rb#L612
I'd like to verify if -t 16:16 prevents the issue (or maybe more appropriately -t n:n where n=nb of cores), then it seems likely related to the Puma threads scaling in some way.
Could you test that hypothesis? Otherwise I'll try to find some time to test it myself.

It would also be interesting to test different Puma version, including 3.11.4 used by RSB, and latest Puma, just in case there was a related fix.

ssnickolay Nov 15, 2020
Author

@eregon unfortunately, result is the same with the fixed count of threads:

╰─ ruby -v (current master)
truffleruby 21.0.0-dev-75b04569, like ruby 2.7.2, GraalVM CE Native [x86_64-darwin]

╰─ TRUFFLERUBYOPT='--experimental-options' bundle exec puma -t 16:16
Puma starting in single mode...
* Version 4.3.5 (truffleruby 21.0.0-dev-75b04569 - ruby 2.7.2), codename: Mysterious Traveller
* Min threads: 16, max threads: 16
* Environment: development
* Listening on tcp://0.0.0.0:9292
Use Ctrl-C to stop

wrk:

╰─ ruby ./bench.rb "wrk -c20 -t20 -d3m http://localhost:9292/request" "c20t20-jvm.dat"
Let's start!
1: 4519.25 RPS
2: 0.17 RPS
^CDone

I've tested other wrk options and didn't see any difference between -t n:n and -t 0:n

eregon Aug 6, 2021
Maintainer

Sorry for the long absence on this discussion, I think it would be interesting to look at it again.
Now TruffleRuby uses TruffleSafepoint so we should be able to rule any issue related to guest safepoints.
Maybe it's a network or concurrency issue, anyway it needs more investigation to find out the real issue here.

eregon · 2021-01-13T12:00:12Z

eregon
Jan 13, 2021
Maintainer

headius found an issue in monitor.rb related to interrupts, I wonder if it might be related.

2 replies

ssnickolay Jan 15, 2021
Author

Is this private repo? I'm not able to open the link above (see 404 page)

eregon Aug 6, 2021
Maintainer

Unfortunately yes. Fix in ruby/spec@d17861b so that shouldn't be a concern anymore. Plus, we moved monitor to Java recently which avoids any interrupt issue and is faster.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Road To Heroku #2105

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 16 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Road To Heroku #2105

ssnickolay Sep 21, 2020

The results:

CRuby 2.6.6 (Test#0)

TruffleRuby (native)

Test#1

Test#2

TruffleRuby (JVM)

Test#3

Test#4

Questions & Notes

Replies: 5 comments · 16 replies

eregon Sep 24, 2020 Maintainer

ssnickolay Sep 24, 2020 Author

eregon Sep 24, 2020 Maintainer

ssnickolay Sep 24, 2020 Author

eregon Sep 26, 2020 Maintainer

bjfish Sep 24, 2020

ssnickolay Sep 24, 2020 Author

eregon Sep 26, 2020 Maintainer

bjfish Sep 24, 2020

ssnickolay Sep 28, 2020 Author

eregon Sep 28, 2020 Maintainer

eregon Nov 2, 2020 Maintainer

ssnickolay Nov 15, 2020 Author

eregon Aug 6, 2021 Maintainer

eregon Jan 13, 2021 Maintainer

ssnickolay Jan 15, 2021 Author

eregon Aug 6, 2021 Maintainer

ssnickolay
Sep 21, 2020

Replies: 5 comments 16 replies

eregon
Sep 24, 2020
Maintainer

ssnickolay Sep 24, 2020
Author

eregon
Sep 24, 2020
Maintainer

ssnickolay Sep 24, 2020
Author

eregon Sep 26, 2020
Maintainer

bjfish
Sep 24, 2020

ssnickolay Sep 24, 2020
Author

eregon Sep 26, 2020
Maintainer

bjfish
Sep 24, 2020

ssnickolay Sep 28, 2020
Author

eregon Sep 28, 2020
Maintainer

eregon Nov 2, 2020
Maintainer

ssnickolay Nov 15, 2020
Author

eregon Aug 6, 2021
Maintainer

eregon
Jan 13, 2021
Maintainer

ssnickolay Jan 15, 2021
Author

eregon Aug 6, 2021
Maintainer