Failed response decoding breaks RPC connection #1409

nazar-pc · 2024-06-13T16:31:02Z

After changing JSON serialization for one for the data structures containing bytes to use hexadecimal string instead of decimal array users that didn't upgrade server noticed that connection breaks in an interesing way.

Server is Substrate node using jsonrpsee 0.22.5, client uses 0.23.1.

When client received a message like [0,0] instead of expected "0000" it failed to decode message with following error returned:

Parse error: invalid type: sequence, expected a hex encoded string

However, after this notifications for other subscriptions stopped happening as well, without any visible errors on the client side.

I don't think connection should break at all in this case, but even if it did, it should be more explicit.

The text was updated successfully, but these errors were encountered:

niklasad1 · 2024-06-13T16:57:18Z

I don't think connection should break at all in this case, but even if it did, it should be more explicit.

It shouldn't, you are sure that nothing else closed the connection?
Which logging level are you using?

Can you enable RUST_LOG=jsonrpsee=trace,soketto=trace" (it's noisy but for this case it should be fine)?

I can double-check tomorrow but this should be fixed ASAP if it's really happening....

nazar-pc · 2024-06-13T17:01:00Z

That is what user logs indicate, though it is not 100% guaranteed of course.

I'm working on reproduction myself, trying to understand what is happening exactly and why.

nazar-pc · 2024-06-13T17:08:59Z

Well, I tried and wasn't able to reproduce exactly that issue as described locally, it does generate errors, but subscription remains intact.

nazar-pc · 2024-06-15T18:56:41Z

Okay, I think I've got closer to answer. Unfortunately jsonrpsee did't produce useful logs on server except periodic jsonrpsee-server: Send ping.

What I hit is probably yet another variation of #1189

From client logs I see that there are 256 requests are made and it takes 22 seconds for server to respond with 256 ~2MiB responses (hex-encoded ~1M binary payload). Almost no subscription notifications made it through in the meantime. Once all responses are received, subscriptions start working again, but not all of them, some remain stuck forever.

Here are logs: node-farmer-log.zip

Node is Substrate-based node/server, farmer is client. On the client there are 18 subspace_subscribeSlotInfo subscriptions for each of the farm (farm_index = 0..=17).

One of the problematic subscriptions is for farm_index=13 (can just grep by that). You'll see that slot notifications were received successfully every second until 2024-06-15T01:01:17 when the event that triggers a lot of mentioned requests stared.

Next 3 notifictions instead of being received at 2024-06-15T01:01:18, 2024-06-15T01:01:19 and 2024-06-15T01:01:20 were instead received at 2024-06-15T01:01:22, 2024-06-15T01:01:25 and 2024-06-15T01:01:39. After that this subscription has died even though RPC connection was alive and some other subscriptions continued to work fine.

I didn't expect that large RPC messages would cause so much trouble (especially because method is tagged as blocking on the server and on the client different subscriptions and requests are made from different threads and decoding is not that expensive, it is just 0.5G of very simple JSON.

niklasad1 · 2024-06-16T10:42:43Z

Thanks for the data and detailed report.

Okay, I think I've got closer to answer. Unfortunately jsonrpsee did't produce useful logs on server except periodic jsonrpsee-server: Send ping.

I think I removed that logging in substrate after we moved those logs to separate middleware but would probably be good to enable it again or add as an CLI option.

From client logs I see that there are 256 requests are made and it takes 22 seconds for server to respond with 256 ~2MiB responses (hex-encoded ~1M binary payload). Almost no subscription notifications made it through in the meantime. Once all responses are received, subscriptions start working again, but not all of them, some remain stuck forever.

Ok I suspected that, you get that Subscription dropped.... on server-side then the subscription is just closed down by the server because the client "lagged" and you won't get notified about it on the client-side. So, that will probably wait "until the connection is closed" because nothing is sent on that subscription again.

It looks like you are using pipe_from_stream and if you can use something else such as to send out a notification when it's closed (that broke polkadot-js so I decided not break it in substrate) or just not drop it at all if that's ok.

Maybe this buffer is just to small for the ordinary use case in substrate because some other guys has reported similar problems

As a workaround for that: is to increase total message buffer for each connection on the server by: --rpc-message-buffer-capacity-per-connection <RPC_MESSAGE_BUFFER_CAPACITY_PER_CONNECTION>

On the client-side, are you creating lots per connection of subscriptions or doing something expensive stuff after sync I/O processing the subscription item such as the tokio runtime is busy with other stuff? Because, the subscription dropping should only occur if the client can't keep up with the server which seems odd to me if not the server is very busy serving lots of connections causing the network to be slow....

As you pointed that out this could also be happening because reading the responses in the client is single-threaded but also worth trying two different tokio runtimes for your own logic and on for the jsonrpsee stuff... but maybe not that easy because jsonrpsee just is doing tokio::spawn

nazar-pc · 2024-06-16T20:43:43Z

Ok I suspected that, you get that Subscription dropped.... on server-side then the subscription is just closed down by the server because the client "lagged" and you won't get notified about it on the client-side. So, that will probably wait "until the connection is closed" because nothing is sent on that subscription again.

I see, that explains what I observe then. I just didn't expect this to fail silently and not break the whole connection.

On the client-side, are you creating lots per connection of subscriptions or doing something expensive stuff after sync I/O processing the subscription item such as the tokio runtime is busy with other stuff? Because, the subscription dropping should only occur if the client can't keep up with the server which seems odd to me if not the server is very busy serving lots of connections causing the network to be slow....

Well, the number of subscriptions per connection isn't that large, mostly under 20. However there are bursts of requests sometimes that cause these issues. For example in addition to those 256 large request/responses client will make potentially a few hundred more small requests. I'm not exactly sure how they are processed by jsonrpsee server, but I see no reason why CPUs that our users are utilizing are not capable of handling a few hundred requests per second even on a single connection very easily unless things are heavily sequential on the server (client makes requests from different tasks and even dedicated threads concurrently alrady).

Specifically I mentioned it took 22 seconds to get 256 responses, but making those requests without other concurrent requests and a single subscription takes ~1.5 second even without using tokio::spawn on the client and much less than a second with tokio::spawn for each request. This makes me thinking that something on server can be improved to drastically improve its performance, something like spawning a task for each incoming request (unless it is done already).

niklasad1 · 2024-06-17T07:25:06Z

Well, the number of subscriptions per connection isn't that large, mostly under 20. However there are bursts of requests sometimes that cause these issues. For example in addition to those 256 large request/responses client will make potentially a few hundred more small requests. I'm not exactly sure how they are processed by jsonrpsee server, but I see no reason why CPUs that our users are utilizing are not capable of handling a few hundred requests per second even on a single connection very easily unless things are heavily sequential on the server (client makes requests from different tasks and even dedicated threads concurrently alrady).

Specifically I mentioned it took 22 seconds to get 256 responses, but making those requests without other concurrent requests and a single subscription takes ~1.5 second even without using tokio::spawn on the client and much less than a second with tokio::spawn for each request. This makes me thinking that something on server can be improved to drastically improve its performance, something like spawning a task for each incoming request (unless it is done already).

So, the jsonrpsee server is already spawning each method call as a separate task but I think you are reaching max buffer size of the bounded channel and things are getting backpressured. By default in substrate --rpc-message-buffer-capacity-per-connection is 64 and that's global per connection, can you increase to something very big and check whether you still get these?

nazar-pc · 2024-06-18T05:21:40Z

So, the jsonrpsee server is already spawning each method call as a separate task but I think you are reaching max buffer size of the bounded channel and things are getting backpressured. By default in substrate --rpc-message-buffer-capacity-per-connection is 64 and that's global per connection, can you increase to something very big and check whether you still get these?

I can see how we can exceed that limit in this case. Just a slight delay somewhere and we'll easily shoot above it. I can work with that, thanks! I'm still confused why it takes so long to process those requests on RPC server though, but I do not have reproduction on my machine, it is usually reported by users.

nazar-pc · 2024-08-21T22:21:25Z

We have refactored the code to multiplex and de-duplicate subscriptions on the client side and it seemed to mostly resolved this issue. Feel free to close this if you don't see there is anything to do, but it is a confusing behavior that things silently break in such a way.

niklasad1 · 2024-08-22T06:51:38Z

Ok, cool.

Mm, I can see whether I get some time off to see how difficult it would be to terminate the connection instead of "silently dropping the subscription". That would be more better.

Meanwhile, we added another option to pipe_from_stream in substrate where you can use a ring buffer instead to replace the oldest item if it can't keep with the server and not drop the subscription. Not sure if it's acceptable for your use-case though

    sink.pipe_from_stream(stream, RingBuffer::new(3)).await;

nazar-pc mentioned this issue Jun 15, 2024

Faster piece getting via RPC autonomys/subspace#2854

Merged

1 task

nazar-pc mentioned this issue Aug 21, 2024

Add RPC subscription for piece object mappings autonomys/subspace#2979

Merged

1 task

niklasad1 closed this as completed Aug 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed response decoding breaks RPC connection #1409

Failed response decoding breaks RPC connection #1409

nazar-pc commented Jun 13, 2024

niklasad1 commented Jun 13, 2024

nazar-pc commented Jun 13, 2024

nazar-pc commented Jun 13, 2024

nazar-pc commented Jun 15, 2024

niklasad1 commented Jun 16, 2024 •

edited

Loading

nazar-pc commented Jun 16, 2024

niklasad1 commented Jun 17, 2024 •

edited

Loading

nazar-pc commented Jun 18, 2024

nazar-pc commented Aug 21, 2024

niklasad1 commented Aug 22, 2024

Failed response decoding breaks RPC connection #1409

Failed response decoding breaks RPC connection #1409

Comments

nazar-pc commented Jun 13, 2024

niklasad1 commented Jun 13, 2024

nazar-pc commented Jun 13, 2024

nazar-pc commented Jun 13, 2024

nazar-pc commented Jun 15, 2024

niklasad1 commented Jun 16, 2024 • edited Loading

nazar-pc commented Jun 16, 2024

niklasad1 commented Jun 17, 2024 • edited Loading

nazar-pc commented Jun 18, 2024

nazar-pc commented Aug 21, 2024

niklasad1 commented Aug 22, 2024

niklasad1 commented Jun 16, 2024 •

edited

Loading

niklasad1 commented Jun 17, 2024 •

edited

Loading