-
Notifications
You must be signed in to change notification settings - Fork 172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failed response decoding breaks RPC connection #1409
Comments
It shouldn't, you are sure that nothing else closed the connection? Can you enable RUST_LOG=jsonrpsee=trace,soketto=trace" (it's noisy but for this case it should be fine)? I can double-check tomorrow but this should be fixed ASAP if it's really happening.... |
That is what user logs indicate, though it is not 100% guaranteed of course. I'm working on reproduction myself, trying to understand what is happening exactly and why. |
Well, I tried and wasn't able to reproduce exactly that issue as described locally, it does generate errors, but subscription remains intact. |
Okay, I think I've got closer to answer. Unfortunately What I hit is probably yet another variation of #1189 From client logs I see that there are 256 requests are made and it takes 22 seconds for server to respond with 256 ~2MiB responses (hex-encoded ~1M binary payload). Almost no subscription notifications made it through in the meantime. Once all responses are received, subscriptions start working again, but not all of them, some remain stuck forever. Here are logs: node-farmer-log.zip Node is Substrate-based node/server, farmer is client. On the client there are 18 One of the problematic subscriptions is for Next 3 notifictions instead of being received at 2024-06-15T01:01:18, 2024-06-15T01:01:19 and 2024-06-15T01:01:20 were instead received at 2024-06-15T01:01:22, 2024-06-15T01:01:25 and 2024-06-15T01:01:39. After that this subscription has died even though RPC connection was alive and some other subscriptions continued to work fine. I didn't expect that large RPC messages would cause so much trouble (especially because method is tagged as |
Thanks for the data and detailed report.
I think I removed that logging in substrate after we moved those logs to separate middleware but would probably be good to enable it again or add as an CLI option.
Ok I suspected that, you get that It looks like you are using pipe_from_stream and if you can use something else such as to send out a notification when it's closed (that broke polkadot-js so I decided not break it in substrate) or just not drop it at all if that's ok. Maybe this buffer is just to small for the ordinary use case in substrate because some other guys has reported similar problems As a workaround for that: is to increase total message buffer for each connection on the server by: On the client-side, are you creating lots per connection of subscriptions or doing something expensive stuff after sync I/O processing the subscription item such as the tokio runtime is busy with other stuff? Because, the subscription dropping should only occur if the client can't keep up with the server which seems odd to me if not the server is very busy serving lots of connections causing the network to be slow.... As you pointed that out this could also be happening because reading the responses in the client is single-threaded but also worth trying two different tokio runtimes for your own logic and on for the jsonrpsee stuff... but maybe not that easy because jsonrpsee just is doing |
I see, that explains what I observe then. I just didn't expect this to fail silently and not break the whole connection.
Well, the number of subscriptions per connection isn't that large, mostly under 20. However there are bursts of requests sometimes that cause these issues. For example in addition to those 256 large request/responses client will make potentially a few hundred more small requests. I'm not exactly sure how they are processed by jsonrpsee server, but I see no reason why CPUs that our users are utilizing are not capable of handling a few hundred requests per second even on a single connection very easily unless things are heavily sequential on the server (client makes requests from different tasks and even dedicated threads concurrently alrady). Specifically I mentioned it took 22 seconds to get 256 responses, but making those requests without other concurrent requests and a single subscription takes ~1.5 second even without using |
So, the jsonrpsee server is already spawning each method call as a separate task but I think you are reaching max buffer size of the bounded channel and things are getting backpressured. By default in substrate |
I can see how we can exceed that limit in this case. Just a slight delay somewhere and we'll easily shoot above it. I can work with that, thanks! I'm still confused why it takes so long to process those requests on RPC server though, but I do not have reproduction on my machine, it is usually reported by users. |
We have refactored the code to multiplex and de-duplicate subscriptions on the client side and it seemed to mostly resolved this issue. Feel free to close this if you don't see there is anything to do, but it is a confusing behavior that things silently break in such a way. |
Ok, cool. Mm, I can see whether I get some time off to see how difficult it would be to terminate the connection instead of "silently dropping the subscription". That would be more better. Meanwhile, we added another option to pipe_from_stream in substrate where you can use a ring buffer instead to replace the oldest item if it can't keep with the server and not drop the subscription. Not sure if it's acceptable for your use-case though sink.pipe_from_stream(stream, RingBuffer::new(3)).await; |
After changing JSON serialization for one for the data structures containing bytes to use hexadecimal string instead of decimal array users that didn't upgrade server noticed that connection breaks in an interesing way.
Server is Substrate node using jsonrpsee 0.22.5, client uses 0.23.1.
When client received a message like
[0,0]
instead of expected"0000"
it failed to decode message with following error returned:However, after this notifications for other subscriptions stopped happening as well, without any visible errors on the client side.
I don't think connection should break at all in this case, but even if it did, it should be more explicit.
The text was updated successfully, but these errors were encountered: