Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokio runtime being blocked by nitox. #24

Open
thedodd opened this issue Feb 12, 2019 · 7 comments
Open

Tokio runtime being blocked by nitox. #24

thedodd opened this issue Feb 12, 2019 · 7 comments

Comments

@thedodd
Copy link

thedodd commented Feb 12, 2019

I observed some very interesting behavior while using this crate in a few services of mine. I've observed that running multiple tasks on the tokio runtime along with a nitox client will cause the other tasks to never be executed, or receive very minimal execution, while nitox is running.

Of course, first I suspected that it was my code, but I reviewed it thoroughly and did not find anything obvious which would block the runtime. After reviewing the networking code in this crate, it looks like parking_lot::RwLock is used as a fairly fundamental building block for the nats client and the connection multiplexer. This is what is causing the blocking.

The parking_lot crate does provide smaller and more efficient locking mechanisms than the std::sync primitives, but they still block the calling thread when attempting to acquire read/write access. The documentation explicitly states this. Take write for example: RwLock's type def, which uses lock_api::RwLock under the hood. Docs state:

Locks this RwLock with exclusive write access, blocking the current thread until it can be acquired. [...]

The blocking locks are used primarily in the client, net/connection & net/connection_inner.

what do we do about this?

First, I think that this is not a good thing. We definitely do not want nitox to block the event loop, because then it essentially defeats the purpose of using an event loop. Pretty sure we can all agree with that.

As a path forward, I would propose that we use a non-blocking message passing based approach. A pattern which I have used quite successfully in the past. Details below.

message passing instead of locks

  • the public nats client type will continue to present roughly the same interface.
  • when the nats client is instantiated/connected, the nats client will spawn a new private task onto the runtime which owns the connection. Just for reference here, we can call that spawned task the daemon.
  • when the daemon is first spawned, it will be given a futures mpsc receiver. This mpsc channel will communicate oneshot channels of Result<NatsClient, NatsError> or the like.
  • the public nats client will send oneshot receivers over the channel to request a clone of the nats connection which the daemon is managing.
  • because the daemon owns the connection, the memory does not need to be shared and therefore no locks are needed. This applies to the multiplexer as well. The daemon's communication channel (or channels) will be able to receive commands to update the multiplexer for new subscriptions &c.
  • we can also setup the public nats client in such a way that when it is dropped, it will issue a command to the daemon which will cause it to shutdown and clean up its resources. This will resolve Clients never disconnect, even after manual drop. #22 & How to Disconnect Client #6 (which is not actually resolved).
@OtaK
Copy link
Contributor

OtaK commented Feb 12, 2019

Maybe we could simply enable lock fairness as described here: https://docs.rs/lock_api/0.1.5/lock_api/struct.RwLockWriteGuard.html#method.unlock_fair

I think it's worth trying when the lockguard is held for too long.

@OtaK
Copy link
Contributor

OtaK commented Feb 12, 2019

Also, I want to apologize for the relative inactivity, we are currently super swamped on R&D here and can't afford extra time to maintain code that actually already works super well for us in production.

We'll get back to nitox super hard when we'll be working on backend stuff on the current project :)

@thedodd
Copy link
Author

thedodd commented Feb 12, 2019

I definitely understand being swamped. I was able to find a workaround, so I'm not blocked by this.

As far as the unlock_fair, that won't actually solve the problem. It may potentially reduce the amount of time that the event loop is blocked, but that will only take effect when there are stacked readers and writers on top of an already acquired write. The fundamental problem is that it is a thread blocking abstraction inside of a parent abstraction which requires that nothing block (in order for the abstraction to remain effective).

I would sooner reach for tokio_threadpool::blocking, except that we can really make the assumption that users will be using a tokio threadpool. They may be using a single thread pattern. It would actually be better to just use a blocking network stack and keep the connection on a single thread and then expose a futures based interface on top of that.

However, the channels based approach mentioned above is very simple to implement, and not having to use locks tends to simplify the code. IMHO, it would be the way to go.

@thedodd
Copy link
Author

thedodd commented Feb 12, 2019

@OtaK also, I'm happy to implement the proposed channels based pattern. I just don't want to go vigilante and implement a bunch of changes which the maintainers don't want :). Let me know what you all think of the design. I'm happy to flesh it out a bit more if that would help.

@OtaK
Copy link
Contributor

OtaK commented Feb 21, 2019

I've looked a bit into it and:

  • The usages of RwLock in connection.rs are not a problem since the non-blocking try_{read/write} versions are used
  • The other usages of blocking write are not used in streams, and are an acceptable cost since they're used in super punctual operations.
  • The only problem I can see is in NatsClient::subscribe where we setup the stream, and for each message incoming, we do lock the subs_tx on the multiplexer to check for automatic max message count unsubscribe. That's where we should act I think. Maybe we could make use of the stream_cancel work started in Clients never disconnect, even after manual drop. #22?
    Like, on max count reached, drop the Trigger and that would also fuse the stream as a consequence.

As much as I like message passing, I think that the overhead it'd create and the introduced complexity are kinda not worth if it's possible to fix this problem by automatically dropping streams with the last option

@thedodd
Copy link
Author

thedodd commented Mar 11, 2019

@OtaK so, using a futures::sync::{mpsc, oneshot} based approach may actually simplify things quite a bit, as no locking would be needed inside of the type which owns the connections and the multiplexer.

The use of channels would be completely opaque to users of the crate as well. The interface could arguably stay almost exactly the same. It would only be used internally for communication between the client instances and the "daemon" type which actually owns the connection.

A big benefit of this as well is that we wouldn't have to worry about blocking issues at all. The whole thing would be purely non-blocking. The main reason I am still pushing towards this idea is because the workaround for dealing with code which may periodically block is to spawn it on a separate tokio runtime instance, which can be a bit of a resource overhead. And if the library isn't actually designed at its base to be non-blocking, subtle issues may come up which cause requests to be dropped for no obvious reason, and then days of debugging ensue ☠️ (which is how I discovered this issue).

@thedodd
Copy link
Author

thedodd commented Mar 13, 2019

Either way, I'm definitely thinking that the stream cancel & streaming support PRs should land first. Maybe even a quick update to the code base overall to cut over to 2018 edition. We can use rust fix for that.

After those items land, I'll experiment with both options. One that just updates to using try_(read|write) on the locks, and then another which moves the multiplexer & the main connection type into a type which owns the data and then uses mpsc to manipulate them in order to completely sidestep the need for locks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants