Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GILProtected possibly allowing concurrent access #218

Open
stuhood opened this issue Apr 21, 2020 · 7 comments
Open

GILProtected possibly allowing concurrent access #218

stuhood opened this issue Apr 21, 2020 · 7 comments

Comments

@stuhood
Copy link
Contributor

stuhood commented Apr 21, 2020

I believe that I have observed GILProtected (and thus, surprisingly, the GIL) allowing concurrent access to a RefCell that it is protecting, thus causing a panic.

To investigate, I added a small wrapper around the RefCell borrows, and observed:

...
22:46:18 [WARN] +++ 83a5671f6c6fbe9f ThreadId(3) generator_send_get
22:46:18 [WARN] --- 83a5671f6c6fbe9f
22:46:18 [WARN] +++ 81f2b10deb81ce84 ThreadId(3) key_for
22:46:18 [WARN] +++ a09bb2c19d0b0962 ThreadId(5) key_for
22:46:18 [ERROR] panic at 'already borrowed: BorrowMutError', /rustc/b8cedc00407a4c56a3bda1ed605c6fc166655447/src/libcore/cell.rs:878
22:46:18 [ERROR] Please set RUST_BACKTRACE=1, re-run, and then file a bug at https://github.com/pantsbuild/pants/issues.
22:46:18 [ERROR] panic at 'already mutably borrowed: BorrowError', /rustc/b8cedc00407a4c56a3bda1ed605c6fc166655447/src/libcore/cell.rs:798
22:46:18 [ERROR] Please set RUST_BACKTRACE=1, re-run, and then file a bug at https://github.com/pantsbuild/pants/issues.
22:46:18 [ERROR] panic at 'already mutably borrowed: BorrowError', /rustc/b8cedc00407a4c56a3bda1ed605c6fc166655447/src/libcore/cell.rs:798
22:46:18 [ERROR] Please set RUST_BACKTRACE=1, re-run, and then file a bug at https://github.com/pantsbuild/pants/issues.
22:46:18 [ERROR] panic at 'already mutably borrowed: BorrowError', /rustc/b8cedc00407a4c56a3bda1ed605c6fc166655447/src/libcore/cell.rs:798
22:46:18 [ERROR] Please set RUST_BACKTRACE=1, re-run, and then file a bug at https://github.com/pantsbuild/pants/issues.
22:46:18 [WARN] --- 81f2b10deb81ce84
...

What I believe that this is showing is that while ThreadId(3) [1] is holding the GIL with the acquisition id 81f2b10deb81ce84, and before it has dropped the RefCell (and thus definitely before it has dropped the GIL), an attempt made by a different thread (ThreadId(5)) to acquire the same RefCell (acquisition a09bb2c19d0b0962) fails. I do not know how to explain the further panics after the first one though.

This code is using cpython = 0.5 with the extension-module feature.


Thank you for creating this library. I've thoroughly enjoyed porting this CFFI code to use it, and this is (hopefully) the last issue to track down before committing the port.

[1] these are all tokio runtime threads, but am not sure that that is relevant

@gracinet
Copy link
Collaborator

Hi,

the GIL would protect you from concurrency between Python calls, not from concurrency arising from inner Rust code. Since you're spawning tokio threads, it looks to me like an indication you might fall in the latter case.

This is seen from a distance, I suppose we'd need to see the actual code to help you further.

@dgrunwald
Copy link
Owner

before it has dropped the RefCell (and thus definitely before it has dropped the GIL)

The part in the parentheses is not a valid conclusion. Every function taking a Python argument is allowed to temporarily release the GIL, as long as it is re-acquired before the function returns.
This happens every time you call into Python code, as the interpreter will regularly release the GIL in order to give other threads a chance to run.

Yes, there can be multiple threads concurrently having a reference to the interior of the GILProtected, but all except one of them will be executing a Python operation, and thus can't be using that reference at that moment. So data races aren't possible. Basically any call into the python interpreter that might release the GIL, can instead be understood as the python interpreter performing a recursive call back to another entry point of your crate. And of course unexpected recursive calls can lead to trouble with RefCell.

@dgrunwald
Copy link
Owner

Note that this applies not just to GILProtected, but also to the data members of a py_class!.

@stuhood
Copy link
Contributor Author

stuhood commented Apr 22, 2020

Interesting. Thank you for the explanation... it makes sense, although I think that I thought that all of the calls occurring under the critical section were "simple enough" (__hash__, etc) that the GIL would not be released.

Yes, there can be multiple threads concurrently having a reference to the interior of the GILProtected, but all except one of them will be executing a Python operation, and thus can't be using that reference at that moment.

I wonder how GILProtected could be marked or made more clearly not safe for this usecase. It seems that it can only be used with Python objects... but even with a Python object under the RefCell, it would still be possible for the RefCell to panic on you in the same way when multiple threads are involved. I double checked that the calls in this case were not recursive under the RefCell borrow (using debug_cell), so it's not clear what would prevent this issue from also occurring with a Python object.

@dgrunwald
Copy link
Owner

dgrunwald commented Apr 22, 2020

Any call into the Python interpreter could do one of the following:

  • release the last reference to an object with a __del__ implemented in Python (-> the Python code running within __del__ may call back into your extension module)
  • hand over the GIL to another thread (-> Python code on another thread may call into your extension module)
  • allocate memory, triggering the garbage collector (-> the runtime calls all your tp_traverse methods + calls __del__ of whatever objects end up being collected)

Python doesn't have the concept of unique ownership, so it's quite tricky to ensure a RefCell isn't borrowed multiple times.

Whenever possible, try to avoid calls into the Python interpreter while having borrowed a RefCell. Even the implicit PyObject::drop calls may cause unexpected recursion (if the __del__ implementation calls back into your extension module) or may allow other threads to run (if __del__ implementation involves any interpreted code or otherwise calls temporarily releases the GIL).

If you can't do this consistently for all uses of a RefCell, you'll have to handle the "already borrowed" error in some way.

@stuhood
Copy link
Contributor Author

stuhood commented Apr 23, 2020

Thank you for the explanation.

I think that given what you've said, it is impossible to use GILProtected<RefCell<T>> safely in the presence of concurrency, because the RefCell is not actually protected from concurrent access by other threads. Is that right? If that's the case, it would be ideal for GILProtected's bounds to be changed such that it was impossible to construct a GILProtected<RefCell<T>>.

Hypothetically, you could use GILProtected<Mutex<RefCell<T>>> to prevent races, and that would collapse to just GILProtected<Mutex<T>> (but I know from experimenting with this that deadlocks would be inevitable: see below). So the next question is why not just Mutex<T>...?


I had initially used GILProtected to try to avoid strange interleavings of the GIL and a RwLock by "merging" the locks under the GIL (which would otherwise cause deadlocks). Since it can't be used that way, I ended up applying the following pattern: https://github.com/pantsbuild/pants/blob/447339ea3247659b9d79c60ecb403147c0d9d224/src/rust/engine/src/externs.rs#L364-L373
...basically: acquire the GIL so that we can release it before acquiring the RwLock so that you can guarantee that the RwLock is the first thing acquired. Inlining usages, it ends up looking like:

let gil = Python::acquire_gil();
gil.python().allow_threads(|| {
  let interns = INTERNS.read();
  let gil = Python::acquire_gil();
  // .. do things with the interns and the gil
})

A "static" Python::allow_threads method would avoid the initial acquisition, which would be handy.

@dgrunwald
Copy link
Owner

GILProtected<RefCell<T>> can still work if the Python code takes care not to call into the extension module from multiple threads concurrently. I don't think this is a safety problem as in Rust's unsafe -- just that the mismatch between Python's and Rust's ownership semantics can fundamentally cause "already borrowed" errors, which the extension module could expose to the Python code via exceptions. As long as the Python code doesn't use an object from multiple threads / from __del__ impementations, this can work fine.

If you need to implement an extension module with shared state that can safely be used from multiple threads, you somehow need to deal with these issues. One option is to avoid calling back into the Python interpreter while having the RefCell borrowed.

Basically, as usual, the GIL only protects the C/Rust-level state, any higher-level state (that can be interacted with from Python) must use separate synchronization.

You could do that with a Python threading.Lock, or with a Rust mutex. You need to release the GIL while waiting to acquire the lock, otherwise you'll deadlock (threading.Lock does the same internally).

allow_threads internally uses PyEval_SaveThread+PyEval_RestoreThread, which requires holding the GIL. I guess there could be a function that uses PyGILState_Check to only release the GIL if it's currently acquired (such a function would need a new name and would only be available with Python 3).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants