Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a mutex to lock heap memory access #94

Merged
merged 5 commits into from
Nov 10, 2023
Merged

Conversation

ngoldbaum
Copy link
Member

This adds an allocator_lock field to the StringDTypeObject struct. This is a PyThread_type_lock mutex that must be acquired before manipulating the allocator.

I defined NPY_STRING_ACQUIRE_ALLOCATOR (as well as variants that accept 2 and 3 descriptors) and NPY_STRING_RELEASE_ALLOCATOR macros that handle locking and unlocking the mutex.

There's nothing in the API that prevents someone from touching the allocator field before acquiring the allocator_lock. Maybe it would be nicer to wrap a type around the allocator or a more idiomatic way to write this in C?

Most of this PR is then applying these macros where appropriate, including in error handling cases. There are a few places where I didn't exit from error cases and other similar bugs are fixed as well.

I wasn't sure what to do in stringdtype_dealloc. Can we assume because of the GIL that a dtype is deallocated by only one thread?

I haven't benchmarked the singlethreaded performance hit or added threaded tests but am planning to add that before merging.

I based this implementation off of this tutorial.

@ngoldbaum
Copy link
Member Author

This is also shows that if we put a mutex on a dtype that seems to not be a big deal in the single-threaded case. For ufuncs and casts we can acquire and release the mutex outside the core loop so it scales nicely too. I wonder if we could fix a ton of threading bugs in numpy by explicitly acquiring a per-dtype mutex before releasing the GIL in numpy.

@lysnikolaou
Copy link
Member

Can we assume because of the GIL that a dtype is deallocated by only one thread?

My understanding of the current version of PEP 703 is that yes, we can assume that.

@ngoldbaum
Copy link
Member Author

I just added a test for multithreaded access and mutation and things seem to be working correctly. I also fixed two unrelated bugs I encountered while writing that test and fixed how I release the mutex - I need to account for cases when the input descriptors are identical objects and not release or acquire the mutex more than once in those cases.

np.testing.assert_array_equal(arr, uarr)


def test_threaded_access_and_mutation(dtype, random_string_list):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@seberg I based this off a test in numpy you wrote. Thanks for the example! Do you think there's any other sorts of operations that would be good to add to this test or problems with this implementation? It definitely does catch threading errors - I noticed that the original version of this PR wasn't releasing mutexes correctly after adding it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC this test was specifically designed for setitem/getitem, I think (maybe only setitem) with structured dtypes, because it is a hairy recursing monster.

I suppose it might be good to have a single test that runs a ufunc (reading) in one thread and assigns to the same array in another? To have one basic ufunc locking test? (of course you can also use a ufunc to assign to the same array with out= additionally)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, I suppose it already runs a ufunc, so whatever :). (I half suspect I didn't write this, but took it/adjusted it from an issue :))

Copy link
Member Author

@ngoldbaum ngoldbaum Nov 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is adapted from the structured dtype test, not just a copy/paste, so it's not quite the same.

Thanks for giving it a look!

BTW, once this PR is merged the stringdtype prototype is functionally done and I'm ready to move to integrating it into numpy. I'll probably update the NEP first though. Maybe I'll also include a call for testing the prototype when I update the mailing list?

@ngoldbaum
Copy link
Member Author

Performance impact for single-threaded use seems minimal. Here's a test on main with numpy and stringdtype compiled in release mode:

In [3]: arr = np.array(["hello", "world"], dtype=StringDType())

In [4]: %timeit np.add(arr, arr)
1.65 µs ± 33.8 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

And with this PR:

In [4]: arr = np.array(["hello", "world"], dtype=StringDType())

In [5]: %timeit np.add(arr, arr)
1.78 µs ± 20.2 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

This kind of tiny array is the worst case for a ufunc too, since we acquire and release the mutex outside the inner loop.

@ngoldbaum ngoldbaum merged commit da64667 into numpy:main Nov 10, 2023
1 check passed
@seberg
Copy link
Member

seberg commented Nov 10, 2023

kind of tiny array is the worst case for a ufunc

not quite true, there are some cases where NumPy will opt to call the inner loop with few items, but many times. where= is the worst, the other is reductions.

But... whatever, only alternative would be to use the auxdata to lock for the whole operation. And that might not even work easily since we may be using casts at the same time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants