Thread-local arenas #8692

kddnewton · 2025-01-13T14:43:32Z

Summary

Currently, all threads use the same arena for imaging. When there are enough workers, in regular Python the GIL will be under lots of contention and in free-threaded Python the mutex will be under lots of contention.

This commit instead introduces lockless thread-local arenas for environments that support it. For environments that do not support thread-locals (or for environments where we couldn't determine if they do or not) we fall back to either the GIL or a mutex if there is no GIL.

This has some implications for statistics, as statistics are now thread-specific. This could be solved in a couple of ways (in C or in Python), or left unsolved and just documented. I think either way is fine.

Code

Most of the code doesn't actually need to change. The bulk of the changes were getting setup.py to emit the proper compilation definitions so that we could check which kind of thread-local declarations were supported at compile-time. Other than that, the declaration of the default arena now has the thread-local declaration and the places where we previously locked the mutex the macro name has changed to reflect that it is specific to the thread-local arena.

Benchmarks

For regular Python, this didn't make much of a difference. (The difference in the samples wasn't statistically significant, 95% CI). For free-threaded Python, however, the difference was fairly massive (about a 70% increase).

v3.13.0 on main

Max: 0.439743 Mean: 0.355661 Min: 0.305783
Max: 0.415384 Mean: 0.361710 Min: 0.304075
Max: 0.427207 Mean: 0.366160 Min: 0.300381
Max: 0.460026 Mean: 0.388431 Min: 0.316797
Max: 0.419853 Mean: 0.361484 Min: 0.309495
Max: 0.393699 Mean: 0.350330 Min: 0.302294
Max: 0.443584 Mean: 0.372369 Min: 0.311351
Max: 0.404041 Mean: 0.355057 Min: 0.309706
Max: 0.420880 Mean: 0.341415 Min: 0.280980
Max: 0.408922 Mean: 0.320707 Min: 0.228622

v3.13.0t on main

Max: 0.218140 Mean: 0.143962 Min: 0.091831
Max: 0.195644 Mean: 0.124187 Min: 0.079139
Max: 0.169986 Mean: 0.124365 Min: 0.081508
Max: 0.194228 Mean: 0.136258 Min: 0.103134
Max: 0.192837 Mean: 0.131196 Min: 0.094301
Max: 0.180463 Mean: 0.126546 Min: 0.079336
Max: 0.181516 Mean: 0.126875 Min: 0.083507
Max: 0.178397 Mean: 0.120558 Min: 0.083620
Max: 0.182262 Mean: 0.129299 Min: 0.087499
Max: 0.167291 Mean: 0.114647 Min: 0.074147

v3.13.0 on branch

Max: 0.429302 Mean: 0.362776 Min: 0.314723
Max: 0.406314 Mean: 0.355255 Min: 0.299485
Max: 0.438540 Mean: 0.378539 Min: 0.308898
Max: 0.425942 Mean: 0.368141 Min: 0.310095
Max: 0.408924 Mean: 0.365672 Min: 0.313756
Max: 0.419717 Mean: 0.361498 Min: 0.307699
Max: 0.418639 Mean: 0.355136 Min: 0.314148
Max: 0.426816 Mean: 0.377236 Min: 0.321773
Max: 0.424230 Mean: 0.358225 Min: 0.291148
Max: 0.421029 Mean: 0.363783 Min: 0.315103

v3.13.0t on branch

Max: 0.103066 Mean: 0.041306 Min: 0.018575
Max: 0.121496 Mean: 0.043042 Min: 0.018622
Max: 0.129727 Mean: 0.040726 Min: 0.014389
Max: 0.124282 Mean: 0.037581 Min: 0.018034
Max: 0.112015 Mean: 0.042051 Min: 0.017231
Max: 0.123254 Mean: 0.042117 Min: 0.019646
Max: 0.129165 Mean: 0.043886 Min: 0.017393
Max: 0.157608 Mean: 0.045151 Min: 0.017874
Max: 0.117050 Mean: 0.043070 Min: 0.016238
Max: 0.131859 Mean: 0.044563 Min: 0.017736

Script

Below is the script that I used to run these benchmarks.

bench.py

import concurrent.futures
import os
import threading
import time

from PIL import Image

num_threads = 16
num_images = 1024


def operation():
    images = []
    for i in range(num_images):
        img = Image.new(
            "RGB", (100, 100), color=(i % 256, (i // 256) % 256, (i // 65536) % 256)
        )
        images.append(img)

    for img in images:
        img = img.convert("CMYK")

    images.clear()


def worker(barrier):
    barrier.wait()
    runtimes = []

    for _ in range(5):
        start_time = time.time()
        operation()
        end_time = time.time()
        runtimes.append(end_time - start_time)

    return runtimes


def benchmark():
    with concurrent.futures.ThreadPoolExecutor(max_workers=num_threads) as executor:
        barrier = threading.Barrier(num_threads)
        futures = [executor.submit(worker, barrier) for _ in range(num_threads)]

        run_times = []
        for future in concurrent.futures.as_completed(futures):
            try:
                run_times.extend(future.result())
            except IndexError:
                os._exit(-1)

        min_time = min(run_times)
        max_time = max(run_times)
        mean_time = sum(run_times) / len(run_times)
        print(f"Max: {max_time:.6f} Mean: {mean_time:.6f} Min: {min_time:.6f}")


benchmark()

aclark4life · 2025-01-13T14:46:05Z

@kddnewton Can we say "environment" instead of "arena" here? Otherwise, thank you for the PR! Oh, or alternatively please explain what "arena" is in this context, haven't heard that one before. At a glance, it looks like either "arena" is another word for "project" or it's an imaging term I'm not familiar with 😄

kddnewton · 2025-01-13T15:00:11Z

@aclark4life No problem! Arenas in this context are memory arenas, which are already in use inside Pillow. The general idea is that they represent large contiguous blocks of memory, that you can then go and manually allocate memory from but avoid the cost of calling malloc/free. Below is a super simplified example:

extern struct my_struct;

int main() {
  struct my_struct s1 = malloc(sizeof(struct my_struct));
  struct my_struct s2 = malloc(sizeof(struct my_struct));
  struct my_struct s3 = malloc(sizeof(struct my_struct));

  /* do something */

  free(s1);
  free(s2);
  free(s3);

  return EXIT_SUCCESS;
}

In this example we manually allocate memory for all three structs, and then manually free them. This can cause heap fragmentation and results in a lot of sys calls. Instead:

extern struct my_struct;

struct my_arena {
  uint8_t *memory;
  size_t size;
};

void* my_malloc(struct my_arena *arena, size_t size) {
  void* result = arena->memory + arena->size + size;
  arena->size += size;
  return result;
}

int main() {
  struct my_arena arena = { .memory = malloc(1024), .size = 0 };

  struct my_struct s1 = my_malloc(&arena, sizeof(struct my_struct));
  struct my_struct s2 = my_malloc(&arena, sizeof(struct my_struct));
  struct my_struct s3 = my_malloc(&arena, sizeof(struct my_struct));

  /* do something */

  free(arena.memory);

  return EXIT_SUCCESS;
}

In this example we make a single memory allocation and then a single free, which means all of the memory is contiguous (helping with locality) and only 2 sys calls are made (more efficient). I'm omitting a couple of details here about bookkeeping, but that's the general gist.

This is already in place in Pillow. There is a single global arena that is used for all memory allocations. This is great, and helps a lot in terms of performance. However the downside is that in a multi-threaded environment, the mutex that wraps access to the arena (be it the GIL or a Python mutex in free-threaded Python) falls under a lot of contention because everyone is trying to use the same arena.

This commit instead makes a separate arena for each thread, so that each thread manages its own memory. This means it never falls under contention, and you can see in the benchmarks that it drastically speeds up free-threaded Python because it never has to lock anything.

I hope I explained that sufficiently, let me know if there's anything I can clear up!

hugovk · 2025-01-13T15:03:23Z

cc @lysnikolaou who's been helping with the free-threaded work.

lysnikolaou

Looks great @kddnewton! Only suggested a change to setup.py, so that it's a bit clearer.

setup.py

kddnewton · 2025-01-13T18:18:01Z

@lysnikolaou are those test failures related to my changes? Doesn't seem like it but since setup.py infects everything I'm not so sure.

radarhere · 2025-01-13T18:56:38Z

The test failures should be fixed by #8686
The docs failure has been fixed in main by #8691

hugovk · 2025-01-13T19:21:50Z

The test failures should be fixed by #8686

Just merged, please update this PR from main.

setup.py

Currently, all threads use the same arena for imaging. This can result in a lot of contention when there are enough workers and the mutex is constantly being checked. This commit instead introduces lockless thread-local arenas for environments that support it.

kddnewton · 2025-01-13T19:42:58Z

@hugovk done!

wiredfool · 2025-01-13T19:44:02Z

The ImagingMemoryArena is an implicit default for the image -- it's not recorded anywhere that I see. What happens if an image is passed from thread to thread?

This is the image struct:

Pillow/src/libImaging/Imaging.h

Line 80 in cfb2dcd

struct ImagingMemoryInstance {

And this is where the memory is released back into the pool:

Pillow/src/libImaging/Storage.c

Line 368 in cfb2dcd

ImagingDestroyArray(Imaging im) {

kddnewton · 2025-01-13T19:47:13Z

@wiredfool do you have an example of passing it from thread to thread? I'm not sure if I know how that would happen.

wiredfool · 2025-01-13T19:58:16Z

(sorry, managed to edit rather than comment)

I've done it in the past where I had an app where all of the processing was offloaded to worker threads, using queuing. Scanner -> initial processing -> thumbnailing -> uploading were all done off the main thread.

Anything were you're doing something with a UI main thread and processing elsewhere -- there are a bunch of operations that will create a new image. If you then have a reference on the main thread you won't be able to release it.

I'm also thinking that it's going to interfere with the lifetimes for arrow support, because that could potentially be freed from a thread that's not even part of our process.

Actually -- is memory in thread local storage actually available outside of the thread?

kddnewton · 2025-01-13T20:06:32Z

The honest answer is I'm not sure. I think we should test this out. Just so that I can properly replicate what you're saying, are you describing: create images in parent thread, child threads pick them off queue and process them, child threads exit, parent thread resumes?

As for TLS being visible outside of the thread, I think the answer is it depends on the implementation. Linux has actual instructions for TLS, whereas macOS implements it as a library from what I understand. I imagine this would impact the answer.

aclark4life · 2025-01-13T20:23:33Z

I was going to raise "does this help #1888?" so I'm curious to know the answer too … thanks all!

wiredfool · 2025-01-13T20:24:14Z

I think something like:

Open image in parent thread
call image.resize in child thread (e.g. threading.run())
Return resized image to parent thread

would probably be enough to do it.

Actually, looking at gcc's tls:

lifetimes that match the thread lifetime, and destructors that cleanup the unique per-thread storage

That concerns me on a couple of fronts --

The image memory is probably accessible outside the thread, since it's a malloc, and it's just the original struct that's going to be in the TLS storage. However, if we have a oneshot thread that passes the image off, the arena will be deallocated before the image is freed.

So not only will the mutex and the arena likely be wrong in the child, there's going to be a pretty good memory leaks there because we won't necessarily get a chance to clean up the malloc'd items.

kddnewton · 2025-01-13T20:35:16Z

@wiredfool Okay that helps a lot. I'll put together some example code and see what I can see. Maybe we'll need to add some logic on moving between threads to ensure everything is working properly. In the meantime let's put a pause on this PR until I can answer your questions.

wiredfool · 2025-01-13T20:42:16Z

Ok, Some thoughts here --

I don't think that image storage tied to the life of the thread is a good idea. It breaks how we think about python objects. However failing tests due to that on this branch that pass on main should be added to main, because this is clearly an undertested corner. I suspect some of the tests might only fail under valgrind.
Alternatively, It might be possible to reduce the scope of the locks, so that we're only locking things that actually modify the arena struct. e.g., we don't need to lock around the (re)malloc, only the addition into the block list. Reads are probably ok, and comparisons to mostly static ones like the blocks_max and block_size probably don't need locks. More fine grained locks might reduce contention.
Or a set of 8 or 16 or n memory arenas and choose which of one them to use with a hash of some thread id. We'd need to store a pointer to the arena in the image struct though, and follow that for destruction. If the thread goes away, we're not actually losing any arena. There could potentially be arenas that don't get allocated to though, or allocated to and then never drained. The settings for the block cache are per arena, so there's the potential for n* the expected memory to be retained when images are freed.

hugovk added the Free-threading PEP 703 support label Jan 13, 2025

kddnewton force-pushed the thread-local-arenas-2 branch from e799ace to 8445d50 Compare January 13, 2025 15:32

lysnikolaou reviewed Jan 13, 2025

View reviewed changes

setup.py Outdated Show resolved Hide resolved

kddnewton force-pushed the thread-local-arenas-2 branch 2 times, most recently from 350283e to e76b4d4 Compare January 13, 2025 16:22

lysnikolaou reviewed Jan 13, 2025

View reviewed changes

setup.py Outdated Show resolved Hide resolved

kddnewton force-pushed the thread-local-arenas-2 branch 2 times, most recently from 51a476e to 20f2f4c Compare January 13, 2025 17:23

lysnikolaou reviewed Jan 13, 2025

View reviewed changes

setup.py Outdated Show resolved Hide resolved

kddnewton force-pushed the thread-local-arenas-2 branch from 20f2f4c to f751960 Compare January 13, 2025 17:37

wiredfool reviewed Jan 13, 2025

View reviewed changes

setup.py Outdated Show resolved Hide resolved

kddnewton force-pushed the thread-local-arenas-2 branch from f751960 to fa6a6b0 Compare January 13, 2025 19:24

Thread-local arenas

cfb2dcd

Currently, all threads use the same arena for imaging. This can result in a lot of contention when there are enough workers and the mutex is constantly being checked. This commit instead introduces lockless thread-local arenas for environments that support it.

kddnewton force-pushed the thread-local-arenas-2 branch from fa6a6b0 to cfb2dcd Compare January 13, 2025 19:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thread-local arenas #8692

Thread-local arenas #8692

kddnewton commented Jan 13, 2025

aclark4life commented Jan 13, 2025 •

edited

Loading

kddnewton commented Jan 13, 2025

hugovk commented Jan 13, 2025

lysnikolaou left a comment

kddnewton commented Jan 13, 2025

radarhere commented Jan 13, 2025

hugovk commented Jan 13, 2025

kddnewton commented Jan 13, 2025

wiredfool commented Jan 13, 2025 •

edited

Loading

kddnewton commented Jan 13, 2025 •

edited by wiredfool

Loading

wiredfool commented Jan 13, 2025 •

edited

Loading

kddnewton commented Jan 13, 2025

aclark4life commented Jan 13, 2025 •

edited

Loading

wiredfool commented Jan 13, 2025

kddnewton commented Jan 13, 2025

wiredfool commented Jan 13, 2025

Thread-local arenas #8692

Are you sure you want to change the base?

Thread-local arenas #8692

Conversation

kddnewton commented Jan 13, 2025

Summary

Code

Benchmarks

v3.13.0 on main

v3.13.0t on main

v3.13.0 on branch

v3.13.0t on branch

Script

aclark4life commented Jan 13, 2025 • edited Loading

kddnewton commented Jan 13, 2025

hugovk commented Jan 13, 2025

lysnikolaou left a comment

Choose a reason for hiding this comment

kddnewton commented Jan 13, 2025

radarhere commented Jan 13, 2025

hugovk commented Jan 13, 2025

kddnewton commented Jan 13, 2025

wiredfool commented Jan 13, 2025 • edited Loading

kddnewton commented Jan 13, 2025 • edited by wiredfool Loading

wiredfool commented Jan 13, 2025 • edited Loading

kddnewton commented Jan 13, 2025

aclark4life commented Jan 13, 2025 • edited Loading

wiredfool commented Jan 13, 2025

kddnewton commented Jan 13, 2025

wiredfool commented Jan 13, 2025

aclark4life commented Jan 13, 2025 •

edited

Loading

wiredfool commented Jan 13, 2025 •

edited

Loading

kddnewton commented Jan 13, 2025 •

edited by wiredfool

Loading

wiredfool commented Jan 13, 2025 •

edited

Loading

aclark4life commented Jan 13, 2025 •

edited

Loading