Zarr-Python → Benchmarking and Performance #1479

MSanKeys963 · 2023-07-31T00:12:13Z

MSanKeys963
Jul 31, 2023
Maintainer

Hi everyone! 👋🏻

Recently, we had meetings on July 6th and 7th, 2023 (meeting notes) led by @rabernat to discuss and decide the path forward for Zarr-Python development. After a good discussion with the attendees and gauging their interest, we decided to divide the larger group into two groups; they are:

Benchmarking and performance working group 📈 → led by @JackKelly
Refactoring working group 🧑🏻‍💻 → led by @jhamman

Thank you @JackKelly and @jhamman, for stepping up! 🙏🏻

What's next? 👀

This discussion thread aims to kick off the 📈 Benchmarking and Performance 📈 working group and hold any top-level discussions related to the development work.

Here's the project board to organise and track progress: https://github.com/orgs/zarr-developers/projects/4

Thank you, everyone, for joining the meetings and sharing your insights. Please feel free to ask any questions.

@JackKelly, please take it from here.

JackKelly · 2023-07-31T16:42:08Z

JackKelly
Jul 31, 2023

Thanks so much for kicking off this discussion, @MSanKeys963 & @rabernat.

Meetings

To everyone else: If you're interested in helping to measure & improve Zarr's performance then please join our half-hourly meetings which will be held every two weeks, starting in September.

Please don't worry if you don't have time to write code! It will still be very useful to hear different perspectives and use-cases in these meetings.

Please fill out this poll to let us know which times would be convenient for the first meeting in September. (Select your timezone. Then click-and-drag to select all the times that are convenient for you. Then fill out your name (bottom right) and click "SEND RESPONSE".) UPDATE: Voting has now closed. Please see this comment for dates of the two kick-off meetings.

Let's work on the assumption that we'll hold our subsequent meetings every two weeks, at the same time-of-day and day-of-week as our first meeting. (Although, if that's not convenient then by all means raise that in the first meeting).

Very quick intro to me

Finally, I should give a super-quick intro: I'm a co-founder of Open Climate Fix. I've been using (and loving) Zarr since early 2019. I've written a lot of Python code over the last ~12 years (most recently: data-processing and machine learning for solar power forecasting). But I haven't contributed to Zarr before, so I'm a newbie in terms of the Zarr-Python codebase. I'm super-excited about Rust, and have been learning Rust because I hope that Rust could help us speed up Zarr, but I'm still a few months away from being productive in Rust. I love reading about high-performance code (maximising CPU cache hits, io_uring, the performance characteristics of SSDs vs HDDs etc.) but I haven't actually written that much high-performance IO code. Which is all to say: I'm eager to help. But I don't make any claims about being especially knowledgeable! So the community's help & guidance will be essential 🙂

0 replies

clbarnes · 2023-08-01T10:35:23Z

clbarnes
Aug 1, 2023

Some thoughts I've had while writing zarr3:

For the data I've interacted with, blosc is probably going to be the best way to handle compression, but won't be available on all targets (getting it to work in WASM seems like it should be possible but probably a new crate). In fact, blosc2 seems to be eating a lot of zarr's lunch in terms of goals.
Not having a native async read/write trait is a pain. I know that tokio is by far the most common way of doing non-trivial async but it feels iffy to force that stack downstream.
Thanks to the trait system, whatever ends up doing our low-level IO can (and should) be a completely different crate to anything zarr-related. "I want IO, fast" is a concern much larger than just us! Which makes me think we might not be best placed to actually do it - I sort of assumed that greater minds than mine would have expended more effort than me on that front.
object_store doesn't support suffix ranges, which we need for the sharding spec (I've just raised object_store: range request with suffix apache/arrow-rs#4611 ). It might also be nice to have multipart ranges (just raised object_store: multipart ranges for HTTP apache/arrow-rs#4612 ). I have a crate in progress for representing byte range responses (which are wildly variable depending on the server) but it's probably not very efficient, and I don't necessarily know how to make it more efficient while doing what we want.
object_store is also a bit of a chocolate kettle on WASM.
I suspect that a lot of things which technically improve performance in rust (reducing allocations etc.) are going to be negligible compared to the absolute cost of network round trips and other IO operations, so exactly as you suggested, getting benchmarks in place first is going to be important!
Several layers of caching will be important, but I'm struggling to envision a rust-y way of orchestrating configurable caches at different levels

14 replies

JackKelly Aug 5, 2023

Thank you so much for your detailed answer, @FrancescAlted!

As you have mentioned, this current discussion thread is quite long. And it's rather meandering (which is entirely my fault! 🙂) So your comment above might not be seen by the bulk of the Zarr community. To maximise the chance that the rest of the Zarr community sees your suggestions, I would suggest copying-and-pasting your above comment into a new issue in the Zarr-Specs repo. This suggestion is based on text in zarr-developers/zarr-specs#254 (A review of the most recent Zarr sharding proposal) which says:

Specific technical feedback on sharding should be made via narrowly scoped issues on the zarr-specs repository that link to [issue 254].

I'm very new to the Zarr dev community - so it's highly likely that I'm mistaken 🙂 - but my impression is that it might be a bit late in the day to modify the Zarr sharding proposal in any substantive way. But please post in Zarr-Specs to get feedback from folks who are far more knowledgeable than I am.

jbms Aug 5, 2023

The zarr v3 sharding in zep 2 allows nested partitioning by nesting the sharded_indexed codec, which I think would be basically equivalent to the blosc2 double partitioning. It seems to me that nested partitioning could clearly be useful to avoid an excessively large shard index that could occur when using just a single level of partitioning and a very large number of partitions, especially with certain sparsity (missing sub-chunks) patterns. It is less clear to me, though, under what circumstances it would be important for performance otherwise, if the implementation is unconstrained. The zarr v3 sharding_indexed format already allows the implementation to choose the order in which sub-chunks are written, and if e.g. Morton order is used then there is a similar opportunity for coalescing reads as with multiple levels of partitioning.

FrancescAlted Aug 7, 2023

@JackKelly I really don't want to interfere in your roadmap, so I won't add my suggestions there. Actually, we already had this discussion past year, and the outcome was that Zarr's sharding and Blosc2's double partition are different beasts with different goals. At any rate, and irregardingly of whether these can cooperate or are essentially incompatible, Zarr could at least leverage Blosc2 as yet another codec; this should be a low hanging fruit, IMO.

akshaysubr Sep 6, 2023

I'm seeing this discussion quite late, so apologies if I missed something. One of my common use cases is to read from zarr stores to feed a DL training pipeline where the data is needed on the GPU. In that context, we are quite resource constrained on the CPU and using GPU decompression has a significant performance upside. However that requires allowing for concurrent decompression of multiple chunks at once rather in a bulk-synchronous fashion. There's more discussion about this in this issue: #1398. In this context, thinking of CPU-GPU compatibility (compressing on one, decompressing on the other) is critically important and there isn't a clean way to do this currently. Would be great to address that topic as part of this effort as well. This issue has some details about how we might allow for this compatibility and this PR has an initial example using a workaround.

weiji14 Oct 13, 2023

@akshaysubr, I've been running some benchmarks on reading Zarr with kvikIO (NVIDIA GPUDirect Storage) and have opened a discussion at zarr-developers/zarr-benchmark#14 (to not clutter this thread). Would be happy to discuss more on this as I'm interested in a fully GPU-native Zarr reader too!

normanrz · 2023-08-01T12:04:42Z

normanrz
Aug 1, 2023
Maintainer

Of course it always depends on the use case and array configuration. From my experience with zarrita and sharding, the performance bottleneck is less at the IO layer but more at the (inner) chunk handling. With sharding, we write out not so many, fairly large objects. With an async IO implementation, this is already fairly fast in Python.
However, assembling the shards is very time consuming. For example, we have a configuration with shard_shape = (1024, 1024, 1024) and chunk_shape = (32, 32, 32). That means we have 32k chunks per shard that all need to be compressed individually and written into a buffer before writing out to IO. This results in a lot of heap allocations which eat up performance. Reading entire shards has the same problem. Optimizing this (e.g. with a Rust crate) would be more worthwhile than replacing fsspec, in my opinion.
(Although, I would love to see a truly async local filesystem library for Python.)

10 replies

JackKelly Aug 1, 2023

Whether to coalesce or parallelize the IO ops is probably always a tradeoff that depends on the access pattern and underlying storage.

Great point. Hopefully we can give the users control over these tradeoffs (so the user can pick the best optimisations for their use-case, and their hardware).

jbms Aug 1, 2023

For 4KB chunks heap allocations might start to be an issue but I expect even then they won't be a major concern if you have a fast memory allocator. Python is slow for a lot of reasons of which memory allocation is, I think, just a small part. Potentially a thread-local arena allocator (i.e. scratch buffer) could be used but that makes it difficult to migrate work between threads.

clbarnes Aug 1, 2023

I feel like rustc won't be a big fan of writing to a shared array in parallel 😅 ndarray supports a mutable chunk iterator but not ndarray::parallel - not sure if the chunks end up Send/Sync-able.

kylebarron Aug 1, 2023

won't be a big fan of writing to a shared array in parallel

I'm not familiar with ndarray but with a normal Vec you can use split_at_mut to get multiple mutable slices that can each be mutated separately from threads.

JackKelly Aug 2, 2023

Yeah, I'm not 100% sure how best to allow concurrent writes to the "shared final" array in Rust! Especially given that writing a Zarr chunk to the final array will often not be a single contiguous memcopy operation (even though the elements of a single chunk are - by definition - contiguous in n-dimensional space, the elements probably won't be contiguous in the serial memory layout of the final array).

A related issue to be aware of: Depending on actual measured performance, we may find that we have to be careful not to invalidate the CPU caches: if two CPUs write to the same cache line (usually 64 bytes wide) then it'll invalidate the CPU-specific caches (which has a large time penalty) (see "false sharing"). So we may want to deliberately schedule the writes so that writes to nearby addresses in the serial structure of the final array never happen concurrently. Although this might be a premature optimisation 🙂.

martindurant · 2023-08-02T13:18:44Z

martindurant
Aug 2, 2023
Maintainer

I am only just seeing this long thread, so please allow me some time to catch up. I am interested, please include me.

I thought I'd quickly point out that rfsspec already supports suffix ranges; and that cramjam shows a nice pattern for encoding/decoding compressors with optional python bindings. I think it's a mistake to attach to arrow, you will never get them to make the changes we require.

1 reply

JackKelly Aug 2, 2023

Cool, no rush! Please don't feel under any pressure - I know you're super-busy! TBH, for me personally, I'm going to be away from 8th Aug until the end of Aug, so please take your time! (But please don't let that stop you folks discussing stuff!)

JackKelly · 2023-08-02T13:54:26Z

JackKelly
Aug 2, 2023

One random thought: This thread has already highlighted that there are loads of use-cases and platforms and software dependencies to consider - and I'm sure many more will come to light soon.

I don't know how you folks feel, but it feels to me like we have more questions than answers at the moment (which is good!). So I think we shouldn't put undue pressure on ourselves to try to architect the "perfect" all-singing all-dancing high-performance Zarr stack in one go. Instead, after we benchmark some existing Zarr implementations, my guess is that the next step will be to follow the lead of projects like rfsspec, Zarrita and zarr3, and implement a handful of prototypes (that might be specific to a single use-case) to start to answer some of our many questions (questions such as: Can we use blosc in a WASM blob? Is io_uring useful for Zarr? Is it even vaguely sane to aim to 1 million IO operations per second? Which existing software packages are useful? etc. etc.). This is also nice because it allows individuals to work on self-contained prototypes (to answer specific questions) without much communications overhead. And hopefully it'll be fun, too, which is very important 🙂.

How does that sound?

5 replies

normanrz Aug 2, 2023
Maintainer

I think that makes a lot of sense. It will be much easier to make informed decisions about optimizations with some benchmarks.
I would recommend to collect use cases for these benchmarks, first. I'll start by putting in our primary use case:
We work with 3D bioimaging data. We store the data in small chunks (e.g. 32 ** 3) in larger shards (e.g. 1024 ** 3). We have both images (hardly compressible) and segmentation maps (highly compressible). Our access pattern is reading/writing entire shards in an HPC environment (mostly Python code) and random access to individual chunks from an interactive visualization tool (web-based, no Python).

JackKelly Aug 2, 2023

Great, thanks, I'll start collecting use-cases in the "Benchmarking Zarr" doc.

Please may I ask some quick follow-up questions about the use-case:

Sorry if I'm being dumb but please could you explain the chunk sizes and shard sizes? 🙂

What storage medium do you store your Zarr data on? (HDDs in your HPC environment?)

Which compression codec(s) do you use for your Zarr data?

Do you use Xarray / Dask in your stack?

Which OS? (Linux?)

For the interactive visualisation, I assume the single most important performance characteristic for your users is latency?

normanrz Aug 2, 2023
Maintainer

Sorry if I'm being dumb but please could you explain the chunk sizes and shard sizes? 🙂

Sorry, my markdown garbled the **. I just updated the post.

What storage medium do you store your Zarr data on? (HDDs in your HPC environment?)

Typically a GPFS (distributed, posix-like FS) with many HDDs.

Which compression codec(s) do you use for your Zarr data?

transpose(F), endian(little), blosc(zstd)

Do you use Xarray / Dask in your stack?

No, but we have our own multi-node scheduler that is similar to dask.

Which OS? (Linux?)

Yes, Linux.

For the interactive visualisation, I assume the single most important performance characteristic for your users is latency?

Latency and throughput are both important. However, we really don't have any plans to implement this software in Python.

clbarnes Aug 2, 2023

Our use case is also 3D imaging, generally single-channel, primarily 8- or 16-bit. Mainly poorly-compressing images, but moving towards using some segmentation maps too. This is all in N5 using gzip for the moment as we didn't know any better when we set it all up, but I'd like us to move towards "proper" zarr as v3 matures, and benchmarking suggests some combination of blosc filters are much better for speed/ compression.

Most of our storage is on RAIDZ2/3 and served over HTTP(S) or over a network mount. We would like to put more of it on our institutional cephfs cluster, which will require sharding due to inode limits. Most of our day-to-day interaction with the data is human scrolling, so latency is most important, but throughput is a limiting factor when it comes to registering and transforming the images (mainly done in the java stack) and we are steadily upscaling our automated segmentation pipelines.

I recommend dask and, increasingly, xarray when interacting with our data, but that's generally for small-scale stuff.

JackKelly Aug 2, 2023

Great stuff, thank you! I've started a new thread for collecting use-cases. (Chris and Norman, please don't worry about copying-and-pasting your use-cases to the new thread!)

MSanKeys963 · 2023-08-03T00:10:56Z

MSanKeys963
Aug 3, 2023
Maintainer Author

Update: I've included the link for the project board in the description.

0 replies

JackKelly · 2023-08-03T12:30:53Z

JackKelly
Aug 3, 2023

Some quick updates:

`object_store`

I've started an issue on object_store to discuss whether object_store is the right place to implement batched, parallel load-decompress-copy. At the time of writing, it's looking like it might not be. (Which is what I expected, and is fine! 🙂).

Work-stealing async executors (like Tokio)

After giving it more thought, I'm wondering if maybe work-stealing isn't necessary or desirable for Zarr?

Work-stealing may not be necessary because, when the user requests a Zarr slice, we know exactly how many tasks need to be completed: each chunk needs to be loaded, decompressed, and moved to the final array. Our tasks won't spawn other tasks, right? And decompression and copying the chunk to the final array won't pause. When a worker thread gets a chunk from disk, that thread is guaranteed to be working flat-out until it's ready to process another chunk from disk.

And work-stealing may not be desirable for us because work-stealing is not very friendly to CPU caches? (I need to read Acar et al. 2000, The Data Locality of Work Stealing).

Which maybe suggests that a simple thread pool may be better?

`io_uring`

If we were to implement our own IO backend using io_uring and a thread pool, we might first submit our queue of, say, 1 million read operations to the kernel. Then we'd have a thread pool with roughly as many threads as there are logical CPU cores. Each worker thread would run a loop which starts by grabbing data from the io_uring completion queue, then immediately decompresses the chunk, and then - while the decompressed data is still in the CPU cache - write the decompressed chunk into the final array in RAM.

But I'm also acutely aware that I'm perhaps overly obsessed with io_uring 🙂. And any implementation which relies on io_uring will only work on Linux. It won't work on WASM, MacOS, or Windows (or even a Linux VM hosted on MacOS or Windows) (although Windows Subsystem for Linux does support io_uring).

Multiple IO backends, each of which provides batched IO, de(compression), and merging into a final array?

Perhaps a path forwards is to have multiple IO backends (which share a common API, of course): one backend uses io_uring for maximum speed, but is limited to running on Linux. Another backend is perhaps just a thin wrapper around object_store. etc.

But, in order to exploit backends which can load & process chunks in parallel, perhaps the API should be batched: i.e. users of the API would be able to say "please get these million chunks, and decompress them, and move them into the final array for me", so the backend can do these steps in parallel if possible? (Similar to what has been proposed multiple times before, e.g. in issues #536, #547, and #1398). This could also open up the possibility of the backend providing batched decompression on the GPU, and copying the decompressed chunks into a final array located on the GPU (discussed in issue #1398).

1 reply

clbarnes Aug 3, 2023

Our tasks won't spawn other tasks, right?

They might do - for example compression/decompression codecs may parallelise internally. Whether we want that will, I suppose, depend on whether we're writing a few big chunks or many small chunks. Does it even matter for our use case? I guess it might do if they're using the same outer executor. The benefit of an async implementation where we don't care about extra tasks being spawned at lower levels is that we can use basically the same mental model for "write this region into an array", "split this region into chunks and write each one to an array chunk", "divide up these chunks according to the sharding spec and write as necessary" etc.

But it's certainly possible that there is a better concurrency model for us - there are a bunch of actor model implementations in rust (some of which build on top of tokio), is that more what we're after?

JackKelly · 2023-08-04T12:46:48Z

JackKelly
Aug 4, 2023

Meeting times for our kick-off meetings!

Thank you to everyone who filled out the poll. Unfortunately there is no single time that everyone can make (because timezones 🙂). In fact, 3 people have 3 completely disjoint availabilities! So I'd propose that we have two kick-off meetings in Sept:

Monday 18 September 2023, 12:00 UTC UPDATE 2nd Oct: We're considering scrapping these Monday meetings. Please see discussion here.
- Can make it: Andreas Poehlmann (@ap--), Chris Barnes (@clbarnes), Norman Rzepka (@normanrz), Sanket Verma (@MSanKeys963)
- Can't make it: Akshay Subramaniam (@akshaysubr), Jeremy Maitin-Shepard (@jbms), Josh Moore (@joshmoore), Max Jones (@maxrjones)
- Subsequent meetings: Mon 2nd Oct, Mon 16th Oct, Mon 30th Oct, etc.
Thursday 28 September 2023, 16:00 UTC
- Can make it: Akshay Subramaniam, Andreas Poehlmann, Jeremy Maitin-Shepard , Max Jones, Sanket Verma
- Can't make it: Chris Barnes, Josh Moore, Norman Rzepka
- Subsequent meetings: Thu 12th Oct, Thu 26 Oct, etc.
  - From Sunday 5th Nov, 16:00 UTC will become 08:00 Pacific time. Which is too early! So maybe we could change to 17:00 UTC for Thu 9th Nov onwards?

If you can make both meetings, then please don't feel under any obligation to attend both meetings!

@joshmoore, I'm really sorry but neither of these meeting times work for you. Would 12:00 UTC work for you on Mon 2nd Oct and/or Mon 16th Oct?

11 replies

JackKelly Sep 14, 2023

Just a quick reminder that our first meeting (for folks in European timezones) is on Monday at 12 UTC! Looking forward to it! Please shout if you'd like to attend but you're not yet on the calendar invite.

olimcc Sep 18, 2023

I didn't fill out the poll either, but I'd like to join if possible? I can't make the 18th, but the 28th would be great. Thank you!

JackKelly Sep 18, 2023

Great, I've just added you to the invite for the 28th!

martindurant Sep 18, 2023
Maintainer

Please add me too

JackKelly Sep 18, 2023

Great, I've added you too, Martin!

Soon, @MSanKeys963 will move these regular calls to the Zarr Community Calendar, and set up Zoom for these calls.

martindurant · 2023-08-07T13:21:15Z

martindurant
Aug 7, 2023
Maintainer

I wonder if we shouldn't have an explicit mention of super-zarr parallelism, particularly dask? The number of threads, async IO and batching will surely have a different impact when dask is also parallelising over the top. For instance, dask has long set thread spawning in the libraries it calls to one, since there are already about one thread per core at work.

Also, there are other parallel libraries out there - they may have the same concerns, but maybe not.

5 replies

JackKelly Aug 7, 2023

Sounds good! When we specify a set of "benchmarking workloads", should some of those workloads include dask?

(I'd guess it should. But I'm also wary of a combinatorial explosion of benchmarks!)

martindurant Aug 8, 2023
Maintainer

Yes, I think we should explicitly test against dask in a couple of configurations (threaded scheduler, distributed default). Are we in a place to know if other parallelism libraries are being used in conjunction with zarr?

JackKelly Aug 8, 2023

Sounds good - thanks for the suggestion.

Are we in a place to know if other parallelism libraries are being used in conjunction with zarr?

I regularly use Zarr with pytorch's DataLoader with num_workers set to greater than 1 (so pytorch uses multiple processes to load data in parallel). So maybe we should also have some benchmarks which do that, too.

JackKelly Aug 8, 2023

BTW, we are likely to also want to benchmark "other" Zarr implementations (TensorStore, Zarrita, etc.) So we may need a "universal", core set of benchmark workloads which can be run against all Zarr implementations. And then have an additional set of benchmark workloads which are just for Zarr implementations which play nicely with Dask and/or PyTorch (like zarr-python). Does that sound OK?

akshaysubr Sep 6, 2023

I also use pytorch DataLoader regularly with asynchronous workers and also use DALI frequently with zarr inside an ExternalSource operator.

JackKelly · 2023-08-08T17:38:43Z

JackKelly
Aug 8, 2023

(Just to flag up that I'll be going on family holiday within the next few days, and will then be on holiday for the rest of August. But please don't let that stop others discussing things! I just didn't want you to think I was being rude by not replying!)

0 replies

MSanKeys963 · 2023-09-19T22:57:34Z

MSanKeys963
Sep 19, 2023
Maintainer Author

Update: As discussed during the first meeting on 9/18, I've added bi-weekly meetings for the benchmarking & performance group to the Zarr Community Calendar. See #1479 (reply in thread).

Meeting details here: https://zarr.dev/community-calls/

0 replies

martindurant · 2023-09-28T17:32:34Z

martindurant
Sep 28, 2023
Maintainer

My personal takeaways from the benchmarking presentation today, as I understood it. The workflow was single-pass of a large amount of zarr uncompresed data on local disk.

The majority of the IO time was in

zarr.storage.DirectoryStore._from_file, f.read()
zarr.core.Array._process_chunk, out[out_selection] = tmp (unaligned memcopy)

This suggests to me the low-hanging fruit of:

implement partial reads for uncompressed data (would also impact remote uncompressed reads)
implement f.readinto to avoid the memcopy in the case of contiguous selection (note that some libraries, particularly cramjam, offer decompress_into for the case of contiguous AND whole chunk selection)

Whether or not the memcopy would be noticeably faster could be tested by preallocating and reusing a sufficiently large numpy buffer and .readinto that, which is safe if the IO happens in a single thread. For strided copy, as in the benchmark workflow, I suspect it makes no difference.

I suspect that dask threading over the IO would make little or no difference since disk operations block, and the memcopy is probably saturating the bus, but it ought to be tried.

10 replies

jbms Sep 30, 2023

@MSanKeys963 @JackKelly Perhaps it would make sense at this point to create an initial repo, e.g. zarr-developers/zarr-benchmarks, and then we can start creating PRs to add benchmarks, etc.? Even if what initially goes into the repo is ultimately entirely replaced by something else, in order to better structure the benchmarks, capture more metrics, etc., I still think just having some benchmarks we can run will be helpful.

JackKelly Oct 1, 2023

I agree!

I've started work on a general framework for running IO-centric benchmarks here. It's not quite ready yet. But the API is shaping up (and is discussed here). But please do review and submit issues / PRs! 🙂 As a very rough timeline, I'd hope we'll be ready to start writing & running Zarr-specific benchmarks by mid-October.

But, I agree, it'd be even better to have a new git repo under zarr-developers. @MSanKeys963 if you can help set up a repo, that'd be great!

Should we have two repos (one for a general framework for running IO-centric benchmarks, and another repo for the Zarr-specific 'recipes' for creating benchmark datasets & runing benchmarks?) That would hopefully enforce a strict separation of concerns, and would allow other folks to use the general framework. But maybe having two repos requires a tiny bit more admin than merging all the code into a single repo. Personally, I'd lean towards having two repos. But please shout if you'd prefer a single repo!

If we do have two repos, should they both be under GitHub.com/zarr-developers?

JackKelly Oct 2, 2023

Hurray! Following today's Zarr Benchmarking & Performance meeting (Europe-friendly time), the zarr-developers GitHub org now has two new repos:

perfcapture - A general-purpose benchmarking framework, focused on IO-intensive workloads. perfcapture doesn't contain any zarr-specific code.
zarr-benchmark - The Zarr-specific code for defining Zarr datasets & Zarr workloads. Built on top of perfcapture.

Huge thanks to @joshmoore for helping get this set up!

If, at a later date, we want to merge these two repos into one, then we can! My hope is that, by separating them, we enforce a strict separation of concerns, and allow other folks to use the general framework.

Now that we have these two new repos, I'd propose that we move away from using this huge Discussion thread, because it's pretty terrifyingly long for new folks! For general discussions around Zarr benchmarking, including discussing logistics for our Zoom meetings, I'd propose we use the Discussions forum under the new zarr-benchmark repo. For technical discussions, I'd propose we move to GitHub issues for perfcapture or zarr-benchmark.

JackKelly Oct 3, 2023

On the topic of Vincent's demo from last week:

I've been chatting to Vincent over email. Vincent has been super helpful!

With Vincent's guidance, I've implemented a very minimal benchmark workload which appears to reproduce the slow memmove_unaligned behaviour that Vincent was telling us about.

Here's the "recipe" which defines the benchmark dataset & workload.

And here are some results from Intel VTune on my local machine.

TBH, this is my first time running VTune, so I'm far from certain as to what all these cool measurements mean! But I'm excited to learn more! To my untrained eyes, it looks like VTune is telling us that there's lots of headroom for improvement 🙂

Technically, you could run this benchmark locally on your machines right now. But I'd suggest waiting at least a week or two for me to develop the code a bit more! It's in a very immature state right now!

martindurant Oct 3, 2023
Maintainer

It does at least make clear: the workload for that particular benchmark is not crossing chunk boundaries on each read.

martindurant · 2023-10-06T13:05:22Z

martindurant
Oct 6, 2023
Maintainer

It occurs to me, that parallel decompression in the existing async chunk loading logic should be pretty simple: use run_in_executor to farm such CPU tasks to threads. So long as the algorithm releases the GIL, this would be enough.

8 replies

martindurant Oct 10, 2023
Maintainer

run_in_executor will use a ProcessPoolExecutor which uses multiprocessing internally.

no, you can use threads https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.ThreadPoolExecutor

JackKelly Oct 10, 2023

edit Martin & I wrote our replies at (almost) the same time! The reply below is replying to Chris' post.

Great point. I agree: any solution which involves picking chunks to send chunks between processes is likely to be slow (and inefficient).

Instead of run_in_executor, can we use concurrent.futures.ThreadPoolExecutor to run the codecs in parallel (to avoid pickling chunks to pass them between processes)? If the codecs release the GIL, then ThreadPoolExecutor should use multiple CPU cores in parallel (if I've understood this discussion correctly)?

Forgive my ignorance but does Zarr-Python load chunk n from disk, then decompress chunk n, then load chunk n+1, then decompress chunk n+1.

Or does Zarr-Python load all the chunks from disk. And then decompress all the chunks?

martindurant Oct 10, 2023
Maintainer

(note that you still need run_in_executor if you want to offload from async - if gives back an awaitable; but the executor can be the thread one)

Forgive my ignorance but does Zarr-Python load chunk n from disk, then decompress chunk n, then load chunk n+1, then decompress chunk n+1.

For some storage backends, it fetches all the chunks required for the current call (i.e., always within one array) asynchronously. This might be the fsspec backend only. Right now, it then goes on to decompress these sequentially. If we are already fetching concurrently, it makes sense to parallelise the decompression even though one is IO and one CPU bound.

If not running async (e.g. local disk), it might make sense to decompress one chunk in a thread while the next is loading; but the "loading" one will I think always hold the GIL, so this might not help anything.

akshaysubr Oct 12, 2023

Would it make sense to add this multi-threaded decoding in the numcodecs.Codec base class? If the base class implements a decode_batch method like in this proposal, then zarr-python can just call that on as many chunks as it has available. Potentially even making that a tunable parameter.

This way, all the parallel decompression stuff is encapsulated in numcodecs and zarr doesn't have to prescribe the parallelism approach. All existing codecs for example will transparently see the benefit of this parallelism and any codecs that want to use a different parallelization approach can still just stick to numcodecs an override the decode_batch method.

One thing I'm not sure about though is if this approach can also potentially allow making the decode_batch method an async generator and if it does, can that enable more effective pipelining?

martindurant Oct 12, 2023
Maintainer

Would it make sense to add this multi-threaded decoding in the numcodecs.Codec base class?

Maybe? (It should only be used for codecs that we know release the GIL)
But in the case that the data chunks are coming in in parallel, this would mean something two-stage: wait for all all the bytes, the then decompress them in parallel.

Going back up the thread a little, I would like to repeat my earlier observations from the last benchmarking presentation, of the low-hanging fruit that will improve performance:

partial load (particularly for uncompressed, but also for anything else that might support it)
load/decompress directly into the output buffer (some codecs do this, we should enumerate which)

MSanKeys963 · 2023-10-16T16:20:20Z

MSanKeys963
Oct 16, 2023
Maintainer Author

Hi everyone! 👋🏻

Update: We're trying to find a new time for the bi-weekly meetings for this group to avoid ongoing conflicts.
Please fill out the poll here: https://whenisgood.net/2k5a3jy.

I've also emailed everyone from the group, but in case I missed you, please refer here.

After the results, I'll update the community calendar as well. Thank you everyone for your time and efforts! Appreciate it!

0 replies

JackKelly · 2023-10-20T18:10:12Z

JackKelly
Oct 20, 2023

I've published some detailed performance analyses (using Intel VTune and Zarr-Benchmark) of Zarr-Python (and numpy) here: zarr-developers/zarr-benchmark#22

0 replies

JackKelly · 2023-11-09T11:01:49Z

JackKelly
Nov 9, 2023

Just a quick reminder that we have a Zarr Benchmarking & Performance Zoom meeting today! Details are on the Zarr Community Calendar. And here's the agenda. Looking forward to it 🙂

0 replies

JackKelly · 2023-12-07T13:11:19Z

JackKelly
Dec 7, 2023

Oooh, it turns out that the NVMe v2 standard enables SSDs to provide key-value storage (and compression) on the device. And can support tiny chunks (down to 1 byte!). So the SSD can basically do everything we need for an Zarr Store! More details here (sorry for cross-posting):

0 replies

Zarr-Python → Benchmarking and Performance #1479

MSanKeys963 Jul 31, 2023 Maintainer

What's next? 👀

Replies: 17 comments · 65 replies

Meetings

Other things to read (if you want to!)

Very quick intro to me

normanrz Aug 1, 2023 Maintainer

martindurant Aug 2, 2023 Maintainer

normanrz Aug 2, 2023 Maintainer

normanrz Aug 2, 2023 Maintainer

MSanKeys963 Aug 3, 2023 Maintainer Author

object_store

Work-stealing async executors (like Tokio)

io_uring

Multiple IO backends, each of which provides batched IO, de(compression), and merging into a final array?

Meeting times for our kick-off meetings!

martindurant Sep 18, 2023 Maintainer

martindurant Aug 7, 2023 Maintainer

martindurant Aug 8, 2023 Maintainer

MSanKeys963 Sep 19, 2023 Maintainer Author

martindurant Sep 28, 2023 Maintainer

martindurant Oct 3, 2023 Maintainer

martindurant Oct 6, 2023 Maintainer

martindurant Oct 10, 2023 Maintainer

martindurant Oct 10, 2023 Maintainer

martindurant Oct 12, 2023 Maintainer

MSanKeys963 Oct 16, 2023 Maintainer Author

MSanKeys963
Jul 31, 2023
Maintainer

Replies: 17 comments 65 replies

normanrz
Aug 1, 2023
Maintainer

martindurant
Aug 2, 2023
Maintainer

normanrz Aug 2, 2023
Maintainer

normanrz Aug 2, 2023
Maintainer

MSanKeys963
Aug 3, 2023
Maintainer Author

`object_store`

`io_uring`

martindurant Sep 18, 2023
Maintainer

martindurant
Aug 7, 2023
Maintainer

martindurant Aug 8, 2023
Maintainer

MSanKeys963
Sep 19, 2023
Maintainer Author

martindurant
Sep 28, 2023
Maintainer

martindurant Oct 3, 2023
Maintainer

martindurant
Oct 6, 2023
Maintainer

martindurant Oct 10, 2023
Maintainer

martindurant Oct 10, 2023
Maintainer

martindurant Oct 12, 2023
Maintainer

MSanKeys963
Oct 16, 2023
Maintainer Author