Zarr is a cloud-native, chunked, compressed, and hierarchical array data format.
- Existing resources
- Introductory videos
- Zarr V3
- Libraries
- Platforms
- Articles
- Talks & Videos
- Life sciences
The Zarr website is already an excellent resource for learning about Zarr and its ecosystem. This list is intended to complement the website with a curated and opinionated list of resources.
This list focuses on Geo/Earth Sciences, but is not limited to that domain.
Lists
- The Zarr website already contains great lists: Zarr Implementations, Zarr Datasets, Zarr metadata conventions
- Zarr tutorials (zarr-developers/tutorials)
- Projects using Zarr (zarr-developers/community#19)
- Beautiful Zarr (zarr-developers/beautiful-zarr)
- See playlists & lists in Talks & Videos
Introductory talks Youtube playlist
Two excellent and up-to-date introductory talks:
Zarr V3 is the upcoming version of Zarr. It is a major update that will bring many new features and improvements.
If you're getting into Zarr now, it might be a good idea to start with Zarr V3.
For an excellent in-depth overview, see the ESIP series of talks
- 2023-03-27 ESIP Cloud Computing Cluster: Zarr - The Next Generation
- 2023-04-24 ESIP Cloud Computing Cluster: Next Generation of Zarr Part 2/3 GeoZarr and Zarr Sharding
- 2023-05-22 ESIP CCC: Next Gen Zarr Part 3/3: accumulation proposal, Kerchunk and Pangeo-Forge
This list contains libraries that directly relate to Zarr in some way.
For implementations of Zarr, see Zarr Implementations.
- kerchunk, see kerchunk section
- xpublish: Exposing as and consuming Zarr through a REST API
- See also routers at xpublish-community, e.g. xpublish-opendap
- Improving Access to NOAA NOS Model Data with Kerchunk and Xpublish
- ndpyramid: utility for generating ND array pyramids using Xarray and Zarr
Storage & I/O
- Tensorstore and xarray-tensorstore: library for efficiently reading and writing large multi-dimensional arrays, has Zarr API
- KivkIO: C++ and Python bindings to cuFile, enabling GPUDirect Storage
- rechunker: disk-to-disk transformation for chunked arrays
- xpartition: writing large xarray datasets to Zarr. Works around shortcomings of Dask (distributed#6360)
ETL
- Xarray: Zarr is commonly written and accessed through xarray's API.
- Xarray has its own Zarr Encoding Specification
- xarray-beam: Integration of xarray and Apache Beam built using Zarr.
- Pangeo-forge: Open-source data platform for transforming datasets into analysis-ready cloud-optimized formats.
Developer-oriented
- numcodecs: Compression and transformation codecs used by Zarr
- pydantic-zarr: Pydantic models for Zarr objects
- traverzarr: Traversing Zarr JSON as if it's a filesystem
- zarr_checksum: Calculating checksum information form Zarr
- zarrdump: Describe zarr stores from the command line
Visualization: For tools & libraries for visualization, see visualization section
Kerchunk allows you to efficiently read chunked data formats such as GRID, NetCDF, COGs by exposing them as a Zarr store.
Talks and tutorials
- All you need is Zarr
- 2022 ESIP Kerchunk Tutorial
- Accessing NetCDF and GRIB file collections as cloud-native virtual datasets using Kerchunk
In the future, Kerchunk will be split into upstream functionality in Zarr itself and a new VirtualiZarr package.
- Kerchunk JSON references will become a part of the Chunk manifest
- For a full overview, see Upstreaming Kerchunk
- What's Next for Kerchunk
- Arraylake: a data lake platform based on Zarr. The company, Earthmover was started by core Zarr developers.
- NASA IMPACT: Zarr Visualization Report
- Earthmover: cloud-native data loaders for machine learning using zarr and xarray
- Zarr Sprint Recap relevant overviews
Existing lists
- Zarr Developers playlists, namely
- Zarr Talks
- Introductory videos in this list
Talks
- Earthmover Webinar: Building a Planetary Scale Earth Observation Data Cube in Zarr with code repository and slides
- Earthmover Webinar: Analysis-ready Weather Forecast Data Cubes with Zarr with code repository and slides
- Presentation | Zarr: Community specification of large, cloud-optimised, N-dimensional, typed array storage
- Presentations for Sanket Verma's talks: SciPy 2023 and PyCon DE 2023
Zarr has seen great adoption in the life sciences domain.
- bdz: Zarr-based format for storing quantitative biosystems dynamics data
- ome-zarr-py: Implementation of next-generation file format (NGFF) specifications for storing bioimaging data in the cloud.
- ez_zarr: Easy, high-level access to OME-Zarr filesets
- hdmf-zarr: Zarr I/O backend for HDMF
Talks and resources
- Zarr | Life Science Lightning Talk | Trevor Manz | Dask Summit 2021
- Accelerating Single-cell Bioinformatics with N-dimensional Arrays in the Cloud | ISMMS
- What are next-generation file formats (NGFF)?
Zarr has seen most work on visualization in the bioimaging community:
- List: Image viewers with OME-Zarr support
- WEBKNOSSOS: web-based visualization & annotation tool, supports OME-Zarr
- Napari: interactive viewer
- Vizarr: interactive viewer built using viv (OME-Zarr and OME-TIFF)
- Neuroglancer: WebGL-based viewer for volumetric data
- BigDataViewer
For a general overview, see
Essentially all other common array data formats can be exposed as Zarr. See Kerchunk.
Zarr, NetCDF, and HDF5 are three separate data formats that nonetheless relate to each other in multiple ways.
- Zarr inherits its hierarchical structure from HDF5.
- Zarr is commonly accessed through xarray, whose data models are based on the NetCDF data format
- NetCDF4 can use HDF5 as a backend
- NCZarr is an extension of the Zarr format to map it to a subset of the NetCDF data model.
Resources
- A Comparison of HDF5, Zarr, and netCDF4 in Performing Common I/O Operations HDF5
- Pangeo: HDF5 at the speed of Zarr
- Joe Jevnik: Zarr vs. HDF5 | PyData New York 2019
Zarr and N5 are two similar array data formats that share common goals and development.
The Zarr V3 spec aims to provide a common implementation target (sources: 1, 2)
Links
- n5
- zarr.n5
- z5: C++ and Python interface for datasets in zarr and n5 format
- Zarr N5 spec diff (zarr-specs#3)
GeoZarr is a proposal for a Zarr-based geospatial data format, being submitted as an OGC standard
GeoZarr will define a metadata convention for Zarr stores that contain geospatial data.
It will also define the relationship of Zarr with CF and NetCDF
Links
STAC provides a common structure for describing and cataloging spatiotemporal assets.
With its hierarchical structure and key-value metadata support, Zarr's capabilities overlap significantly with STAC.
The communities have not yet converged on a canonical representation of Zarr datasets through STAC.
Today, a good example of exposing Zarr in STAC is Planetary Computer
- Reading Zarr Data
- STAC collection: Daymet Annual North America
- STAC collection: CIL Global Downscaled Projections for Climate Impacts Research
- xstac: STAC from xarray
- Related STAC extensions: xarray-assets, datacube
More discussion & Related links
- Pangeo: Metadata duplication on STAC zarr collections
- geozarr-spec#32: Integration of Zarr with STAC Catalogs
- stac-spec#781: Zarr Extension?
- Tom Augspurper: STAC and Kerchunk
- Presentation | Daniel Jahn β STAC vs Zarr
- Arraylake a data lake platform that is arguably the first example of a pure Zarr data catalog
In the future, the Zarr V3 Spec and GeoZarr convention will likely enable greater interoperability between STAC and Zarr.