Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_zarr failed in rust: incompatible fill value 0 for data type string #15

Open
zqfang opened this issue Jan 28, 2025 · 6 comments
Open
Labels
help wanted Extra attention is needed

Comments

@zqfang
Copy link
Contributor

zqfang commented Jan 28, 2025

Hi Kai,

Thank you so much for the amazing implementation of anndata in rust. It helps a lot with my research.

I need your help with reading the Zarr file.

I've been testing zarr format input, however, it complains about the error message below. I can't figure out what the error means. I hope you can help me out about this.

called `Result::unwrap()` on an `Err` value: incompatible fill value 0 for data type string

the code I used was

        let path = "/Users/fangzq/Data/NK/ZZM_panNK_test5.zarr";
        let pbuf = Zarr::open::<PathBuf>(path.to_string());
        let adata = AnnData::<Zarr>::open(pbuf.unwrap()).unwrap();

I used scanpy to save h5ad file to zarr. h5ad works great in rust, but zarr is not.

Please see the zarr file I used.

ZZM_panNK_test5.zarr.zip

@kaizhang kaizhang added the help wanted Extra attention is needed label Jan 29, 2025
@kaizhang
Copy link
Owner

The zarr backend is still in beta. It has no guarantee to be compatible with zarr files generated from different programs. The zarr development is not on my priority list as one of the main issue with zarr is that it generates a lot of small files which degrades the hard drive performance. Because of this, zarr is less convenient to use compared with hdf5.

@zqfang
Copy link
Contributor Author

zqfang commented Jan 29, 2025

I see. I can help if you can give me some hints or directions to work with.

Zarr support is important since it is not dependent on the hdf5 C library, which makes the anndata-rs more portable on different platforms.

@LDeakin
Copy link

LDeakin commented Jan 30, 2025

@zqfang that error is raised by zarrs because 0 is not interpreted as a valid fill value for Zarr V2 string data. It isn't really in the spec, but I'd welcome a PR to support it. A workaround similar to bool data with an integer fill value is needed:

https://github.com/LDeakin/zarrs/blob/d5a5021dbc5e117ecf51c9e72ce019f107dc091b/zarrs_metadata/src/v2_to_v3.rs#L113-L127


one of the main issue with zarr is that it generates a lot of small files which degrades the hard drive performance

@kaizhang that could be addressed with the sharding codec when anndata/scanpy support Zarr V3 (scverse/anndata#1726).

@zqfang
Copy link
Contributor Author

zqfang commented Jan 30, 2025

@LDeakin , Thank you so much for your insights! you save my day!

So the issue is: the fill value is 0 for string array in anndata-zarr v2, while zarrs did not allow this.

After I changed the fill_value of string-arrary with the code below. The rust backend of anndata-zarr read my file successfully.

import zarr
import base64

# change 0 to 'MA=='
# see fill value encoding: https://zarr-specs.readthedocs.io/en/latest/v2/v2.0.html
# the fill value MUST be encoded as an ASCII string using the standard Base64 alphabet
print(base64.b64encode(b'0').decode()) # print out MA==
#
store = zarr.open("ZZM_panNK_test5.zarr", mode="rw")
# update value, this is just one of the string-array
store["obs/_index"].fill_value = "MA=="

@ilan-gold
Copy link

Just to chime in here, zarr v3 is still missing a number of features "needed" for anndata: https://zarr.readthedocs.io/en/stable/user-guide/v3_migration.html#work-in-progress

But we can definitely start hacking, and maybe providing our own fill-in implementations although the first step was simply upgrading the python version of the package before the format. For example, I'm not sure how important things like structured arrays really are and even so, we might be able to write a python codec or the like to provide a bridge. I'm excited to try out sharding as well :)

@ilan-gold
Copy link

Just checked in any case to be sure, and most of the failures are structured array-related in zarr file-format version 3. So if this package does not rely on them (not sure how it would since I don't think zarrs handles it), some tests could in theory be added and maybe checked against scverse/anndata#1726 if you wanted to see performance/interop. I intend to release this in the next anndata release (0.12), but the structured arrays in zarr v3 may take longer although there is an active discussion happening: zarr-developers/zarr-python#2134

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

4 participants