Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cloud store support #129

Open
jeromekelleher opened this issue Jan 30, 2025 · 10 comments
Open

Cloud store support #129

jeromekelleher opened this issue Jan 30, 2025 · 10 comments

Comments

@jeromekelleher
Copy link
Contributor

We need to have cloud store support soon, so it would be good to spec out the options here.

From an interface perspective, I guess we'll need to support various side-channels for passing authentication tokens.

I guess the first choice is whether to use fspec or work directly with S3Map, ABStore, etc.

Fsspec would seem like less work, and should be performant enough for purposes here?

Hmm, looks like Fspec doesn't support Azure directly though. As S3 and Azure are the most immediately important targets here, I wonder if there's much actual value in using Fsspec.

@tomwhite any thoughts here?

@tomwhite
Copy link
Contributor

We could consider obstore, which supports S3 and Azure: https://developmentseed.org/obstore/latest/

It's being integrated into Zarr here: zarr-developers/zarr-python#1661 - that PR is very close to being completed I think. We might consider moving vcztools to use Zarr Python v3, which would be low-risk since it only needs to read Zarr. (Bio2zarr writes Zarr, so best left on Zarr Python v2 for the time being.) Or at least require v3 for cloud support?

@jeromekelleher
Copy link
Contributor Author

Yeah, v3 for zcztools does sound sensible as we will want to have async chunk downloading too.

Obstore looks promising. I guess we'd have to try out a few of these and see how well they work with the key clouds.

@tomwhite
Copy link
Contributor

tomwhite commented Feb 3, 2025

As a proof-of-concept I ran some of the unit tests using obstore (and Zarr #1661) for files on the local filesystem and they passed.

I also uploaded a vcz to S3 and ran the following, using this modification to vcztools:

export AWS_ACCESS_KEY_ID=...
export AWS_SECRET_ACCESS_KEY=...
export AWS_DEFAULT_REGION=...
vcztools view s3://sgkit-dev-data/sample.vcf.vcz/

Which produced the expected output:

##fileformat=VCFv4.0
##FILTER=<ID=PASS,Description="All filters passed">
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=1000GenomesPilot-NCBI36
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=AC,Number=.,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FILTER=<ID=q10,Description="Quality below 10">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
##ALT=<ID=DEL:ME:ALU,Description="Deletion of ALU element">
##ALT=<ID=CNV,Description="Copy number variable region">
##contig=<ID=19>
##contig=<ID=20>
##contig=<ID=X>
##vcztools_viewCommand=view s3://sgkit-dev-data/sample.vcf.vcz/; Date=2025-02-03 14:04:31.277127
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	NA00001	NA00002	NA00003
19	111	.	A	C	9.6	.	.	GT:HQ	0|0:10,15	0|0:10,10	0/1:3,3
19	112	.	A	G	10	.	.	GT:HQ	0|0:10,10	0|0:10,10	0/1:3,3
20	14370	rs6054257	G	A	29	PASS	H2;AF=0.5;NS=3;DB;DP=14	GT:HQ:GQ:DP	0|0:51,51:48:1	1|0:51,51:48:8	1/1:.,.:43:5
20	17330	.	T	A	3	q10	AF=0.017;NS=3;DP=11	GT:HQ:GQ:DP	0|0:58,50:49:3	0|1:65,3:3:5	0/0:.,.:41:3
20	1110696	rs6040355	A	G,T	67	PASS	AF=0.333,0.667;NS=2;DB;DP=10;AA=T	GT:HQ:GQ:DP	1|2:23,27:21:6	2|1:18,2:2:0	2/2:.,.:35:4
20	1230237	.	T	.	47	PASS	NS=3;DP=13;AA=T	GT:HQ:GQ:DP	0|0:56,60:54:.	0|0:51,51:48:4	0/0:.,.:61:2
20	1234567	microsat1	G	GA,GAC	50	PASS	AC=1,1;NS=3;AN=4;DP=9;AA=G	GT:GQ:DP	0/1:.:4	0/2:17:2	./.:40:3
20	1235237	.	T	.	.	.	.	GT	0/0	0|0	./.
X	10	rsTest	AC	A,ATG,C	10	PASS	.	GT	0	0/1	0|2

So it looks promising. It should be straightforward to only use obstore for cloud stores for the moment, and leave local files using the current code paths.

@jeromekelleher
Copy link
Contributor Author

Amazing! Is this on Zarr 2 or Zarr 3?

@tomwhite
Copy link
Contributor

tomwhite commented Feb 3, 2025

Zarr-Python 3 - the obstore integration only works with that.

@tomwhite
Copy link
Contributor

tomwhite commented Feb 3, 2025

I did another test, and running the same command with Zarr-Python v2 and fsspec (specifically pip install s3fs) works with no changes to vcztools:

vcztools view s3://sgkit-dev-data/sample.vcf.vcz/

@jeromekelleher
Copy link
Contributor Author

I did another test, and running the same command with Zarr-Python v2 and fsspec (specifically pip install s3fs) works with no changes to vcztools:

Wow, literally no changes at all??

Something to discuss, but I think it might be better to stick with one Zarr version across the different packages if we could, so maybe the fsspec version for now is the easiest path

@tomwhite
Copy link
Contributor

tomwhite commented Feb 3, 2025

Something to discuss, but I think it might be better to stick with one Zarr version across the different packages if we could, so maybe the fsspec version for now is the easiest path

I agree. We could release what we have now (which will work with S3), and then integrate obstore for Azure (and asyncio) later.

@tomwhite
Copy link
Contributor

tomwhite commented Feb 3, 2025

For this release we'll just document how to run on S3 using fsspec.

@benjeffery
Copy link
Contributor

There are some good tools for mocking out S3 and azure that we could use in CI: moto and pytest-azure

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants