Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support fixed and dynamic length allele strings in IO readers #98

Closed
eric-czech opened this issue Aug 7, 2020 · 2 comments · Fixed by #116
Closed

Support fixed and dynamic length allele strings in IO readers #98

eric-czech opened this issue Aug 7, 2020 · 2 comments · Fixed by #116
Labels
IO Issues related to reading and writing common third-party file formats

Comments

@eric-czech
Copy link
Collaborator

We've made special cases in the plink/bgen readers to coerce alleles to a fixed length string type, e.g. sgkit-dev/sgkit-plink#12. This requires a full scan of the alleles to get the max length so it adds a minor delay to loading times, but makes storage in-memory or otherwise more compact if the allele lengths don't vary widely. Apart from the delay for in building what would otherwise be a fully lazily defined data structure, it can also be a poor default. The maximum allele size in UKB PLINK data is 256 so the fixed-length strings are suboptimal.

This has also come up with vcf here: https://github.com/pystatgen/sgkit/pull/40#issuecomment-669948502.

I propose that we change create_genotype_call_dataset to allow fixed or object types for alleles (as @alimanfoo suggested above) and make object type the default for alleles in IO readers. As far as I know it will be more common to start a workflow with a mix of variant types and then possibly filter to SNVs so any conversion to fixed-length should take place after that. Perhaps we can broaden the scope of https://github.com/pystatgen/sgkit/issues/90 a bit to make that function a part of the top-level API.

@jeromekelleher
Copy link
Collaborator

I think using object types is a good idea. In practise, the amount of space we save by having fixed size types for alleles is minimal and it creates a lot of awkward corner-cases. It's a premature optimisation, basically.

@hammer hammer added the IO Issues related to reading and writing common third-party file formats label Aug 12, 2020
@tomwhite
Copy link
Collaborator

Relevant Zarr documentation: https://zarr.readthedocs.io/en/stable/tutorial.html#string-arrays

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Issues related to reading and writing common third-party file formats
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants