You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We've made special cases in the plink/bgen readers to coerce alleles to a fixed length string type, e.g. sgkit-dev/sgkit-plink#12. This requires a full scan of the alleles to get the max length so it adds a minor delay to loading times, but makes storage in-memory or otherwise more compact if the allele lengths don't vary widely. Apart from the delay for in building what would otherwise be a fully lazily defined data structure, it can also be a poor default. The maximum allele size in UKB PLINK data is 256 so the fixed-length strings are suboptimal.
I propose that we change create_genotype_call_dataset to allow fixed or object types for alleles (as @alimanfoo suggested above) and make object type the default for alleles in IO readers. As far as I know it will be more common to start a workflow with a mix of variant types and then possibly filter to SNVs so any conversion to fixed-length should take place after that. Perhaps we can broaden the scope of https://github.com/pystatgen/sgkit/issues/90 a bit to make that function a part of the top-level API.
The text was updated successfully, but these errors were encountered:
I think using object types is a good idea. In practise, the amount of space we save by having fixed size types for alleles is minimal and it creates a lot of awkward corner-cases. It's a premature optimisation, basically.
We've made special cases in the plink/bgen readers to coerce alleles to a fixed length string type, e.g. sgkit-dev/sgkit-plink#12. This requires a full scan of the alleles to get the max length so it adds a minor delay to loading times, but makes storage in-memory or otherwise more compact if the allele lengths don't vary widely. Apart from the delay for in building what would otherwise be a fully lazily defined data structure, it can also be a poor default. The maximum allele size in UKB PLINK data is 256 so the fixed-length strings are suboptimal.
This has also come up with vcf here: https://github.com/pystatgen/sgkit/pull/40#issuecomment-669948502.
I propose that we change create_genotype_call_dataset to allow fixed or object types for alleles (as @alimanfoo suggested above) and make object type the default for alleles in IO readers. As far as I know it will be more common to start a workflow with a mix of variant types and then possibly filter to SNVs so any conversion to fixed-length should take place after that. Perhaps we can broaden the scope of https://github.com/pystatgen/sgkit/issues/90 a bit to make that function a part of the top-level API.
The text was updated successfully, but these errors were encountered: