Review genotype storage methods by access to samples #33
Replies: 0 comments 23 replies
-
VCF/bcftoolsData is stored by per-variant record, so full decoding is required to access data for any sample. |
Beta Was this translation helpful? Give feedback.
-
PlinkThe plink .bed format is stored in "variant-major" format, but can also be in "individual major" mode (discussed here) It seems that plink 1 can support both formats, but plink 2 just uses variant major? It seems like very old versions of plink were individual major, and this option was just kept for compatibility. Plink uses a 2-bit encoding: 00 Homozygous for first allele in .bim file Currently in the text we're saying that this is a "memory-map" format because it could be mmap'd and computed on directly, but I don't know if that's how plink actually works. |
Beta Was this translation helpful? Give feedback.
-
SavvySparse allele vectors and the savvy software suite
So, Savvy stores per-variant records also, like BCF. No mention of access by sample in the paper. |
Beta Was this translation helpful? Give feedback.
-
SpVCFSparse Project VCF: efficient encoding of population genotype matrices
Text based encoding, mostly compatible with VCF. Focused on interoperability and compression. |
Beta Was this translation helpful? Give feedback.
-
PBWTEfficient haplotype matching and storage using the positional Burrows–Wheeler transform (PBWT)
Represents a large collection of haplotypes (i.e. genotypes in sample-major form) compactly and in a way that makes some types of searches fast. Very influential, and basis of many current approaches. |
Beta Was this translation helpful? Give feedback.
-
BGTBGT: efficient and flexible genotype query across many samples
From repo:
Matrix is written in sample-major format (ish) via PBWT, so per sample queries should be fast. Not clear how well it works for per-site queries. |
Beta Was this translation helpful? Give feedback.
-
SeqArraySeqArray—a storage-efficient high-performance data format for WGS variant calls
R package, built on the Genomic Data Structure format:
Actually quite a lot in common with sgkit:
Not clear if they do chunking, but it's clearly got a lot of the same ideas. |
Beta Was this translation helpful? Give feedback.
-
GQTEfficient genotype compression and analysis of large genetic-variation data sets
So, this is essentially in "sample-major" format with some tweaks. Provides a CLI, but no programmatic access. Focused on simple queries and filtering, and then outputting to VCF |
Beta Was this translation helpful? Give feedback.
-
BGENBGEN: a binary file format for imputed genotype and haplotype data
Variant-major format specialised for imputed data. Used for UKB genotype data |
Beta Was this translation helpful? Give feedback.
-
XSIXSI—a genotype compression tool for compressive genomics in large biobanks
Within a block, genotype data is encoded by a combination of methods (PBWT for common variants). So, seems like each block needs to be decoded to get access to the variant data for a specific sample? Sounds like this is a relatively costly operation also because the per-block PBWTs need to be unravelled. It provides a C API. |
Beta Was this translation helpful? Give feedback.
-
GTCGTC: how to maintain huge genotype collections in a compressed form
GTC takes a similar strategy to XSI. The genotype matrix is broken up into blocks of k variants, and within each block the haplotypes are permuted to maximise similarity between adjacent samples. The paper has an interesting analysis of the query time by variant and sample: Of course the query time is to output a VCF here - what you'd actually do with that is another question |
Beta Was this translation helpful? Give feedback.
-
TGCGenome compression: a novel approach for large collections
Purely compression focused. Transformations of the VCF, plus some additional compression. Requires complete decompression of the file to get any info. |
Beta Was this translation helpful? Give feedback.
-
GTRACGTRAC: fast retrieval from compressed collections of genomic variants
Based on TGC, extended to facilitate querying. Very specialised encoding schemes for the binary variant matrix. Supports efficient row and column access, via algorithms. Implementation is clearly just to support the paper, and not intended for reuse. |
Beta Was this translation helpful? Give feedback.
-
VCFSharkVCFShark: how to squeeze a VCF file
Stores the VCF in a column-wise fashion, using specialised compressors for each type. Genotypes treated specially. Only supports full decompression. |
Beta Was this translation helpful? Give feedback.
-
Genozipgenozip: a fast and efficient compression tool for VCF files
Focuses on compressing the non-genotype data. Claim the majority of data in a VCF is not actually genotypes, it's all the other stuff. Only VCF output is provided Looks like they do sample and and variant blocks:
Has some strategies for entropy reduction:
|
Beta Was this translation helpful? Give feedback.
-
GTSharkGTShark: genotype compression in large projects
Based on PBWT, but doesn't use run-length encoding.
Code only supports decompressing the entire file or extracting one sample. |
Beta Was this translation helpful? Give feedback.
-
GBCGBC: a parallel toolkit based on highly addressable byte-encoding blocks for extremely large-scale genotypes of species
Lots of specialised stuff for compressing genotype data, claim best-in-class compression perf. Emphasises parallel and accessibility. Java API
Main query output is VCF. Compared with plink for LD calculation. |
Beta Was this translation helpful? Give feedback.
-
Hail
Genotype storage is done via the MatrixTable, "a distributed two-dimensional extension of a Table" The MatrixTable is generic and supports access in both dimensions. I can't find any obvious documentation about how it all works, though. |
Beta Was this translation helpful? Give feedback.
-
An important part of the narrative emerging from #50 is that it's generally hard to get access to per-sample data from existing tools. This issue is for listing existing methods and linking to papers/repos so that we can easily assess the support for this operation across alternatives.
I'll organise this by one-comment per tool, so that we can discuss further in threads if needs be
Beta Was this translation helpful? Give feedback.
All reactions