Review genotype storage methods by access to samples #33

jeromekelleher · 2023-12-07T09:33:38Z

jeromekelleher
Dec 7, 2023
Maintainer

An important part of the narrative emerging from #50 is that it's generally hard to get access to per-sample data from existing tools. This issue is for listing existing methods and linking to papers/repos so that we can easily assess the support for this operation across alternatives.

I'll organise this by one-comment per tool, so that we can discuss further in threads if needs be

jeromekelleher · 2023-12-07T09:36:58Z

jeromekelleher
Dec 7, 2023
Maintainer Author

VCF/bcftools

Data is stored by per-variant record, so full decoding is required to access data for any sample.

0 replies

jeromekelleher · 2023-12-07T09:51:53Z

jeromekelleher
Dec 7, 2023
Maintainer Author

Plink

The plink .bed format is stored in "variant-major" format, but can also be in "individual major" mode (discussed here)

It seems that plink 1 can support both formats, but plink 2 just uses variant major? It seems like very old versions of plink were individual major, and this option was just kept for compatibility.

Plink uses a 2-bit encoding:

00 Homozygous for first allele in .bim file
01 Missing genotype
10 Heterozygous
11 Homozygous for second allele in .bim file

Currently in the text we're saying that this is a "memory-map" format because it could be mmap'd and computed on directly, but I don't know if that's how plink actually works.

1 reply

jeromekelleher Dec 7, 2023
Maintainer Author

The "second gen" plink has more flexible file formats, not requiring hard-calls

paper (2015) https://academic.oup.com/gigascience/article/4/1/s13742-015-0047-8/2707533

jeromekelleher · 2023-12-07T09:57:58Z

jeromekelleher
Dec 7, 2023
Maintainer Author

Savvy

Sparse allele vectors and the savvy software suite

paper (2021): https://academic.oup.com/bioinformatics/article/37/22/4248/6275747
repo: https://github.com/statgen/savvy

The sparse allele vectors (SAV) file format supplements the dense vectors used for storing genomic information in BCF files with a new sparse vector data type. Instead of storing a value for each allele, only offsets and values of non-reference alleles are stored.

Genomic regions are organized into an r-tree to enable fast random access to an SAV file without having to traverse the entire index file. Each leaf entry in the tree points to a zstd compressed block in the corresponding SAV file. The entry also encodes the number of variant records in the block, which can be variable depending on the parameters for compressing the SAV file.

So, Savvy stores per-variant records also, like BCF.

No mention of access by sample in the paper.

0 replies

jeromekelleher · 2023-12-07T10:02:43Z

jeromekelleher
Dec 7, 2023
Maintainer Author

SpVCF

Sparse Project VCF: efficient encoding of population genotype matrices

paper (2020): https://academic.oup.com/bioinformatics/article/36/22-23/5537/6029516
repo: https://github.com/mlin/spVCF

Text based encoding, mostly compatible with VCF. Focused on interoperability and compression.

0 replies

jeromekelleher · 2023-12-07T10:07:52Z

jeromekelleher
Dec 7, 2023
Maintainer Author

PBWT

Efficient haplotype matching and storage using the positional Burrows–Wheeler transform (PBWT)

Represents a large collection of haplotypes (i.e. genotypes in sample-major form) compactly and in a way that makes some types of searches fast.

Very influential, and basis of many current approaches.

0 replies

jeromekelleher · 2023-12-07T10:19:13Z

jeromekelleher
Dec 7, 2023
Maintainer Author

BGT

BGT: efficient and flexible genotype query across many samples

paper (2016): https://academic.oup.com/bioinformatics/article/32/4/590/1743991
repo: https://github.com/lh3/bgt

From repo:

BGT uses the same data structure as PBWT and is inspired by PBWT.
BGT puts more emphasis on fast random access and data retrieval.

Matrix is written in sample-major format (ish) via PBWT, so per sample queries should be fast. Not clear how well it works for per-site queries.

1 reply

jeromekelleher Dec 7, 2023
Maintainer Author

I think it's fair to say that BGT hasn't taken off - last commit to the repo was in 2019, 93 stars on github and 49 citation for the paper.

jeromekelleher · 2023-12-07T11:43:16Z

jeromekelleher
Dec 7, 2023
Maintainer Author

SeqArray

SeqArray—a storage-efficient high-performance data format for WGS variant calls

paper (2017): https://academic.oup.com/bioinformatics/article/33/15/2251/3072873
bioconductor: https://bioconductor.org/packages/release/bioc/html/SeqArray.html
SeqArray repo: https://github.com/zhengxwen/SeqArray

R package, built on the Genomic Data Structure format:

paper (2012) https://academic.oup.com/bioinformatics/article/28/24/3326/245844
repo: https://github.com/zhengxwen/gdsfmt

Actually quite a lot in common with sgkit:

Array oriented
Optimised low-level code
Supports parallel (but not distributed)
Arrays are labelled, and kept together in a single container.

Not clear if they do chunking, but it's clearly got a lot of the same ideas.

1 reply

jeromekelleher Dec 7, 2023
Maintainer Author

Note these papers are relatively highly cited: the 2019 citations for the gds paper.

jeromekelleher · 2023-12-07T11:54:39Z

jeromekelleher
Dec 7, 2023
Maintainer Author

GQT

Efficient genotype compression and analysis of large genetic-variation data sets

paper (2016) https://www.nature.com/articles/nmeth.3654
https://github.com/ryanlayer/gqt (last updated 2019)

However, because VCF is intentionally organized from the perspective of chromosomal loci to support 'variant-centric' analyses, it is ill suited to queries focused on specific genotype, phenotype or inheritance combinations, such as which variants are homozygous exclusively in affected women (Fig. 1b). GQT introduces a complementary 'individual-centric' strategy for indexing and mining very large genetic-variation data sets.

So, this is essentially in "sample-major" format with some tweaks.

Provides a CLI, but no programmatic access. Focused on simple queries and filtering, and then outputting to VCF

0 replies

jeromekelleher · 2023-12-07T12:06:06Z

jeromekelleher
Dec 7, 2023
Maintainer Author

BGEN

BGEN: a binary file format for imputed genotype and haplotype data

preprint (2018) https://www.biorxiv.org/content/10.1101/308296v2.full.pdf
Website: https://www.chg.ox.ac.uk/~gav/bgen_format/bgen_format.html

The remainder of the file consists of a sequence of variant data blocks, one
for each of the L variants stored in the file. Each variant data block contains identifying information about the corresponding variant (e.g. its identifier, genomic position, and alleles), followed by the genotype probability
data for the variant. Genotype probability data are encoded in a packed bi trepresentation, and are then compressed.

Variant-major format specialised for imputed data.

Used for UKB genotype data

0 replies

jeromekelleher · 2023-12-07T12:19:21Z

jeromekelleher
Dec 7, 2023
Maintainer Author

XSI

XSI—a genotype compression tool for compressive genomics in large biobanks

paper (2022) https://academic.oup.com/bioinformatics/article/38/15/3778/6617346
repo: https://github.com/rwk-unil/xSqueezeIt (last update May 23)

In order to achieve fast random access to any variant loci, a block-based approach combined with indexing is used.

Within a block, genotype data is encoded by a combination of methods (PBWT for common variants).

So, seems like each block needs to be decoded to get access to the variant data for a specific sample? Sounds like this is a relatively costly operation also because the per-block PBWTs need to be unravelled.

It provides a C API.

0 replies

jeromekelleher · 2023-12-07T13:37:19Z

jeromekelleher
Dec 7, 2023
Maintainer Author

GTC

GTC: how to maintain huge genotype collections in a compressed form

paper (2018) https://academic.oup.com/bioinformatics/article/34/11/1834/4813738
repo:[ ](https://github.com/refresh-bio/GTC

GTC takes a similar strategy to XSI. The genotype matrix is broken up into blocks of k variants, and within each block the haplotypes are permuted to maximise similarity between adjacent samples.

The paper has an interesting analysis of the query time by variant and sample:

Of course the query time is to output a VCF here - what you'd actually do with that is another question

0 replies

jeromekelleher · 2023-12-07T13:46:42Z

jeromekelleher
Dec 7, 2023
Maintainer Author

TGC

Genome compression: a novel approach for large collections

paper (2013) https://academic.oup.com/bioinformatics/article/29/20/2572/278528
repo: https://github.com/refresh-bio/TGC

Purely compression focused. Transformations of the VCF, plus some additional compression. Requires complete decompression of the file to get any info.

0 replies

jeromekelleher · 2023-12-07T13:54:45Z

jeromekelleher
Dec 7, 2023
Maintainer Author

GTRAC

GTRAC: fast retrieval from compressed collections of genomic variants

paper (2016) https://academic.oup.com/bioinformatics/article/32/17/i479/2450767
repo: https://github.com/kedartatwawadi/GTRAC

Based on TGC, extended to facilitate querying. Very specialised encoding schemes for the binary variant matrix.

Supports efficient row and column access, via algorithms.

Implementation is clearly just to support the paper, and not intended for reuse.

0 replies

jeromekelleher · 2023-12-07T14:00:35Z

jeromekelleher
Dec 7, 2023
Maintainer Author

VCFShark

VCFShark: how to squeeze a VCF file

paper (2021) https://academic.oup.com/bioinformatics/article/37/19/3358/6206359
repo: https://github.com/refresh-bio/vcfshark

Stores the VCF in a column-wise fashion, using specialised compressors for each type. Genotypes treated specially.

Only supports full decompression.

0 replies

jeromekelleher · 2023-12-07T14:13:38Z

jeromekelleher
Dec 7, 2023
Maintainer Author

Genozip

genozip: a fast and efficient compression tool for VCF files

paper (2020): https://academic.oup.com/bioinformatics/article/36/13/4091/5837110
repo: https://github.com/divonlan/genozip

Focuses on compressing the non-genotype data. Claim the majority of data in a VCF is not actually genotypes, it's all the other stuff.

Only VCF output is provided

Looks like they do sample and and variant blocks:

Finally, the –vblock and –sblock options allow the user to control the tradeoff between compression and speed related to subsetting regions and samples.

First, the VCF file is divided into variant blocks of up to 128 MB each (configurable with --vblock), and the samples within each variant block are further divided into sample blocks of up to 4,096 samples each (configurable with --sblock), from which the
genotypes are extracted and transposed to create a haplotype matrix. Prior to compression, each haplotype matrix is further transformed by padding the ploidy to the maximal ploidy represented in the matrix, substituting 2-digit allele values with a single ascii character, and clustering the rows of haplotypes so that similar haplotypes are adjacent to one other.

Has some strategies for entropy reduction:

First, the –optimize option improves compression by modifying data in some INFO and FORMAT subfields by rounding floating point numbers to 2 significant digits and capping Phred values. Note that in this case the VCF data are modified,

2 replies

jeromekelleher Dec 7, 2023
Maintainer Author

There is a second paper about genozip

Genozip: a universal extensible genomic data compressor

2021: https://academic.oup.com/bioinformatics/article/37/16/2225/6135077

Also a flashy website, charging model and talk of patents: https://www.genozip.com/

jeromekelleher Dec 7, 2023
Maintainer Author

It's still just a compressor though, outputting VCF for queries.

jeromekelleher · 2023-12-07T14:21:10Z

jeromekelleher
Dec 7, 2023
Maintainer Author

GTShark

GTShark: genotype compression in large projects

paper (2019): https://academic.oup.com/bioinformatics/article/35/22/4791/5521623
repo: https://github.com/refresh-bio/GTShark

Based on PBWT, but doesn't use run-length encoding.

GTC does not use PBWT, but a specialized technique based on Ziv-Lempel compression algorithm, run length encoding and Huffman coding. Similarly to BGT it processes variants in blocks but permutes the haplotypes within it.

Code only supports decompressing the entire file or extracting one sample.

0 replies

jeromekelleher · 2023-12-07T15:04:53Z

jeromekelleher
Dec 7, 2023
Maintainer Author

GBC

GBC: a parallel toolkit based on highly addressable byte-encoding blocks for extremely large-scale genotypes of species

paper (2023): https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02906-z
website: https://pmglab.top/gbc/en/

Lots of specialised stuff for compressing genotype data, claim best-in-class compression perf. Emphasises parallel and accessibility. Java API

Accessing genotypes by column (i.e., subjects) is usually slower than by row (i.e., sites) because decompression involves all blocks.

Main query output is VCF.

Compared with plink for LD calculation.

0 replies

jeromekelleher · 2023-12-07T15:17:51Z

jeromekelleher
Dec 7, 2023
Maintainer Author

Hail

No paper.
Website: hail.is

Genotype storage is done via the MatrixTable, "a distributed two-dimensional extension of a Table"

The MatrixTable is generic and supports access in both dimensions.

I can't find any obvious documentation about how it all works, though.

0 replies

Review genotype storage methods by access to samples #33

jeromekelleher Dec 7, 2023 Maintainer

Replies: 0 comments · 23 replies

jeromekelleher Dec 7, 2023 Maintainer Author

VCF/bcftools

jeromekelleher Dec 7, 2023 Maintainer Author

Plink

jeromekelleher Dec 7, 2023 Maintainer Author

jeromekelleher Dec 7, 2023 Maintainer Author

Savvy

jeromekelleher Dec 7, 2023 Maintainer Author

SpVCF

jeromekelleher Dec 7, 2023 Maintainer Author

PBWT

jeromekelleher Dec 7, 2023 Maintainer Author

BGT

jeromekelleher Dec 7, 2023 Maintainer Author

jeromekelleher Dec 7, 2023 Maintainer Author

SeqArray

jeromekelleher Dec 7, 2023 Maintainer Author

jeromekelleher Dec 7, 2023 Maintainer Author

GQT

jeromekelleher Dec 7, 2023 Maintainer Author

BGEN

jeromekelleher Dec 7, 2023 Maintainer Author

XSI

jeromekelleher Dec 7, 2023 Maintainer Author

GTC

jeromekelleher Dec 7, 2023 Maintainer Author

TGC

jeromekelleher Dec 7, 2023 Maintainer Author

GTRAC

jeromekelleher Dec 7, 2023 Maintainer Author

VCFShark

jeromekelleher Dec 7, 2023 Maintainer Author

Genozip

jeromekelleher Dec 7, 2023 Maintainer Author

jeromekelleher Dec 7, 2023 Maintainer Author

jeromekelleher Dec 7, 2023 Maintainer Author

GTShark

jeromekelleher Dec 7, 2023 Maintainer Author

GBC

jeromekelleher Dec 7, 2023 Maintainer Author

Hail

jeromekelleher
Dec 7, 2023
Maintainer

Replies: 0 comments 23 replies

jeromekelleher
Dec 7, 2023
Maintainer Author

jeromekelleher
Dec 7, 2023
Maintainer Author

jeromekelleher Dec 7, 2023
Maintainer Author

jeromekelleher
Dec 7, 2023
Maintainer Author

jeromekelleher
Dec 7, 2023
Maintainer Author

jeromekelleher
Dec 7, 2023
Maintainer Author

jeromekelleher
Dec 7, 2023
Maintainer Author

jeromekelleher Dec 7, 2023
Maintainer Author

jeromekelleher
Dec 7, 2023
Maintainer Author

jeromekelleher Dec 7, 2023
Maintainer Author

jeromekelleher
Dec 7, 2023
Maintainer Author

jeromekelleher
Dec 7, 2023
Maintainer Author

jeromekelleher
Dec 7, 2023
Maintainer Author

jeromekelleher
Dec 7, 2023
Maintainer Author

jeromekelleher
Dec 7, 2023
Maintainer Author

jeromekelleher
Dec 7, 2023
Maintainer Author

jeromekelleher
Dec 7, 2023
Maintainer Author

jeromekelleher
Dec 7, 2023
Maintainer Author

jeromekelleher Dec 7, 2023
Maintainer Author

jeromekelleher Dec 7, 2023
Maintainer Author

jeromekelleher
Dec 7, 2023
Maintainer Author

jeromekelleher
Dec 7, 2023
Maintainer Author

jeromekelleher
Dec 7, 2023
Maintainer Author