Analysis of chunk size, compressor and filters on 1000 genomes data #74

jeromekelleher · 2024-03-11T09:51:03Z

jeromekelleher
Mar 11, 2024
Maintainer

It would be very useful to have a high-level overview of the effect of chunk size, compressor and filters on some recent VCF data. The 1000 genomes NYGC resequencing data is as good a choice as any.

Let's just look at the FORMAT fields, as the 1D stuff doesn't really contribute much and makes no real difference. For chr20 I have :

$ python3 -m bio2zarr vcf2zarr inspect tmp/1kg_chr20_all.exploded/ | grep FORMAT
name                      type       chunks  size       compressed      max_n  min_val    max_val                                                                                             
------------------------  -------  --------  ---------  ------------  -------  ---------  ---------
FORMAT/AB                 Float         552  367.09 MB  584.49 MB           1  0.04       0.95
FORMAT/AD                 Integer      1246  932.6 MB   7.78 GB             7  0          5.8e+03
FORMAT/DP                 Integer       612  418.35 MB  6.08 GB             1  0          5.8e+03
FORMAT/GQ                 Integer       612  418.35 MB  5.25 GB             1  0          99
FORMAT/GT                 Integer       898  626.94 MB  417.16 MB           3  -1         6
FORMAT/MIN_DP             Integer        80  32.65 KB   4.89 KB             0  n/a        n/a
FORMAT/MQ0                Integer        80  32.65 KB   4.89 KB             0  n/a        n/a
FORMAT/PGT                String        422  89.08 MB   100.13 MB           1  n/a        n/a
FORMAT/PID                String        422  112.55 MB  129.55 MB           1  n/a        n/a
FORMAT/PL                 Integer      1998  1.59 GB    18.42 GB           28  0          1.7e+05
FORMAT/RGQ                Integer        80  32.65 KB   4.89 KB             0  n/a        n/a
FORMAT/SB                 Integer        80  32.65 KB   4.89 KB             0  n/a        n/a

As discussed in #53, the PL values are a major outlier here. Let's exclude those too.

So, for the remaining fields, what I'd like to do is systematically try out combinations of

Chunk size (in both the variants and samples dimension)
Choice of compressor (particularly those available in Blosc)
Filters like Quantize. (It's probably worth applying quantize to Float fields because there may well be random noise in the float stuff). The SPVCF paper may provide some inspiration for appropriate settings.

A good way to proceed here would be to

Generate a schema for your exploded vcf
Delete all columns except for the one you are interested in and the mandatory ones
Write a script to generate combinations for parameters in different schemas
Run vcf2zarr encode -s [your schema] --max-variant-chunks=10 to encode the first 10 variant chunks
Compute the compression ratio for this combination (The info function on Zarr arrays is probably helpful here).

We should get enough information from the first 10 variant chunks, but we can make that bigger if it seems necessary.

(Note the generalised inspect command for Zarr datasets discussed in #39 would be useful for step 5 here. Basically it would just use the zarr info method, and tabulate the output)

shz9 · 2024-03-11T17:15:45Z

shz9
Mar 11, 2024

I can take a look at this and run the experiments.
How's the chunk size currently determined? Is it set automatically? What range of values make sense to explore here (for both variants and sample dimensions)?

1 reply

jeromekelleher Mar 12, 2024
Maintainer Author

Chunk size currently defaults to 10_000 in the variants dimension (currently called "chunk_length", but see #73) and 1_000 in the samples dimension. This is pretty arbitrary, and was chosen as a reasonable default after a bit of experimentation over in the sgkit repo.

For 1000G data, we probably want to try just a few combinations: 1000, 10_000 and 100_000 in the variants dimension, and 100, 1000 in the samples dimension. This will give us a good view of how chunk size affects things from a compression perspective.

jeromekelleher · 2024-03-12T09:24:21Z

jeromekelleher
Mar 12, 2024
Maintainer Author

Not above that the size and compressed columns in the intermediate format don't seem to make sense - there is an open issue on this (#48)

0 replies

jeromekelleher · 2024-03-14T23:26:25Z

jeromekelleher
Mar 14, 2024
Maintainer Author

I've added the generalised inspect command now (#75) that will also summarise the Zarr columns. Here's what we get with the 1000 genomes chr20 data:

name                          dtype    stored     size         ratio    nchunks  chunk_size    avg_chunk_stored    shape                chunk_shape        compressor                                                       filters
----------------------------  -------  ---------  ---------  -------  ---------  ------------  ------------------  -------------------  -----------------  ---------------------------------------------------------------  ------------
/call_PL                      int32    23.33 GB   1.06 TB      45          1184  896.11 MB     19.71 MB            (2958525, 3202, 28)  (10000, 1000, 28)  Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
/call_AD                      int16    7.56 GB    132.62 GB    18          1184  112.01 MB     6.38 MB             (2958525, 3202, 7)   (10000, 1000, 7)   Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
/call_GQ                      int8     6.62 GB    9.47 GB       1.4        1184  8 MB          5.59 MB             (2958525, 3202)      (10000, 1000)      Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
/call_DP                      int16    4.73 GB    18.95 GB      4          1184  16 MB         4 MB                (2958525, 3202)      (10000, 1000)      Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
/call_AB                      float32  1.12 GB    37.89 GB     34          1184  32 MB         945.31 KB           (2958525, 3202)      (10000, 1000)      Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
/call_PID                     object   1.01 GB    75.79 GB     75          1184  64.01 MB      852.77 KB           (2958525, 3202)      (10000, 1000)      Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  [VLenUTF8()]
/call_PGT                     object   576.2 MB   75.79 GB    130          1184  64.01 MB      486.66 KB           (2958525, 3202)      (10000, 1000)      Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  [VLenUTF8()]
/call_genotype                int8     178.99 MB  18.95 GB    110          1184  16 MB         151.17 KB           (2958525, 3202, 2)   (10000, 1000, 2)   Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
/call_genotype_mask           bool     29.91 MB   18.95 GB    630          1184  16 MB         25.26 KB            (2958525, 3202, 2)   (10000, 1000, 2)   Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
/variant_MLEAF                float32  11.35 MB   71 MB         6.3         296  239.88 KB     38.35 KB            (2958525, 6)         (10000, 6)         Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
/variant_AF                   float32  11.17 MB   71 MB         6.4         296  239.88 KB     37.75 KB            (2958525, 6)         (10000, 6)         Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
/variant_MQRankSum            float32  10 MB      11.83 MB      1.2         296  39.98 KB      33.8 KB             (2958525,)           (10000,)           Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
/variant_ClippingRankSum      float32  10 MB      11.83 MB      1.2         296  39.98 KB      33.79 KB            (2958525,)           (10000,)           Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
/variant_BaseQRankSum         float32  9.94 MB    11.83 MB      1.2         296  39.98 KB      33.58 KB            (2958525,)           (10000,)           Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
/variant_AF_AFR               float32  9.91 MB    71 MB         7.2         296  239.88 KB     33.49 KB            (2958525, 6)         (10000, 6)         Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
/variant_ReadPosRankSum       float32  9.78 MB    11.83 MB      1.2         296  39.98 KB      33.05 KB            (2958525,)           (10000,)           Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
/variant_AF_AFR_unrel         float32  9.61 MB    71 MB         7.4         296  239.88 KB     32.47 KB            (2958525, 6)         (10000, 6)         Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
/variant_quality              float32  9.29 MB    11.83 MB      1.3         296  39.98 KB      31.39 KB            (2958525,)           (10000,)           Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
/variant_SOR                  float32  9.23 MB    11.83 MB      1.3         296  39.98 KB      31.18 KB            (2958525,)           (10000,)           Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
/variant_VQSLOD               float32  8.61 MB    11.83 MB      1.4         296  39.98 KB      29.09 KB            (2958525,)           (10000,)           Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
/variant_ExcHet               float32  8.53 MB    71 MB         8.3         296  239.88 KB     28.8 KB             (2958525, 6)         (10000, 6)         Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
/variant_FS                   float32  8.13 MB    11.83 MB      1.5         296  39.98 KB      27.48 KB            (2958525,)           (10000,)           Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
/variant_AF_SAS               float32  7.92 MB    71 MB         9           296  239.88 KB     26.76 KB            (2958525, 6)         (10000, 6)         Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
/variant_AF_AMR               float32  7.89 MB    71 MB         9           296  239.88 KB     26.66 KB            (2958525, 6)         (10000, 6)         Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
/variant_QD                   float32  7.84 MB    11.83 MB      1.5         296  39.98 KB      26.49 KB            (2958525,)           (10000,)           Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
/variant_InbreedingCoeff      float32  7.79 MB    11.83 MB      1.5         296  39.98 KB      26.31 KB            (2958525,)           (10000,)           Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
/variant_ExcHet_AFR           float32  7.66 MB    71 MB         9.3         296  239.88 KB     25.88 KB            (2958525, 6)         (10000, 6)         Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
/variant_AF_AMR_unrel         float32  7.59 MB    71 MB         9.4         296  239.88 KB     25.63 KB            (2958525, 6)         (10000, 6)         Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
/variant_AF_SAS_unrel         float32  7.58 MB    71 MB         9.4         296  239.88 KB     25.6 KB             (2958525, 6)         (10000, 6)         Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
/variant_AF_EUR               float32  7.52 MB    71 MB         9.4         296  239.88 KB     25.39 KB            (2958525, 6)         (10000, 6)         Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
/variant_AF_EAS               float32  7.3 MB     71 MB         9.7         296  239.88 KB     24.66 KB            (2958525, 6)         (10000, 6)         Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
/variant_AF_EUR_unrel         float32  7.29 MB    71 MB         9.7         296  239.88 KB     24.63 KB            (2958525, 6)         (10000, 6)         Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
/variant_AF_EAS_unrel         float32  7.1 MB     71 MB        10           296  239.88 KB     23.99 KB            (2958525, 6)         (10000, 6)         Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
/variant_HWE                  float32  6.63 MB    71 MB        11           296  239.88 KB     22.41 KB            (2958525, 6)         (10000, 6)         Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
/variant_allele               object   6.24 MB    165.68 MB    27           296  559.72 KB     21.08 KB            (2958525, 7)         (10000, 7)         Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  [VLenUTF8()]
/variant_ExcHet_AMR           float32  5.85 MB    71 MB        12           296  239.88 KB     19.78 KB            (2958525, 6)         (10000, 6)         Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
/variant_DP                   int32    5.75 MB    11.83 MB      2.1         296  39.98 KB      19.42 KB            (2958525,)           (10000,)           Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
/variant_ExcHet_SAS           float32  5.56 MB    71 MB        13           296  239.88 KB     18.8 KB             (2958525, 6)         (10000, 6)         Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
/variant_ExcHet_EUR           float32  5.39 MB    71 MB        13           296  239.88 KB     18.2 KB             (2958525, 6)         (10000, 6)         Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
/variant_HWE_AFR              float32  5.15 MB    71 MB        14           296  239.88 KB     17.39 KB            (2958525, 6)         (10000, 6)         Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
/variant_ExcHet_EAS           float32  5.03 MB    71 MB        14           296  239.88 KB     16.98 KB            (2958525, 6)         (10000, 6)         Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
/variant_AC                   int16    4.65 MB    35.5 MB       7.6         296  119.94 KB     15.71 KB            (2958525, 6)         (10000, 6)         Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
/variant_MLEAC                int16    4.63 MB    35.5 MB       7.7         296  119.94 KB     15.63 KB            (2958525, 6)         (10000, 6)         Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
/variant_AC_Het               int16    4.52 MB    35.5 MB       7.8         296  119.94 KB     15.28 KB            (2958525, 6)         (10000, 6)         Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
/variant_AC_AFR               int16    3.9 MB     35.5 MB       9.1         296  119.94 KB     13.19 KB            (2958525, 6)         (10000, 6)         Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
/variant_HWE_AMR              float32  3.87 MB    71 MB        18           296  239.88 KB     13.08 KB            (2958525, 6)         (10000, 6)         Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
/variant_ME                   float32  3.85 MB    71 MB        18           296  239.88 KB     12.99 KB            (2958525, 6)         (10000, 6)         Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
/variant_HWE_SAS              float32  3.83 MB    71 MB        19           296  239.88 KB     12.92 KB            (2958525, 6)         (10000, 6)         Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
/variant_AC_Het_AFR           int16    3.74 MB    35.5 MB       9.5         296  119.94 KB     12.63 KB            (2958525, 6)         (10000, 6)         Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
/variant_AC_AFR_unrel         int16    3.71 MB    35.5 MB       9.6         296  119.94 KB     12.53 KB            (2958525, 6)         (10000, 6)         Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
/variant_HWE_EUR              float32  3.68 MB    71 MB        19           296  239.88 KB     12.43 KB            (2958525, 6)         (10000, 6)         Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
/variant_AC_Het_AFR_unrel     int16    3.52 MB    35.5 MB      10           296  119.94 KB     11.91 KB            (2958525, 6)         (10000, 6)         Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
/variant_position             int32    3.4 MB     11.83 MB      3.5         296  39.98 KB      11.5 KB             (2958525,)           (10000,)           Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
/variant_HWE_EAS              float32  3.37 MB    71 MB        21           296  239.88 KB     11.39 KB            (2958525, 6)         (10000, 6)         Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
/variant_AC_SAS               int16    3.03 MB    35.5 MB      12           296  119.94 KB     10.22 KB            (2958525, 6)         (10000, 6)         Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
/variant_AC_AMR               int16    3.03 MB    35.5 MB      12           296  119.94 KB     10.22 KB            (2958525, 6)         (10000, 6)         Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
/variant_MQ                   float32  2.97 MB    11.83 MB      4           296  39.98 KB      10.03 KB            (2958525,)           (10000,)           Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
...

So, it seems like we're doing a pretty good job here on everything except the PL fields (which is discussed in #53).

In particular, we're getting a compression ratio of 110 on call_genotype, which is great! What I'm wondering is, would we some better compression by tweaking the Blosc settings? It would be nice to get as good a compression ratio as we can on call_genotype, as it's the thing that people have focused most on.

@shz9, is this something you'd like to look into?

1 reply

jeromekelleher Mar 14, 2024
Maintainer Author

In particular, I'm suspicious of the AUTOSHUFFLE setting in Blosc and whether that's a good thing for us to have on by default.

jeromekelleher · 2024-03-14T23:40:02Z

jeromekelleher
Mar 14, 2024
Maintainer Author

Update, here's the same for chr2:

name                          dtype    stored     size         ratio    nchunks  chunk_size    avg_chunk_stored    shape                 chunk_shape        compressor                                                       filters
----------------------------  -------  ---------  ---------  -------  ---------  ------------  ------------------  --------------------  -----------------  ---------------------------------------------------------------  ------------
/call_PL                      int32    83.75 GB   3.8 TB       45          4244  896.36 MB     19.73 MB            (10607611, 3202, 28)  (10000, 1000, 28)  Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
/call_AD                      int16    26.86 GB   475.52 GB    18          4244  112.04 MB     6.33 MB             (10607611, 3202, 7)   (10000, 1000, 7)   Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
/call_GQ                      int8     24.96 GB   33.97 GB      1.4        4244  8 MB          5.88 MB             (10607611, 3202)      (10000, 1000)      Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
/call_DP                      int16    17.14 GB   67.93 GB      4          4244  16.01 MB      4.04 MB             (10607611, 3202)      (10000, 1000)      Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
/call_AB                      float32  3.29 GB    135.86 GB    41          4244  32.01 MB      775.49 KB           (10607611, 3202)      (10000, 1000)      Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
/call_PID                     object   2.43 GB    271.72 GB   110          4244  64.03 MB      572.02 KB           (10607611, 3202)      (10000, 1000)      Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  [VLenUTF8()]
/call_PGT                     object   1.44 GB    271.72 GB   190          4244  64.03 MB      338.83 KB           (10607611, 3202)      (10000, 1000)      Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  [VLenUTF8()]
/call_genotype                int8     549.88 MB  67.93 GB    120          4244  16.01 MB      129.56 KB           (10607611, 3202, 2)   (10000, 1000, 2)   Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
/call_genotype_mask           bool     82.7 MB    67.93 GB    820          4244  16.01 MB      19.48 KB            (10607611, 3202, 2)   (10000, 1000, 2)   Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
/variant_MLEAF                float32  38.08 MB   254.58 MB     6.7        1061  239.95 KB     35.89 KB            (10607611, 6)         (10000, 6)         Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
/variant_AF                   float32  37.27 MB   254.58 MB     6.8        1061  239.95 KB     35.12 KB            (10607611, 6)         (10000, 6)         Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
/variant_MQRankSum            float32  36.03 MB   42.43 MB      1.2        1061  39.99 KB      33.96 KB            (10607611,)           (10000,)           Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
/variant_ClippingRankSum      float32  35.99 MB   42.43 MB      1.2        1061  39.99 KB      33.92 KB            (10607611,)           (10000,)           Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
/variant_BaseQRankSum         float32  35.78 MB   42.43 MB      1.2        1061  39.99 KB      33.72 KB            (10607611,)           (10000,)           Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
/variant_ReadPosRankSum       float32  35.19 MB   42.43 MB      1.2        1061  39.99 KB      33.16 KB            (10607611,)           (10000,)           Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
/variant_SOR                  float32  33.06 MB   42.43 MB      1.3        1061  39.99 KB      31.16 KB            (10607611,)           (10000,)           Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
/variant_AF_AFR               float32  33.03 MB   254.58 MB     7.7        1061  239.95 KB     31.13 KB            (10607611, 6)         (10000, 6)         Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
/variant_quality              float32  32.97 MB   42.43 MB      1.3        1061  39.99 KB      31.08 KB            (10607611,)           (10000,)           Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
/variant_AF_AFR_unrel         float32  32.03 MB   254.58 MB     7.9        1061  239.95 KB     30.18 KB            (10607611, 6)         (10000, 6)         Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
/variant_VQSLOD               float32  30.52 MB   42.43 MB      1.4        1061  39.99 KB      28.76 KB            (10607611,)           (10000,)           Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
/variant_FS                   float32  29.15 MB   42.43 MB      1.5        1061  39.99 KB      27.48 KB            (10607611,)           (10000,)           Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
/variant_ExcHet               float32  28.8 MB    254.58 MB     8.8        1061  239.95 KB     27.15 KB            (10607611, 6)         (10000, 6)         Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
/variant_QD                   float32  28.01 MB   42.43 MB      1.5        1061  39.99 KB      26.39 KB            (10607611,)           (10000,)           Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
/variant_InbreedingCoeff      float32  27.24 MB   42.43 MB      1.6        1061  39.99 KB      25.67 KB            (10607611,)           (10000,)           Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
/variant_AF_SAS               float32  26.33 MB   254.58 MB     9.7        1061  239.95 KB     24.81 KB            (10607611, 6)         (10000, 6)         Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
/variant_AF_AMR               float32  25.9 MB    254.58 MB     9.8        1061  239.95 KB     24.41 KB            (10607611, 6)         (10000, 6)         Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
/variant_ExcHet_AFR           float32  25.62 MB   254.58 MB     9.9        1061  239.95 KB     24.14 KB            (10607611, 6)         (10000, 6)         Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
/variant_AF_SAS_unrel         float32  25.13 MB   254.58 MB    10          1061  239.95 KB     23.68 KB            (10607611, 6)         (10000, 6)         Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
/variant_AF_AMR_unrel         float32  24.91 MB   254.58 MB    10          1061  239.95 KB     23.48 KB            (10607611, 6)         (10000, 6)         Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
/variant_AF_EUR               float32  24.68 MB   254.58 MB    10          1061  239.95 KB     23.26 KB            (10607611, 6)         (10000, 6)         Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
/variant_AF_EAS               float32  24.31 MB   254.58 MB    10          1061  239.95 KB     22.91 KB            (10607611, 6)         (10000, 6)         Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
/variant_AF_EUR_unrel         float32  23.93 MB   254.58 MB    11          1061  239.95 KB     22.56 KB            (10607611, 6)         (10000, 6)         Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
/variant_AF_EAS_unrel         float32  23.63 MB   254.58 MB    11          1061  239.95 KB     22.27 KB            (10607611, 6)         (10000, 6)         Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)  None
...

Overall pattern is basically the same

0 replies

shz9 · 2024-03-15T21:55:49Z

shz9
Mar 15, 2024

Would an approach like the following work for testing this systematically without generating schemas?

import itertools
import zarr
import numcodecs
import pandas as pd
import numpy as np
from tqdm import tqdm


def generate_chunksize_variations(shape, min_chunksize=1000):
    """
    Generate variations for the chunk sizes
    """
    
    dim_chunksizes = []
    
    for sdim in shape:
        
        if sdim <= min_chunksize:
            dim_chunksizes.append([sdim])
        
        else:
            
            dim_chunksizes.append([min_chunksize])
        
            while True:
                if dim_chunksizes[-1][-1]*10 < sdim:
                    dim_chunksizes[-1].append(dim_chunksizes[-1][-1]*10)
                else:
                    break
            
            dim_chunksizes[-1].append(sdim)
        
    return list(itertools.product(*dim_chunksizes))


def generate_blosc_variations():
    """
    Generate variations for the Blosc compressor
    """
    
    from numcodecs.blosc import list_compressors
    
    compressors = list_compressors()
    bit_shuffle = list(range(-1, 3))
    c_levels = [5, 7, 9]
    
    return [
        {'cname': cn, 'shuffle': bits, 'clevel': cl}
        for cn, bits, cl in itertools.product(compressors, bit_shuffle, c_levels)
    ]
    

def test_compression_variations(z_arr):
    """
    Test the impact of compressor options + chunksizes on compression ratios 
    of Zarr arrays.
    """

    chunksize_var = generate_chunksize_variations(z_arr.shape)
    blosc_var = generate_blosc_variations()
    
    comp_results_table = []
    
    for cs_var, bcomp_var in tqdm(itertools.product(chunksize_var, blosc_var),
                                  total=len(chunksize_var)*len(blosc_var),
                                  desc='Compression variations'):
        
        new_compressor = numcodecs.Blosc(**bcomp_var)
        z2 = zarr.empty_like(z_arr, chunks=cs_var, compressor=new_compressor)
    
        z2[:] = z_arr[:]
        
        comp_results_table.append({**bcomp_var, 
                                   **{'chunksize': cs_var, 
                                      'CompressionRatio': float(dict(z2.info_items())['Storage ratio'])}})
        
    # If the array is of type bool, try the PackBits codec:
    if z_arr.dtype == bool:
        
        for cs_var in chunksize_var:
            z2 = zarr.empty_like(z_arr, chunks=cs_var, compressor=numcodecs.PackBits())
            z2[:] = z_arr[:]
            
            comp_results_table.append({
                'cname': 'PackBits', 'shuffle': None, 'clevel': None,
                'chunksize': cs_var, 
                'CompressionRatio': float(dict(z2.info_items())['Storage ratio'])
            })
        
    return pd.DataFrame(comp_results_table).sort_values('CompressionRatio', ascending=False)


def test_vcf2zarr_compression_variations(z_group, keys=None):
    """
    Test the impact of compressor options + chunksizes on compression ratios 
    of vcf2zarr zarr arrays.
    
    """
    
    
    keys = keys or list(z_group.keys())
    
    tabs = []
    
    for k in keys:
        
        tabs.append(test_compression_variations(z_group[k]))
        tabs[-1]['ArrayName'] = k
    
    return pd.concat(tabs)

I tested this on my end with the tiny VCF that I've been working with and it seems to work. I don't have a lot of samples/variants in this file, so couldn't test a lot of variations of the chunksizes. I'll see if I can test it on larger files over the weekend. This is how I invoke it on already converted VCF files (if you want to test on your converted VCF files):

import zarr 
z = zarr.open("tmp/sample.zarr/")

res = test_vcf2zarr_compression_variations(z, ['call_PL', 'call_genotype', 'call_genotype_mask']

I get something like this:

       cname  shuffle  clevel       chunksize  CompressionRatio ArrayName
52      zstd        0       7  (1000, 91, 28)              24.0   call_PL
112     zstd        0       7  (6515, 91, 28)              24.0   call_PL
115     zstd        1       7  (6515, 91, 28)              22.9   call_PL
109     zstd       -1       7  (6515, 91, 28)              22.9   call_PL
55      zstd        1       7  (1000, 91, 28)              22.8   call_PL
49      zstd       -1       7  (1000, 91, 28)              22.8   call_PL

...

       cname  shuffle  clevel      chunksize  CompressionRatio      ArrayName
112     zstd        0       7  (6515, 91, 2)              32.5  call_genotype
115     zstd        1       7  (6515, 91, 2)              32.5  call_genotype
49      zstd       -1       7  (1000, 91, 2)              31.2  call_genotype
58      zstd        2       7  (1000, 91, 2)              31.2  call_genotype
109     zstd       -1       7  (6515, 91, 2)              30.4  call_genotype
118     zstd        2       7  (6515, 91, 2)              30.4  call_genotype
52      zstd        0       7  (1000, 91, 2)              29.9  call_genotype
55      zstd        1       7  (1000, 91, 2)              29.9  call_genotype

No huge variations of compression ratios by varying the shuffle parameter. The parameters that have the largest impact are the compressor type (zstd being best) and compression level (higher is better).

1 reply

jeromekelleher Mar 16, 2024
Maintainer Author

This is great, thanks! Yes, applying to a larger vcf would be good. You want about 10 variant chunks I think, as the first few can be misleading.

jeromekelleher · 2024-03-17T09:40:16Z

jeromekelleher
Mar 17, 2024
Maintainer Author

It turns out Alistair did a really nice analysis on this in 2016 for genotypes. His conclusion is that zstd is great, and the bit shuffle filter adds quite a bit of extra compression. We should probably adopt this as our default for genotypes. Can you try bit shuffle @shz9?

I think the open question here is how we can do better with things like GQ and DP which aren't compressing very well currently (again, ignoring PL as it requires special treatment)

0 replies

shz9 · 2024-03-19T03:20:34Z

shz9
Mar 19, 2024

Here are some preliminary results on this from my analysis of chr22:
All of these experiments were done with zstd compressor with compression level set to 7 by default.

The shuffle parameter does indeed have an impact on the compression ratios, though it varies by which matrix we're looking at (maybe unfortunate? :/).
- For the call_genotype matrix, setting the shuffle parameter to 2 or -1 gives the best compression ratios.
- For the call_GQ matrix, setting the shuffle parameter to 0 or 1 gives the best compression ratios.
- For the call_AB matrix, setting the shuffle parameter to 0 gives the best compression ratios.
- For the other arrays that I tested, impact is small.
I also tested the impact of data layout on compression because I had a hunch that putting the sample vs. variant vs. ploid data contiguously could result in better compression. So, I tested all possible transpositions of those matrices. In the figure below, v is for variant dimension, s is the sample dimension, and p is the ploid dimension. And the numbers are the associated chunksizes.
- The data layout does indeed make a difference, especially for the call_AD matrix. If we use the standard layout of (variant, samples, ploid), we get a compression ratio of ~12. But, if we use (ploid, variant, sample), we get a compression ratio of ~20.
- For the call_genotype matrix, the standard layout is already close to optimal, but we could improve a bit exchanged the ploid vs. sample dimensions.
- I suspect that for many of the large matrices that we're dealing with, including call_genotype, if we sort the samples by their relatedness, we may be able to improve the compression even more. Not sure how practical this is though.

Here's the latest version script I'm using if anyone else would like to experiment:

import itertools
import zarr
import numcodecs
import pandas as pd
import numpy as np
from tqdm import tqdm
import argparse


def generate_shape_permutations(shape):
    return list(itertools.permutations(range(len(shape))))


def generate_chunksize_variations(shape,
                                  min_chunksize=(1000, 100),
                                  max_chunksize=(10_000, None)):
    """
    Generate variations for the chunk sizes
    """

    if isinstance(min_chunksize, int):
        min_chunksize = list(np.repeat(min_chunksize, len(shape)))

    if isinstance(max_chunksize, int):
        max_chunksize = list(np.repeat(max_chunksize, len(shape)))

    dim_chunksizes = []

    for sdim, min_cs, max_cs in itertools.zip_longest(shape, min_chunksize, max_chunksize):

        if min_cs is None:
            min_cs = sdim

        if max_cs is None or max_cs > sdim:
            max_cs = sdim

        if sdim <= min_cs:
            dim_chunksizes.append([sdim])

        else:

            dim_chunksizes.append([min_cs])

            while True:
                if dim_chunksizes[-1][-1] * 10 <= max_cs:
                    dim_chunksizes[-1].append(dim_chunksizes[-1][-1] * 10)
                else:
                    break

    return list(itertools.product(*dim_chunksizes))

def generate_blosc_variations():
    """
    Generate variations for the Blosc compressor
    """

    from numcodecs.blosc import list_compressors

    compressors = ['zstd']  # list_compressors()
    bit_shuffle = list(range(-1, 3))
    c_levels = [7] #[5, 7, 9]

    return [
        {'cname': cn, 'shuffle': bits, 'clevel': cl}
        for cn, bits, cl in itertools.product(compressors, bit_shuffle, c_levels)
    ]

def test_compression_variations(z_arr,
                                max_chunks=10,
                                min_chunksize=(1000, 100),
                                max_chunksize=(10_000, None),
                                dry_run=False):
    """
    Test the impact of compressor options + chunksizes on compression ratios
    of Zarr arrays.

    :param z_arr: A Zarr array.
    :param max_chunks: Only extract up to `max_chunks` from the original array for processing
    (to save memory)
    :param max_chunksize: The maximum chunksize (along any dimension) to test.
    :param dry_run: If True, generate the variations table without copying any data.
    """

    shape_perm = generate_shape_permutations(z_arr.shape)
    chunksize_var = generate_chunksize_variations(z_arr.shape,
                                                  min_chunksize=min_chunksize,
                                                  max_chunksize=max_chunksize)
    blosc_var = generate_blosc_variations()

    comp_results_table = []

    # Determine the shape to extract based on the chunksize variations and
    # the max_chunks parameter:

    max_idx = np.minimum(np.array(chunksize_var).max(axis=0) * max_chunks,
                         np.array(z_arr.shape))
    print(max_idx)
    slices = slices = tuple(slice(0, n) for n in max_idx)

    if not dry_run:
        # Extract the data
        arr = z_arr[slices]

    for cs_var, bcomp_var, dim_order in tqdm(itertools.product(chunksize_var, blosc_var, shape_perm),
                                             total=len(chunksize_var) * len(blosc_var) * len(shape_perm),
                                             desc='Compression variations'):

        reshaped_chunks = tuple(np.array(cs_var)[np.array(dim_order)])

        if not dry_run:
            arr_reshaped = arr.transpose(dim_order)

            new_compressor = numcodecs.Blosc(**bcomp_var)
            z2 = zarr.empty_like(arr_reshaped,
                                 chunks=reshaped_chunks,
                                 compressor=new_compressor,
                                 dtype=z_arr.dtype)

            z2[:] = arr_reshaped

            n_chunks = z2.nchunks
            compress_ratio = float(dict(z2.info_items())['Storage ratio'])
        else:
            n_chunks = None
            compress_ratio = None

        comp_results_table.append({**bcomp_var,
                                   **{'chunksize': reshaped_chunks,
                                      'nchunks': n_chunks,
                                      'dim_order': dim_order,
                                      'CompressionRatio': compress_ratio}})

    return pd.DataFrame(comp_results_table).sort_values('CompressionRatio', ascending=False)

def test_vcf2zarr_compression_variations(z_group,
                                         keys=None,
                                         max_chunks=10,
                                         min_chunksize=(1000, 100),
                                         max_chunksize=(10_000, None),
                                         dry_run=False):
    """
    Test the impact of compressor options + chunksizes on compression ratios
    of vcf2zarr zarr arrays.

    :param z_group: A Zarr group
    :param keys: A list of keys to test the compression variations on. If None,
    it tests on all the Zarr array in the hierarchy.
    :param max_chunks: Only extract up to `max_chunks` from the original array for processing
    (to save memory)
    :param max_chunksize: The maximum chunksize (along any dimension) to test.
    :param dry_run: If True, generate the variations table without copying any data.


    """

    keys = keys or list(z_group.keys())

    tabs = []

    for k in keys:
        print("Testing:", k)
        tabs.append(test_compression_variations(z_group[k],
                                                max_chunks=max_chunks,
                                                min_chunksize=min_chunksize,
                                                max_chunksize=max_chunksize,
                                                dry_run=dry_run))
        tabs[-1]['ArrayName'] = k

    return pd.concat(tabs)

if __name__ == '__main__':

    parser = argparse.ArgumentParser(description='vcf2zarr compression benchmarks')
    parser.add_argument('-i', '--input', dest='input_zarr_group', type=str, required=True,
                        help='The path to the input Zarr group.')
    parser.add_argument('-o', '--output', dest='output_file', type=str, required=True,
                        help='The path to the output CSV file.')
    parser.add_argument('--keys', dest='keys', type=str, nargs='+', default=None,
                        help='The keys to test the compression variations on. If not provided, '
                             'it tests on all the Zarr arrays in the hierarchy.')
    parser.add_argument('--max-chunks', dest='max_chunks', type=int, default=10,
                        help='Only extract up to `max_chunks` from the original array for processing.')
    parser.add_argument('--dry-run', dest='dry_run', action='store_true', default=False,
                        help='If True, generate the variations table without copying any data.')

    args = parser.parse_args()

    results_df = test_vcf2zarr_compression_variations(zarr.open(args.input_zarr_group),
                                                      keys=args.keys,
                                                      max_chunks=args.max_chunks,
                                                      dry_run=args.dry_run)

    results_df.to_csv(args.output_file, index=False)

3 replies

jeromekelleher Mar 19, 2024
Maintainer Author

Fantastic, thanks @shz9! The rearranging dimensions stuff is interesting, and suggests that a "dimension shuffle" filter for zarr would be helpful in some cases.

I would like to use this analysis as the basis of picking default values for these fields in bio2zarr. Would be useful to mention in the paper also, and I guess a useful supplemental figure or table?

shz9 Mar 19, 2024

Sure, I can clean it up a bit more to include in the paper. I also need to analyze the call_PL + other arrays of interest.

I agree that a dimension shuffle would be good for multi-dimensional arrays in general. If rates of change are different per dimension, it makes sense to me to arrange the data so that the dimensions with the slowest rates are stored contiguously.

This also motivates the idea that I mentioned above: What do you think about the possibility of sorting samples by relatedness? The idea here is that closely related samples will have less change in their variation data, so you could compress their data blocks a lot easier than randomly sorted samples. Difficulty with this, of course, is how to get a heuristic and fast measure of relatedness that we can use at the explode stage.

jeromekelleher Mar 19, 2024
Maintainer Author

I'm much more interested in everything else except PL - it requires a different encoding and special treatment.

Did your analysis on the genotype mask yield much? There's going to be a nice story in the paper about storing masks, so good to make these as compact as possible.

I agree sorting samples by relatedness should help compression. I think we're probably doing ok in that sense already as samples tend to be grouped by population and family. A test of this would be to take a few random permutations of the samples and see if compression is worse.

shz9 · 2024-03-20T03:01:07Z

shz9
Mar 20, 2024

Here are the compression results for all the call_* arrays (except for call_PL). Removed the AUTOSHUFFLE (-1) panels because they're redundant.

I'm also including a version of the plot that only has the dimensions in the default order (variant, sample, ploid):

From these results, here are my recommendations for the compressors/chunk sizes for these fields:

call_AB: shuffle=0, chunks=(10000, 1000).
call_AD: shuffle=1, chunks=(10000, 1000, None). If possible, transpose dimensions to (2, 0, 1).
call_DP: shuffle=1, chunks=(10000, 1000)
call_GQ: shuffle=0, chunks=(10000, 1000)
call_MIN_DP, call_MQ0, call_RGQ, call_SB, call_genotype_phased: shuffle=0 (doesn't matter much), chunks=(10000, 100) (NOTE smaller chunk size for sample dimension).
call_PGT, call_PID: shuffle=0, chunks=(10000, 1000)
call_genotype, call_genotype_mask: shuffle=2, chunks=(10000, 1000, None)

I think what's remaining to explore here is the idea of applying filters before compression, such as Quantize for floats and PackBits for bool. This will only affect call_genotype_mask and call_AB. Anything else to check?

1 reply

jeromekelleher Mar 20, 2024
Maintainer Author

That sounds great, yep interested in how filters might help. I think we're basically done then

shz9 · 2024-03-22T22:21:27Z

shz9
Mar 22, 2024

Here are the results for the PackBits filter as applied to call_genotype_mask:

It seems to help quite substantially, especially when shuffle=0 or shuffle=1. With shuffle=2 (the new recommended default), there's somewhat of an improvement but the magnitude varies with the chunksize/shape (interestingly, seems to harm compression ratio for some chunksizes). I'd say it's NOT worth adding it as a filter by default, especially given that it'll significantly slow down encoding/decoding speeds for small or no benefit for storage.

For Quantize, I tested only one setting, where we keep the same data type (float32?) and quantize to 1 decimal place (e.g. Quantize(1, dtype=original_dtype)). Here are the results with and without Quantize:

This quantization may be too aggressive, but it doubles/triples the compression ratio, especially when coupled with shuffle=2. If there's interest, I can try quantization with different decimal points / output data types (e.g. quantize to 2 decimal points and switch to float16, etc.).

Finally, here's the final version of the benchmarking script:

compression_benchmarks.py

import itertools
import zarr
import numcodecs
import pandas as pd
import numpy as np
from tqdm import tqdm
import argparse


def generate_shape_permutations(shape):
    return list(itertools.permutations(range(len(shape))))


def generate_chunksize_variations(shape,
                                  min_chunksize=(1000, 100),
                                  max_chunksize=(10_000, None)):
    """
    Generate variations for the chunk sizes
    """

    if isinstance(min_chunksize, int):
        min_chunksize = list(np.repeat(min_chunksize, len(shape)))

    if isinstance(max_chunksize, int):
        max_chunksize = list(np.repeat(max_chunksize, len(shape)))

    dim_chunksizes = []

    for sdim, min_cs, max_cs in itertools.zip_longest(shape, min_chunksize, max_chunksize):

        if min_cs is None:
            min_cs = sdim

        if max_cs is None or max_cs > sdim:
            max_cs = sdim

        if sdim <= min_cs:
            dim_chunksizes.append([sdim])

        else:

            dim_chunksizes.append([min_cs])

            while True:
                if dim_chunksizes[-1][-1] * 10 <= max_cs:
                    dim_chunksizes[-1].append(dim_chunksizes[-1][-1] * 10)
                else:
                    break

    return list(itertools.product(*dim_chunksizes))


def generate_blosc_variations():
    """
    Generate variations for the Blosc compressor
    """

    from numcodecs.blosc import list_compressors

    compressors = ['zstd']  # list_compressors()
    bit_shuffle = list(range(3))
    c_levels = [7] #[5, 7, 9]

    return [
        {'cname': cn, 'shuffle': bits, 'clevel': cl}
        for cn, bits, cl in itertools.product(compressors, bit_shuffle, c_levels)
    ]


def test_compression_variations(z_arr,
                                max_chunks=10,
                                min_chunksize=(1000, 100),
                                max_chunksize=(10_000, None),
                                dry_run=False):
    """
    Test the impact of compressor options + chunksizes on compression ratios
    of Zarr arrays.

    :param z_arr: A Zarr array.
    :param max_chunks: Only extract up to `max_chunks` from the original array for processing
    (to save memory)
    :param max_chunksize: The maximum chunksize (along any dimension) to test.
    :param dry_run: If True, generate the variations table without copying any data.
    """

    shape_perm = generate_shape_permutations(z_arr.shape)
    chunksize_var = generate_chunksize_variations(z_arr.shape,
                                                  min_chunksize=min_chunksize,
                                                  max_chunksize=max_chunksize)
    blosc_var = generate_blosc_variations()

    comp_results_table = []

    # Determine the shape to extract based on the chunksize variations and
    # the max_chunks parameter:

    max_idx = np.minimum(np.array(chunksize_var).max(axis=0) * max_chunks,
                         np.array(z_arr.shape))
    print(max_idx)
    slices = slices = tuple(slice(0, n) for n in max_idx)

    if not dry_run:
        # Extract the data
        arr = z_arr[slices]

    for cs_var, bcomp_var, dim_order in tqdm(itertools.product(chunksize_var, blosc_var, shape_perm),
                                             total=len(chunksize_var) * len(blosc_var) * len(shape_perm),
                                             desc='Compression variations'):

        reshaped_chunks = tuple(np.array(cs_var)[np.array(dim_order)])

        if not dry_run:
            arr_reshaped = arr.transpose(dim_order)

            new_compressor = numcodecs.Blosc(**bcomp_var)

            if z_arr.filters is not None:
                object_codec = z_arr.filters[0]
            else:
                object_codec = None

            z2 = zarr.empty_like(arr_reshaped,
                                 chunks=reshaped_chunks,
                                 compressor=new_compressor,
                                 dtype=z_arr.dtype,
                                 object_codec=object_codec)

            z2[:] = arr_reshaped

            n_chunks = z2.nchunks
            compress_ratio = float(dict(z2.info_items())['Storage ratio'])
        else:
            n_chunks = None
            compress_ratio = None

        comp_results_table.append({**bcomp_var,
                                   **{'chunksize': reshaped_chunks,
                                      'nchunks': n_chunks,
                                      'dim_order': dim_order,
                                      'CompressionRatio': compress_ratio}})

        # Test the combination of filters + compressor:
        if arr.dtype == bool:
            
            if not dry_run:
                z2 = zarr.empty_like(arr_reshaped,
                                     chunks=reshaped_chunks,
                                     compressor=new_compressor,
                                     filters=[numcodecs.PackBits()],
                                     dtype=z_arr.dtype,
                                     object_codec=object_codec)

                z2[:] = arr_reshaped

                n_chunks = z2.nchunks
                compress_ratio = float(dict(z2.info_items())['Storage ratio'])

            else:
                n_chunks = None
                compress_ratio = None
            
            comp_results_table.append({**bcomp_var,
                                       **{'chunksize': reshaped_chunks,
                                          'nchunks': n_chunks,
                                          'dim_order': dim_order,
                                          'CompressionRatio': compress_ratio}})
            comp_results_table[-1]['cname'] += '+PackBits'

        elif np.issubdtype(arr.dtype, np.floating):
            
            if not dry_run:
                z2 = zarr.empty_like(arr_reshaped,
                                     chunks=reshaped_chunks,
                                     compressor=new_compressor,
                                     filters=[numcodecs.Quantize(1, arr.dtype)],
                                     dtype=z_arr.dtype,
                                     object_codec=object_codec)

                z2[:] = arr_reshaped

                n_chunks = z2.nchunks
                compress_ratio = float(dict(z2.info_items())['Storage ratio'])
            else:
                n_chunks = None
                compress_ratio = None

            comp_results_table.append({**bcomp_var,
                                   **{'chunksize': reshaped_chunks,
                                      'nchunks': n_chunks,
                                      'dim_order': dim_order,
                                      'CompressionRatio': compress_ratio}})
            comp_results_table[-1]['cname'] += '+Quantize'

    return pd.DataFrame(comp_results_table).sort_values('CompressionRatio', ascending=False)


def test_vcf2zarr_compression_variations(z_group,
                                         keys=None,
                                         max_chunks=10,
                                         min_chunksize=(1000, 100),
                                         max_chunksize=(10_000, None),
                                         dry_run=False):
    """
    Test the impact of compressor options + chunksizes on compression ratios
    of vcf2zarr zarr arrays.

    :param z_group: A Zarr group
    :param keys: A list of keys to test the compression variations on. If None,
    it tests on all the Zarr array in the hierarchy.
    :param max_chunks: Only extract up to `max_chunks` from the original array for processing
    (to save memory)
    :param max_chunksize: The maximum chunksize (along any dimension) to test.
    :param dry_run: If True, generate the variations table without copying any data.


    """

    keys = keys or list(z_group.keys())

    tabs = []

    for k in keys:
        print("Testing:", k)
        tabs.append(test_compression_variations(z_group[k],
                                                max_chunks=max_chunks,
                                                min_chunksize=min_chunksize,
                                                max_chunksize=max_chunksize,
                                                dry_run=dry_run))
        tabs[-1]['ArrayName'] = k

    return pd.concat(tabs)


if __name__ == '__main__':

    parser = argparse.ArgumentParser(description='vcf2zarr compression benchmarks')
    parser.add_argument('-i', '--input', dest='input_zarr_group', type=str, required=True,
                        help='The path to the input Zarr group.')
    parser.add_argument('-o', '--output', dest='output_file', type=str, required=True,
                        help='The path to the output CSV file.')
    parser.add_argument('--keys', dest='keys', type=str, nargs='+', default=None,
                        help='The keys to test the compression variations on. If not provided, '
                             'it tests on all the Zarr arrays in the hierarchy.')
    parser.add_argument('--max-chunks', dest='max_chunks', type=int, default=10,
                        help='Only extract up to `max_chunks` from the original array for processing.')
    parser.add_argument('--dry-run', dest='dry_run', action='store_true', default=False,
                        help='If True, generate the variations table without copying any data.')

    args = parser.parse_args()

    results_df = test_vcf2zarr_compression_variations(zarr.open(args.input_zarr_group),
                                                      keys=args.keys,
                                                      max_chunks=args.max_chunks,
                                                      dry_run=args.dry_run)

    results_df.to_csv(args.output_file, index=False)

And the plotting functions/utils:

plotting_functions.py

import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import ast

arr = np.array(['v', 's', 'p'], dtype=str)

def map_chunksize(row):
    arr_r = arr[np.array(ast.literal_eval(row.dim_order))]
    arr_c = np.array(ast.literal_eval(row.chunksize)).astype(str)
    
    return ",".join([r + c for r, c in zip(arr_r, arr_c)])

def extract_chunksize_per_dim(row):
    arr_r = list(ast.literal_eval(row.dim_order))
    arr_c = list(ast.literal_eval(row.chunksize))
    
    return pd.Series([arr_c[arr_r.index(0)], arr_c[arr_r.index(1)]])
    

df = pd.read_csv("PATH_TO_OUTPUT_CSV_FILE")
df['named_chunksize'] = df.apply(map_chunksize, axis=1) 
df[['variant_cs', 'sample_cs']] = df.apply(extract_chunksize_per_dim, axis=1)
df = df.loc[df.shuffle != -1]

df_plot = df.loc[df.dim_order.isin(['(0, 1)', '(0, 1, 2)'])]

g = sns.catplot(df_plot, kind='bar', row='cname', col='shuffle', 
                x='named_chunksize', y='CompressionRatio', sharex=False, sharey='row')
for i, ax in enumerate(g.fig.axes):   ## getting all axes of the fig object
     ax.set_xticklabels(ax.get_xticklabels(), rotation = 90)
plt.subplots_adjust(hspace = 1.)
plt.show()

8 replies

jeromekelleher Mar 24, 2024
Maintainer Author

Re float16, it's a nice idea and would definitely be worth trying out. I guess practically, it's not so well supported across languages, so under the "portability" argument for Zarr it would probably not be worth the compression gains.

shz9 Mar 26, 2024

Sorry, I'm a bit busy this week, I'll try to get back to this later. I agree with your point about float16 and language compatibility. That's why I think integer quantization may work better for this.

But then this got me to consider: Given that the AB field, in addition to a few others, can be derived from fields that we're storing already, do you think it makes sense to drop it from storage and provide a utility function to reconstruct it on the fly, if it's ever needed?

jeromekelleher Mar 28, 2024
Maintainer Author

But then this got me to consider: Given that the AB field, in addition to a few others, can be derived from fields that we're storing already, do you think it makes sense to drop it from storage and provide a utility function to reconstruct it on the fly, if it's ever needed?

Well that's a great point that we could illustrate for the paper. What's the minimum fields we'd need to keep, and how complicated is the code to derive AB from it? If we can write a numba function that takes a chunk of (e.g. AD) values and writes into a chunk of AB values using numba in a function that looks a bit like this one then it would be a nice illustration.

It would just be demo code to make the point that we can compute this stuff more cheaply on the fly that we can retrieve it from VCF.

shz9 Apr 18, 2024

OK, so I tried to implement the idea above using dask and got a semi-working function that looks like this:

import dask.array as da
import numpy as np

def compute_allele_balance(z):
    """
    Compute Allele Balance given a Zarr hierarchy that contains Allele Depth information.

    Parameters:
    z : Zarr array
        The input Zarr array containing Allele Depth information.

    Returns:
    Dask array
        The computed allele balance.
    """

    # Pass the relevant Zarr arrays to dask:
    ad_arr = da.from_zarr(z.call_AD)
    call_gt = da.from_zarr(z.call_genotype)

    # Define a mask for missing data in the Allele Depth array
    missing_mask = ad_arr < 0

    # Define a mask for homozygous genotypes:
    hom_mask = da.expand_dims(call_gt[..., 0] == call_gt[..., 1], axis=-1)

    # Apply the mask to the Allele Depth array
    # This will replace any masked data with NaN
    ad_arr = da.ma.masked_array(ad_arr,
                                mask=missing_mask | hom_mask,
                                dtype=np.float32,
                                fill_value=np.nan)

    # Compute the allele balance by dividing the first allele 
    # depth by the sum of all allele depths
    return da.divide(ad_arr[..., 0], ad_arr.sum(axis=-1))

And then I call it as follows:

import zarr

z = zarr.open("converted/chr22.zarr/")
ab = compute_allele_balance(z)

# Get the allele balance for a chunk of the data:
a1 = ab[:10000,...].compute().filled(np.nan)

# Compare to allele balance saved in `call_AB`:
a2 = z.call_AB[:10000, ...]

# Hide all `nan` values:
mask = ~np.isnan(a1) & ~np.isnan(a2)

# Compare with `allclose`. Set `atol=1e-2` because AB values 
# are discretized to two significant digits in VCF:
np.allclose(a1[mask], a2[mask], atol=1e-2, rtol=1e-2)
# True

# However, the masks are not exactly 100% comparable at the moment:
np.array_equal(np.isnan(a1), np.isnan(a2))
# False

However, the call_AB matrix has more nan values than the one I compute, which tells me that I'm missing more masks here. Any ideas what else I should mask before running this computation?

Comparing speeds, computing AB values on the fly is currently about ~20 times slower than loading and decompressing those values from disk:

%timeit z.call_AB[:10000, ...]
# 66.4 ms ± 2.99 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit ab[:10000,].compute()
# 1.34 s ± 25.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Though the function above could in principle be optimized further. I could also try with numba if you think it looks promising.

jeromekelleher Apr 19, 2024
Maintainer Author

That's great @shz9! I think it would be simpler to do this with Numba, actually, and that would allow us to directly time the cost of computing this on a per-chunk basis (rather than dealing with Dask quirks). Basically, you iterate over the chunks of the call_genotype and call_AD arrays, and pass each into a numba function which takes those single-chunk arrays as a parameter, and fills in an array representing the on-the-fly computed call_AB value. Something roughly equivalent is here

We can say something like "Note that the allele balance (AB) field can be straightforwardly computed from the genotypes and allele depth (AD) fields. Computing the allele balance chunks on demand using a simple Numba compiled function for [dataset] required seconds, around XX longer than reading the equivalent stored array in".

That's making the point pretty clearly, I think?

jeromekelleher · 2024-03-25T16:19:44Z

jeromekelleher
Mar 25, 2024
Maintainer Author

One thing I wondered about slightly idly @shz9, is whether having power-of-two sized chunks would make any difference? In principle, by aligning with OS page boundaries (usually 4K) there may be an overall performance improvement. In practise, it probably doesn't make much difference.

If it's not too much hassle, I wonder if we could try something close to 10_000 x 1000 which 4KiB aligned?

1 reply

shz9 Mar 26, 2024

This sounds plausible, but it would require making the chunksizes dependent on the data type and the array dimensions, which can make things complicated. It may require having to set the chunksizes dynamically after scanning the VCF.

I can give it a try for a couple of the arrays to see if it makes a difference.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Analysis of chunk size, compressor and filters on 1000 genomes data #74

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 10 comments 16 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Analysis of chunk size, compressor and filters on 1000 genomes data #74

jeromekelleher Mar 11, 2024 Maintainer

Replies: 10 comments · 16 replies

jeromekelleher Mar 12, 2024 Maintainer Author

jeromekelleher Mar 12, 2024 Maintainer Author

jeromekelleher Mar 14, 2024 Maintainer Author

jeromekelleher Mar 14, 2024 Maintainer Author

jeromekelleher Mar 14, 2024 Maintainer Author

jeromekelleher Mar 16, 2024 Maintainer Author

jeromekelleher Mar 17, 2024 Maintainer Author

jeromekelleher Mar 19, 2024 Maintainer Author

jeromekelleher Mar 19, 2024 Maintainer Author

jeromekelleher Mar 20, 2024 Maintainer Author

jeromekelleher Mar 24, 2024 Maintainer Author

jeromekelleher Mar 28, 2024 Maintainer Author

jeromekelleher Apr 19, 2024 Maintainer Author

jeromekelleher Mar 25, 2024 Maintainer Author

jeromekelleher
Mar 11, 2024
Maintainer

Replies: 10 comments 16 replies

jeromekelleher Mar 12, 2024
Maintainer Author

jeromekelleher
Mar 12, 2024
Maintainer Author

jeromekelleher
Mar 14, 2024
Maintainer Author

jeromekelleher Mar 14, 2024
Maintainer Author

jeromekelleher
Mar 14, 2024
Maintainer Author

jeromekelleher Mar 16, 2024
Maintainer Author

jeromekelleher
Mar 17, 2024
Maintainer Author

jeromekelleher Mar 19, 2024
Maintainer Author

jeromekelleher Mar 19, 2024
Maintainer Author

jeromekelleher Mar 20, 2024
Maintainer Author

jeromekelleher Mar 24, 2024
Maintainer Author

jeromekelleher Mar 28, 2024
Maintainer Author

jeromekelleher Apr 19, 2024
Maintainer Author

jeromekelleher
Mar 25, 2024
Maintainer Author