-
Notifications
You must be signed in to change notification settings - Fork 14
Whole genome calling
vg call
and vg genotype
do not scale beyond graphs of a few megabases. The next generation of callers will hopefully be able to leverage the newer indexes in vg to get around this limitation, but in the meantime we split the input up with vg chunk
a priori. This also helps to distribute computation to different processes.
To call a whole genome, pass the gam with --gams
and a list of chromosome names with --chroms
, along with the xg and sample name. Toil boilerplate as described in the README applies for AWS:
toil-vg call $TOIL_JS PATH-TO-XG SAMPLE-NAME $TOIL_OS --gams WHOLE-GENOME.gam --chroms $(for i in $(seq 1 22; echo X; echo Y); do echo chr${i}; done)
Use the --recall
option to call SVs (that are already in the graph).
The steps performed by toil-vg call
are described below (see https://github.com/vgteam/toil-vg/blob/master/src/toil_vg/vg_call.py for the exact steps), or run toil-vg call
with --realTimeLogging
enabled on a small example and grep Docker Run
from the output to see the command-line tools run.
Most commands take -t
to specify the number of threads. I've hardcoded a few options below, and probably forgot others. The config and Python are where to find the more complete options and descriptions.
In summary, vg chunk
is used to split the input xg and GAM into a bunch of chunks. Each one has a .vg and .gam. A vcf is made for each chunk with vg call
and they are merged together.
Update toil-vg now supports the --genotype_vcf
option in order to genotype variants in the VCF. This VCF must have been used to construct the graph with toil-vg construct --alt_path_gam_index
. This option is preferred over the --recall
option, as the output is much easier to interpret and it is slightly more accurate.
First, the GAM must be sorted and indexed with vg gamsort
vg gamsort INPUT-GAM -i INPUT-GAM.sorted.gai > INPUT-GAM.sorted.gam
This can take several hours, but only needs to be done once. If toil-vg call
finds a .gai, it skips this step. The GAM can be a whole genome or a single chromosome, the remaining steps are unchanged.
Next, the xg and indexed GAM are split up into overlapping 2.5 Megabase chunks with vg chunk
.
Note that PATH_LIST is a text file with the name of each chromosome to call, one per line, ex
chr1
chr2
chr3 ...
Update toil-vg now uses vg paths --xg --lengths
to get a list of all paths and their lengths from the XG. If no paths are specified on the command line, all the paths are used. If the user specified paths with --chroms
, this command is still used to determine their lengths (which are used below).
vg chunk -x INPUT-XG \
-a INPUT-GAM.sorted.gam \
-P PATH_LIST \
-c 50 \
-g \
-s 2500000 \
-o 100000 \
-b call_chunk \
-t THREADS \
-E OUTPUT_BED \
-f
The OUTPUT_BED
file will contain the information for each calling job. The first 3 columns are the BED coordinates of the chunk, the 4th column is the path of the GAM file for the chunk (replace .gam in this path to get the path of the .vg file of the chunk).
Update When using --genotype_vcf
, the (indexed) alt path GAM (created with toil-vg construct/index --alt_path_gam_index
) should be passed to vg chunk
using via -a <alt_paths>.gam
. Multiple occurrences of -a
are now supported by vg chunk
.
A line from the BED file above contains all the information necessary to call the chunk. The default options for everything are specified in vg_config.py, and are different for SNPs and SVs. The first step to calling is filtering the gam. This step isn't as useful as it used to be due to improvements in vg, but still improves calling small variants on some benchmarks
vg index chunk.vg -x chunk.xg
vg filter chunk.gam -r 0.9 -fu -m 1 -q 15 -D 999 -x chunk.xg > filtered.gam
Then augmenting the graph
Update When using --genotype_vcf
, the graph must first be updated to include the alt paths. These paths are in a GAM chunk (same as alignment chunk, but with -1 added to the name's prefix): vg augment -i chunk.vg chunk-1.gam > chunk_alts.gam && mv chunk_alts.gam chunk.gam
vg augment chunk.vg filtered.gam -q 10 -Z chunk.trans -S chunk.support > chunk_aug.vg
Then the output can be used to make the VCF
vg call chunk_aug.vg -z chunk.trans -s chunk.support -c CHROMOSOME-NAME -r CHROMOSOME-NAME -l CHROMOSOME-LENGTH -o OFFSET > chunk.vcf
CHROMOSOME-NAME
can be inferred from the BED file (first column). CHROMOSOME-LENGTH
is the total length of the chromosome. It can be inferred from the BED file too (Update: toil-vg now gets path information from the xg -- see above). toil-vg does this by reading each line of the bed file and storing the minimum and maximum position for each chromosome, and computing size using that (see path_bounds
and path_size
variables in vg_call.py
). OFFSET
can be inferred from the BED file (second column).
vg genotype
can be used instead of vg call
. It is slower and takes more memory, but if the GAM is passed in with -G
, vg augment
is not necessary. It usually performs better on SNPs and small indels than vg call
. Only vg call
can be used to SVs.
Update When using --genotype_vcf
, the input VCF gets passed to vg call with the -f
option. All the other options are the same as the defaults using --recall
. The VCF passed in via -f
should be clipped as below in Step 5). The vg call output does not need to be clipped again.
Now sort the VCF
head -10000 chunk.vcf | grep "^#"; cat $@ | grep -v "^#" | sort -k1,1d -k2,2n | bgzip > sorted.vcf.gz
Then index it
tabix -f -p vcf sorted.vcf.gz
Because of the overlap, we must clip chunks so they exactly represent the regions.
The clipped offset of the chunk_ith BED chunk would be
clipped_chunk_offset = chunk_i * 2500000 - chunk_i * 100000
That's enough to extract the region of interest from our VCF
bcftools view sorted.vcf.gz -t CHROMOSOME-NAME:START-END > clipped.vcf
where
START=clipped_chunk_offset + 1 (+ 50000 if not the first chunk)
END=clipped_chunk_offset + 2500000 -1 (-50000 if not last chunk)
The offset computation here is using the chunk size and half the overlap. In hindsight, it could be simplified by not creating overlapping chunks and boosting the context expansion (-c) in vg chunk, but this hasn't been tested.
Update Clipping not required at this point when using --genotype_vcf
. It needs to be done before vg call
instead.
bcftools concat clipped1.vcf.gz clipped2.vcf.gz ....