-
Notifications
You must be signed in to change notification settings - Fork 194
VCF export with vg deconstruct
VCF is a standard, text-based format for representing genome variation. Since VCF is so widely used and supported, it's common to convert from vg to VCF to leverage existing linear-reference-based tools to study variation from a graph.
Graphs are exported to VCF using vg deconstruct
. This tools creates a site (VCF line) for every snarl (bubble) in the graph. The allele information (reference and alts) for these sites is taken from paths in the graph. This is important: if your graph does not have sufficient path information for both reference and alt alleles, and/or these paths are not named correctly, then vg deconstruct
will probably not work properly.
vg deconstruct
will work best if all paths in your graph follow the "PanSN" naming convention described here since writing VCF requires parsing out SAMPLE
, HAPLOTYPE
and CONTIG
fields from the path names.
The reference path forms the basis of all the coordinates in the VCF. A single path can be specified with -p
or a set of paths with a common prefix with -P
. By default, all REFERENCE
-sense paths in the graph are used. This behaviour makes sense when running on a GBZ with a single reference, such as those often produced by Minigraph-Cactus or vg autoindex
, but it's usually prodent to explicitly set the reference with -P
.
The ALT alleles will be formed from all other paths in the graph (that were not selected as reference paths as described above). Unlike for reference paths, these paths can be HAPLOTYPE
-sense paths from GBZ or GBWTs. To get meaningful genotypes in the output VCF, these paths must include sample and haplotype information either with the proper naming scheme or by virtue of being encoded in a valid GBZ/GBWT.
Cycles in the reference path as found, for example, in graphs from PGGB are supported. A site is written for each loop through the path, with alt alleles assigned to the appropriate sites by measuring similarity in their flanking context.
The ID
field in VCF is the two node IDs and their orientations in the graph that correspond to the two endpoints of the snarl. Note that when running on GBZ
input that was derived from a GFA, you can have IDs written in terms of the original GFA coordinates with -OP
.
By default, all top-level sites (ie not nested in other sites) along the chosen reference path(s) are output. The -a
option will write a site for every single snarl that is spanned by a reference path. Nesting information is stored in the following INFO
tags in the VCF
-
LV
: The level in the snarl tree.0
refers to "top-level" snarls.1
refers to snarls that are nested inside top-level snarls,2
to snarls nested inside snarls ofLV=1
, etc. -
PS
: Name of the parent snarl. Note that the snarl hierarchy is a tree, so each snarl is either top-level (LV=0
) or has exactly one parent.
vg deconstruct
writes base-level resolved alleles. So two SV insertions that differ by a single base will have separate alleles in the output. This can lead to a proliferation of multi-allelic sites and make SVs difficult to count.
-L
is a new option that clusters together alleles based on how similar they are in the graph. More precisely, two alleles are merged if the Jaccard coefficient of their (basewide) graph positions is greater or equal to the value given. By default it is 1
, meaning alleles are only merged if they are identical.
When -L
is specified, and it is <1
, then two additional FORMAT
tags are written - these apply to each genotyped allele:
-
TS
: The similarity between the actual graph allele and the allele represented in the VCF. This is the Jaccard coefficient mentioned above. -
TL
: The difference in length between the graph allele and the VCF allele in bp (positive when the true allele is bigger than the VCF allele)
-n
is a new option for exporting nested sites that addresses some shortcomings with the -a
option:
- It has a simpler interface for nested insertions: off-reference paths do not be selected by the user, they will be chosen by the lexicographically first non-reference path through a given site. Note: GBZ haplotype paths are not currently supported -- insertions will only work if all paths are
REFERENCE
-sense. - It uses a bunch of new VCF tags to link nested alleles together. These can help, for example, to simply query variants within nested insertions by where their "root" site lies on the reference.
- (with
-R
) It uses star-alleles to represent spanning deletions. So a haplotype that spans a parent site but is not present in a child site will be represented this way, as opposed to getting a.
(missing) genotype.
The new INFO
TAGs are (work in progress):
-
PA
: The allele number of this site's reference haplotype in its parent site. -
PL
: The length of this site's reference allele in its parent site. -
RL
: The length of the reference allele of this site's parent site. -
PR
: The length of the allele in the top-level reference site in which this site is nested -
RC
: The name of the reference contig (as found in theCONTIG
field of the top-level site containing this site) -
RS
: The start position on the top-level reference (as found in thePOS
field of the top-level site containing this site) -
RD
: The end position on the top-level reference (as found in thePOS
field of the top-level site containing this site)
vg deconstruct test/nesting/nested_snp_in_del.gfa -p x -n | tail -4
##contig=<ID=x,length=5>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT a
x 1 >1>6 CATG CAAG,C 60 . AC=1,1;AF=0.5,0.5;AN=2;AT=>1>2>3>5>6,>1>2>4>5>6,>1>6;NS=1;PA=0;PL=4;PR=4;RC=x;RD=4;RL=4;RS=1;LV=0 GT 1|2
x 3 >2>5 T A,* 60 . AC=1,1;AF=0.5,0.5;AN=2;AT=>2>3>5,>2>4>5,.;NS=1;PA=0;PL=4;PR=4;RC=x;RD=4;RL=4;RS=1;LV=1;PS=>1>6 GT 1|2
vg deconstruct test/nesting/nested_snp_in_ins2.gfa -p x -n
##contig=<ID=a#1#y0,length=5>
##contig=<ID=x,length=2>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT a b
a#1#y0 3 >2>5 A T,* 60 . AC=1,1;AF=0.25,0.25;AN=4;AT=>2>4>5,>2>3>5,.;NS=2;PA=1;PL=4;PR=1;RC=x;RD=1;RL=4;RS=1;LV=1;PS=>1>6 GT 0|1 0|2
x 1 >1>6 C CAAG,CATG 60 . AC=2,1;AF=0.5,0.25;AN=4;AT=>1>6,>1>2>4>5>6,>1>2>3>5>6;NS=2;PA=0;PL=1;PR=1;RC=x;RD=1;RL=1;RS=1;LV=0 GT 1|2 1|0