Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chromosome path not found in graph or haplotypes index #4411

Open
zwh82 opened this issue Sep 26, 2024 · 9 comments
Open

Chromosome path not found in graph or haplotypes index #4411

zwh82 opened this issue Sep 26, 2024 · 9 comments

Comments

@zwh82
Copy link

zwh82 commented Sep 26, 2024

1. What were you trying to do?
I used vg rna to build a spliced pangenome and a pantranscriptome with a GFA file from PGGB and a gff3 file. Similarly, I used vg autoindex to build index for mpmap and rpvg.

2. What did you want to happen?
Get a spliced pangenome and a pantranscriptome from vg rna. Get index files from vg autoindex.

3. What actually happened?
I got a error ERROR: Chromosome path "MHChr01" not found in graph or haplotypes index (line 2) from vg rna and vg autoindex.

4. If you got a line like Stack trace path: /somewhere/on/your/computer/stacktrace.txt, please copy-paste the contents of that file here:

## vg rna
[vg rna] Parsing graph file ...
[vg rna] Graph parsed in 0.00261929 seconds, 0.0147743 GB
[vg rna] Adding transcript splice-junctions and exon boundaries to graph ...
        ERROR: Chromosome path "MHChr01" not found in graph or haplotypes index (line 2).

## vg autoindex
[vg autoindex] Executing command: vg1_59 autoindex -p auto -w mpmap -w rpvg -g sub_w.gfa -x mh_chr1.gff3
[IndexRegistry]: Checking for haplotype lines in GFA.
[IndexRegistry]: Constructing a GBZ from GFA input.
[IndexRegistry]: Constructing haplotype-transcript GBWT and spliced graph from GBZ-format graph.
        ERROR: Chromosome path "MHChr01" not found in graph or haplotypes index (line 2).

5. What data and command can the vg dev team use to make the problem happen?
This is a small graph from PGGB. Two haplotypes in the graph are named MHChr01#0#chr01 and ZSChr01#1#chr01.
sub.gfa.gz

Then I convert to GFA with W lines.
sub_w.gfa.gz
command: vg convert -f sub.gfa > sub_w.gfa

This is a gff3 file.
mh_chr1.gff3.gz

commands for vg rna and vg autoindex

vg rna -p -t 64 -a -r -e -n mh_chr1.gff3 -s Parent -i sub_chr01_pantranscriptome.txt -b sub_chr01_pantranscriptome.gbwt sub_w.gfa > spliced.gfa

vg autoindex -p auto -w mpmap -w rpvg -g sub_w.gfa -x mh_chr1.gff3

The paths in sub_w.gfa

W	MHChr01	0	chr01	0	26966	>1>2>4
W	ZSChr01	1	chr01	0	45371	>1>3>4

The gff3 file

##gff-version 3
MHChr01	TBtools	gene	7552	15388	.	+	.	ID=OsMH_01G0000025
MHChr01	GSAman	mRNA	7552	15388	.	+	.	ID=OsMH_01T0000025.1;Parent=OsMH_01G0000025;Name=OsMH_01T0000025.1;ManInfo=EVD:ISO/EVD:RNA/REF:Nipp/UTR TBD
MHChr01	GSAman	exon	7552	7831	.	+	.	Parent=OsMH_01T0000025.1
MHChr01	TBtools	five_prime_UTR	7552	7831	.	+	.	Parent=OsMH_01T0000025.1
MHChr01	GSAman	exon	7917	8179	.	+	.	Parent=OsMH_01T0000025.1
MHChr01	TBtools	five_prime_UTR	7917	8011	.	+	.	Parent=OsMH_01T0000025.1
MHChr01	GSAman	CDS	8012	8179	.	+	0	Parent=OsMH_01T0000025.1
MHChr01	GSAman	exon	8920	9018	.	+	.	Parent=OsMH_01T0000025.1
MHChr01	GSAman	CDS	8920	9018	.	+	0	Parent=OsMH_01T0000025.1
MHChr01	GSAman	exon	10020	10123	.	+	.	Parent=OsMH_01T0000025.1
MHChr01	GSAman	CDS	10020	10123	.	+	0	Parent=OsMH_01T0000025.1
MHChr01	GSAman	exon	11699	12507	.	+	.	Parent=OsMH_01T0000025.1
MHChr01	GSAman	CDS	11699	12507	.	+	1	Parent=OsMH_01T0000025.1
MHChr01	GSAman	exon	12591	12713	.	+	.	Parent=OsMH_01T0000025.1
MHChr01	GSAman	CDS	12591	12713	.	+	2	Parent=OsMH_01T0000025.1
MHChr01	GSAman	exon	12795	12883	.	+	.	Parent=OsMH_01T0000025.1
MHChr01	GSAman	CDS	12795	12883	.	+	2	Parent=OsMH_01T0000025.1
MHChr01	GSAman	exon	12971	13171	.	+	.	Parent=OsMH_01T0000025.1
MHChr01	GSAman	CDS	12971	13171	.	+	0	Parent=OsMH_01T0000025.1
MHChr01	GSAman	exon	13773	14184	.	+	.	Parent=OsMH_01T0000025.1
MHChr01	GSAman	CDS	13773	14184	.	+	0	Parent=OsMH_01T0000025.1
MHChr01	GSAman	exon	14667	14750	.	+	.	Parent=OsMH_01T0000025.1
MHChr01	GSAman	CDS	14667	14677	.	+	2	Parent=OsMH_01T0000025.1
MHChr01	TBtools	three_prime_UTR	14678	14750	.	+	.	Parent=OsMH_01T0000025.1
MHChr01	GSAman	exon	14837	14993	.	+	.	Parent=OsMH_01T0000025.1
MHChr01	TBtools	three_prime_UTR	14837	14993	.	+	.	Parent=OsMH_01T0000025.1
MHChr01	GSAman	exon	15067	15388	.	+	.	Parent=OsMH_01T0000025.1
MHChr01	TBtools	three_prime_UTR	15067	15388	.	+	.	Parent=OsMH_01T0000025.1

I don't know why it can't find chr paths. I tried to run vg autoindex with gfa_with_w_lines.gfa and gfa_with_w_lines.gtf in vg test and it can work.

Then I tried to rename two haplotypes to MHChr01 and ZSChr01, so the GFA file can only be P lines.

P	MHChr01	1+,2+,4+,5+
P	ZSChr01	1+,3+,4+,6+

Now vg rna can work.

$ vg1_59 rna -p -t 64 -a -r -e -n mh_chr1.gff3 -s Parent -i sub_chr01_pantranscriptome.txt -b sub_chr01_pantranscriptome.gbwt sub.gfa > spliced.gfa
[vg rna] Parsing graph file ...
[vg rna] Graph parsed in 0.00922727 seconds, 0.0147705 GB
[vg rna] Adding transcript splice-junctions and exon boundaries to graph ...
        Parsed 4 transcripts
        Constructed 4 reference transcript paths
        Updated graph with reference transcript paths
[vg rna] Transcripts parsed and graph updated in 0.0158957 seconds, 0.023922 GB
[vg rna] Projecting transcripts to haplotypes ...
        Parsed 4 transcripts
        Projected 3 haplotype-specific transcript paths
[vg rna] Haplotype-specific transcripts constructed in 0.00692912 seconds, 0.0252037 GB
[vg rna] Topological sorting graph and compacting node ids ...
        Sorted 550 nodes
[vg rna] Graph sorted and compacted in 0.000983271 seconds, 0.0252037 GB
[vg rna] Adding reference and projected transcripts as embedded paths in the graph ...
        Embedded 7 paths in graph
[vg rna] Transcript paths added in 0.00014651 seconds, 0.0252037 GB
[vg rna] Writing pantranscriptome transcripts to file(s) ...
[vg rna] Writing splicing graph to stdout ...
[vg rna] Graph and pantranscriptome written in 0.119253 seconds, 0.281044 GB

But vg autoindex still can't work.

$ vg1_59 autoindex -p auto -w mpmap -w rpvg -g sub.gfa -x mh_chr1.gff3 
[vg autoindex] Executing command: vg1_59 autoindex -p auto -w mpmap -w rpvg -g sub.gfa -x mh_chr1.gff3
[IndexRegistry]: Checking for haplotype lines in GFA.
error:[vg autoindex] Input is not sufficient to create indexes
Inputs
        GTF/GFF
        Reference GFA
are insufficient to create target index Haplotype-Transcript GBWT

Could you help me see what the problem is? Thanks for your help.

6. What does running vg version say?

vg version v1.59.0 "Casatico"
Compiled with g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 on Linux
Linked against libstd++ 20230528
Built by [email protected]
@jeizenga
Copy link
Contributor

vg autoindex --workflow rpvg and vg rna -b both require haplotypes, so you need to communicate to interface that you have a source of haplotypes. In your case, that would be W lines in the GFA. However, it appears that what you actually have are reference annotations, and you are simply reinterpreting a reference sequence as a haplotype. When you do that, it confuses the interface, because its looking for a reference sequence to match the annotations, not a haplotype sequence.

I think the correct solution in your case would be to either 1) only convert the non-reference sequence into W lines, or 2) add the RS tag to the GFA header to indicate that a certain sample corresponds to the reference sequence.

I should also note that the GFF for a given haplotype is expected to follow the Pan-SN naming specification, which has semantics sample#haplotye_number#contig. In your W and P lines, it looks to me like you are mixing the contig name into the sample, which is not really what's intended.

@zwh82
Copy link
Author

zwh82 commented Sep 27, 2024

I should also note that the GFF for a given haplotype is expected to follow the Pan-SN naming specification, which has semantics sample#haplotye_number#contig. In your W and P lines, it looks to me like you are mixing the contig name into the sample, which is not really what's intended.

I know that. I set it in this way solely to better represent my samples for now.

I think the correct solution in your case would be to either 1) only convert the non-reference sequence into W lines, or 2) add the RS tag to the GFA header to indicate that a certain sample corresponds to the reference sequence.

I tried these solutions.

  1. I convert the non-reference sequence into W lines, and rename reference sequence as MHChr01. The vg autoindex and vg rna can work.
    The paths in the GFA file.
P	MHChr01	1+,2+,4+,5+,6+,8+
W	ZSChr01	0	chr01	0	45371	>1>3>4>6>7>8
  1. I tried to only add the RS tag to the GFA header in sub_w.gfa. But both vg rna and vg autoindex can't work.

It seems that the GFA file has strict format. It must include the reference path with P line and haplotype paths, and the name of reference path cannot follow the Pan-SN naming specification.

In fact, for this species, I have multiple genomes and corresponding annotation files. This means that each haplotype may be considered as the reference.

I tried to run with two annotation files(MHChr01 gff3 and ZSChr01 gff3) .

  1. If I want to run vg autoindex, I must manually set all samples as reference and haplotype in the GFA file at the same time.
    The command is vg autoindex -p auto -w mpmap -w rpvg -g sub_chr01.gfa -x mh_chr1.gff3 -x zs_chr1.gff3 -a Parent
    The paths in the GFA file.
P	MHChr01	1+,2+,4+,5+,6+,8+
P	ZSChr01	1+,3+,4+,6+,7+,8+
W	MHChr01	0	chr01	0	26966	>1>2>4>5>6>8
W	ZSChr01	0	chr01	0	45371	>1>3>4>6>7>8

I got the result from auto.txorigin.tsv.

Name	Length	Transcripts	Haplotypes
OsMH_01T0000100.2_R1	2507	OsMH_01T0000100.2	MHChr01,MHChr01#0#chr01
OsMH_01T0000100.2_H1	2507	OsMH_01T0000100.2	ZSChr01#0#chr01
OsZS_01T0000100.1_R1	363	OsZS_01T0000100.1	ZSChr01,ZSChr01#0#chr01

The I used the same GFA file to run vg rna. The command is vgrna -p -t 64 -k 256 -a -r -e -n mh_chr1.gff3 -n zs_chr1.gff3 -s Parent -f sub_chr01_pantranscriptome.fa -i sub_chr01_pantranscriptome.txt -b sub_chr01_pantranscriptome.gbwt sub_chr01.gfa > spliced.gfa
I got the result from pantranscriptome.txt.

Name	Length	Transcripts	Haplotypes
OsMH_01T0000100.2_R1	2507	OsMH_01T0000100.2	MHChr01,MHChr01#0#chr01#0
OsMH_01T0000100.2_H1	2507	OsMH_01T0000100.2	ZSChr01,ZSChr01#0#chr01#0
OsZS_01T0000100.1_R1	363	OsZS_01T0000100.1	ZSChr01,ZSChr01#0#chr01#0
  1. When I want to run vg rna again, I found only set all samples as reference. However, vg autoindex can't treat the new GFA as input.

The paths in the new GFA file.

P	MHChr01	1+,2+,4+,5+,6+,8+
P	ZSChr01	1+,3+,4+,6+,7+,8+

I got the result from pantranscriptome.txt.

Name	Length	Transcripts	Haplotypes
OsMH_01T0000100.2_R1	2507	OsMH_01T0000100.2	MHChr01
OsMH_01T0000100.2_H1	2507	OsMH_01T0000100.2	ZSChr01
OsZS_01T0000100.1_R1	363	OsZS_01T0000100.1	ZSChr01

It seems that the results are the same, but Haplotypes column is different. Which is the best way to achieve it? I prefer to use the new GFA in the second way to build all the required files through vg autoindex. But it's a pity that it can't work.

@jeizenga
Copy link
Contributor

If you have annotations for the other haplotypes, then that's what the -j option in vg rna and -H in vg autoindex are intended for. Right now it seems like there's something getting mixed up that is both adding the annotations as reference transcripts and projecting onto other haplotypes (which already have the annotations).

Can you specify what the RS tag and the W lines looked like when you said it didn't work? Also what error did it give you? I expect that this will be the most direct route to getting the spliced pangenome graph that you want.

@zwh82
Copy link
Author

zwh82 commented Sep 28, 2024

Can you specify what the RS tag and the W lines looked like when you said it didn't work?

This is the GFA file includes RS tag and w lines. The MHChr01 annotation file has been given previously.

H	VN:Z:1.1	RS:Z:MHChr01
W	MHChr01	1	chr01	0	26966	>1>2>4>5>6>8
W	ZSChr01	2	chr01	0	45371	>1>3>4>6>7>8
  1. vg rna
$ vg1_59 rna -p -t 64 -k 256 -a -r -e -n mh_chr1.gff3  -s Parent -f sub_chr01_pantranscriptome.fa -i sub_chr01_pantranscriptome.txt -b sub_chr01_pantranscriptome.gbwt sub_w_set_ref.gfa > spliced.gfa
[vg rna] Parsing graph file ...
[vg rna] Graph parsed in 0.00377116 seconds, 0.0202713 GB
[vg rna] Adding transcript splice-junctions and exon boundaries to graph ...
        ERROR: Chromosome path "MHChr01" not found in graph or haplotypes index (line 2).
  1. vg autoindex
$ vg1_59 autoindex -p auto -w mpmap -w rpvg -g sub_w_set_ref.gfa -x mh_chr1.gff3  -a Parent
[vg autoindex] Executing command: vg1_59 autoindex -p auto -w mpmap -w rpvg -g sub_w_set_ref.gfa -x mh_chr1.gff3 -a Parent
[IndexRegistry]: Checking for haplotype lines in GFA.
error:[vg autoindex] Input is not sufficient to create indexes
Inputs
        GTF/GFF
        Reference GFA
are insufficient to create target index Haplotype-Transcript GBWT


Maybe I confused something. Before I figure it out, the GFA file sub.gfa I use now is like this.

P	MHChr01	1+,2+,4+,5+,6+,8+
P	ZSChr01	1+,3+,4+,6+,7+,8+

If you have annotations for the other haplotypes, then that's what the -j option in vg rna and -H in vg autoindex are intended for.

Based on this GFA files, I have no way to use vg autoindex, which must check the presence of references and haplotypes.

I don't know how to use -j option properly. What does the -j option mean? The explanation is use haplotype paths in GBWT index as reference sequences (disables projection). My understanding is that the annotation file of each genome only traverses the corresponding genome for annotation, and does not project to other haplotypes. For example, MHChr01.gff3 only annotates MHChr01, and ZSChr01.gff3 only annotates ZSChr01.

Now I tried to use vg rna. The command is vg1_59 rna -p -t 64 -j -l new_sub_chr01_pantranscriptome.gbwt -s Parent -n mh_chr1.gff3 -n zs_chr1.gff3 -a -r -b pan.gbwt -i pan.txt sub.gfa > spliced.gfa. The new_sub_chr01_pantranscriptome.gbwt file includes the index of the two haplotypes. When I tried to stop using -j and -l, it seemed like I got the same result. The command is vg1_59 rna -p -t 64 -s Parent -n mh_chr1.gff3 -n zs_chr1.gff3 -a -r -b pan.gbwt -i pan.txt sub.gfa > spliced.pg

Name	Length	Transcripts	Haplotypes
OsMH_01T0000025.1_R1	2943	OsMH_01T0000025.1	MHChr01
OsMH_01T0000075.1_R1	2283	OsMH_01T0000075.1	MHChr01
OsMH_01T0000100.1_R1	2611	OsMH_01T0000100.1	MHChr01
OsMH_01T0000100.2_R1	2507	OsMH_01T0000100.2	MHChr01
OsZS_01T0000100.1_R1	363	OsZS_01T0000100.1	ZSChr01
OsZS_01T0000200.1_R1	795	OsZS_01T0000200.1	ZSChr01

However, it may be difficult to know haplotype specific transcripts clearly. When I read the article Haplotype-aware pantranscriptome analyses using spliced pangenome graphs, I found that the construction of splicing pangenome is through VCF + reference FASTA, and then projected the reference annotation onto the haplotype to obtain haplotype-specific transcripts. For a transcript that can be projected onto a haplotype, it has the same length, the same start and stop sites on both the reference and haplotype. The difference between them on the reference and haplotype is only caused by the variation of some bases. I don't know if my understanding is correct?

When I have multiple genomes and their corresponding annotations, I don't know how I should build them correctly. Because I have multiple annotation files, the same gene may have different start and end positions and different lengths on two haplotypes. There are also some transcripts that appear on specific haplotypes. These are two types of haplotype specific transcripts in my ideal situation.

Don't worry, I'm just organizing my thoughts and ideas. I really look forward to your suggestions.

@zwh82
Copy link
Author

zwh82 commented Sep 30, 2024

When I read the article Haplotype-aware pantranscriptome analyses using spliced pangenome graphs, I found that the construction of splicing pangenome is through VCF + reference FASTA, and then projected the reference annotation onto the haplotype to obtain haplotype-specific transcripts. For a transcript that can be projected onto a haplotype, it has the same length, the same start and stop sites on both the reference and haplotype. The difference between them on the reference and haplotype is only caused by the variation of some bases.

I'm sorry. I misunderstood that transcripts don’t all have the same length. But I have a question: when I project the reference annotation onto the haplotype, not all transcript annotations are projected onto the haplotype. In what situations would that happen?

@jeizenga
Copy link
Contributor

I think to use the GFA as input for vg rna, you would need to first convert it to a GBZ file with vg gbwt and then provide the GBZ as input to vg rna with the -z option.

With the vg autoindex command, it seems to be not detecting that your GFA has haplotypes in it, but it's not totally clear to me why. It should be enough to have a single W line that is not designated as a reference path with the RS tag. It seems like you already have this boiled down to a minimal reproducing example, so if you want to provide the inputs to me, I can take a look.

Regarding the -j option, you are correct that vg rna will not do projection if you provide this option. For users that only have annotations for a reference genome, vg rna can project the reference transcripts onto haplotypes by finding haplotypes that traverse each exon. However, there are other tools/pipelines that are more capable at comparative annotation, so if you have external annotations for the haplotypes, that's generally preferred. Accordingly, providing -j to vg rna or -H to vg autoindex automatically disables projection.

@jjuhyunkim
Copy link

Hi,

I have the same issue here. I generated the graph using minigraph Cactus with different reference samples (see below).

<style> </style>
seqFile chm13,hg002.1, hg002.2 chm13,hg002mat, hg002pat hg002mat, hg002pat
reference(RS header in GFA) chm13 chm13 hg002mat hg002pat hg002mat hg002pat
vg path -L -x chm13#0#chr22 hg002#1#chr22#0 hg002#1#chr22#1588801 hg002#1#chr22#8063369 hg002#1#chr22#8801348 hg002#1#chr22#9233182 hg002#1#chr22#9296730 hg002#1#chr22#9502398 hg002#1#chr22#17529314 hg002#1#chr22#21081973 hg002#1#chr22#21309690 hg002#2#chr22#276 hg002#2#chr22#1640972 hg002#2#chr22#2109789 hg002#2#chr22#4393926 hg002#2#chr22#5153161 hg002#2#chr22#13673342 hg002#2#chr22#17402067 chm13#0#chr22 hg002mat#0#chr22 hg002mat#0#chr22[1588801] hg002mat#0#chr22[8063369] hg002mat#0#chr22[8801348] hg002mat#0#chr22[9233182] hg002mat#0#chr22[9296730] hg002mat#0#chr22[9502398] hg002mat#0#chr22[17529314] hg002mat#0#chr22[21081973] hg002mat#0#chr22[21309690] hg002pat#0#chr22[276] hg002pat#0#chr22[1640972] hg002pat#0#chr22[2109789] hg002pat#0#chr22[4393926] hg002pat#0#chr22[5153161] hg002pat#0#chr22[13673342] hg002pat#0#chr22[17402067] hg002mat#0#chr22 hg002pat#0#chr22[7613] hg002pat#0#chr22[696114] hg002pat#0#chr22[1580247] hg002pat#0#chr22[2109789] hg002pat#0#chr22[3713744] hg002pat#0#chr22[13674874] hg002pat#0#chr22[17239804]

I’ve tried to match the gene annotation file with the chromosome names, but I’m still encountering the same error message. How can I resolve this? Could you please provide me with an example of the gene annotation file?

My script looks like this:

vg rna -p --threads $SLURM_CPUS_PER_TASK \
				-q --transcripts test.gtf --gbz-format \
				--write-gbwt ${chr}.pantranscriptome.gbwt \
				--use-hap-ref \
				--write-info ${chr}.pantranscriptome.txt \
				-f ${chr}.pantranscriptome.fa \
				minigraph.chr22.gbz \
				--gbwt-bidirectional > ${chr}.spliced_graph.pg

@zwh82
Copy link
Author

zwh82 commented Oct 8, 2024

@jeizenga Sorry for the late response, I was on vacation for a week. It would be great if you could still take some time to take a look. Here are the GFA file and annotation file I used.
GFA file (W lines with RS tag):
sub_w_set_ref.gfa.gz
gbz file:
sub_chr01_pantranscriptome_set_ref.gbz.gz

two annotation files:
mh_chr1.gtf.gz
zs_chr1.gtf.gz

I can't work with the following commands with this GFA file.

# vg rna
vg rna -p -t 64 -k 256 -n mh_chr1.gtf -n zs_chr1.gtf -r -f chr01_pantranscriptome.fa -i chr01_pantranscriptome.txt -b chr01_pantranscriptome.gbwt sub_w_set_ref.gfa > spliced.gfa

# vg rna gbz
vg rna -p -t 64 -k 256 -n mh_chr1.gtf -n zs_chr1.gtf -r -f chr01_pantranscriptome.fa -i chr01_pantranscriptome.txt -b chr01_pantranscriptome.gbwt -z sub_chr01_pantranscriptome_set_ref.gbz > spliced.gfa

# vg autoindex
vg autoindex -p auto -w mpmap -w rpvg -g sub_w_set_ref.gfa -x mh_chr1.gtf -x zs_chr1.gtf

vg autoindex -p auto -w mpmap -w rpvg -g sub_w_set_ref.gfa -H mh_chr1.gtf -H zs_chr1.gtf

Accordingly, providing -j to vg rna or -H to vg autoindex automatically disables projection.

Maybe I wasn’t clear before, but I find this a bit confusing. When I specify multiple annotation files without enabling projection, vg rna can annotate on the corresponding haplotype in two ways: 1) Without the -e option and other construction options, it will not do projection, and it will annotate on the corresponding haplotype. 2) -j must be specified together with -l, which is the GBWT index file of haplotypes. I also need to build an additional GBWT index file of haplotypes. Both methods yield the same results, but the second way seems more troublesome than the first way. I can't see the advantages of the second way.

@zwh82
Copy link
Author

zwh82 commented Oct 29, 2024

@jeizenga I think I have figured out how vg rna and vg autoindex run correctly. Here are the tests I conducted and the required documents. Perhaps it will be clearer.

  1. vg rna
# GFA: two haplotypes are written as W lines
# GTF: the name seems necessary to follow SAMPLE#HAPLOTYPE#LOCUS#PHASE_BLOCK(may be the old path metadata model of vg?) 
vg rna -p -n mh_chr1.gtf -n zs_chr1.gtf sub_w.gfa > spliced.gfa
# projection
vg rna -p -a -r -e -n mh_chr1.gtf sub_w.gfa -i hap.txt > spliced.gfa

The GFA file and GTF file used in vg rna.

# GFA 
H	VN:Z:1.1
W	MHChr01	1	chr01	0	26966	>1>2>4>5>6>8
W	ZSChr01	2	chr01	0	45371	>1>3>4>6>7>8

# GTF
MHChr01#1#chr01#0	GSAman	transcript	7552	15388	.	+	.	transcript_id "OsMH_01T0000025.1"; gene_id "OsMH_01G0000025"
ZSChr01#2#chr01#0	makerRMon	transcript	11754	12116	.	-	.	transcript_id "OsZS_01T0000100.1"; gene_id "OsZS_01G0000100"
  1. vg autoindex
# vg autoindex for rpvg must need one reference(written as P line or W line with RS tag) and one haplotype in GFA file, and specified the reference annotation file with -x, --tx-gff option
# GFA: The reference can be written as P line or W line with RS tag. The haplotype is written as W line.
# GTF: The name of reference written as P line is the same as the name of P line. The name of haplotype or reference with W line need to follow PanSN(SAMPLE#HAPLOTYPE#LOCUS/contig_name) instead of SAMPLE#HAPLOTYPE#LOCUS#PHASE_BLOCK
vg autoindex -p auto -w rpvg -w mpmap -g sub_w.gfa -x mh_chr1.gtf -H zs_chr1.gtf

The GFA file and GTF file used in vg autoindex.

# GFA 
H	VN:Z:1.1	RS:Z:MHChr01
W	MHChr01	1	chr01	0	26966	>1>2>4>5>6>8
W	ZSChr01	2	chr01	0	45371	>1>3>4>6>7>8

# GTF
MHChr01#1#chr01	GSAman	transcript	7552	15388	.	+	.	transcript_id "OsMH_01T0000025.1"; gene_id "OsMH_01G0000025"
ZSChr01#2#chr01	makerRMon	transcript	11754	12116	.	-	.	transcript_id "OsZS_01T0000100.1"; gene_id "OsZS_01G0000100"

Although vg autoindex can work now, it seems unable to perform haplotype projection and can only annotate the corresponding haplotypes based on the annotation file. Perhaps I haven't found the correct approach, and I would like to know whether vg autoindex can perform haplotype projection based on reference annotations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants