Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to Generate a VCF File Showing Differences Between HG002 Haplotypes from the HPRC Pangenome #4419

Open
xuxingyubio opened this issue Oct 17, 2024 · 4 comments

Comments

@xuxingyubio
Copy link

Hello VG Team, I am currently working with the HPRC pangenome and aiming to construct a VCF file that highlights the differences between the two haplotypes (hap1 and hap2) of the HG002 sample. Specifically, I want to generate a VCF file that represents hap2 relative to hap1 for HG002. So far, I have downloaded the HPRC pangenome data from the HPRC project, which includes multiple haplotypes for various samples, including HG002. I have attempted to use VG tools, such as vg convert to change the reference, but found that it doesn't seem to support operations targeting individual haplotypes, and vg deconstruct to obtain VCF files; however, it appears that it does not allow for processing single haplotypes separately. It seems that the current VG tools do not support operations on individual haplotypes within a sample. I am specifically looking to extract the variant differences between hap1 and hap2 of HG002 and represent them in a VCF file. Could you please guide me on how to effectively generate a VCF file that captures the differences between the two haplotypes (hap1 and hap2) of the HG002 sample from the HPRC pangenome? Thank you for your support!

@glennhickey
Copy link
Contributor

HG002 was held out of the release HPRC graphs. If you want to make your own hg002-only graph you can do so quite quickly with minigraph-cactus

You'd feed it something like

HG002_hap1   HG002.hap1.fa.gz
HG002_hap2   HG002.hap2.fa.gz

And run with --reference HG002_hap1 HG002_hap2 --vcf --vcfreference HG002_hap1 HG002_hap2 among the usual options to get a pair of haploid vcf's comparing each haplotype with the other.

Otherwise if you already have a graph with HG002 in it, then I think deconstruct -P will work. You may need to promote HG002 to a reference path as described here

@xuxingyubio
Copy link
Author

Thank you for your response. I followed your method and tried it out, but I noticed that the contig lengths in the generated VCF file do not match the original lengths, resulting in positional misalignment. Could this be due to some trimming performed during pangenome construction(minigraph cactus)? Is it possible to obtain the trimmed fasta file used in pangenome construction?

id=NA12878hap1|ptg000002l
1895202
##contig=<ID=NA12878hap1#0#ptg000002l,length=1894829>

@glennhickey
Copy link
Contributor

Yeah, that's a known issue due to path fragmentation. The VCF itself is valid and coordinates correct, it's just that the contig lengths can be too short in the header. This only happens when multiple references are given, and only to references after the first (so hap2 in our example). You options are:

  • fix the header manually
  • rerun with --reference HG002_hap2 --vcf --vcfreference HG002_hap2 to make the second vcf
  • try running with --vcf full to make a VCF of the unclipped graph. But note you will get some giant sites for the centromeres that you may want to remove yourself.

@xuxingyubio
Copy link
Author

I used hg38 as the reference, then switched the reference using vg convert and constructed the VCF file with vg deconstruct. However, I noticed that the reference bases in the VCF file at the corresponding coordinates do not match the original bases in the input FASTA file. Can using an unclipped graph solve this problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants