-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unnormal results ! #37
Comments
@umahsn can you please help me in that if you have some time, I will be very grateful |
Hi, yes I can help look into it. Is the dataset you are analyzing publicly available? If so, I can try to replicate the variant calling pipeline and see what's going on. |
If you could let me know what reference you are using, nanopore dataset as well as how you are aligning the reads that would be helpful. |
One thing I'd recommend is that if you are using a microbiome sample, it might be better to not use --haploid_genome mode. Because even though the genome itself is haploid, if the sample is taken from a microbial community with a mixture of potentially different strains, then it becomes a slightly different question and falls into the realm of metagenomics. The haploid model of NanoCaller was trained on CHM13, a haploid cell line, and is most suitable for samples that come from cell culture or from haploid chromosomes of a single organism, basically under the assumption that there should only be a single haplotype and the model is forced to choose between homozygous reference and homozygous alternative. If you have a mixture of strains, then the sample is somewhat "polyploid" in the sense that there are multiple haplotypes present, hence metagenomics. When you run NanoCaller under haploid model, then it will tend to report those variants that are confidently supporting a single haplotype assumption. In this case, I'd recommend using NanoCaller without --haploid_genome parameter, but I would warn you that we have not evaluated accuracy of NanoCaller in metagenomic variant calling. It might also be suitable to use a metagenomics pipeline, and there are some datasets outlined here for potentially benchmakring nanopore tools for microbial datasets: https://academic.oup.com/gigascience/article/8/5/giz043/5486468 You can also check https://academic.oup.com/gigascience/article/9/2/giaa007/5728470 for SNP pipelines for microbial datasets. |
Hello @umahsn , thanks for your reply. Thank you in advance. |
Yes I can take a look at the datasets. I have reached out via email. |
Hi, I ran NanoCaller on this sample under the haploid model, and there was an example of a clear variant at (137 NC_007197.1:c1781822-1780204) with >97% allele frequency that was missed by NanoCaller. When I looked at the intermediate probability output, the reference base had low confidence, but was still the highest probability prediction. I believe this was an error on part of However, when I used the diploid model, this variant was correctly identified. There were some other potential variant sites with <25% allele frequency but NanoCaller diploid model did not call those. The reason for that is that NanoCaller did not see any evidence for any other variants co-occurring on the same reads as these variants which can be tested by looking at an IGV plot as well. This indicates that such mismatches are likely to be errors, however this is still with the assumption of a diploid organism. If you are interested in just getting a list of sites with variant allele frequency in a certain range, such as 1-25% or something like that, it is possible to use samtools pileup. NanoCaller previously (v<3.0) did produce file containing a list of all sites with allele frequency above min_allele_freq threshold, but I removed it in the current version. I will look into reintroducing that feature. |
Hi, Just wanted to update that I have added more details into the VCF files, including for candidate sites that were determined to not be true variants by NanoCaller. There is a "unfiltered" VCF file for SNPs that contains these extra sites. In the INFO field, you can see the probability that each base is a true allele. These have only been implemented for the diploid model, which I think is more suitable for your scenario since it estimates probability of occurence of each base independently. There seems to be few problems with the haploid model which I will fix, but I am also working on releasing a new model for metagenomes, which might take a couple months to get released. |
Hello,
I have some ONT reads of Aspergillus fumigatus that I mapped against a reference sequence.
I am using Nanocaller for variants calling, My input files are a BAM file (555,1 Ko; 85 sequences mapped) and my ref is a fasta file (3,6 Ko, length = 1200pb)
The command that I am using is :
NanoCaller --mode snps --sequencing ont --haploid_genome --bam sorted_mapped.bam --ref gene.fna
I have few questions:
1/ Is it normal that my VCF output file contains only 8 snps ! Since, as I know the VCF file should mention all variations in all sequences compared to the ref seq, and when I have a look through IGV, I have more than 8 snps. (I think the tool mentioned only the variations that are present in almost all sequences! )? I tried to use other flags as
--min_allele_freq
to precise that the minimum allele frequency should be 0,01 and--mincov
to precise that coverage of 1 is good to call a variant, but nothing changed, always 8 variants !2/ Is there any way to perform some performance analysis on the tool? I was thinking to use a true dataset containing some known variants, but It seems that I didn't find the one until now.
I will be very grateful if you can help.
Thanks in advance!
have a good day!
The text was updated successfully, but these errors were encountered: