Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runtime #15

Open
Akazhiel opened this issue Jul 22, 2021 · 7 comments
Open

Runtime #15

Akazhiel opened this issue Jul 22, 2021 · 7 comments

Comments

@Akazhiel
Copy link

Hello!

On the NanoCaller paper you have a table of run times for the different modes and different technologies. And I noticed that for ONT on the mode to call both on 16 CPUs the runtime was about 18h. But I've been running my data on 8 CPUs since it's the max I have on this machine, and it's been going on for 23h already and it hasn't reached chromosome 3 yet. The type of data is ONT running in the both mode. What could be the reason is taking so long?

Best regards,

Jonatan

@umahsn
Copy link
Collaborator

umahsn commented Jul 27, 2021

Hi Jonatan,

Thank you for pointing this out. We made some changes to the indel candidate site selection and reorganizing the code to become more modular, and this might be causing an increase in runtime. I reran some tests and it does seem that there is an increase in runtime. While I fix this issue, you can try to use an older release (less than v0.4) which has similar indel performance and same SNP performance. Additionally I can try to create a branch that provides older candidate selection method as an option and uses the same API as v0.4 release.

@umahsn
Copy link
Collaborator

umahsn commented Jul 27, 2021

Another thing, if you used human reference genome for variant calling, did you use --exclude_bed option set to hg19 or hg38? Setting this parameter removes telomeric and centromeric regions from variant calling and can significantly increase speed because these regions have very high alignment error which gives rise to too many variant candidates, especially in chr1 centromere which can end up taking several hours just by itself.

In our paper, we used this parameter to report runtime.

@Akazhiel
Copy link
Author

Hello!

Thanks for all the input and tips. I'll try using a previous version and check how faster it goes. Eventually we'll be running this in a HPC with access to more CPUs which will increment the speed by a lot but I still found it weird to be so slow in 8 CPUs, it's taken 5 days just to complete one sample.

For your second reply yes, I did use the --exclude-bed option for hg38.

@umahsn
Copy link
Collaborator

umahsn commented Jul 29, 2021

Just for context, can you tell me what is the coverage of your BAM file, and if you know which Guppy version was used to basecall the reads?

@Akazhiel
Copy link
Author

Hello!

The average coverage of the BAM file if I didn't calculate it in a wrong way because there are a lot of ways of computing it and I always fail to find an easy and straightforward one, is 20x. As for the Guppy version was the 3.4.5.

@umahsn
Copy link
Collaborator

umahsn commented Aug 3, 2021

Hi Jonatan,

It turns out that the problem was being caused by this commit: 2546959, so I have reverted the changes from that commit in v0.4.1 (both in this repo and docker). You should be able to get a ~40% reduction in runtime compared to v0.4.0 and the performance will be similar to the one reported in our paper.

During this testing I found several other areas of runtime improvement, for instance replacing biopython's pairwise alignment algorithm with one that is implemented in C. I will be releasing these improvements over the next few weeks.

Also, NanoCaller logs report coverage which is calculated for SNP calling. If you use NanoCaller_WGS.py, these logs will be in the output/logs/ directory, or in case of NanoCaller.py just printed to stdout like in this example: https://github.com/WGLab/NanoCaller/blob/master/sample/log

@Akazhiel
Copy link
Author

Akazhiel commented Aug 5, 2021

Hello Mian!

That's great news! Thanks for looking into it and fixing it in such a quick time. I reckon the improvements of replacing Biopython will come with a future version and that are not yet implemented in the v0.4.1? I'm looking forward to it!

On another note I've checked the logs and it seems that on average the coverage was 20x for the sample tested. According to a discussion with the author of PEPPER which is another tool for SNP/Indel calling we agreed that 20x seems to be low to call these type of variants since it'll call almost everything it finds. Hopefully for the samples I asked to be sequenced I'll have more coverage and the calling will be more precise.

Best regards,

Jonatan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants