Runtime #15

Akazhiel · 2021-07-22T08:06:26Z

Hello!

On the NanoCaller paper you have a table of run times for the different modes and different technologies. And I noticed that for ONT on the mode to call both on 16 CPUs the runtime was about 18h. But I've been running my data on 8 CPUs since it's the max I have on this machine, and it's been going on for 23h already and it hasn't reached chromosome 3 yet. The type of data is ONT running in the both mode. What could be the reason is taking so long?

Best regards,

Jonatan

The text was updated successfully, but these errors were encountered:

umahsn · 2021-07-27T02:02:57Z

Hi Jonatan,

Thank you for pointing this out. We made some changes to the indel candidate site selection and reorganizing the code to become more modular, and this might be causing an increase in runtime. I reran some tests and it does seem that there is an increase in runtime. While I fix this issue, you can try to use an older release (less than v0.4) which has similar indel performance and same SNP performance. Additionally I can try to create a branch that provides older candidate selection method as an option and uses the same API as v0.4 release.

umahsn · 2021-07-27T02:11:28Z

Another thing, if you used human reference genome for variant calling, did you use --exclude_bed option set to hg19 or hg38? Setting this parameter removes telomeric and centromeric regions from variant calling and can significantly increase speed because these regions have very high alignment error which gives rise to too many variant candidates, especially in chr1 centromere which can end up taking several hours just by itself.

In our paper, we used this parameter to report runtime.

Akazhiel · 2021-07-27T06:49:55Z

Hello!

Thanks for all the input and tips. I'll try using a previous version and check how faster it goes. Eventually we'll be running this in a HPC with access to more CPUs which will increment the speed by a lot but I still found it weird to be so slow in 8 CPUs, it's taken 5 days just to complete one sample.

For your second reply yes, I did use the --exclude-bed option for hg38.

umahsn · 2021-07-29T21:29:23Z

Just for context, can you tell me what is the coverage of your BAM file, and if you know which Guppy version was used to basecall the reads?

Akazhiel · 2021-07-30T07:56:28Z

Hello!

The average coverage of the BAM file if I didn't calculate it in a wrong way because there are a lot of ways of computing it and I always fail to find an easy and straightforward one, is 20x. As for the Guppy version was the 3.4.5.

umahsn · 2021-08-03T20:13:44Z

Hi Jonatan,

It turns out that the problem was being caused by this commit: 2546959, so I have reverted the changes from that commit in v0.4.1 (both in this repo and docker). You should be able to get a ~40% reduction in runtime compared to v0.4.0 and the performance will be similar to the one reported in our paper.

During this testing I found several other areas of runtime improvement, for instance replacing biopython's pairwise alignment algorithm with one that is implemented in C. I will be releasing these improvements over the next few weeks.

Also, NanoCaller logs report coverage which is calculated for SNP calling. If you use NanoCaller_WGS.py, these logs will be in the output/logs/ directory, or in case of NanoCaller.py just printed to stdout like in this example: https://github.com/WGLab/NanoCaller/blob/master/sample/log

Akazhiel · 2021-08-05T11:25:49Z

Hello Mian!

That's great news! Thanks for looking into it and fixing it in such a quick time. I reckon the improvements of replacing Biopython will come with a future version and that are not yet implemented in the v0.4.1? I'm looking forward to it!

On another note I've checked the logs and it seems that on average the coverage was 20x for the sample tested. According to a discussion with the author of PEPPER which is another tool for SNP/Indel calling we agreed that 20x seems to be low to call these type of variants since it'll call almost everything it finds. Hopefully for the samples I asked to be sequenced I'll have more coverage and the calling will be more precise.

Best regards,

Jonatan

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Runtime #15

Runtime #15

Akazhiel commented Jul 22, 2021

umahsn commented Jul 27, 2021

umahsn commented Jul 27, 2021 •

edited

Loading

Akazhiel commented Jul 27, 2021

umahsn commented Jul 29, 2021

Akazhiel commented Jul 30, 2021

umahsn commented Aug 3, 2021

Akazhiel commented Aug 5, 2021

Runtime #15

Runtime #15

Comments

Akazhiel commented Jul 22, 2021

umahsn commented Jul 27, 2021

umahsn commented Jul 27, 2021 • edited Loading

Akazhiel commented Jul 27, 2021

umahsn commented Jul 29, 2021

Akazhiel commented Jul 30, 2021

umahsn commented Aug 3, 2021

Akazhiel commented Aug 5, 2021

umahsn commented Jul 27, 2021 •

edited

Loading