-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
toil-vg call takes over two days on yeast graph #737
Comments
Thanks, David. I'll take a look. That's way too long, especially when
compared to the human samples that can be done in 3 hours for 30x
coverage.
…On Fri, Mar 15, 2019 at 10:41 AM David Heller ***@***.***> wrote:
Hi,
I'm running toil-vg call on a cactus graph of 5 yeast strains (~12Mb
genome size) and it takes suprisingly long (62 hours). I used the latest
toil from 3 days ago and the second latest vg docker image (
quay.io/vgteam/vg:v1.14.0-38-g4bd8aa5c-t290-run). This were my commands:
MASTER_IP=`ifconfig eth0 |grep "inet addr" |awk '{print $2}' |awk -F: '{print $2}'`
toil clean aws:us-west-2:vgcall-yeast-cactus-four-jobstore
toil-vg call --realTimeStderr --config config.txt --nodeTypes r4.xlarge,r4.large --minNodes 0,0 --maxNodes 1,2 --provisioner aws --batchSystem mesos --mesosMaster=${MASTER_IP}:5050 --metrics aws:us-west-2:vgcall-yeast-cactus-four-jobstore component0.xg SRR4074413.recall.cactus.four aws:us-west-2:vgcall-yeast-cactus-four-outstore --gams SRR4074413.mapped.sorted.gam --recall --chroms S288C.chrI S288C.chrII S288C.chrIII S288C.chrIV S288C.chrV S288C.chrVI S288C.chrVII S288C.chrVIII S288C.chrIX S288C.chrX S288C.chrXI S288C.chrXII S288C.chrXIII S288C.chrXIV S288C.chrXV S288C.chrXVI 2> SRR4074413.recall.cactus.four.log
All input data and the log output can be found in this directory:
/public/groups/cgl/users/daheller/yeast_graph/graphs/cactus_four/speed_issue
.
I would be very grateful for hints on why my run might have been so slow
and guidance on how to speed it up. Cheers!
David
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#737>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA2_7qCdhHW0wB4VLLUk4K6Ot6HR7OVeks5vW7ETgaJpZM4b2ngA>
.
|
Usually Fixing this is on the todo list, and there's an issue here: vgteam/vg#2144 But @eldariont do you have a sense of what the maximum insertion length calling in yeast? This should be measurable from the VCF output of toil-vg call. If it's much smaller than 80kb, you can probably get away with lowering If you're dealing with very big insertions such that we can't drop the |
PS @eldariont Always use |
I don't think the context thing is the whole story. I just checked my run
with the different chromosomes all seeming to take the same amount of time. So both chunking and calling seem to take 6 hours each which seems excessive on both counts. It's especially ironic given that chunking isn't doing anything too useful given the chromosomes are smaller than the chunk size. Something like a r3.8xlarge with |
Hi Glenn,
If I divide length by nodes I get an average node length of 3.2 bps which is probably much less than a graph based on the human reference and a few variants. The largest insertion call on the first sample was close to 50kb but only 35 calls were larger than 1kb. Applying the formula |
Yeah, that's an order of magnitude more nodes at least than human, and much higher averge degree. I'll take a stab at improving this starting with chunk, but I'm not sure we should hold up the paper for it. I'd say just move to bigger nodes. I use I think it may be worth trying to call one sample with increased |
Thanks, Glenn. I restarted the pipeline with the parameters you recommend and it is running smoothly until now. |
I've run a few more tests (which involve lots of waiting because it's so slow). I think the chunking as currently implemented just won't work on your graph. On the human vcf-derived graphs, when I use vg chunk to grab, say, chr1:1-1000000 and expand the context 2500 steps, it pulls in a few insertions and maybe some deletions and stuff around the edges. There are no long-range or inter-chromosome events. But when doing the same on your yeast graph, it pulls in half the graph. This makes The resulting chunks are large and complex and it then takes several hours just to compute the snarls on them, which is the first step I'll try adding an option to turn off chunking in toil-vg call. I think this is the best bet short term. The next version of the caller (hopefully to be started after the paper's done) ought to do away with explicit chunking altogether. |
Yes, that is a very good explanation for the long runtimes. Is it feasible to skip the chunking and call on the entire yeast graph? If yes, that seems to be a very good solution for my experiments. Do you have a guess why the graph has these characteristics like very small nodes and many long-range edges? Is it a necessary consequence of the cactus multiple genome alignment or rather something that could be remedied with altered parameters to cactus? |
I'm implementing an option to bypass chunking right now. Will let you know when it's tested. If it doesn't take too much memory, I think it'll be the way to go. I think the small nodes / long edges comes in part from the high divergence between your strains. But Cactus being overzealous with what it aligns could be a factor too. |
#738 helps. Here's an example that runs in two hours by disabling chunking:
Don't think it ever used more than two nodes, and ran in |
Wow, it's now blazingly fast. Thanks, Glenn, for adding this option! 👍 |
Hi,
I'm running toil-vg call on a cactus graph of 5 yeast strains (~12Mb genome size) and it takes suprisingly long (62 hours). I used the latest toil from 3 days ago and the second latest vg docker image (quay.io/vgteam/vg:v1.14.0-38-g4bd8aa5c-t290-run). This were my commands:
All input data and the log output can be found in this directory:
/public/groups/cgl/users/daheller/yeast_graph/graphs/cactus_four/speed_issue
.I would be very grateful for hints on why my run might have been so slow and guidance on how to speed it up. Cheers!
David
The text was updated successfully, but these errors were encountered: