Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

toil-vg call takes over two days on yeast graph #737

Open
eldariont opened this issue Mar 15, 2019 · 12 comments
Open

toil-vg call takes over two days on yeast graph #737

eldariont opened this issue Mar 15, 2019 · 12 comments

Comments

@eldariont
Copy link

Hi,

I'm running toil-vg call on a cactus graph of 5 yeast strains (~12Mb genome size) and it takes suprisingly long (62 hours). I used the latest toil from 3 days ago and the second latest vg docker image (quay.io/vgteam/vg:v1.14.0-38-g4bd8aa5c-t290-run). This were my commands:

MASTER_IP=`ifconfig eth0 |grep "inet addr" |awk '{print $2}' |awk -F: '{print $2}'`
toil clean aws:us-west-2:vgcall-yeast-cactus-four-jobstore
toil-vg call --realTimeStderr --config config.txt --nodeTypes r4.xlarge,r4.large --minNodes 0,0 --maxNodes 1,2 --provisioner aws --batchSystem mesos --mesosMaster=${MASTER_IP}:5050 --metrics aws:us-west-2:vgcall-yeast-cactus-four-jobstore component0.xg SRR4074413.recall.cactus.four aws:us-west-2:vgcall-yeast-cactus-four-outstore --gams SRR4074413.mapped.sorted.gam --recall --chroms S288C.chrI S288C.chrII S288C.chrIII S288C.chrIV S288C.chrV S288C.chrVI S288C.chrVII S288C.chrVIII S288C.chrIX S288C.chrX S288C.chrXI S288C.chrXII S288C.chrXIII S288C.chrXIV S288C.chrXV S288C.chrXVI 2> SRR4074413.recall.cactus.four.log

All input data and the log output can be found in this directory: /public/groups/cgl/users/daheller/yeast_graph/graphs/cactus_four/speed_issue.

I would be very grateful for hints on why my run might have been so slow and guidance on how to speed it up. Cheers!
David

@glennhickey
Copy link
Collaborator

glennhickey commented Mar 15, 2019 via email

@glennhickey
Copy link
Collaborator

Usually vg chunk takes about as much time as vg gamsort. For this data, gamsort took about 30min but chunk's still going more than an hour later. I'm wondering if the highly-connected (relative to the human graphs I've been looking at) nature of the cactus graph is causing the context expansion phase of chunk to blow up. In -recall mode it's jacked way up to 2500 nodes by default.

Fixing this is on the todo list, and there's an issue here: vgteam/vg#2144

But @eldariont do you have a sense of what the maximum insertion length calling in yeast? This should be measurable from the VCF output of toil-vg call. If it's much smaller than 80kb, you can probably get away with lowering recall-context in the config. In general, recall-context should be bigger than the maximum number of nodes you'd expect to find in an insertion in your graph. Something like max-insert-len / av-node-size + some padding is probably fine.

If you're dealing with very big insertions such that we can't drop the recall-context too much, I should be able to fix it by resolving that above vg issue on Monday...

@glennhickey
Copy link
Collaborator

PS @eldariont Always use --realTimeLogging with toil-vg. It'll give you a much better sense of what's running when and for how long.

@glennhickey
Copy link
Collaborator

I don't think the context thing is the whole story. I just checked my run

Successfully ran vg gamsort -i SRR4074413.recall.cactus.four_chroms.gam.sorted.gam.gai SRR4074413.recall.cactus.four_chroms.gam --threads 16 in 6028.70355916 seconds.

Successfully ran vg chunk -x graph.vg.xg -a SRR4074413.recall.cactus.four_chroms.gam.sorted.gam -c 2500 -P path_list.txt -g -s 2500000 -o 100000 -b call_chunk_chroms -t 16 -E output_bed_chunks_chroms.bed -f in 21950.013551 seconds.

Successfully ran vg augment chunk_S288C.chrI_0.vg chunk_S288C.chrI_0.gam -t 4 --augmentation-mode pileup --translation chunk_S288C.chrI_0.trans --support chunk_S288C.chrI_0.support --recall in 1804.87139297 seconds.

Successfully ran vg call chunk_S288C.chrI_0_aug.vg -t 4 -S SRR4074413.recall.cactus.four -z chunk_S288C.chrI_0.trans -s chunk_S288C.chrI_0.support -b chunk_S288C.chrI_0.vg -u -n 0 -e 1000 -r S288C.chrI -c S288C.chrI -l 219929 -o 0 in 21920.575428 seconds.

with the different chromosomes all seeming to take the same amount of time. So both chunking and calling seem to take 6 hours each which seems excessive on both counts. It's especially ironic given that chunking isn't doing anything too useful given the chromosomes are smaller than the chunk size. Something like a r3.8xlarge with --whole_genome_config should be enough to get your runtime to under a day, but there's clearly some other stuff going on that needs investigating.

@eldariont
Copy link
Author

Hi Glenn,
thank you so much for exploring this. I think that the cactus graph is special in a couple of regards. For instance, vg stats on the graph gives me the following statistics:

vg stats -z component0.vg
nodes	4861438
edges	6563304
vg stats -l component0.vg
length	15390507

If I divide length by nodes I get an average node length of 3.2 bps which is probably much less than a graph based on the human reference and a few variants.

The largest insertion call on the first sample was close to 50kb but only 35 calls were larger than 1kb. Applying the formula max-insert-len / av-node-size + some padding would give something like 17kb while I used 2.5kb for my current run.

@glennhickey
Copy link
Collaborator

Yeah, that's an order of magnitude more nodes at least than human, and much higher averge degree. I'll take a stab at improving this starting with chunk, but I'm not sure we should hold up the paper for it. I'd say just move to bigger nodes. I use --maxNodes 20 --nodeTypes r3.8xlarge:0.53 --defaultPreemptable --whole_genome_config (note the syntax for spot-bid pricing)

I think it may be worth trying to call one sample with increased recall-context and seeing if it slows it down substantially and if you get any extra insertions. The other thing to try is reducing the call chunk size, but I don't know if that'll save you more than a couple hours per sample.

@eldariont
Copy link
Author

Thanks, Glenn. I restarted the pipeline with the parameters you recommend and it is running smoothly until now.

@glennhickey
Copy link
Collaborator

I've run a few more tests (which involve lots of waiting because it's so slow). I think the chunking as currently implemented just won't work on your graph. On the human vcf-derived graphs, when I use vg chunk to grab, say, chr1:1-1000000 and expand the context 2500 steps, it pulls in a few insertions and maybe some deletions and stuff around the edges. There are no long-range or inter-chromosome events. But when doing the same on your yeast graph, it pulls in half the graph.

This makes vg chunk very slow because it needs to do a large search entailing many xg lookups and creation of intermediate protobuf objects.

The resulting chunks are large and complex and it then takes several hours just to compute the snarls on them, which is the first step vg call. This is exacerbated by all the dangling ends left by vg chunk, which the snarl code doesn't like.

I'll try adding an option to turn off chunking in toil-vg call. I think this is the best bet short term. The next version of the caller (hopefully to be started after the paper's done) ought to do away with explicit chunking altogether.

@eldariont
Copy link
Author

Yes, that is a very good explanation for the long runtimes. Is it feasible to skip the chunking and call on the entire yeast graph? If yes, that seems to be a very good solution for my experiments.

Do you have a guess why the graph has these characteristics like very small nodes and many long-range edges? Is it a necessary consequence of the cactus multiple genome alignment or rather something that could be remedied with altered parameters to cactus?

@glennhickey
Copy link
Collaborator

I'm implementing an option to bypass chunking right now. Will let you know when it's tested. If it doesn't take too much memory, I think it'll be the way to go.

I think the small nodes / long edges comes in part from the high divergence between your strains. But Cactus being overzealous with what it aligns could be a factor too.

@glennhickey
Copy link
Collaborator

glennhickey commented Mar 20, 2019

#738 helps. Here's an example that runs in two hours by disabling chunking:

toil-vg call aws:us-west-2:glennhickey-jobstore-chn1 ./component0.xg SRR4074413.recall.cactus.four `aws:us-west-2:glennhickey/outstore/speed-issue` --gams ./SRR4074413.mapped.sorted.gam --chroms S288C.chrI S288C.chrII S288C.chrIII S288C.chrIV S288C.chrV S288C.chrVI S288C.chrVII S288C.chrVIII S288C.chrIX S288C.chrX S288C.chrXI S288C.chrXII S288C.chrXIII S288C.chrXIV S288C.chrXV S288C.chrXVI --recall --call_chunk_size 0 --whole_genome_config --realTimeLogging --realTimeStderr --logFile mylog.log --provisioner aws --batchSystem mesos --defaultPreemptable --retryCount 3 --nodeTypes r3.8xlarge:0.53 --maxNodes 20

Don't think it ever used more than two nodes, and ran in 7380.4997642 seconds., so total cost < $2.

@eldariont
Copy link
Author

Wow, it's now blazingly fast. Thanks, Glenn, for adding this option! 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants