toil-vg call takes over two days on yeast graph #737

eldariont · 2019-03-15T14:41:23Z

Hi,

I'm running toil-vg call on a cactus graph of 5 yeast strains (~12Mb genome size) and it takes suprisingly long (62 hours). I used the latest toil from 3 days ago and the second latest vg docker image (quay.io/vgteam/vg:v1.14.0-38-g4bd8aa5c-t290-run). This were my commands:

MASTER_IP=`ifconfig eth0 |grep "inet addr" |awk '{print $2}' |awk -F: '{print $2}'`
toil clean aws:us-west-2:vgcall-yeast-cactus-four-jobstore
toil-vg call --realTimeStderr --config config.txt --nodeTypes r4.xlarge,r4.large --minNodes 0,0 --maxNodes 1,2 --provisioner aws --batchSystem mesos --mesosMaster=${MASTER_IP}:5050 --metrics aws:us-west-2:vgcall-yeast-cactus-four-jobstore component0.xg SRR4074413.recall.cactus.four aws:us-west-2:vgcall-yeast-cactus-four-outstore --gams SRR4074413.mapped.sorted.gam --recall --chroms S288C.chrI S288C.chrII S288C.chrIII S288C.chrIV S288C.chrV S288C.chrVI S288C.chrVII S288C.chrVIII S288C.chrIX S288C.chrX S288C.chrXI S288C.chrXII S288C.chrXIII S288C.chrXIV S288C.chrXV S288C.chrXVI 2> SRR4074413.recall.cactus.four.log

All input data and the log output can be found in this directory: /public/groups/cgl/users/daheller/yeast_graph/graphs/cactus_four/speed_issue.

I would be very grateful for hints on why my run might have been so slow and guidance on how to speed it up. Cheers!
David

The text was updated successfully, but these errors were encountered:

glennhickey · 2019-03-15T18:31:14Z

Thanks, David. I'll take a look. That's way too long, especially when compared to the human samples that can be done in 3 hours for 30x coverage.

…

On Fri, Mar 15, 2019 at 10:41 AM David Heller ***@***.***> wrote: Hi, I'm running toil-vg call on a cactus graph of 5 yeast strains (~12Mb genome size) and it takes suprisingly long (62 hours). I used the latest toil from 3 days ago and the second latest vg docker image ( quay.io/vgteam/vg:v1.14.0-38-g4bd8aa5c-t290-run). This were my commands: MASTER_IP=`ifconfig eth0 |grep "inet addr" |awk '{print $2}' |awk -F: '{print $2}'` toil clean aws:us-west-2:vgcall-yeast-cactus-four-jobstore toil-vg call --realTimeStderr --config config.txt --nodeTypes r4.xlarge,r4.large --minNodes 0,0 --maxNodes 1,2 --provisioner aws --batchSystem mesos --mesosMaster=${MASTER_IP}:5050 --metrics aws:us-west-2:vgcall-yeast-cactus-four-jobstore component0.xg SRR4074413.recall.cactus.four aws:us-west-2:vgcall-yeast-cactus-four-outstore --gams SRR4074413.mapped.sorted.gam --recall --chroms S288C.chrI S288C.chrII S288C.chrIII S288C.chrIV S288C.chrV S288C.chrVI S288C.chrVII S288C.chrVIII S288C.chrIX S288C.chrX S288C.chrXI S288C.chrXII S288C.chrXIII S288C.chrXIV S288C.chrXV S288C.chrXVI 2> SRR4074413.recall.cactus.four.log All input data and the log output can be found in this directory: /public/groups/cgl/users/daheller/yeast_graph/graphs/cactus_four/speed_issue . I would be very grateful for hints on why my run might have been so slow and guidance on how to speed it up. Cheers! David — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#737>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA2_7qCdhHW0wB4VLLUk4K6Ot6HR7OVeks5vW7ETgaJpZM4b2ngA> .

glennhickey · 2019-03-15T21:11:58Z

Usually vg chunk takes about as much time as vg gamsort. For this data, gamsort took about 30min but chunk's still going more than an hour later. I'm wondering if the highly-connected (relative to the human graphs I've been looking at) nature of the cactus graph is causing the context expansion phase of chunk to blow up. In -recall mode it's jacked way up to 2500 nodes by default.

Fixing this is on the todo list, and there's an issue here: vgteam/vg#2144

But @eldariont do you have a sense of what the maximum insertion length calling in yeast? This should be measurable from the VCF output of toil-vg call. If it's much smaller than 80kb, you can probably get away with lowering recall-context in the config. In general, recall-context should be bigger than the maximum number of nodes you'd expect to find in an insertion in your graph. Something like max-insert-len / av-node-size + some padding is probably fine.

If you're dealing with very big insertions such that we can't drop the recall-context too much, I should be able to fix it by resolving that above vg issue on Monday...

glennhickey · 2019-03-15T21:13:31Z

PS @eldariont Always use --realTimeLogging with toil-vg. It'll give you a much better sense of what's running when and for how long.

glennhickey · 2019-03-16T12:48:35Z

I don't think the context thing is the whole story. I just checked my run

Successfully ran vg gamsort -i SRR4074413.recall.cactus.four_chroms.gam.sorted.gam.gai SRR4074413.recall.cactus.four_chroms.gam --threads 16 in 6028.70355916 seconds.

Successfully ran vg chunk -x graph.vg.xg -a SRR4074413.recall.cactus.four_chroms.gam.sorted.gam -c 2500 -P path_list.txt -g -s 2500000 -o 100000 -b call_chunk_chroms -t 16 -E output_bed_chunks_chroms.bed -f in 21950.013551 seconds.

Successfully ran vg augment chunk_S288C.chrI_0.vg chunk_S288C.chrI_0.gam -t 4 --augmentation-mode pileup --translation chunk_S288C.chrI_0.trans --support chunk_S288C.chrI_0.support --recall in 1804.87139297 seconds.

Successfully ran vg call chunk_S288C.chrI_0_aug.vg -t 4 -S SRR4074413.recall.cactus.four -z chunk_S288C.chrI_0.trans -s chunk_S288C.chrI_0.support -b chunk_S288C.chrI_0.vg -u -n 0 -e 1000 -r S288C.chrI -c S288C.chrI -l 219929 -o 0 in 21920.575428 seconds.

with the different chromosomes all seeming to take the same amount of time. So both chunking and calling seem to take 6 hours each which seems excessive on both counts. It's especially ironic given that chunking isn't doing anything too useful given the chromosomes are smaller than the chunk size. Something like a r3.8xlarge with --whole_genome_config should be enough to get your runtime to under a day, but there's clearly some other stuff going on that needs investigating.

eldariont · 2019-03-18T10:13:13Z

Hi Glenn,
thank you so much for exploring this. I think that the cactus graph is special in a couple of regards. For instance, vg stats on the graph gives me the following statistics:

vg stats -z component0.vg
nodes	4861438
edges	6563304
vg stats -l component0.vg
length	15390507

If I divide length by nodes I get an average node length of 3.2 bps which is probably much less than a graph based on the human reference and a few variants.

The largest insertion call on the first sample was close to 50kb but only 35 calls were larger than 1kb. Applying the formula max-insert-len / av-node-size + some padding would give something like 17kb while I used 2.5kb for my current run.

glennhickey · 2019-03-18T15:25:18Z

Yeah, that's an order of magnitude more nodes at least than human, and much higher averge degree. I'll take a stab at improving this starting with chunk, but I'm not sure we should hold up the paper for it. I'd say just move to bigger nodes. I use --maxNodes 20 --nodeTypes r3.8xlarge:0.53 --defaultPreemptable --whole_genome_config (note the syntax for spot-bid pricing)

I think it may be worth trying to call one sample with increased recall-context and seeing if it slows it down substantially and if you get any extra insertions. The other thing to try is reducing the call chunk size, but I don't know if that'll save you more than a couple hours per sample.

eldariont · 2019-03-18T21:01:06Z

Thanks, Glenn. I restarted the pipeline with the parameters you recommend and it is running smoothly until now.

glennhickey · 2019-03-19T14:09:38Z

I've run a few more tests (which involve lots of waiting because it's so slow). I think the chunking as currently implemented just won't work on your graph. On the human vcf-derived graphs, when I use vg chunk to grab, say, chr1:1-1000000 and expand the context 2500 steps, it pulls in a few insertions and maybe some deletions and stuff around the edges. There are no long-range or inter-chromosome events. But when doing the same on your yeast graph, it pulls in half the graph.

This makes vg chunk very slow because it needs to do a large search entailing many xg lookups and creation of intermediate protobuf objects.

The resulting chunks are large and complex and it then takes several hours just to compute the snarls on them, which is the first step vg call. This is exacerbated by all the dangling ends left by vg chunk, which the snarl code doesn't like.

I'll try adding an option to turn off chunking in toil-vg call. I think this is the best bet short term. The next version of the caller (hopefully to be started after the paper's done) ought to do away with explicit chunking altogether.

eldariont · 2019-03-19T14:18:42Z

Yes, that is a very good explanation for the long runtimes. Is it feasible to skip the chunking and call on the entire yeast graph? If yes, that seems to be a very good solution for my experiments.

Do you have a guess why the graph has these characteristics like very small nodes and many long-range edges? Is it a necessary consequence of the cactus multiple genome alignment or rather something that could be remedied with altered parameters to cactus?

glennhickey · 2019-03-19T14:26:50Z

I'm implementing an option to bypass chunking right now. Will let you know when it's tested. If it doesn't take too much memory, I think it'll be the way to go.

I think the small nodes / long edges comes in part from the high divergence between your strains. But Cactus being overzealous with what it aligns could be a factor too.

glennhickey · 2019-03-20T00:29:58Z

#738 helps. Here's an example that runs in two hours by disabling chunking:

toil-vg call aws:us-west-2:glennhickey-jobstore-chn1 ./component0.xg SRR4074413.recall.cactus.four `aws:us-west-2:glennhickey/outstore/speed-issue` --gams ./SRR4074413.mapped.sorted.gam --chroms S288C.chrI S288C.chrII S288C.chrIII S288C.chrIV S288C.chrV S288C.chrVI S288C.chrVII S288C.chrVIII S288C.chrIX S288C.chrX S288C.chrXI S288C.chrXII S288C.chrXIII S288C.chrXIV S288C.chrXV S288C.chrXVI --recall --call_chunk_size 0 --whole_genome_config --realTimeLogging --realTimeStderr --logFile mylog.log --provisioner aws --batchSystem mesos --defaultPreemptable --retryCount 3 --nodeTypes r3.8xlarge:0.53 --maxNodes 20

Don't think it ever used more than two nodes, and ran in 7380.4997642 seconds., so total cost < $2.

eldariont · 2019-03-20T08:54:43Z

Wow, it's now blazingly fast. Thanks, Glenn, for adding this option! 👍

eldariont mentioned this issue Mar 15, 2019

Segmentation fault with vg call on cactus graph chunk vgteam/vg#2156

Closed

glennhickey mentioned this issue Mar 19, 2019

Option to call without chunking #738

Merged

glennhickey mentioned this issue Dec 11, 2019

WIP: Rewrite toil-vg call #781

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

toil-vg call takes over two days on yeast graph #737

toil-vg call takes over two days on yeast graph #737

eldariont commented Mar 15, 2019

glennhickey commented Mar 15, 2019 via email

glennhickey commented Mar 15, 2019

glennhickey commented Mar 15, 2019

glennhickey commented Mar 16, 2019

eldariont commented Mar 18, 2019

glennhickey commented Mar 18, 2019

eldariont commented Mar 18, 2019

glennhickey commented Mar 19, 2019

eldariont commented Mar 19, 2019

glennhickey commented Mar 19, 2019

glennhickey commented Mar 20, 2019 •

edited

Loading

eldariont commented Mar 20, 2019

toil-vg call takes over two days on yeast graph #737

toil-vg call takes over two days on yeast graph #737

Comments

eldariont commented Mar 15, 2019

glennhickey commented Mar 15, 2019 via email

glennhickey commented Mar 15, 2019

glennhickey commented Mar 15, 2019

glennhickey commented Mar 16, 2019

eldariont commented Mar 18, 2019

glennhickey commented Mar 18, 2019

eldariont commented Mar 18, 2019

glennhickey commented Mar 19, 2019

eldariont commented Mar 19, 2019

glennhickey commented Mar 19, 2019

glennhickey commented Mar 20, 2019 • edited Loading

eldariont commented Mar 20, 2019

glennhickey commented Mar 20, 2019 •

edited

Loading