Checkpointing is required for stable running under HTCondor #11

duncanmmacleod · 2016-03-29T15:41:14Z

To protect against wasted CPU time when jobs get evicted from condor, we need to implement a checkpointing scheme, so that jobs can be restarted from some working point.

@areeda, can you outline how this is done in the matlab implementation?

areeda · 2016-03-30T04:10:29Z

Checkpointing Is a pretty gross process currently. After trigger generation is complete a file is written. If job restart and that file is there it uses the existing triggers otherwise it starts recalculating from the beginning. For the hveto run itself Matlab has the ability to write out all the variables in the workspace very quickly. After each round and before the first round the workspace is written containing a flag.
if hveto is restarted and the flag exists then the preliminaries are skipped and we pick up after the last completed round. For the Omega scans we check if the applicable files exist. If so, we can skip that one.

areeda · 2017-10-30T04:03:05Z

In preparation for the gravity spy enabling studies that we want to run over O1 and O2, we have resurrected and modified the program we used with the previous Matlab version.
The trigger collection is a very time-consuming operation mainly because there are so many XML files need to be processed. For O2 at LLO there are about 20 million XML files. Just to count the files took 75 minutes.
So the current plan is to read through all of the XML files for 01 and O2, write them onto an intermediate format, and maybe reformat one more time for efficient use. For example we plan to run each category of gravity spike glitches over 01 and 02 at LHO and LLO.
this process checkpoints at each channel. While the channel is being processed is written to a '.tmo' file and when complete it is renamed to '.csv' file. If we are evicted and restart it's a very quick process to see which CSV files already exist.

duncanmmacleod added the enhancement label Mar 29, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Checkpointing is required for stable running under HTCondor #11

Checkpointing is required for stable running under HTCondor #11

duncanmmacleod commented Mar 29, 2016

areeda commented Mar 30, 2016

areeda commented Oct 30, 2017

Checkpointing is required for stable running under HTCondor #11

Checkpointing is required for stable running under HTCondor #11

Comments

duncanmmacleod commented Mar 29, 2016

areeda commented Mar 30, 2016

areeda commented Oct 30, 2017