You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
To protect against wasted CPU time when jobs get evicted from condor, we need to implement a checkpointing scheme, so that jobs can be restarted from some working point.
@areeda, can you outline how this is done in the matlab implementation?
The text was updated successfully, but these errors were encountered:
Checkpointing Is a pretty gross process currently. After trigger generation is complete a file is written. If job restart and that file is there it uses the existing triggers otherwise it starts recalculating from the beginning. For the hveto run itself Matlab has the ability to write out all the variables in the workspace very quickly. After each round and before the first round the workspace is written containing a flag.
if hveto is restarted and the flag exists then the preliminaries are skipped and we pick up after the last completed round. For the Omega scans we check if the applicable files exist. If so, we can skip that one.
In preparation for the gravity spy enabling studies that we want to run over O1 and O2, we have resurrected and modified the program we used with the previous Matlab version.
The trigger collection is a very time-consuming operation mainly because there are so many XML files need to be processed. For O2 at LLO there are about 20 million XML files. Just to count the files took 75 minutes.
So the current plan is to read through all of the XML files for 01 and O2, write them onto an intermediate format, and maybe reformat one more time for efficient use. For example we plan to run each category of gravity spike glitches over 01 and 02 at LHO and LLO.
this process checkpoints at each channel. While the channel is being processed is written to a '.tmo' file and when complete it is renamed to '.csv' file. If we are evicted and restart it's a very quick process to see which CSV files already exist.
To protect against wasted CPU time when jobs get evicted from condor, we need to implement a checkpointing scheme, so that jobs can be restarted from some working point.
@areeda, can you outline how this is done in the matlab implementation?
The text was updated successfully, but these errors were encountered: