Work in progress!
Feel welcome to submit a trouble ticket if you think we should write/rewrite/extend one of its section now.
> jl --help
Usage: jl <command> [options] [<joblist_file>]
Supported commands are
submit Submits a job to the underlying queuing system and adds it to the list
add Extracts job-ids from stdin and adds them to the list
wait Wait for a list of jobs to finish
resub Resubmit non-complete jobs with escalated scheduler parameters
status Prints various statistics and allows to create an html report for the list
cancel Removes all jobs of this list from the scheduler queue
up Moves a list of jobs to the top of a queue (if supported by the underlying scheduler)
reset Removes all information related to this joblist.
If no <joblist_file> is provided, jl will use '.jobs' as default, but to save typing it will remember
the last used joblist instance per directory.
All sub-commands provide more specific information (e.g. jl submit --help
)
The basic workflow is as follow:
- Submit some jobs
jl submit "sleep 10" ## add a job
jl submit "sleep 1000" ## add another which won't finish in our default queue
- Wait for them to finish
jl wait
> 2 jobs in total; 0.0% complete; Remaining time <NA>; 0 done; 0 running; 2 pending; 0 killed; 0 failed
> 2 jobs in total; 0.0% complete; Remaining time <NA>; 0 done; 2 running; 0 pending; 0 killed; 0 failed
> 2 jobs in total; 50.0% complete; Remaining time ~10S; 1 done; 1 running; 0 pending; 0 killed; 0 failed
> 2 jobs in total; 50.0% complete; Remaining time ~10S; 1 done; 0 running; 0 pending; 1 killed; 0 failed
- Report status, render html-report and log information with
jl status
> 2 jobs in total; 50.0% complete; Remaining time ~10S; 1 done; 0 running; 0 pending; 1 killed; 0 failed
jl status --report
> .jobs: Exported statistics into .jobs.{runinfo|jc}.log
> .jobs: Rendering HTML report... done
- Resubmit non-complete jobs by escalating their scheduler configuration
jl resub --queue "long" ## wait and resubmit failing jobs to another queue
By using jl
workflows will be decoupled from the underlying queuing system.
Ie. jl
-ified workflows would run on a slurm system, an LSF cluster or simply locally on any desktop machine.
Submit jobs to a queuing system or to a local scheduler
Wait for lists of jobs to finish
Monitor job submissions.
Submit some jobs with bsub/sbatch as you're used to and use jl for blocking and monitoring and final status handling:
bsub "echo foo" | jl add
bsub "echo bar" | jl add
bsub "exit 1" | jl add
// wait and exit with 1 (or do whatever you feel is most appropriate) if some
// jobs fail to complete (which will cause jl to exit with 1
jl wait || exit 1
## print captured sterr to understand why they did fail
jl status --failed --logs err
All jl
commands use .jobs
as a default list, but you can provide your own for clarity:
bsub "sleep 3" | jl add .other_jobs
Or to give a slurm example:
echo '#!/bin/bash
touch test.txt' | sbatch -p my_queue -J test_job --time=00:20 | jl add
jl wait --report
Since jl startup is limited by the underlying java VM, subsequent invocation might be too slow to monitor/submit large (ie >1k) jobs jobs. There are 2 options to overcome this limitation batch submissions and submission monitoring.
jl
can read job definitions from a file or /dev/stdin
and do a batch submission. By default it expects one job per line.
for job_nr in $(seq 1 10 | head -n2); do
echo "echo 'sleeping for $job_nr seconds'; sleep $job_nr"
done | jl submit --batch -
jl
also allows to use a custom separator regex pattern to separate and submit multi-line commands:
for job_nr in $(seq 1 10); do
echo "
## another job nr${job_nr}
sleepTime=$(perl -e 'print int(rand(20)) + 10');
sleep $sleepTime;
echo slept for $sleepTime seconds in job $job_nr
" | sed 's/^ *//' ## delete leading whitespace to allow for more robust regex
done | jl submit --batch - --bsep '^##'
The separator pattern is expected to be java regular expression which will be matched unanchored to each line in order to split the input into job chunks.
To debug such loops you can use source /dev/stdin
as pipe sink
for job_nr in $(seq 1 10); do
echo "hello nr${job_nr}"
done | source /dev/stdin
Differences to a simple eval
are detailed out in difference-between-eval-and-source-dev-stdin
A simplified method to monitor just monitor a large number of jobs is to capture the job IDs directly from stdout
using jl add
. Here's an example:
for job_nr in $(seq 1 10); do
bsub "sleep 10"
done | jl add
Jobs are resubmitted by escalatint their base configuration.
Note: Just jobs are submitted with jl submit
they can also be resubmitted in case they fail. When using jl add
resubmission with jl resub
is not possible.
jl
can be globally configured by exporting some shell variables prior to launching jl
.
JL_FORCE_LOCAL
: Force local schedulerJL_DISABLE_REMEMBER_ME
When set always fall back to .jobs if no joblist-name is given.JL_MAX_LOCAL_JOBS
: Maximum number of concurrent local jobs. If not setjl
will use the number of cores - 2 as default. Just affects local scheduler.
Expert and Debugging Options:
JL_LOG_SUBMISSIONS
: Print the actual shell command that is executed by jl when submitting a job