Skip to content

Commit

Permalink
Adresses SMART-Lab#160
Browse files Browse the repository at this point in the history
  • Loading branch information
IshmaelBelghazi committed Aug 8, 2017
1 parent e8a7d0a commit 151020e
Show file tree
Hide file tree
Showing 3 changed files with 7 additions and 14 deletions.
2 changes: 1 addition & 1 deletion docs/source/autoresume.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,5 +10,5 @@ tasks as soon as they hit the walltime. The caveat here is that your tasks
**must be resumable**, i.e. be capable of restoring their state after being
killed and rerun.

You can engage the autoresumption by passing ``-m`` or ``--autoresume`` during
You can engage the autoresumption by passing ``-r`` or ``--autoresume`` during
``smart-dispatch`` execution. See :doc:`usage` for details.
11 changes: 6 additions & 5 deletions docs/source/usage.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,10 +11,11 @@ Hierarchy of generated files

In order to understand the contents of the generated folders/files, it's good to know how ``smart-dispatch`` deals with **commands** that a user requests to launch on the cluster:

* Each invokation of ``smart-dispatch`` creates a so-called **batch** of **jobs**. Smart Dispatch will do its best to create as many simultaneous jobs so as to effecitvely utilze the allocated resources.
* Smart Dispatch will distribute commands to jobs such that each of the latter uses an entire node. Jobs may run many commands concurrently if necessary to use a maximum number of cores and GPUs. The distribution is based on number of cores per node / per command and number of GPUs per node / per command.

* Each job is basically a single PBS file that is run by the queue management system on the cluster (either ``msub`` or ``qsub``).
* A job spawns mulitple concurrent **workers** that all cooperate to execute the requested commands.
* Each worker (basically, a python script) is executing commands sequentially.
* A job spawns multiple concurrent **workers** that all cooperate to execute the requested commands.
* Each worker is executing commands sequentially.

A typical hierarchy of ``./SMART_DISPATCH_LOGS/{batch_id}/`` should look like this: ::

Expand Down Expand Up @@ -58,7 +59,7 @@ Now let's go through the subdirectories.
This directory holds generated PBS files (``job_commands_{pbs_index}.sh``) as well as three command lists:

``commands.txt``:
A list pending commands (this is where the workers are taking their next commands to execute from).
A list of pending commands (this is where the workers are taking their next commands to execute from).
``running_commands.txt``:
A list of currently running commands.
``failed_commands.txt``:
Expand All @@ -68,7 +69,7 @@ This directory holds generated PBS files (``job_commands_{pbs_index}.sh``) as we
``logs/``
^^^^^^^^^

Output and error logs in are saved in this directory. The root level contains logs for actual commands. There are also two additional subfolder:
Output and error logs are saved in this directory. The root level contains logs for actual commands. There are also two additional subfolders:

``job/``:
Holds logs for the PBS files.
Expand Down
8 changes: 0 additions & 8 deletions smartdispatch.sublime-project

This file was deleted.

0 comments on commit 151020e

Please sign in to comment.