Workflow Organization
How to organize your workflow
Job Dependency
Job dependencies are used to defer the start of a job until the specified dependencies have been satisfied. They are specified with the --dependency
option to sbatch
in the format
Dependency types are:
Condition | Explanation |
---|---|
after:jobid[:jobid…] | job can begin after the specified jobs have started |
afterany:jobid[:jobid…] | job can begin after the specified jobs have terminated |
afternotok:jobid[:jobid…] | job can begin after the specified jobs have failed |
afterok:jobid[:jobid…] | job can begin after the specified jobs have run to completion with an exit code of zero (see the user guide for caveats). |
singleton | jobs can begin execution after all previously launched jobs with the same name and user have ended. This is useful to collate results of an array job or similar. |
To set up pipelines using job dependencies, the most useful types are afterany,
afterok, and
singleton. The simplest way is to use the
afterok` dependency for single consecutive jobs. For example:
Now when job1
ends with an exit code of zero, job2
will become eligible for scheduling. However, if job1
fails (ends with a non-zero exit code), job2
will not be scheduled but will remain in the queue and needs to be canceled manually.
As an alternative, the afterany
dependency can be used, and checking for successful execution of the prerequisites can be done in the jobscript itself.
Example in Bash
The following ${id1##* }
constructs are there because sbatch
returns a statements like Submitted batch job <JOBID>
.
Unsatisfiable Dependencies
Sometimes a dependency condition cannot be satisfied. For example when asking for a predecessor job to be sucessfully done (afterok:<jobid>
) and it fails.
In such a case --kill-on-invalid-dep=<yes|no>
can be specified to sbatch
.
Job Arrays
Purpose
According to the Slurm Job Array Documentation , “job arrays offer a mechanism for submitting and managing collections of similar jobs quickly and easily.” In general, job arrays are useful for applying the same processing routine to a collection of multiple input data files. Job arrays offer a very simple way to submit a large number of independent processing jobs.
We strongly recommend using parallel processing in addition and to use job arrays as a convenience feature, without neglecting performance optimization.
Maximal Job Array Size
You can look up the maximal job array size using the following command:
The reason we do not document this value is that it is subject to change.
Why is there an array size limit at all?
If unlimited or if the limit is huge, some users will see this as an invitation to submit a large number of otherwise unoptimized jobs. The idea behind job arrays is to ease workflows, particularly the submission of a bigger number of jobs. However, the motivations to pool jobs and to optimize still apply.
Job Arrays in Slurm
By submitting a single job array sbatch
script, a specified number of “array-tasks” will be created based on this “master” sbatch
script. An example job array script is given below:
In the above example, the --array=1-16
option will cause 16 array-tasks (numbered 1, 2, …, 16) to be spawned when this master job script is submitted. The “array-tasks” are simply copies of this master script that are automatically submitted to the scheduler on your behalf. However, in each array-tasks an environment variable called SLURM_ARRAY_TASK_ID
will be set to a unique value (in this example, a number in the range 1, 2, …, 16). In your script, you can use this value to select, for example, a specific data file that each array-tasks will be responsible for processing.
Job array indices can be specified in several ways. For example:
The %A_%a
construct in the output and error file names is used to generate unique output and error files based on the master job ID (%A
) and the array-tasks ID (%a
). In this fashion, each array-tasks will be able to write to its own output and error file.
Limiting the number of concurrent jobs of an array
It is possible to limit the number of concurrently executed jobs of an array, e.g. to minimize I/O overhead within one approach, with this syntax:
where a limit of 50 concurrent jobs would be in place.
Multiprog for “Uneven” Arrays
The --multi-prog
option in srun
allows you to assign each parallel task in your job with a different option. More information can be found in our documentation on node-local scheduling](/docs/advanced/workflow-organization/#Script_Para).
Script Based Parallelization
There are some use cases, where you would want to simply request a full cluster node from slurm and then run many (e.g. much more than 64) small (e.g. only a fragment of the total job runtime) tasks on this full node. Then of course you will need some local scheduling on this node to ensure proper utilization of all cores.
To accomplish this, we suggest you use the
GNU Parallel
program. The program is installed to /cluster/bin
, but you can also simply load the modulefile software/gnu_parallel
so that you can also access its man page.
For more documentation on how to use GNU Parallel, please read man parallel and man parallel_tutorial , where you’ll find a great number of examples and explanations.
MOGON Usage Example
Let’s say we have some input data files that contain differing parameters that are going to be processed independently by our program:
Now of course we could submit 150 jobs using Slurm or we could use one job which processes the files one after another, but the most elegant way would be to submit one job for 64 cores (e.g. a whole node on Mogon) and process the files in parallel. This is especially convenient since we can then use the nodeshort
queue which has better scheduling characteristics than short
(while both show better scheduling compared to their long
counterparts:
After this job has run, we should have the results/output data (in this case, it’s just the output of wc
, for demonstration):
Multithreaded Programs
Let’s further assume that our program can work in parallel itself using OpenMP.
We determined that OMP_NUM_THREADS=8
is the best amount of parallel work for one set of input data.
This means we can launch 64/8=8
processes using GNU Parallel on the one node we have
Running on several hosts
We do not recommend supplying a hostlist to GNU parallel with the -S
option, as GNU parallel attempts to ssh
on the respective nodes (including the master host) and therefore loses the environment. You can script around this, but you will run into a quotation hell.
The number of tasks (given by -n
) times the number of CPUs per task (given by -c
) needs to be equal to the number of nodes (given by -N) times the number of CPUs per nodes (to be inferred from
scontrol show node
SLURM multiprog for uneven arrays
The
SLURM multiprog
option in srun
essentially displays a master-slave setup. You need it to run within a SLURM job allocation and trigger srun
with the %%--multi-prog%%
option and appropriate multiprog file:
Then, of course, the multi
.conf` file has to exist:
Indeed, as the naming suggests, you can use such a setup to emulate a master-slave environment. But then the processes have to care themselves about their communication (sockets, regular files, etc.). And the most cumbersome aspect is: You have to maintain two files at all times, whenever the setup has to be changed, and all parameters have to match.
The configuration file contains three fields, separated by blanks. These fields are:
- Task number
- Executable File
- Argument
Parameters available :
%t
- The task number of the responsible task%o
- The task offset (task’s relative position in the task range).
Checkpoint Restarting
Introducing wall times is one measure to ensure a balanced distribution of resources in every HPC cluster. Yet, some applications need to have extremely long run times. The solution is Application Checkpointing , where a snapshot of the running application is saved in pre-defined intervals. This provides the ability to restart an application from the point where the checkpoint has been saved.
Third party tools
Checkpointing multithreaded applications with dmtcp
dmtcp is a versatile checkpointing application providing good documentation (incl. a video).
We provide at least one module for dmtcp, check: