Reviewing Jobs
How to evaluate your jobs
On this page
- Which time limits should I choose?
- How much memory do my jobs really need?
Such questions are the basic questions for any new tool to be used in batch jobs. We usually advise to launch a few test jobs with representative parameterization (hence, no toy data). Subsequently, a setup for more, productive jobs can be chosen, such that a safety margin for wall time and memory limit is placed, which does not in turn throttle the own throughput1.
SLURM
provides an on-board script, seff
, which can be used to evaluate jobs which have finished. To invoke it, run
It will give an output like:
Here the meaning is
Key | Interpretation |
---|---|
<given job ID> | the ID used $ seff <jobid> |
<cluster> | the cluster name |
<user> | user name for the job |
<group> | a unix group((due to our mapping, that can be any of the groups a user belongs to)) |
State | can be any of COMPLETED , FAILED or CANCELED |
Nodes | number of nodes reserved for the job |
Cores per node | number of cores per node for the job |
CPU Utilized | the utilized overall CPU time (used time per CPU * No. of CPUs) |
CPU Efficiency | an apparent computation efficiency (utilized CPUs over core-walltime); the core-walltime is the turn-around time of the job, including setup and cleanup |
Job Wall-clock time | elapsed time of the job |
Memory Utilized | Peak Memory |
Memory Efficiency | see below for an explanation |
Obviously, the CPU efficiency should not be too low. In the example 14% of the resources is not used – apparently. Is this good or bad? The reported “Memory Efficiency” is way below anything which can be considered “efficient”, right?
- The CPU Efficiency takes into account the node-preperation before job start and the subsequent cleanup time. Hence, the value will be always below 100 %. In the example, with a turn-around time of 5.5 minutes, 2 times 30 seconds for the preparation and cleanup will take 18 % of the time, already. Hence, the particular example can be considered very efficient. For longer turn-around times, this prep-/clean figure will vanish.
- To report Memory Efficiency the way SLURM does is using absolutely the wrong term: The default memory reservation for the used partition is $112.50 GB$ (actually GiB ). To use less is not a sign of meager efficiency, but rather a sign to use the CPUs well and not using the reserved memory.
Still, the reported Memory Efficiency can be an important measure for the used memory. If you want to know your peak memory usage, that measure can give you a hint.
However, please note that SLURM samples the memory usage in intervals. Hence, usage peaks may be missed.
Footnotes
As a rule of thumb, a few percent of the maximum memory and 10-15% above the measured maximum time is sufficient. However, for a detailed analysis, please evaluate carefully or approach the HPC team. ↩︎