Job Monitoring
Manage and Monitor Jobs using SLURM
Information on Jobs
List job | Command |
---|---|
own active | squeue -u $USER |
in <partition> | squeue -u $USER -p <partition> |
show priority | sprio -l |
list running | squeue -u $USER -t RUNNING |
list pending | squeue -u $USER -t PENDING |
show details | scontrol show jobid -dd <jobid> |
status info | sstat --format=AveCPU,AvePages,AveRSS,AveVMSize,JobID -j <jobid> --allsteps |
statistics on completed (per job) | sacct -j <jobid> --format=JobID,JobName,MaxRSS,Elapsed |
statistics on completed (per username) | sacct -u <username> --format=JobID,JobName,MaxRSS,Elapsed |
summary statistics on completed job | seff <jobid> |
sacct
. Note that only recent jobs will be displayed without specifying the -S
flag (for the start date to search from). For example -S 0901
would loop up the jobs from the September, 1st. See the manpage for more information on time related lookup options.Controlling Jobs
Job operation | Command |
---|---|
cancel one | scancel <jobid> |
cancel all | scancel -u <username> |
cancel all your pending | scancel -u $USER -t PENDING |
cancel one or more by name | scancel --name <myJobName> |
pause one | scontrol hold <jobid> |
resume one | scontrol resume <jobid> |
requeue one | scontrol requeue <jobid> |
Modifying Pending Jobs
Sometimes squeue --start
might indicate a wrong requirement specification, e.g. BadConstraints
. In this case a user can figure out the mismatch with scontrol show job <jobid>
(which might require some experience). Wrong requirements can be fixed like:
To correct a job’s | Command |
---|---|
memory requirement | scontrol update job <jobid> MinMemoryNode=<mem in MB> |
memory requirement | scontrol update job <jobid> MinMemoryCPU=<mem in MB> |
number of requested CPUs | scontrol update job <jobid> NumCPUs=<number> |
For more information see man scontrol
.
Job State Codes
Status | Code | Description |
---|---|---|
COMPLETED | CD | The Job has completed successfully. |
COMPLETING | CG | The job is finishing but some processes are still active. |
FAILED | F | The job terminated with a non-zero exit code and failed to execute. |
PENDING | PD | The job is waiting for resource allocation. It will eventually run. |
PREEMPTED | PR | The job was terminated because of preemption by another job. |
RUNNING | R | The job currently is allocated to a node and is running. |
SUSPENDED | S | A running job has been stopped with its cores released to other jobs. |
STOPPED | ST | A running job has been stopped with its cores retained. |
Pending Reasons
So, why do my jobs not start? SLURM may list a number of reasons for pending jobs (those labelled PD
, when squeue
is triggered). Here, we show some more frequent reasons:
Reason | Brief Explanation |
---|---|
Priority | At first, every job gets this reason. If not scheduled for a while (> several minutes), the job simply lacks priority to start. |
AssocGrpCPURunMinutesLimit | Indicates, that the partitions-associated quality of service (CPU time) is exhausted for the user account/project account in question. This number will recover. |
QOSMaxCpuPerNode | This may indicate a violation of the number allowed in the chosen partition. |
QOSMaxJobsPerUserLimit | For certain partitions the number of running jobs per user is limited. |
QOSMaxJobsPerAccountLimit | For certain partitions the number of running jobs per account is limited. |
QOSGrpGRESRunMinutes | For certain partitions the generic resources (e.g. GPUs) are limited. See GPU Queues |
QOSGrpMemLimit | The requested partition is limited in the fraction of resources it can take from the cluster and this amount has been reached: jobs need to end, before new may start. |
QOSMinMemory | The Job isn’t requesting enough Memory for the requested Partition. |
QOSGrpCpuLimit | The requested partition is limited in the fraction of resources it can take from the cluster and this amount has been reached: jobs need to end, before new may start. |
Resources | The job is eligible to run but resources aren’t available at this time. This usually just means that your job will start next once nodes are done with their current jobs. |
ReqNodeNotAvail | Simply means that no node with the required resources is available. SLURM will list all non-available nodes, which can be confusing. This reason is similar to Resources as it means that a specific job has to wait for a resource to be released. |
And then there are limitations due to the number of jobs a group (a.k.a. account) may run at a given time. More information on partitions can be found on their respective docs site.