Job Monitoring

Manage and Monitor Jobs using SLURM

Information on Jobs

List jobCommand
own activesqueue -u $USER
in <partition>squeue -u $USER -p <partition>
show prioritysprio -l
list runningsqueue -u $USER -t RUNNING
list pendingsqueue -u $USER -t PENDING
show detailsscontrol show jobid -dd <jobid>
status infosstat --format=AveCPU,AvePages,AveRSS,AveVMSize,JobID -j <jobid> --allsteps
statistics on completed (per job)sacct -j <jobid> --format=JobID,JobName,MaxRSS,Elapsed
statistics on completed (per username)sacct -u <username> --format=JobID,JobName,MaxRSS,Elapsed
summary statistics on completed jobseff <jobid>
You can see completed Jobs only with sacct. Note that only recent jobs will be displayed without specifying the -S flag (for the start date to search from). For example -S 0901 would loop up the jobs from the September, 1st. See the manpage for more information on time related lookup options.

Controlling Jobs

Job operationCommand
cancel onescancel <jobid>
cancel allscancel -u <username>
cancel all your pendingscancel -u $USER -t PENDING
cancel one or more by namescancel --name <myJobName>
pause onescontrol hold <jobid>
resume onescontrol resume <jobid>
requeue onescontrol requeue <jobid>

Modifying Pending Jobs

Sometimes squeue --start might indicate a wrong requirement specification, e.g. BadConstraints. In this case a user can figure out the mismatch with scontrol show job <jobid> (which might require some experience). Wrong requirements can be fixed like:

To correct a job’sCommand
memory requirementscontrol update job <jobid> MinMemoryNode=<mem in MB>
memory requirementscontrol update job <jobid> MinMemoryCPU=<mem in MB>
number of requested CPUsscontrol update job <jobid> NumCPUs=<number>

For more information see man scontrol.

Job State Codes

StatusCodeDescription
COMPLETEDCDThe Job has completed successfully.
COMPLETINGCGThe job is finishing but some processes are still active.
FAILEDFThe job terminated with a non-zero exit code and failed to execute.
PENDINGPDThe job is waiting for resource allocation. It will eventually run.
PREEMPTEDPRThe job was terminated because of preemption by another job.
RUNNINGRThe job currently is allocated to a node and is running.
SUSPENDEDSA running job has been stopped with its cores released to other jobs.
STOPPEDSTA running job has been stopped with its cores retained.

Pending Reasons

So, why do my jobs not start? SLURM may list a number of reasons for pending jobs (those labelled PD, when squeue is triggered). Here, we show some more frequent reasons:

ReasonBrief Explanation
PriorityAt first, every job gets this reason. If not scheduled for a while (> several minutes), the job simply lacks priority to start.
AssocGrpCPURunMinutesLimitIndicates, that the partitions-associated quality of service (CPU time) is exhausted for the user account/project account in question. This number will recover.
QOSMaxCpuPerNodeThis may indicate a violation of the number allowed in the chosen partition.
QOSMaxJobsPerUserLimitFor certain partitions the number of running jobs per user is limited.
QOSMaxJobsPerAccountLimitFor certain partitions the number of running jobs per account is limited.
QOSGrpGRESRunMinutesFor certain partitions the generic resources (e.g. GPUs) are limited. See GPU Queues
QOSGrpMemLimitThe requested partition is limited in the fraction of resources it can take from the cluster and this amount has been reached: jobs need to end, before new may start.
QOSMinMemoryThe Job isn’t requesting enough Memory for the requested Partition.
QOSGrpCpuLimitThe requested partition is limited in the fraction of resources it can take from the cluster and this amount has been reached: jobs need to end, before new may start.
ResourcesThe job is eligible to run but resources aren’t available at this time. This usually just means that your job will start next once nodes are done with their current jobs.
ReqNodeNotAvailSimply means that no node with the required resources is available. SLURM will list all non-available nodes, which can be confusing. This reason is similar to Resources as it means that a specific job has to wait for a resource to be released.

And then there are limitations due to the number of jobs a group (a.k.a. account) may run at a given time. More information on partitions can be found on their respective docs site.