Job Monitoring

Manage and Monitor Jobs using Slurm

Information on Jobs

List job Command
own active squeue -u $USER
in <partition> squeue -u $USER -p <partition>
show priority sprio -l
list running squeue -u $USER -t RUNNING
list pending squeue -u $USER -t PENDING
show details scontrol show jobid -dd <jobid>
status info sstat --format=AveCPU,AvePages,AveRSS,AveVMSize,JobID -j <jobid> --allsteps
statistics on completed (per job) sacct -j <jobid> --format=JobID,JobName,MaxRSS,Elapsed
statistics on completed (per username) sacct -u <username> --format=JobID,JobName,MaxRSS,Elapsed
summary statistics on completed job seff <jobid>
You can see completed Jobs only with sacct. Note that only recent jobs will be displayed without specifying the -S flag (for the start date to search from). For example -S 0901 would loop up the jobs from the September, 1st. See the manpage for more information on time related lookup options.

Controlling Jobs

Job operation Command
cancel one scancel <jobid>
cancel all scancel -u <username>
cancel all your pending scancel -u $USER -t PENDING
cancel one or more by name scancel --name <myJobName>
pause one scontrol hold <jobid>
resume one scontrol resume <jobid>
requeue one scontrol requeue <jobid>

Modifying Pending Jobs

Sometimes squeue --start might indicate a wrong requirement specification, e.g. BadConstraints. In this case a user can figure out the mismatch with scontrol show job <jobid> (which might require some experience). Wrong requirements can be fixed like:

To correct a job’s Command
memory requirement scontrol update job <jobid> MinMemoryNode=<mem in MB>
memory requirement scontrol update job <jobid> MinMemoryCPU=<mem in MB>
number of requested CPUs scontrol update job <jobid> NumCPUs=<number>

For more information see man scontrol.

Job State Codes

Status Code Description
COMPLETED CD The Job has completed successfully.
COMPLETING CG The job is finishing but some processes are still active.
FAILED F The job terminated with a non-zero exit code and failed to execute.
PENDING PD The job is waiting for resource allocation. It will eventually run.
PREEMPTED PR The job was terminated because of preemption by another job.
RUNNING R The job currently is allocated to a node and is running.
SUSPENDED S A running job has been stopped with its cores released to other jobs.
STOPPED ST A running job has been stopped with its cores retained.

Pending Reasons

So, why do my jobs not start? Slurm may list a number of reasons for pending jobs (those labelled PD, when squeue is triggered). Here, we show some more frequent reasons:

Reason Brief Explanation
Priority At first, every job gets this reason. If not scheduled for a while (> several minutes), the job simply lacks priority to start.
AssocGrpCPURunMinutesLimit Indicates, that the partitions-associated quality of service (CPU time) is exhausted for the user account/project account in question. This number will recover.
QOSMaxCpuPerNode This may indicate a violation of the number allowed in the chosen partition.
QOSMaxJobsPerUserLimit For certain partitions the number of running jobs per user is limited.
QOSMaxJobsPerAccountLimit For certain partitions the number of running jobs per account is limited.
QOSGrpGRESRunMinutes For certain partitions the generic resources (e.g. GPUs) are limited. See GPU Queues
QOSGrpMemLimit The requested partition is limited in the fraction of resources it can take from the cluster and this amount has been reached: jobs need to end, before new may start.
QOSMinMemory The Job isn’t requesting enough Memory for the requested Partition.
QOSGrpCpuLimit The requested partition is limited in the fraction of resources it can take from the cluster and this amount has been reached: jobs need to end, before new may start.
Resources The job is eligible to run but resources aren’t available at this time. This usually just means that your job will start next once nodes are done with their current jobs.
ReqNodeNotAvail Simply means that no node with the required resources is available. Slurm will list all non-available nodes, which can be confusing. This reason is similar to Resources as it means that a specific job has to wait for a resource to be released.

And then there are limitations due to the number of jobs a group (a.k.a. account) may run at a given time. More information on partitions can be found on their respective docs site.

Job Priority

The Slurm scheduler determines the order in which jobs in the pending queue are executed by assigning a priority score. This score is calculated as a weighted sum of various factors, including age, fairshare, job size, and others, and is represented as an integer value. Generally, the job with the highest priority will be executed first, unless a smaller job can be scheduled earlier without delaying the higher-priority job, thanks to the backfilling mechanism.

To get insights into your job’s priority, you can use the following commands:

  • View your job’s priority: The sprio command shows the priority value for a specific job:

    sprio -j <job_id>
  • Check your fairshare: Use the sshare command to view the fairshare value for your account:

    sshare -A <account_name>
  • Estimate job start time: To check when your job is likely to start, use the squeue command with the --start flag:

    squeue -j <jobid> --start

    Please note that the estimated start time is based on the anticipated runtimes of other jobs, so it may not always be precise.