Job Monitoring
Manage and Monitor Jobs using SLURM
Information on Jobs
List job | Command |
---|---|
own active | squeue -u $USER |
in <partition> | squeue -u $USER -p <partition> |
show priority | sprio -l |
list running | squeue -u $USER -t RUNNING |
list pending | squeue -u $USER -t PENDING |
show details | scontrol show jobid -dd <jobid> |
status info | sstat --format=AveCPU,AvePages,AveRSS,AveVMSize,JobID -j <jobid> --allsteps |
statistics on completed (per job) | sacct -j <jobid> --format=JobID,JobName,MaxRSS,Elapsed |
statistics on completed (per username) | sacct -u <username> --format=JobID,JobName,MaxRSS,Elapsed |
summary statistics on completed job | seff <jobid> |
sacct
. Note that only recent jobs will be displayed without specifying the -S
flag (for the start date to search from). For example -S 0901
would loop up the jobs from the September, 1st. See the manpage for more information on time related lookup options.Controlling Jobs
Job operation | Command |
---|---|
cancel one | scancel <jobid> |
cancel all | scancel -u <username> |
cancel all your pending | scancel -u $USER -t PENDING |
cancel one or more by name | scancel --name <myJobName> |
pause one | scontrol hold <jobid> |
resume one | scontrol resume <jobid> |
requeue one | scontrol requeue <jobid> |
Modifying Pending Jobs
Sometimes squeue --start
might indicate a wrong requirement specification, e.g. BadConstraints
. In this case a user can figure out the mismatch with scontrol show job <jobid>
(which might require some experience). Wrong requirements can be fixed like:
To correct a job’s | Command |
---|---|
memory requirement | scontrol update job <jobid> MinMemoryNode=<mem in MB> |
memory requirement | scontrol update job <jobid> MinMemoryCPU=<mem in MB> |
number of requested CPUs | scontrol update job <jobid> NumCPUs=<number> |
For more information see man scontrol
.
Job State Codes
Status | Code | Description |
---|---|---|
COMPLETED | CD | The Job has completed successfully. |
COMPLETING | CG | The job is finishing but some processes are still active. |
FAILED | F | The job terminated with a non-zero exit code and failed to execute. |
PENDING | PD | The job is waiting for resource allocation. It will eventually run. |
PREEMPTED | PR | The job was terminated because of preemption by another job. |
RUNNING | R | The job currently is allocated to a node and is running. |
SUSPENDED | S | A running job has been stopped with its cores released to other jobs. |
STOPPED | ST | A running job has been stopped with its cores retained. |
Pending Reasons
So, why do my jobs not start? SLURM may list a number of reasons for pending jobs (those labelled PD
, when squeue
is triggered). Here, we show some more frequent reasons:
Reason | Brief Explanation |
---|---|
Priority | At first, every job gets this reason. If not scheduled for a while (> several minutes), the job simply lacks priority to start. |
AssocGrpCPURunMinutesLimit | Indicates, that the partitions-associated quality of service (CPU time) is exhausted for the user account/project account in question. This number will recover. |
QOSMaxCpuPerNode | This may indicate a violation of the number allowed in the chosen partition. |
QOSMaxJobsPerUserLimit | For certain partitions the number of running jobs per user is limited. |
QOSMaxJobsPerAccountLimit | For certain partitions the number of running jobs per account is limited. |
QOSGrpGRESRunMinutes | For certain partitions the generic resources (e.g. GPUs) are limited. See GPU Queues |
QOSGrpMemLimit | The requested partition is limited in the fraction of resources it can take from the cluster and this amount has been reached: jobs need to end, before new may start. |
QOSMinMemory | The Job isn’t requesting enough Memory for the requested Partition. |
QOSGrpCpuLimit | The requested partition is limited in the fraction of resources it can take from the cluster and this amount has been reached: jobs need to end, before new may start. |
Resources | The job is eligible to run but resources aren’t available at this time. This usually just means that your job will start next once nodes are done with their current jobs. |
ReqNodeNotAvail | Simply means that no node with the required resources is available. SLURM will list all non-available nodes, which can be confusing. This reason is similar to Resources as it means that a specific job has to wait for a resource to be released. |
And then there are limitations due to the number of jobs a group (a.k.a. account) may run at a given time. More information on partitions can be found on their respective docs site.
Job Priority
The SLURM scheduler determines the order in which jobs in the pending queue are executed by assigning a priority score. This score is calculated as a weighted sum of various factors, including age, fairshare, job size, and others, and is represented as an integer value. Generally, the job with the highest priority will be executed first, unless a smaller job can be scheduled earlier without delaying the higher-priority job, thanks to the backfilling mechanism.
To get insights into your job’s priority, you can use the following commands:
View your job’s priority: The
sprio
command shows the priority value for a specific job:Check your fairshare: Use the
sshare
command to view the fairshare value for your account:Estimate job start time: To check when your job is likely to start, use the
squeue
command with the--start
flag:Please note that the estimated start time is based on the anticipated runtimes of other jobs, so it may not always be precise.