Calculating on GPU nodes without using the accelerators / GPUs is prohibited! We reserve the right to terminate an account for abuse of these resources.
The m2_gpu is a single partition allowing a runtime of up to 5 days. In order to prevent single users or groups to flood the entire partition with their long running jobs, a limitation has been set, such that other users get the chance to run their jobs, too. This may result in pending reasons such as QOSGrpGRESRunMinutes. For other pending reasons, see our page on job management.
Unlike the login-nodes the s-nodes have Intel-CPUs, which means that you have to compile your code on the GPU-nodes otherwise you may end up with illegal instruction errors or similar.
There is a partition m2_gpu-compile which allows for running one job per user with maximum 8 cores, 1 CPU, and --mem=18000M for compiling your code. Maximum runtime for compile jobs is 60 minutes.
To use a GPU you have to explicitly reserve it as a resource in the submission script:
<number> can be anything from 1-6 on our GPU nodes, depending on the partition. In order to use more than 1 GPU the application needs to support using this much, of course.
--gres-flags=enforce-binding is currently not working properly in our Slurm version. You may try to use it with Multi-task GPU job but it won’t work with Jobs reserving only part of a node. SchedMD seems to work on a bug fix.
Most GPU programs just know which device to select. Some do not. In any case SLURM exports the environment variable CUDA_VISIBLE_DEVICES, which simply holds the comma-separated, enumerated devices allowed in a job environment, starting from 0.
So, when for instance another job occupies the first device and your job selects two GPUs, CUDA_VISIBLE_DEVICES might hold the value 1,2 and you can read this into an array (with a so-called
HERE string
):
Now, you can point your applications to the respective devices (assuming you start two and not one, which uses both):