On MOGON, GPU tasks can only be run on clusters that have GPU ressources available. This cluster needs to be explicitly specified by choosing its corresponding partition. The following partitions have GPU ressources available:
There is a number of different public partitions part of the MOGON II cluster that support GPU usage:
Calculating on GPU nodes without using the accelerators / GPUs is prohibited! We reserve the right to terminate an account for abuse of these resources.
To get to know which account to use for the m2_gpu partition, login and call:
All accounts that show Partition=m2_gpu can be used to submit jobs to the GPU partition. To find information about other partitions, replace m2_gpu with the partition you are interested in.
Every group interested in using those GPUs, which does not have access already, can apply for it via the
AHRP website
(currently only MOGON II).
All GPU partitions on MOGON NHR have a time limit of 6 days for all jobs. In order to prevent single users or groups to flood the entire partition with their long running jobs, a limitation has been set, such that other users get the chance to run their jobs, too.
This may result in jobs not starting due to so-called pending reasons such as QOSGrpGRESRunMinutes. For other pending reasons, see our page on job management.
The m2_gpu is a single partition allowing a runtime of up to 5 days. In order to prevent single users or groups to flood the entire partition with their long running jobs, a limitation has been set, such that other users get the chance to run their jobs, too.
This may result in pending reasons such as QOSGrpGRESRunMinutes. For other pending reasons, see our page on job management.
Unlike the login-nodes the s-nodes have Intel-CPUs, which means that you have to compile your code on the GPU-nodes otherwise you may end up with illegal instruction errors or similar.
There is a partition m2_gpu-compile which allows for running one job per user with maximum 8 cores, 1 CPU, and --mem=18000M for compiling your code. Maximum runtime for compile jobs is 60 minutes.
To use a GPU you have to explicitly reserve it as a resource in the submission script:
<number> can be anything from 1-6 on our GPU nodes, depending on the partition. In order to use more than 1 GPU the application needs to support using this much, of course.
--gres-flags=enforce-binding is currently not working properly in our Slurm version. You may try to use it with Multi-task GPU job but it won’t work with Jobs reserving only part of a node. SchedMD seems to work on a bug fix.
Most GPU programs just know which device to select. Some do not. In any case SLURM exports the environment variable CUDA_VISIBLE_DEVICES, which simply holds the comma-separated, enumerated devices allowed in a job environment, starting from 0.
So, when for instance another job occupies the first device and your job selects two GPUs, CUDA_VISIBLE_DEVICES might hold the value 1,2 and you can read this into an array (with a so-called
HERE string
):
Now, you can point your applications to the respective devices (assuming you start two and not one, which uses both):