Job Settings

 Cluster


 Memory


 Job Duration


 Parallelization Paradigms

Deeplearning: max 8 GPUs
MI250: max 4 GPUs
SmallGPU: max 8 GPUs
A100DL: max 4 GPUs
A100AI: max 8 GPUs

  Partitions

Max Walltime for : (d-hh:mm:ss)
This Partition is for the old DGX-Users, which have been migrated from M2 to MOGON-KI. GPUs are only useable, when your Project requested them.
Billing Weights: CPU=1.5*Num Mem=0.25*GB GPU=10*Num
(ki)-smallcpu is only available when 1 Node is choosen. mogondoks (ki)-smallcpu is the common queue for most users on MOGON, lowest bill overall.
Billing Weights: CPU=1.0*Num Mem=1*GB
(ki)-Parallel is an exclusive queue, you must pay for all the resources of your allocated node, even if you do not use them. mogondoks
Over 173 Nodes only 256GB per Node is available.
Billing Weights: CPU=128 Mem=1*256/512
(ki)-Longtime is a special queue for jobs, that exceed the 6-day walltime limit of the other CPU partitions.
If your Job has less than 6 Days Walltime, Slurm will not schedule your Job or accept your Script. mogondoks
Billing Weights: CPU=1.25*Num Mem=1.0*GB
(ki)-Largemem is a high memory queue, for jobs that exceed the standard node's 512GB mem limit.
If your Job need more then 1TB RAM, only Hugemem is available. mogondoks
If your Job requires less than 512GB, Slurm will not schedule your Job or accept your Script.
Billing Weights: CPU=1.0*Num Mem=1.6*GB
(ki)-Hugemem is a high memory queue, for jobs that exceed the 1TB mem limit of Largemem. mogondoks
If your Job requires less than 1TB, Slurm will not schedule your Job or accept your Script.
Billing Weights: CPU=1.0*Num Mem=2.8*GB
For the MI250 Queue, you must compile your Application with SYCL/HIP/OpenCL on the System with ROCm, not CUDA. mogondoks
GPUs are only useable, when your Project requested them. The MI250 are Dual-GPUs, for best Performace often 2 MPI-Prozesses per GPU is best.
Billing Weights: CPU=1.0*Num Mem=1.5*GB GPU=9*Num
For the SmallGPU Queue with A40, load CUDA-Modules. mogondoks GPUs are only useable, when your Project requested them.
Billing Weights: CPU=1.0*Num Mem=1.5*GB GPU=7*Num
For the A100DL Queue, load CUDA-Modules. mogondoks GPUs are only useable, when your Project requested them.
Billing Weights: CPU=1.0*Num Mem=1.5*GB GPU=9*Num
For the A100AI Queue, load CUDA-Modules. mogondoks GPUs are only useable, when your Project requested them.
Billing Weights: CPU=1.0*Num Mem=3.0*GB GPU=17*Num

 Modules

Currently only Toolchain 2024a is supported with no logic.
Some combinations will fail, take attention !

 Executable Commands

 srun / mpirun
Use mpirun/mpiexec — NOT srun — for OpenMPI 5.x.x jobs.
#!/bin/bash
#========[ + + + + MOGON Script Engine v26.6.2 + + + + ]========#
#
#  Documentation:  https://docs.hpc.uni-mainz.de
#   Chat Support:  https://mattermost.gitlab.rlp.net/hpc-support
# Ticket Support:  hpc@uni-mainz.de

#========[ + + + + Job Information + + + + ]========#
#SBATCH --mail-user=
#SBATCH --account=
#SBATCH --mail-type=
#SBATCH --job-name=
#SBATCH --comment=
#SBATCH --output=stdout_%x_%j.out
#SBATCH --output=_%x_%j.out
#SBATCH --error=stderr_%x_%j.err
#SBATCH --error=_%x_%j.err

#========[ + + + + Job Description + + + + ]========#
#SBATCH --partition=
#SBATCH --gres=gpu:
#SBATCH --time=:00
#SBATCH --signal=B:SIGUSR2@600
#SBATCH --ramdisk=
M
#SBATCH --mem=
#SBATCH --mem-per-cpu=
#SBATCH --nodes=
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=
#SBATCH --array=1-
export OMP_NUM_THREADS=
export MKL_NUM_THREADS=

#========[ + + + + Localscratch & Ramdisk + + + + ]========#
SAVEDPWD=$(pwd)
JOBDIR=/localscratch/$SLURM_JOB_ID
RAMDISK=$JOBDIR/ramdisk
cleanup(){
    cp /localscratch/${SLURM_JOB_ID}/output_file ${SAVEDPWD}/ &
    cp /localscratch/${SLURM_JOB_ID}/restart_file ${SAVEDPWD}/ &
    wait
    exit 0
}
trap 'cleanup' SIGUSR2
cp ${SAVEDPWD}/input_file /localscratch/${SLURM_JOB_ID}
cp ${SAVEDPWD}/restart_file /localscratch/${SLURM_JOB_ID}
cd /localscratch/${SLURM_JOB_ID}
${SAVEDPWD}/my_program
cleanup
######
cp *file in parallel file system* $RAMDISK/.

#========[ + + + + Modules + + + + ]========#
module purge
module use /apps/easybuild/current/cuda/modules/all
module load

#========[ + + + + Execution + + + + ]========#
Get your share: "sshare -A <account_name>" mogondoks
   Total Resource Consumption
Total CPUs:
Total GPUs:
Total Memory:
Total CPU hours: h
Max Energy Consumption: up to for the Job
Billing: Your share costs