Compute Nodes
This page lists hardware specifications for the compute nodes currently used in the MOGON clusters, as well as our partitioning.
Available Resources
The following table displays all generally available nodes of MOGON KI. They are interconnected via Infiniband HDR and have $3.2\thinspace\text{TB}$ disk space. All nodes run on the same AMD EPYC 7713 CPU architecture.
| Nodes | S / C / T | RAM | Accelerator |
|---|---|---|---|
| cpu0xxx | $2/64/1$ | $33\times~256\thinspace\text{GB}$ $14\times~512\thinspace\text{GB}$ $\space2\times1024\thinspace\text{GB}$ $1\times2048\thinspace\text{GB}$ |
|
| gpu0101 - gpu0102 |
$2/64/1$ | $1024\thinspace\text{GB}$ | $4\times$ AMD MI250 |
| gpu0201- gpu0204 |
$2/64/1$ | $2048\thinspace\text{GB}$ | $8\times$ Nvidia A100-SXM4 $80\thinspace\text{GB}$ |
| gpu0301- gpu0307 |
$2/64/1$ | $1024\thinspace\text{GB}$ | $8\times$ Nvidia A40 $48\thinspace\text{GB}$ |
S stands for sockets per node, C - cores per socket, and T - threads per core.
The following table displays all generally available nodes of MOGON NHR. They are interconnected via Infiniband HDR and have $3.2\thinspace\text{TB}$ disk space. All nodes run on the same AMD EPYC 7713 CPU architecture.
| Nodes | S / C / T | RAM | Accelerator |
|---|---|---|---|
| cpu0xxx | $2/64/1$ | $400\times~256\thinspace\text{GB}$ $159\times~512\thinspace\text{GB}$ $\space27\times1024\thinspace\text{GB}$ $4\times2048\thinspace\text{GB}$ |
|
| gpu0001- gpu0010 |
$2/64/1$ | $1024\thinspace\text{GB}$ | $4\times$ Nvidia A100-SXM4 $40\thinspace\text{GB}$ |
S stands for sockets per node, C - cores per socket, and T - threads per core.
The memory specified above is not to be confused with RAM available at runtime, as all nodes reserve some memory for basic services.
When you specify your memory reservation with Slurm, please use the RAM values in the partitioning table below.
Partitioning
Individual compute nodes are grouped together into larger subsets of the cluster to form so-called partitions. Partitions group nodes based on characteristics or policies to ensure fairness and responsiveness.
Nodes: CPU-Nodes
| Partition | Limit | RAM | Designated Use |
|---|---|---|---|
| ki-smallcpu | 6 days | $1\thinspace930\thinspace\text{MiB}$ $\space\text{per CPU}$ |
for jobs using CPUs $\ll 128$ max. run. jobs per user: $3\text{k}$ |
| ki-parallel | 6 days | $\space248\thinspace000\thinspace\text{MiB}$ $\space504\thinspace000\thinspace\text{MiB}$ |
jobs using $\text{n}$ exclusive nodes, $\text{CPUs}=128\times\text{n}$ for $\text{n}\in[1,2,\ldots]$ |
| ki-longtime | 12 days | $\space248\thinspace000\thinspace\text{MiB}$ $\space504\thinspace000\thinspace\text{MiB}$ |
long running jobs $\ge \text{6 days}$ |
| ki-largemem | 6 days | $1\thinspace016\thinspace000\thinspace\text{MiB}$ | higher memory needs |
| ki-hugemem | 6 days | $1\thinspace992\thinspace000\thinspace\text{MiB}$ | higher memory needs |
| Partition | Limit | RAM | Designated Use |
|---|---|---|---|
| smallcpu | 6 days | $1\thinspace930\thinspace\text{MiB}$ $\space\text{per CPU}$ |
for jobs using $\text{CPUs} \ll 128$ max. run. jobs per user: $3\text{k}$ |
| parallel | 6 days | $\space248\thinspace000\thinspace\text{MiB}$ $\space504\thinspace000\thinspace\text{MiB}$ |
jobs using $\text{n}$ exclusive nodes, $\text{CPUs}=128\times\text{n}$ for $\text{n}\in[1,2,\ldots]$ |
| longtime | 12 days | $\space248\thinspace000\thinspace\text{MiB}$ $\space504\thinspace000\thinspace\text{MiB}$ |
long running jobs $\ge \text{6 days}$ |
| largemem | 6 days | $1\thinspace016\thinspace000\thinspace\text{MiB}$ | higher memory needs |
| hugemem | 6 days | $1\thinspace992\thinspace000\thinspace\text{MiB}$ | higher memory needs |
Did you know?
The parallel partition allocates nodes exclusively — meaning even a 2-CPU job reserves a full node. To avoid waste, submit small jobs to the smallcpu partition.
Partitions supporting Accelerators
| Partition | Nodes | Limit | RAM | Designated Use |
|---|---|---|---|---|
| mi250 | gpu010x | 6 days | $1\thinspace016\thinspace000\thinspace\text{GB}$ | GPU requirement |
| a100ai | gpu020x | 6 days | $1\thinspace992\thinspace000\thinspace\text{GB}$ | GPU requirement |
| a40 | gpu030x | 6 days | $1\thinspace016\thinspace000\thinspace\text{GB}$ | GPU requirement |
| ki-gpu-devel | - | 2 hours | - | GPU testing |
Private Partitions
| Partition | Nodes | Limit | RAM | Accelerators | |
|---|---|---|---|---|---|
| topml | gpu0601 | 6 days | $1\thinspace547\thinspace259\thinspace\text{MiB}$ | NVIDIA H100 80GB HBM3 | |
| komet | floating Partition | 6 days | $248\thinspace000\thinspace\text{MiB}$ | - | |
| czlab | gpu0602 | 6 days | $1\thinspace031\thinspace580\thinspace\text{MiB}$ | NVIDIA L40 |
| Partition | Nodes | Limit | RAM | Designated Use |
|---|---|---|---|---|
| a100dl | gpu00xx | 6 days | $1\thinspace016\thinspace000\thinspace\text{MiB}$ | GPU requirement |
Hidden Partitions
Information on hidden partitions can be viewed by anyone. These partitions are set to be hidden to avoid cluttering the output for every poll - these partitions are “private” to certain projects/groups and only of interest to respective groups.
To visualize all jobs for a user in all partitions supply the -a flag:
squeue -u $USER -aLikewise sinfo can be supplemented with -a to gather informations. All other commands work without this flag as expected.
Slurm Query
The tables on this page listed key attributes of MOGON’s compute nodes grouped by partition. For a complete listing, you can also query this information with the following Slurm command:
scontrol show partition <name-of-partition> --clusters=<cluster>For example:
scontrol show partition ki-parallel --clusters=mogonkiscontrol show partition parallel --clusters=mogonnhrSlurm will display defaults as well as minimal and maximal settings for reservation time, memory capacity, etc. of the partition.
Memory Limits
You can also list all of our partitions with relevant limits using the sinfo command:
sinfo -e -o "%20P %16F %8z %.8m %.11l %G" -S "+P+m" --clusters=all