Compute Nodes

This page lists hardware specifications for the compute nodes currently used in the MOGON clusters, as well as our partitioning.

Available Resources

The following table displays all generally available nodes of MOGON KI. They are interconnected via Infiniband HDR and have $3.2\thinspace\text{TB}$ disk space. All nodes run on the same AMD EPYC 7713 CPU architecture.

Nodes S / C / T RAM Accelerator
cpu0xxx $2/64/1$ $33\times~256\thinspace\text{GB}$
$14\times~512\thinspace\text{GB}$
$\space2\times1024\thinspace\text{GB}$
$1\times2048\thinspace\text{GB}$
gpu0101 -
gpu0102
$2/64/1$ $1024\thinspace\text{GB}$ $4\times$ AMD MI250
gpu0201-
gpu0204
$2/64/1$ $2048\thinspace\text{GB}$ $8\times$ Nvidia A100-SXM4 $80\thinspace\text{GB}$
gpu0301-
gpu0307
$2/64/1$ $1024\thinspace\text{GB}$ $8\times$ Nvidia A40 $48\thinspace\text{GB}$

S stands for sockets per node, C - cores per socket, and T - threads per core.

The following table displays all generally available nodes of MOGON NHR. They are interconnected via Infiniband HDR and have $3.2\thinspace\text{TB}$ disk space. All nodes run on the same AMD EPYC 7713 CPU architecture.

Nodes S / C / T RAM Accelerator
cpu0xxx $2/64/1$ $400\times~256\thinspace\text{GB}$
$159\times~512\thinspace\text{GB}$
$\space27\times1024\thinspace\text{GB}$
$4\times2048\thinspace\text{GB}$
gpu0001-
gpu0010
$2/64/1$ $1024\thinspace\text{GB}$ $4\times$ Nvidia A100-SXM4 $40\thinspace\text{GB}$

S stands for sockets per node, C - cores per socket, and T - threads per core.

The memory specified above is not to be confused with RAM available at runtime, as all nodes reserve some memory for basic services.

When you specify your memory reservation with Slurm, please use the RAM values in the partitioning table below.

Partitioning

Individual compute nodes are grouped together into larger subsets of the cluster to form so-called partitions. Partitions group nodes based on characteristics or policies to ensure fairness and responsiveness.

Nodes: CPU-Nodes

Partition Limit RAM Designated Use
ki-smallcpu 6 days $1\thinspace930\thinspace\text{MiB}$
$\space\text{per CPU}$
for jobs using CPUs $\ll 128$
max. run. jobs per user: $3\text{k}$
ki-parallel 6 days $\space248\thinspace000\thinspace\text{MiB}$
$\space504\thinspace000\thinspace\text{MiB}$
jobs using $\text{n}$ exclusive nodes,
$\text{CPUs}=128\times\text{n}$ for $\text{n}\in[1,2,\ldots]$
ki-longtime 12 days $\space248\thinspace000\thinspace\text{MiB}$
$\space504\thinspace000\thinspace\text{MiB}$
long running jobs $\ge \text{6 days}$
ki-largemem 6 days $1\thinspace016\thinspace000\thinspace\text{MiB}$ higher memory needs
ki-hugemem 6 days $1\thinspace992\thinspace000\thinspace\text{MiB}$ higher memory needs
Partition Limit RAM Designated Use
smallcpu 6 days $1\thinspace930\thinspace\text{MiB}$
$\space\text{per CPU}$
for jobs using $\text{CPUs} \ll 128$
max. run. jobs per user: $3\text{k}$
parallel 6 days $\space248\thinspace000\thinspace\text{MiB}$
$\space504\thinspace000\thinspace\text{MiB}$
jobs using $\text{n}$ exclusive nodes,
$\text{CPUs}=128\times\text{n}$ for $\text{n}\in[1,2,\ldots]$
longtime 12 days $\space248\thinspace000\thinspace\text{MiB}$
$\space504\thinspace000\thinspace\text{MiB}$
long running jobs $\ge \text{6 days}$
largemem 6 days $1\thinspace016\thinspace000\thinspace\text{MiB}$ higher memory needs
hugemem 6 days $1\thinspace992\thinspace000\thinspace\text{MiB}$ higher memory needs

Did you know?

The parallel partition allocates nodes exclusively — meaning even a 2-CPU job reserves a full node. To avoid waste, submit small jobs to the smallcpu partition.

Partitions supporting Accelerators

Partition Nodes Limit RAM Designated Use
mi250 gpu010x 6 days $1\thinspace016\thinspace000\thinspace\text{GB}$ GPU requirement
a100ai gpu020x 6 days $1\thinspace992\thinspace000\thinspace\text{GB}$ GPU requirement
a40 gpu030x 6 days $1\thinspace016\thinspace000\thinspace\text{GB}$ GPU requirement
ki-gpu-devel - 2 hours - GPU testing

Private Partitions

Partition Nodes Limit RAM Accelerators
topml gpu0601 6 days $1\thinspace547\thinspace259\thinspace\text{MiB}$ NVIDIA H100 80GB HBM3
komet floating Partition 6 days $248\thinspace000\thinspace\text{MiB}$ -
czlab gpu0602 6 days $1\thinspace031\thinspace580\thinspace\text{MiB}$ NVIDIA L40
Partition Nodes Limit RAM Designated Use
a100dl gpu00xx 6 days $1\thinspace016\thinspace000\thinspace\text{MiB}$ GPU requirement

Hidden Partitions

Information on hidden partitions can be viewed by anyone. These partitions are set to be hidden to avoid cluttering the output for every poll - these partitions are “private” to certain projects/groups and only of interest to respective groups.

To visualize all jobs for a user in all partitions supply the -a flag:

squeue -u $USER -a

Likewise sinfo can be supplemented with -a to gather informations. All other commands work without this flag as expected.

Slurm Query

The tables on this page listed key attributes of MOGON’s compute nodes grouped by partition. For a complete listing, you can also query this information with the following Slurm command:

scontrol show partition <name-of-partition> --clusters=<cluster>

For example:

scontrol show partition ki-parallel --clusters=mogonki
scontrol show partition parallel --clusters=mogonnhr

Slurm will display defaults as well as minimal and maximal settings for reservation time, memory capacity, etc. of the partition.

Memory Limits

You can also list all of our partitions with relevant limits using the sinfo command:

sinfo -e -o "%20P %16F %8z %.8m %.11l %G" -S "+P+m" --clusters=all