Manage IO Issues
How to prevent Memory and I/O issues
Local Scratch Space
On every node, there is local scratch space available to your running jobs.
Every job can therefore use a directory called /localscratch/${SLURM_JOB_ID}/
on the local disk. If a job array starts then this directory is also called /localscratch/${SLURM_JOB_ID}/
, where the variable SLURM_ARRAY_TASK_ID
is an index of a subjob in the job array and unrelated to $SLURM_JOB_ID
.
If your job(s) in question are merely reading and writing big files in a linear mode, there is no requirement to use a local scratch or a ramdisk. However, these are scenarios, where using the local scratch might be beneficial:
- if your job produces many temporary files
- if your job reads a file or set of files in a directory repeatedly during run time (for multiple threads or concurrent jobs mean a random access pattern to the global file system, which is a true performance killer)
If your job runs on multiple nodes, you cannot use the local scratch space on one node from the other nodes.
If you need your input data on every node, please refer to the section Copy files to multiple nodes via job script.
For the further explanation on this page, we assume you have a program called my_program
, which reads input data from ./input_file
, writes output data to ./output_file
and periodically writes a checkpoint file called ./restart_file
.
The program shall be executed on a whole node with 64 processors. It probably uses OpenMP.
Assume you would normally start the program in the current working directory where it will read and write its data like this:
Now to get the performance of local disk access, you want to use the aforementioned local scratch space on the compute node.
Available Space
Please take in mind, that the free space on /localscratch/${SLURM_JOB_ID}/
when the jobs starts, might be shared with other users. If you need the total space to be available to you for the whole job, you should request the whole node, for example by allocating all CPUs.
Copy files via job script and signalling batch scripts with SLURM
The following example will submit a jobscript, where SLURM will send a signal to the job script prior to ending. This will enable the jobscript to collect data written to the local scratch directory or directories.
Signalling in SLURM – difference between signalling submission scripts and applications
In SLURM applications do not automatically get a signal before hitting the walltime. It needs to be specified:
This would send the signal SIGUSR2
to the application ten minutes before hitting the walltime of the job. Note that the slurm documentation states that there is an uncertainty of up to 1 minute.
Usually this requires you to use
within a submission script to signal the batch-job (instead of all the children of but not the batch job itselft). The reason is: If using a submission script like the one above, you trap the signal within the script, not the application.
Copy files to multiple nodes via job script
The following script can be used to ensure that input files are present in the job directory on all nodes.
The demonstrated sbcast
command can also be used for the one-node example above.
Ramdisk Reservation
In addition to this section, there is a man page available:
Especially for I/O intensive jobs, which issue many system calls, the local disk as well as our GPFS fileserver can be a bottleneck. Staging of files to a local RAM disk (or ramdisk) can be a solution for these jobs.
In order to create a ramdisk for your job you must specify an (additional) sbatch statement, where the size of the ramdisk is stated in the usual kilobytes, megabytes or gigabytes, abbreviated as M
, G
or T
.
The ramdisk is created in the jobdir directory /localscratch/${SLURM_JOB_ID}/ramdisk
For this example we assume submission like $ sbatch <jobscript>
and a script like:
The specification can be given on the command line, too:
- Omitting the unit will result in an error informing you about this requirement to supply a unit.
- A unit like
T
for Terabyte is to be used with caution as other big values withG
: The current implementation will not check, whether or not the physical RAM is actually provided by the selected nodes.
The reserved disk size is stored in the environment variable SLURM_SPANK_JOB_RAMDISK
. This can be used in job scripts and will hold the reserved memory value in units of megabytes.
Multi Node Jobs
For jobs using multiple nodes, multiple ramdisks are created. To facilitate the stage in phase, you can use the sbcast
command:
Now, all nodes can access the file at the same path.
Special Filesystems
EtapFS
When using etapfs you should acess these filesystem only within your job and not indicate paths on the command line with SLURM commands (e.g. sbatch
). Especially stdout
should not be directed to /etapfs
as this can prevent a job to be ended (even if the wall clock time limit is hit), if one such filesystem “hangs”.
This means that the -o
, -e
and -D/--workdir=<path>
parameters of sbatch
should not point to etapfs. However, copying within your script to etapfs is ok, e.g.:
Likewise, pointing the output of subshells to etapfs is an alternate solution. This helps the scheduler to clean up jobs.
Also, you should not submit jobs when your actual working directory at the time of submit is on an etapfs. Instead,
can intentionally be used to avoid the mentioned problems, whilst submitting from a working directory in etapfs.
Atlas - I/O bandwidth reservation
The GPFS fileserver of MOGON provides a file system which is exclusively reserved for ATLAS users. The maximum total I/O bandwidth of this file system is about $8000 MB/s$. Until the GPFS file system has been optimized for typical ATLAS ROOT jobs this value is set to a lower value in order to prevent oversubscription of the provided bandwidth, which would result in an unnecessarily large wall time of the job. When submitting a job, the user must specify the expected bandwidth.
If the user needs an I/O bandwidth of e.g. $10MB/s$, the SBATCH
statements must provide the following parameter:
Jobs will only start running, if there is enough I/O bandwidth available. The amount of available bandwidth can be checked via:
Here, the first number gives the total bandwidth (in $MB/s$), which is available. The second gives the currently reserved bandwidth. Hence, jobs requesting more bandwidth than currently available will have to wait. (The result also mentions the hosts managing this bandwidth in the following columns.)
PLEASE NOTE
If the rusage parameter is omitted by a user, Slurm will automatically assume a bandwith of $10 MB/s$.
For jobs that will not do a reasonable amount (sum of all files < 100MByte) of I/O, the user should specify the parameter with $0 MB/s$, otherwise - like mentioned before - $10MB/s$ will be assumed.