Mathematica
is a computational software program used in many scientific, engineering, mathematical and computing fields. It was conceived by Stephen Wolfram and is developed by Wolfram Research of Champaign, Illinois. Mathematica is renowned as the world’s ultimate application for computations. But it’s much more—it’s the only development platform fully integrating computation into complete workflows, moving you seamlessly from initial ideas all the way to deployed individual or enterprise solutions.
Members of the physics department of the Johannes Gutenberg University enjoy an unlimited number of licenses. Hence, a multitude of jobs can be submitted with regard to licensing.
All other users, however, share the 10 licenses the university has obtained. Hence, the number of concurrent jobs is severely limited. Jobs may crash with an appropriate error message. To avoid this limit the number of concurrently running jobs. If in doubt contact the HPC-team.
Please Note
The license availability check is performed upon starting Mathematica. A check prior to the start is not possible.
The Wolfram Language uses independent kernels as parallel processors. It is clear that these kernels do not share a common memory, even if they happen to reside on the same machine. However, the Wolfram Language provides functions that implement virtual shared memory for these remote kernels.
Before you start parallelising with Mathematica on MOGON GPUs, you need to prepare your Mathematica environment for the usage of GPUs. Your slurm script needs this command (and the SBATCH options should be configured as for any GPU job)
The test estimates how fast data can be sent to and read from the GPU. However, there is also some overhead included in the measurements, in particular the overhead for function calls and array allocation time. Because those are present in any “real” use of the GPU, it is reasonable to include them. Memory is allocated and data is sent to the GPU using CUDAMemoryLoad[]. Memory is allocated and data is transferred back to CPU memory using CUDAMemoryGet[].
The theoretical bandwidth per lane for PCIe 3.0 is $0.985 GB/s$. For the GTX 1080Ti (PCIe3 x16) used in our MOGON GPU nodes the 16-lane slot could theoretical give $15.754 GB/s$.(( This example was taken from the
MATLAB Help Center
and adapted.))
The job script is pretty ordinary. In this example, we use only one GPU and start Mathematica with four threads. To do this, we request one process with four cpus for multithreading:
The job is submitted with the following command
The job will be finished after a few minutes, you can view the output as follows:
The output should be similar to the following lines:
The script also generates a plot, which we would like to show here:
You might be familiar with this example if you stumbled upon our MATLAB article or read it on purpose. At this point we would like to restate what we originally took from the
MATLAB Help Center
:
For operations where the number of floating-point computations performed per element read from or written to memory is high, the memory speed is much less important. In this case the number and speed of the floating-point units is the limiting factor. These operations are said to have high "computational density".
A good test of computational performance is a matrix-matrix multiply. For multiplying two $N times N$ matrices, the total number of floating-point calculations is
$$ FLOPS(N) = 2N^3 - N^2 $$
Two input matrices are read and one resulting matrix is written, for a total of $3N^2$ elements read or written. This gives a computational density of $(2N - 1)/3$ FLOP/element. Contrast this with plus as used above, which has a computational density of $1/2$ FLOP/element.
MATLAB Help Center,
Measuring GPU Performance
You can submit the job by executing:
The job will be completed after a couple of minutes and you can view the output with:
The output should resemble the following lines:
The graphic generated in the script is shown below: