Lustre
MOGON II has several Lustre fileserver for different purposes:
- Project (
/lustre/project or /lustre/miifs01
) - ATLAS (private,
/lustre/miifs02
) - HIMster2_th (private,
/lustre/miifs04
) - HIMster2_exp (private,
/lustre/miifs05
)
MOGON II projects get access to the project fileserver if it is requested during the application process.
NO BACKUP
There is NO BACKUP AT ALL on any Lustre fileservers. The fileservers are not an archive system. Please remove all data from the fileservers that is no longer needed.
Basics
- Random IO bad
- Sequential IO good
= bad
= bad
= good
= good
Architecture
A Lustre filesystem consists of multiple Object storage targets (OST) and at least one Metadata target (MDT). The actual data within a file is stored on the OSTs at a block level, while the MDT holds the metadata to a file. Servers providing the MDTs are called Metadata Servers (MDS), servers providing the OSTs are called Object Storage Servers (OSS).
In order to leverage the Lustre capabilities of parallelization, it is possible to store single files across several OSTs. This enables a file to be read from multiple sources in parallel, reducing both access times and bandwidth for very large files.
Quota
The usage of the project fileserver is limited on a per project basis. You can find out your project’s quota and the amount you are currently using
The example shows that the project uses $6.716TB$ of the assigned quota of $10TB$. If any limit is exceeded, the entry is marked with an asterisk.
If your project is above its quota, file creation will be prohibited!
Striping
Striping
The process of distributing file blocks across multiple storage targets is called striping.
We have implemented a default striping scheme of all files on the project fileservers. All files are striped across four OSTs, beginning at a size of $4GB$. This not only improves read performance for these files but also distributes load better across storage targets and storage servers.
You can find out the striping pattern of your files with:
The output shows that the file has a stripe count of 1 for the first $1GB$, and consists of 4 stripes afterwards. As the testfile is empty, there is only one object created on OST 37 at this point. Should the file exceed $1GB$, more objects on others OSTs will be assigned.
The current striping pattern is a tradeoff that should give good performance in most scenarios. Feel free to change the striping layout if you find it necessary with the lfs setstripe
command.
Lustre usage guidelines
Avoid …
- a large amount of files in single directories
- random IO
The project fileserver consists of spinning disks. Reading data randomly from files is considerably slower than sequential reading. - using
ls
or wildcard operation
Although it is not intuitive, invocations ofls
on directories imply a series of operations, that cause a high load on both MDS and OSS. Please avoid them where possible. - too many parallel IO operations on the same file
It is easy to start hundreds of parallel threads operating on the same file. It is not easy to actually process this IO on the server side due to low throughput and high latency for spinning disks. This applies to read operations and is even more important for write operations due to lock contention. - more than 1000 files per directory
Use the node local scratch if you can not avoid the issues listed above. Feel free to contact us any time if you have questions about your workload and would like to have advice on how to improve your IO.
Performance considerations …
- accessing files in Lustre involves a big overhead
- accessing a lot of files in the same directory causes file locking conflicts
heavily reducing the efficiency and speed of these operations. - reading small files is not efficient
Every read and open operation on a file comes with a huge overhead and network latency. You should use bigger files instead of many small ones wherever possible.
Recent changes
During February and March 2022, we have extended the storage in the project filesystem by adding more OSTs. The available space has been increased by a significant amount, so that the OSTs were extremely unbalanced.
Prior to this addition we did not use any striping per default. The usage statistics have shown that load is often quite unbalanced towards single OSTs, creating bottlenecks in terms of bandwidth and $IOPS$. As a data migration towards the new storage targets was necessary to restore balance, we also striped files based on the aforementioned striping pattern.
Please Note:
The access times of those files that have been migrated might have changed to the date of migration. This is natural for the migration process and no need for concern.