Archiving

Meeting the Archiving Requirement

Research projects are required to archive their data (raw data, results, software workflows, etc.) by their respective funding organizations. In addition, HPC projects may not expect long time storage of their research data on HPC filesystems and are hence advised to utilize the ZDV data management facilities.

Automatically set Metadata

With every archiving act the following metadata are automatically associated with your data set:

  • Creator - the full user name of the person archiving the data in question
  • Publisher - always set to “Johannes Gutenberg-University”
  • Location - set to “Mainz, Germany”
  • Date - is the data of archiving act and contains the Unix timestamp of the act
  • ExpiryDate - is Date + 10 years
  • protected - per default this property is set to “false”, which means that data can still be changed (e.g. more data added to a collection)

Meta Data Stewardship with Schemas

In order to facilitate populating iRODS collections with meta data, according to schemas we provide a helper module.

You can create a schema file with an online tool:

JSON-Schemas to iRODS

Loading the module tools/imcs will provide a script which can be called like:

schema2avu -j <json_file> -c <iRODS-path to iRODS collection>
Currently no nested schemas for complex data are supported. As this nesting might be data specific, you may approach the HPC team to include the necessary feature for your specific data.

Preparing to archive

We suggest to compress and annotate data prior to archiving with the iRODS archive:

  • compressing saves transfer time
  • annotation eases the interpretation of retrieved data (if an archive needs to be pulled back).

Compressing Directories

A smaller directory can be compressed in the standard way:

# assuming gzip compression
tar -czf <archivename>.tar.gz <directoryname>

You may speed-up the compression, on a login-node using a parallel compression tool like pigz:

module load tools/pigz
tar cf - <directoryname> | pigz -p 4 > <archivename>.tar.gz

If the directory you are working on is too big, you can run an interactive job, too:

module load tools/pigz
# an interactive job might look like:
srun -A <your account> -p parallel -C broadwell -t <appropriate time> -N 1 -c40 --pty bash -i
  <some node>:$ tar -I pigz -cf <archivename>.tar.gz <directoryname>