Archiving
Meeting the Archiving Requirement
Research projects are required to archive their data (raw data, results, software workflows, etc.) by their respective funding organizations. In addition, HPC projects may not expect long time storage of their research data on HPC filesystems and are hence advised to utilize the ZDV data management facilities.
Automatically set Metadata
With every archiving act the following metadata are automatically associated with your data set:
Creator
- the full user name of the person archiving the data in questionPublisher
- always set to “Johannes Gutenberg-University”Location
- set to “Mainz, Germany”Date
- is the data of archiving act and contains the Unix timestamp of the actExpiryDate
- isDate
+ 10 yearsprotected
- per default this property is set to “false”, which means that data can still be changed (e.g. more data added to a collection)
Meta Data Stewardship with Schemas
In order to facilitate populating iRODS collections with meta data, according to schemas we provide a helper module.
You can create a schema file with an online tool:
JSON-Schemas to iRODS
Loading the module tools/imcs
will provide a script which can be called like:
schema2avu -j <json_file> -c <iRODS-path to iRODS collection>
Preparing to archive
We suggest to compress and annotate data prior to archiving with the iRODS archive:
- compressing saves transfer time
- annotation eases the interpretation of retrieved data (if an archive needs to be pulled back).
Compressing Directories
A smaller directory can be compressed in the standard way:
# assuming gzip compression
tar -czf <archivename>.tar.gz <directoryname>
You may speed-up the compression, on a login-node using a parallel compression tool like pigz
:
module load tools/pigz
tar cf - <directoryname> | pigz -p 4 > <archivename>.tar.gz
If the directory you are working on is too big, you can run an interactive job, too:
module load tools/pigz
# an interactive job might look like:
srun -A <your account> -p parallel -C broadwell -t <appropriate time> -N 1 -c40 --pty bash -i
<some node>:$ tar -I pigz -cf <archivename>.tar.gz <directoryname>