by Matt Schultz and Kyle Felker
This is the seventh post in the bloggERS Script It! Series.
In 2016, bepress launched a new service to assist subscribers of Digital Commons with archiving the contents of their institutional repository. Using Amazon Web Services (AWS), this new service pushes daily updates of an institution’s repository contents to an Amazon Simple Storage Service (S3) bucket setup that is hosted by the institution. Subscribers thus control a real-time copy of all their institutional repository content outside of bepress’s Digital Commons platform. They can download individual files or the entire repository from this S3 bucket, all on their own schedules, and for whatever purposes they deem appropriate.
Grand Valley State University (GVSU) Libraries makes use of Digital Commons for their ScholarWorks@GVSU institutional repository. Using bepress’s new archiving service has given GVSU an opportunity to perform scripted automation in the Cloud using the same open source tools that they use in their regular curation workflows. These tools include Brunnhilde, which bundles together ClamAV and Siegfried to produce file format, fixity, and virus check reports, as well as BagIt for preservation packaging.
Leveraging the ease of launching Amazon’s EC2 server instances and use of their AWS command line interface (CLI), GVSU Libraries was able to readily configure an EC2 “curation server” that directly syncs copies of their Digital Commons data from S3 to the new server where the installed tools mentioned above do their work of building preservation packages and sending them back to S3 and to Glacier for nearline and long-term storage. The entire process now begins and ends in the Cloud, rather than involving the download of data to a local workstation for processing.
Creating a digital preservation copy of the GVSU’s ScholarWorks files involves four steps:
- Syncing data from S3 to EC2: No processing can actually be done on files in-place on S3, so they must be copied to the EC2 “curation server”. As part of this process, we wanted to reorganize the files into logical directories (alphabetized), so that it would be easier to locate and process files, and to better organize the virus reports and the “bags” the process generates.
- Virus and format reports with Brunnhilde: The synced files are then run through Brunnhilde’s suite of tools. Brunnhilde will generate both a command-line summary of problems found and detailed reports that are deposited with the files. The reports stay with the files as they move through the process.
- Preservation Packaging with BagIt: Once the files are checked, they need to be put in “bags” for storage and archiving using the BagIt tool. This will bundle the files in a data directory and generate metadata that can be used to check their integrity.
- Syncing files to S3 and Glacier: Checked and bagged files are then moved to a different S3 bucket for nearline storage. From that bucket, we have set up automated processes (“lifecycle management” in AWS parlance) to migrate the files on a quarterly schedule into Amazon Glacier, our long-term storage solution.
Once the process has been completed once, new files are incorporated and re-synced on a quarterly basis to the BagIt data directories and re-checked with Brunnhilde. The BagIt metadata must then be updated and re-verified using BagIt, and the changes synced to the destination S3 bucket.
Running all these tools in sequence manually using the command line interface is both tedious and time-consuming. We chose to automate the process using a shell script. Shell scripts are relatively easy to write, and are designed to automate command-line tasks that involve a lot of repetitive work (like this one).
These scripts can be found at our github repo: https://github.com/gvsulib/curationscripts
Process_backup is the main script. It handles each of the four processing stages outlined above. As it does so, it stores the output of those tasks in log files so they can be examined later. In addition, it emails notifications to our task management system (Asana) so that our curation staff can check on the process.
After the first time the process is run, the metadata that BagIt generates has to be updated to reflect new data. The version of BagIt we are using (Python) can’t do this from the command line, but it does have an API with a command that will update existing “bag” metadata. So, we created a small Python script to do this (regen_BagIt_manifest.py). The shell script invokes this script at the third stage if bags have previously been created.
Finally, the update.sh script automatically updates all the tools used in the process and emails curation staff when the process is done. We then schedule the scripts to run automatically using the Unix cron utility.
GVSU Libraries is now hard at work on a final bagit_check.py script that will facilitate spot retrieval of the most recent version of a “bag” from the S3 nearline storage bucket and perform a validation audit using BagIt.
Matt Schultz is the Digital Curation Librarian for Grand Valley State University Libraries. He provides digital preservation for the Libraries’ digital collections, and offers support to faculty and students in the areas of digital scholarship and research data management. He has the unique displeasure of working on a daily basis with Kyle Felker.