GVSU Scripted IR Curation in the Cloud

by Matt Schultz and Kyle Felker

This is the seventh post in the bloggERS Script It! Series.

In 2016, bepress launched a new service to assist subscribers of Digital Commons with archiving the contents of their institutional repository. Using Amazon Web Services (AWS), this new service pushes daily updates of an institution’s repository contents to an Amazon Simple Storage Service (S3) bucket setup that is hosted by the institution. Subscribers thus control a real-time copy of all their institutional repository content outside of bepress’s Digital Commons platform. They can download individual files or the entire repository from this S3 bucket, all on their own schedules, and for whatever purposes they deem appropriate.

Grand Valley State University (GVSU) Libraries makes use of Digital Commons for their ScholarWorks@GVSU institutional repository. Using bepress’s new archiving service has given GVSU an opportunity to perform scripted automation in the Cloud using the same open source tools that they use in their regular curation workflows. These tools include Brunnhilde, which bundles together ClamAV and Siegfried to produce file format, fixity, and virus check reports, as well as BagIt for preservation packaging.

Leveraging the ease of launching Amazon’s EC2 server instances and use of their AWS command line interface (CLI), GVSU Libraries was able to readily configure an EC2 “curation server” that directly syncs copies of their Digital Commons data from S3 to the new server where the installed tools mentioned above do their work of building preservation packages and sending them back to S3 and to Glacier for nearline and long-term storage. The entire process now begins and ends in the Cloud, rather than involving the download of data to a local workstation for processing.

Creating a digital preservation copy of the GVSU’s ScholarWorks files involves four steps:

  1. Syncing data from S3 to EC2:  No processing can actually be done on files in-place on S3, so they must be copied to the EC2 “curation server”. As part of this process, we wanted to reorganize the files into logical directories (alphabetized), so that it would be easier to locate and process files, and to better organize the virus reports and the “bags” the process generates.
  2. Virus and format reports with Brunnhilde:  The synced files are then run through Brunnhilde’s suite of tools.  Brunnhilde will generate both a command-line summary of problems found and detailed reports that are deposited with the files.  The reports stay with the files as they move through the process.
  3. Preservation Packaging with BagIt: Once the files are checked, they need to be put in “bags” for storage and archiving using the BagIt tool.  This will bundle the files in a data directory and generate metadata that can be used to check their integrity.
  4. Syncing files to S3 and Glacier: Checked and bagged files are then moved to a different S3 bucket for nearline storage.  From that bucket, we have set up automated processes (“lifecycle management” in AWS parlance) to migrate the files on a quarterly schedule into Amazon Glacier, our long-term storage solution.

Once the process has been completed once, new files are incorporated and re-synced on a quarterly basis to the BagIt data directories and re-checked with Brunnhilde.  The BagIt metadata must then be updated and re-verified using BagIt, and the changes synced to the destination S3 bucket.

Running all these tools in sequence manually using the command line interface is both tedious and time-consuming. We chose to automate the process using a shell script. Shell scripts are relatively easy to write, and are designed to automate command-line tasks that involve a lot of repetitive work (like this one).

These scripts can be found at our github repo: https://github.com/gvsulib/curationscripts

Process_backup is the main script. It handles each of the four processing stages outlined above.  As it does so, it stores the output of those tasks in log files so they can be examined later.  In addition, it emails notifications to our task management system (Asana) so that our curation staff can check on the process.

After the first time the process is run, the metadata that BagIt generates has to be updated to reflect new data.  The version of BagIt we are using (Python) can’t do this from the command line, but it does have an API with a command that will update existing “bag” metadata. So, we created a small Python script to do this (regen_BagIt_manifest.py).  The shell script invokes this script at the third stage if bags have previously been created.

Finally, the update.sh script automatically updates all the tools used in the process and emails curation staff when the process is done. We then schedule the scripts to run automatically using the Unix cron utility.

GVSU Libraries is now hard at work on a final bagit_check.py script that will facilitate spot retrieval of the most recent version of a “bag” from the S3 nearline storage bucket and perform a validation audit using BagIt.

Matt Schultz is the Digital Curation Librarian for Grand Valley State University Libraries. He provides digital preservation for the Libraries’ digital collections, and offers support to faculty and students in the areas of digital scholarship and research data management. He has the unique displeasure of working on a daily basis with Kyle Felker.

Kyle Felker was born feral in the swamps of Louisiana, and a career in technology librarianship didn’t do much to domesticate him.  He currently works as an Application Developer at GVSU Libraries.


Small-Scale Scripts for Large-Scale Analysis: Python at the Alexander Turnbull Library

by Flora Feltham

This is the third post in the bloggERS Script It! Series.

The Alexander Turnbull is a research library that holds archives and special collections within the National Library of New Zealand. This means exactly what you’d expect: manuscripts and archives, music, oral histories, photographs, and paintings, but also artefacts such as Katherine Mansfield’s typewriter and a surprising amount of hair. In 2008, the National Library established the National Digital Heritage Archive (NDHA), and has been actively collecting and managing born-digital materials since. I am one of two Digital Archivists who administer the transfer of born-digital heritage material to the Library. We also analyse files to ensure they have all the components needed for long-term preservation and ingest collections to the NDHA. We work closely with our digital preservation system administrators and the many other teams responsible for appraisal, arrangement and description, and providing access to born-digital heritage.

Why Scripting?

As archivists responsible for safely handling and managing born-digital heritage, we use scripts to work safely and sanely at scale. Python provides a flexible yet reliable platform for our work: we don’t have to download and learn a new piece of software every time we need to accomplish a different task. The increasing size and complexity of digital collections often means that machine processing is the only way to get our work done. A human could not reliably identify every duplicate file name in a collection of 200,000 files… but Python can. To protect original material, too, the scripting we do during appraisal and technical analysis is done using copies of collections. Here are some of the extremely useful tasks my team scripted recently:

  • Transfer
    • Generating a list of files on original storage media
    • Transferring files off the original digital media to our storage servers
  • Appraisal
    • Identifying duplicate files across different locations
    • Adding file extensions so material opens in the correct software
    • Flattening complex folder structures to support easy assessment
  • Technical Analysis
    • Sorting files into groups based on file extension to isolate unknown files
    • Extracting file signature information from unknown files

Our most-loved Python script even has a name: Safe Mover. Developed and maintained by our Digital Preservation Analyst, Safe Mover will generate file checksums, maintain metadata integrity, and check file authenticity, all the while copying material off digital storage media. Running something somebody else wrote was a great introduction to scripting. I finally understood that: a) you can do nimble computational work from a text editor; and b) a ‘script’ is just a set of instructions you write for the computer to follow.

Developing Skills Slowly, Systematically, and as Part of a Group

Once we recognised that we couldn’t do our work without scripting, my team started regular ‘Scripting Sessions’ with other colleagues who code. At each meeting we solve a genuine processing challenge from someone’s job, which allows us to apply what we learn immediately. We also write the code together on a big screen which, like learning any language, helped us become comfortable making mistakes. Recently, I accidentally copied 60,000 spreadsheets to my desktop and then wondered aloud why my laptop ground to a halt.

Outside of these meetings, learning to script has been about deliberately choosing to problem-solve using Python rather than doing it ‘by-hand’. Initially, this felt counter-intuitive because I was painfully slow: I really could have counted 200 folders on my fingers faster than I wrote a script to do the same thing.  But luckily for me, my team recognised the overall importance of this skill set and I also regularly remind myself: “this will be so useful the next 5000 times I need to inevitably do this task”.

The First Important Thing I Remembered How to Write

Practically every script I write starts with import os. It became so simple once I’d done it a few times: import is a command, and ‘os’ is the name of the thing I want. os is a Python module that allows you to interact with and leverage the functionality of your computer’s operating system. In general terms, a Python module is a just pre-existing code library for you to use. They are usually grouped around a particular theme or set of tasks.

The function I use the most is os.walk(). You tell os.walk() where to start and then it systematically traverses every folder beneath that. For every folder it finds, os.walk() will record three things: 1: the path to that folder, 2: a list of any sub-folders it contains, and 3: a list of any files it contains. Once os.walk() has completed its… well… walk, you have access to the name and location of every folder and file in your collection.

You then use Python to do something with this data: print it to the screen, write it in a spreadsheet, ask where something is, or open a file. Having access to this information becomes relevant to archivists very quickly. Just think about our characteristic tasks and concerns: identifying and recording original order, gaining intellectual control, maintaining authenticity. I often need to record or manipulate file paths, move actual files around the computer, or extract metadata stored in the operating system.

An Actual Script

Recently, we received a Mac-formatted 1TB hard drive from a local composer and performer. When #1 script Safe Mover stopped in the middle of transferring files, we wondered if there was a file path clash. Generally speaking, in a Windows formatted file system there’s a limit of 255 characters to a file path (“D:\so_it_pays_to\keep_your_file_paths\niceand\short.doc”).

Some older Mac systems have no limit on the number of file path characters so, if they’re really long, there can be a conflict when you’re transferring material to a Windows environment. To troubleshoot this problem we wrote a small script:

import os

top = "D:\example_folder_01\collection"
for root, dir_list, file_list in os.walk(top):
    for item in file_list:
    file_path = os.path.join(root,item)
    if len(file_path) > 255:
        print (file_path)

So that’s it– and running it over the hard drive gave us the answer we needed. But what is this little script actually doing?

# import the Python module 'os'.
import os

# tell our script where to start
top = " D:\example_folder_01\collection"
# now os.walk will systematically traverse the directory tree starting from 'top' and retain information in the variables root, dir_list, and file_list.
# remember: root is 'path to our current folder', dir_list is 'a list of sub-folder names', and file_list is 'a list of file names'.
for root, dir_list, file_list in os.walk(top):
    # for every file in those lists of files...
    for item in file_list:
        # ...store its location and name in a variable called 'file_path'.
        # os.path.join joins the folder path (root) with the file name for every file (item).
        file_path = os.path.join(root,item)
        # now we do the actual analysis. We want to know 'if the length of any file path is greater than 255 characters'.
        if len(file_path) > 255:
            # if so: print that file path to the screen so we can see it!
            print (file_path)


All it does is find every file path longer than 255 characters and print it to the screen. The archivists can then eyeball the list and decide how to proceed. Or, in the case of our 1TB hard drive, exclude that as a problem because it didn’t contain any really long file paths. But at least we now know that’s not the problem. So maybe we need… another script?

Flora Feltham is the Digital Archivist at the Alexander Turnbull Library, National Library of New Zealand Te Puna Mātauranga o Aotearoa. In this role she supports the acquisition, ingest, management, and preservation of born-digital heritage collections. She lives in Wellington.

Electronic Records at SAA 2018

With just weeks to go before the 2018 SAA Annual Meeting hits the capital, here’s a round-up of the sessions that might interest ERS members in particular. This year’s schedule offers plenty for archivists who deal with the digital, tackling current gnarly issues around transparency and access, and format-specific challenges like Web archiving and social media. Other sessions anticipate the opportunities and questions posed by new technologies: blockchain, artificial intelligence, and machine learning.

And of course, be sure to mark your calendars for the ERS annual meeting! This year’s agenda includes lightning talks from representatives from the IMLS-funded OSSArcFlow and Collections as Data projects, and the DLF Levels of Access research group. There will also be a mini-unconference session focused on problem-solving current challenges associated with the stewardship of electronic records. If you would like to propose an unconference topic or facilitate a breakout group, sign up here.

Wednesday, August 15


Electronic Records Section annual business meeting (https://archives2018.sched.com/event/ESmz/electronic-records-section)

Thursday, August 16


105 – Opening the Black Box: Transparency and Complexity in Digital  Preservation (https://archives2018.sched.com/event/ESlh)


Open Forums: Preservation of Electronic Government Information (PEGI) Project (https://archives2018.sched.com/event/ETNi/open-forums-preservation-of-electronic-government-information-pegi-project)

Open Forums: Safe Search Among Sensitive Content: Investigating Archivist and Donor Conceptions of Privacy, Secrecy, and Access (https://archives2018.sched.com/event/ETNh/open-forums-safe-search-among-sensitive-content-investigating-archivist-and-donor-conceptions-of-privacy-secrecy-and-access)


201 – Email Archiving Comes of Age (https://archives2018.sched.com/event/ESlo/201-email-archiving-comes-of-age)

204-Scheduling the Ephemeral: Creating and Implementing Records Management Policy for Social Media (https://archives2018.sched.com/event/ESls/204-scheduling-the-ephemeral-creating-and-implementing-records-management-policy-for-social-media)

Friday, August 17

2:00 – 3:00

501 – The National Archives Aims for Digital Future: Discuss NARA Strategic Plan and Future of Archives with NARA Leaders (https://archives2018.sched.com/event/ESmP/501-the-national-archives-aims-for-digital-future-discuss-nara-strategic-plan-and-future-of-archives-with-nara-leaders)

502 – This is not Skynet (yet): Why Archivists should care about Artificial Intelligence and Machine Learning (https://archives2018.sched.com/event/ESmQ/502-this-is-not-skynet-yet-why-archivists-should-care-about-artificial-intelligence-and-machine-learning)

504 – Equal Opportunities: Physical and Digital Accessibility of Archival Collections (https://archives2018.sched.com/event/ESmS/504-equal-opportunities-physical-and-digital-accessibility-of-archival-collections)

508 – Computing Against the Grain: Capturing and Appraising Underrepresented Histories of Computing (https://archives2018.sched.com/event/ESmW/508-computing-against-the-grain-capturing-and-appraising-underrepresented-histories-of-computing)

Saturday, August 18

8:30 – 10:15

605 – Taming the Web: Perspectives on the Transparent Management and Appraisal of Web Archives (https://archives2018.sched.com/event/ESme/605-taming-the-web-perspectives-on-the-transparent-management-and-appraisal-of-web-archives)

606 – Let’s Be Clear: Transparency and Access to Complex Digital Objects (https://archives2018.sched.com/event/ESmf/606-lets-be-clear-transparency-and-access-to-complex-digital-objects)

10:30 – 11:30

704 – Blockchain: What Is It and Why Should You Care (https://archives2018.sched.com/event/ESmo/704-blockchain-what-is-it-and-why-should-we-care)


Restructuring and Uploading ZIP Files to the Internet Archive with Bash

by Lindsey Memory and Nelson Whitney

This is the second post in the bloggERS Script It! series.

This blog is for anyone interested in uploading ZIP files into the Internet Archive.

The Harold B. Lee Library at Brigham Young University has been a scanning partner with the Internet Archive since 2009. Any loose-leaf or oversized items go through the convoluted workflow depicted below, which can take hours of painstaking mouse-clicking if you have a lot of items like we do (think 14,000 archival issues of the student newspaper). Items must be scanned individually as JPEGS, each JPEG must be reformatted into a JP2, the JP2s must all be zipped into ZIP files, then ZIPs are (finally) uploaded one-by-one into the Internet Archive.

old workflow
Workflow for uploading ZIP files to the Internet Archive.

Earlier this year, the engineers at Internet Archive published a single line of Python code that allows scan centers to upload multiple ZIP files into the Internet Archive at once (see “Bulk Uploads”) . My department has long dreamed of a script that could reformat Internet-Archive-bound items and upload them automatically. The arrival of the Python code got us moving.  I enlisted the help of the library’s senior software engineer and we discussed ways to compress scans, how Python scripts communicate with the Internet Archive, and ways to reorganize the scans’ directory files in a way conducive to a basic Bash script.

The project was delegated to Nelson Whitney, a student developer. Nelson wrote the script, troubleshot it with me repeatedly, and helped author this blog. Below we present his final script in two parts, written in Bash for iOS in Spring 2018.

Part 1: makeDirectories.sh

This simple command, executed through Terminal on a Mac, takes a list of identifiers (list.txt) and generates a set of organized subdirectories for each item on that list. These subdirectories house the JPEGs and are structured such that later they streamline the uploading process.

#! /bin/bash

# Move into the directory "BC-100" (the name of our quarto scanner), then move into the subdirectory named after whichever project we're scanning, then move into a staging subdirectory.
cd BC-100/$1/$2

# Takes the plain text "list.txt," which is saved inside the staging subdirectory, and makes a new subdirectory for each identifier on the list.
cat list.txt | xargs mkdir
# For each identifier subdirectory,
for d in */; do
  # Terminal moves into that directory and
  cd $d
  # creates three subdirectories inside named "01_JPGs_cropped,"
mkdir 01_JPGs_cropped
  # "02_JP2s,"
mkdir 02_JP2s
  # and "03_zipped_JP2_file," respectively.
mkdir 03_zipped_JP2_file
  # It also deposits a blank text file in each identifier folder for employees to write notes in.
  touch Notes.txt

  cd ..
file structure copy
Workflow for uploading ZIP files to the Internet Archive.

Part 2: macjpzipcreate.sh

This Terminal command can recursively move through subdirectories, going into 01_JPEG_cropped first and turning all JPEGs therein into JP2s. Terminal saves the JP2s into the subdirectory 02_JP2s, then zips the JP2s into a zip file and saves the zip in subdirectory 03_zipped_JP2_file. Finally, Terminal uploads the zip into the Internet Archive. Note that for the bulk upload to work, you must have configured Terminal with the “./ia configure” command and entered your IA admin login credentials.

#! /bin/bash

# Places the binary file "ia" into the project directory (this enables upload to Internet Archive)
cp ia BC-100/$1/$2
# Move into the directory "BC-100" (the name of our quarto scanner), then moves into the subdirectory named after whichever project we're scanning, then move into a staging subdirectory
cd BC-100/$1/$2

  # For each identifier subdirectory
  for d in */; do
  # Terminal moves into that identifier's directory and then the directory containing all the JPG files
  cd $d
  cd 01_JPGs_cropped

  # For each jpg file in the directory
  for jpg in< *.jpg; do
    # Terminal converts the jpg files into jp2 format using the sips command in MAC OS terminals
    sips -s format jp2 --setProperty formatOptions best $jpg --out ../02_JP2s/$jpg.jp2

  cd ../02_JP2s
  # The directory variable contains a trailing slash. Terminal removes the trailing slash,
  # gives the correct name to the zip file,
  # and zips up all JP2 files.
  zip $d$im.zip *
  # Terminal moves the zip files into the intended zip file directory.
  mv $d$im.zip ../03_zipped_JP2_file

  # Terminal moves back up the project directory to where the ia script exists
  cd ../..
  # Uses the Internet-Archive-provided Python Script to upload the zip file to the internet
  ./ia upload $d $d/03_zipped_JP2_file/$d$im.zip --retries 10
  # Change the repub_state of the online identifier to 4, which marks the item as done in Internet Archive.
  ./ia metadata $d --modify=repub_state:4

The script has reduced the labor devoted to the Internet Archive by a factor of four. Additionally, it has bolstered Digital Initiatives’ relationship with IT. It was a pleasure working with Nelson; he gained real-time engineering experience working with a “client,” and I gained a valuable working knowledge of Terminal and the design of basic scripts.


lindsey_memoryLindsey Memory is the Digital Initiatives Workflows Supervisor at the Harold B. Lee Library at Brigham Young University. She loves old books and the view from her backyard.



Nelson Whitney

Nelson Whitney is an undergraduate pursuing a degree in Computer Science at Brigham Young University. He enjoys playing soccer, solving Rubik’s cubes, and spending time with his family and friends. One day, he hopes to do cyber security for the Air Force.

Meet the 2018 Candidates: Scott Kirycki

The 2018 elections for Electronic Records Section leadership are upon us! To support your getting to know the candidates, we will be presenting additional information provided by the 2018 nominees for ERS leadership positions. For more information about the slate of candidates, you can check out the full 2018 ERS elections site. ERS Members: be sure to vote! Polls are open through July 17!

Candidate name: Scott Kirycki

Running for: Steering Committee

What made you decide you wanted to become an archivist?

My journey to becoming an archivist began somewhere that did not, strictly speaking, exist: the fictional worlds of radio programs from the 1940s. When I was a boy, a local radio station replayed old shows such as The Lone Ranger, The Shadow, and The Jack Benny Program. The shows pulled me in with their appeal to the imagination, and I wanted to learn more about them. My parents and teachers had taught me well about the library, and my interest in radio programs (and soon my interest in the historical period that produced them) provided a new focus for the use of library resources. I checked out books, records, and tapes and studied the non-circulating reference material. Later, as I worked on more research projects for school assignments, I learned how to use The Reader’s Guide to Periodical Literature, and that led me to the treasures contained in bound volumes of magazines and on microfilm.

After starting college and choosing English as my major, I spent more time in libraries, particularly academic libraries with their special collections and emphasis on research. I grew to enjoy the research part of college work especially, which prompted me to pursue a master’s in English literature. When the time came for me to do a thesis for my degree, I picked a bibliographic research project over a literary interpretation or analysis. I created an annotated bibliography of the books advertised in The Tatler, an eighteenth-century British periodical.

Given my interest in research and the amount of time I spent in libraries, I considered following my master’s in English with a degree in library science. Since I liked historical material, I thought of studying to work in an archive. Although I looked into applying at some schools of library science, I was not wholeheartedly enthusiastic about additional years of schooling at that point in my life, so my journey took a turn to the business world.

The company where I began working after grad school turned out to be a good fit for me. The work connected to my interest in research because it involved making computerized databases for lawyers. I learned about database software, indexing, document scanning, and some electronic records management. In time, I became a department manager and later moved to project management. Regrettably for me, the company was eventually sold, and the new owners started a course of restructuring that culminated with the elimination of my position.

After exploring the job market, I reached the conclusion that further education would be a rewarding pursuit – rewarding not just from the standpoint of increasing the likelihood of landing a job, but also rewarding for personal growth and the opportunity to learn from other people. Because I had considered library science before and still had an interest in the things of the past, I decided to return to school to earn a degree in library science with a focus on archives. I was drawn to courses on digital content where I could continue to use and build on the experience and data-handling skills that I gained during my first career.

Though my decision to become an archivist was a long time coming, I am glad to have made it and look forward to continuing to discover the rewards of working in a field where I can benefit others by helping them connect to information.

What is one thing you’d like to see the Electronic Records Section accomplish during your time on the steering committee?

As I have been working on projects with the Records Management Team at the University of Notre Dame, I have become increasingly aware of how many electronic records consist of data points in enterprise-size content management systems rather than discrete files such as Word docs and PDFs. I anticipate that archiving databases as well as material from systems that were not necessarily designed with long-term preservation in mind will be a growing challenge for archivists. I would like to see the Electronic Records Section put forward guidance on best practices for meeting this challenge.

What cartoon character do you model yourself after?

The Tick (from the 1994 – 1996 Fox animated series)

Meet the 2018 Candidates: Kelsey O’Connell

The 2018 elections for Electronic Records Section leadership are upon us! To support your getting to know the candidates, we will be presenting additional information provided by the 2018 nominees for ERS leadership positions. For more information about the slate of candidates, you can check out the full 2018 ERS elections site. ERS Members: be sure to vote! Polls are open through July 17!

Candidate name: Kelsey O’Connell

Running for: Steering Committee

What made you decide you wanted to become an archivist?

As a history/English major in college, I knew I didn’t want to become a teacher or a lawyer so I began exploring other career options. I landed a position as a student assistant in my college library’s Special Collections and Archives department where I began processing collections. I immediately loved the organization, research, and learning that I participated in daily and realized I just wanted to do this for the rest of my life.

What is one thing you’d like to see the Electronic Records Section accomplish during your time on the steering committee?

My primary interest is employing ERS’s platform to influence SAA and related organizations to begin creating and formalizing documentation for electronic records. We all talk about documentation a lot, but we still haven’t rolled them out. I think a strategic approach to initiating much of this documentation is for ERS to survey, index, and prioritize various policies, guidelines, standards, and frameworks needed to have robust documentation on the management and care of electronic records. We’d be able to utilize Section members’ input and participation in the discussion, allowing us to have a comprehensive yet diverse perspective for our recommendations.

What cartoon character do you model yourself after?

I have to admit I had to ask my family and friends for help identifying this one. More than a few of them said Velma from Scooby Doo. I laughed because I assumed it was because I really can’t see without my glasses – like when my cat enjoys knocking them off my nightstand in the middle of the night and I have to search for them in the morning. But they all said it’s because I like figuring stuff out. Although mystery isn’t my favorite genre, there are some clear parallels to researching and employing some trial and error tactics with electronic records.